gxqcn
发表于 2020-2-18 13:52:28
谢谢!
刚刚我也查到了,从 C++17 起,有 std::aligned_alloc 可用了 :)
现在看来,软件上,新的 C++ 的标准越来越给力了;
但硬件 CPU 对 64 位整数运算支持的指令,尤其是SIMD,还是相当欠缺。
mathe
发表于 2020-2-18 14:35:07
你可以试验直接使用mulx指令看看。只要多条指令之间没有数据依赖关系,硬件会自动并行化,不一定需要用SIMD
liangbch
发表于 2020-2-19 13:22:42
gxqcn 发表于 2020-2-18 11:41
有两个问题请教大家:
内部的基数是32或者64bits.
gmp.h中有如下定义,这个gmp.h在编译时生成,默认的32位环境是32,64位环境是64. 不过可以配置为其它值,有效取值是32,64和x32,最后一个不知道啥意思,我这里make时出错。
#define GMP_LIMB_BITS32
如在64位环境编译为32为的库,可使用下列命令
./configure ABI=32
happysxyf
发表于 2020-2-19 13:45:42
有啥新的进展没?
liangbch
发表于 2020-2-19 13:46:01
看了下GMP代码,GMP只在64位的版本中,在下列文件中用到了256位寄存器(ymm寄存器)的vmovdqa和movdqu指令,没有用到avx512指令。其它部分也未用到avx指令。
x86_64/fastavx/copyi.asm
x86_64/fastavx/copyd.asm
gxqcn
发表于 2020-2-19 15:44:43
我记得好多年前,GMP 的主页上提到它内部基数的位数,与系统指针的位数并不相等,好像64位下是 52bits 还是 56bits 来着,但刚才去查没查到。
另外,相同的计算结果,32位与64位版本,效率比大概是多少?
liangbch
发表于 2020-2-23 20:42:03
实际测试了一下,64位版本的效能大约为32位版本的2.6倍左右,比想象的要低一些。
我的电脑是Intel(R) Xeon(R) CPU E5-2670,sandybridge架构。
下面是测试数据。
为了便于比较,1个长度单位为32bits,时间单位为微秒。使用16进制字符串赋值,仅测试函数mpn_mul的运行时间。
例如,行"1024*1024 158 59 2.68" 表示,被乘数长度和乘数的长度都是1024*32比特,32位版本运行时间为158微秒,64位版本运行时间是58微秒,64位版本的速度是32版本的2.68倍。
len1*len2 32 bits lib(us) 64bits lib(us) 64bits/32bits
64*64 98 109 0.90
70*70 95 137 0.69
76*76 90 133 0.68
83*83 96 126 0.76
91*91 87 115 0.76
99*99 86 117 0.74
108*108 75 114 0.66
117*117 70 107 0.65
128*128 85 105 0.81
140*140 72 102 0.71
152*152 58 100 0.58
166*166 65 92 0.71
181*181 74 82 0.90
197*197 43 82 0.52
215*215 49 79 0.62
235*235 55 72 0.76
256*256 62 79 0.78
279*279 72 68 1.06
304*304 79 52 1.52
332*332 91 61 1.49
362*362 102 69 1.48
395*395 117 40 2.93
431*431 133 46 2.89
470*470 147 45 3.27
512*512 153 57 2.68
558*558 172 64 2.69
609*609 79 74 1.07
664*664 88 85 1.04
724*724 98 50 1.96
790*790 112 43 2.60
861*861 126 49 2.57
939*939 143 52 2.75
1024*1024 158 59 2.68
1117*1117 178 66 2.70
1218*1218 205 78 2.63
1328*1328 230 85 2.71
1448*1448 264 95 2.78
1579*1579 285 111 2.57
1722*1722 323 125 2.58
1878*1878 410 140 2.93
2048*2048 423 155 2.73
2233*2233 464 176 2.64
2435*2435 521 202 2.58
2656*2656 582 226 2.58
2896*2896 690 251 2.75
3158*3158 763 288 2.65
3444*3444 830 325 2.55
3756*3756 923 375 2.46
4096*4096 1041 423 2.46
4467*4467 1201 454 2.65
4871*4871 1307 500 2.61
5312*5312 1487 569 2.61
5793*5793 1629 878 1.86
6317*6317 1834 714 2.57
6889*6889 2069 825 2.51
7512*7512 2345 882 2.66
8192*8192 2786 1009 2.76
8933*8933 2911 1118 2.60
9742*9742 3257 1382 2.36
10624*10624 3691 1429 2.58
11585*11585 3739 1570 2.38
12634*12634 4516 1767 2.56
13777*13777 4804 1954 2.46
15024*15024 5088 2005 2.54
16384*16384 6253 2385 2.62
17867*17867 6504 2612 2.49
19484*19484 6835 3021 2.26
21247*21247 8121 3168 2.56
23170*23170 8149 3246 2.51
25268*25268 10851 3903 2.78
27554*27554 10377 4024 2.58
30048*30048 10720 4285 2.50
32768*32768 13784 5325 2.59
35734*35734 14027 5509 2.55
38968*38968 16639 6388 2.60
42495*42495 17409 6805 2.56
46341*46341 17479 6860 2.55
50535*50535 22117 8583 2.58
55109*55109 22461 9167 2.45
60097*60097 22850 9098 2.51
65536*65536 31509 11979 2.63
71468*71468 32104 12324 2.60
77936*77936 34722 13251 2.62
84990*84990 41021 17141 2.39
92682*92682 43234 15444 2.80
101070*101070 50055 19123 2.62
110218*110218 50431 19881 2.54
120194*120194 53285 19809 2.69
131072*131072 70961 26338 2.69
142935*142935 71727 27189 2.64
155872*155872 85549 32807 2.61
169979*169979 100685 32574 3.09
185364*185364 118300 33296 3.55
202141*202141 114053 41074 2.78
220436*220436 118086 42328 2.79
240387*240387 120861 42334 2.85
262144*262144 160000 61190 2.61
285870*285870 166417 62167 2.68
311744*311744 177583 68385 2.60
339959*339959 213000 81006 2.63
370728*370728 218389 83596 2.61
404281*404281 253199 115366 2.19
440872*440872 257377 102662 2.51
480774*480774 261366 112603 2.32
524288*524288 349563 135294 2.58
571740*571740 360441 137598 2.62
623487*623487 427958 167993 2.55
679917*679917 450731 171983 2.62
741455*741455 454241 177001 2.57
808563*808563 556654 217155 2.56
881744*881744 546546 217261 2.52
961548*961548 551027 228512 2.41
1048576*1048576 739869 295126 2.51
1143480*1143480 811496 304789 2.66
1246974*1246974 881269 349131 2.52
1359835*1359835 939293 375359 2.50
1482910*1482910 1023155 431470 2.37
1617125*1617125 1148907 479196 2.40
1763488*1763488 1251408 479356 2.61
1923097*1923097 1361736 543979 2.50
2097152*2097152 1562511 627422 2.49
2286960*2286960 1667559 655844 2.54
2493948*2493948 1861228 755660 2.46
2719670*2719670 1979308 799447 2.48
2965821*2965821 2183367 903321 2.42
3234251*3234251 2404021 990645 2.43
3526975*3526975 2648709 1070849 2.47
3846194*3846194 2811668 1134798 2.48
4194304*4194304 3270555 1332069 2.46
liangbch
发表于 2020-2-23 22:10:40
64位basecase mul的原代码,见 mpn/x86_64/coreisbr/mul_basecase.asm,AT&T格式
dnlAMD64 mpn_mul_basecase optimised for Intel Sandy bridge and Ivy bridge.
dnlContributed to the GNU project by Torbjörn Granlund.
dnlCopyright 2003-2005, 2007, 2008, 2011-2013 Free Software Foundation, Inc.
dnlThis file is part of the GNU MP Library.
dnl
dnlThe GNU MP Library is free software; you can redistribute it and/or modify
dnlit under the terms of either:
dnl
dnl * the GNU Lesser General Public License as published by the Free
dnl Software Foundation; either version 3 of the License, or (at your
dnl option) any later version.
dnl
dnlor
dnl
dnl * the GNU General Public License as published by the Free Software
dnl Foundation; either version 2 of the License, or (at your option) any
dnl later version.
dnl
dnlor both in parallel, as here.
dnl
dnlThe GNU MP Library is distributed in the hope that it will be useful, but
dnlWITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
dnlor FITNESS FOR A PARTICULAR PURPOSE.See the GNU General Public License
dnlfor more details.
dnl
dnlYou should have received copies of the GNU General Public License and the
dnlGNU Lesser General Public License along with the GNU MP Library.If not,
dnlsee https://www.gnu.org/licenses/.
include(`../config.m4')
C cycles/limb mul_1 mul_2 mul_3 addmul_2
C AMD K8,K9
C AMD K10
C AMD bull
C AMD pile
C AMD steam
C AMD bobcat
C AMD jaguar
C Intel P4
C Intel core
C Intel NHM
C Intel SBR 2.5 2.5 - 2.95
C Intel IBR 2.4 2.3 - 2.68
C Intel HWL 2.35 2.0 - 2.5
C Intel BWL
C Intel atom
C VIA nano
C The inner loops of this code are the result of running a code generation and
C optimisation tool suite written by David Harvey and Torbjorn Granlund.
C TODO
C* Fix the addmul_2 fluctuation affecting SBR.
C* Improve feed-in code, avoiding zeroing of many registers and dummy adds in
C the loops at the expense of code size.
C* Adjoin a mul_3, avoiding slow mul_1 for odd vn.
C* Consider replacing the 2-way mul_2 code with 4-way code, for a very slight
C speedup.
C* Further micro-optimise.
C When playing with pointers, set this to $2 to fall back to conservative
C indexing in wind-down code.
define(`I',`$1')
define(`rp', `%rdi')
define(`up', `%rsi')
define(`un_param',`%rdx')
define(`vp', `%rcx')
define(`vn', `%r8')
define(`un', `%rbx')
define(`w0', `%r10')
define(`w1', `%r11')
define(`w2', `%r12')
define(`w3', `%r13')
define(`n', `%rbp')
define(`v0', `%r9')
ABI_SUPPORT(DOS64)
ABI_SUPPORT(STD64)
ASM_START()
TEXT
ALIGN(16)
PROLOGUE(mpn_mul_basecase)
FUNC_ENTRY(4)
IFDOS(` mov 56(%rsp), %r8d ')
push %rbx
push %rbp
mov un_param, un C free up rdx
neg un
mov (up), %rax C shared for mul_1 and mul_2
lea (up,un_param,8), up C point at operand end
lea (rp,un_param,8), rp C point at rp
mov (vp), v0 C shared for mul_1 and mul_2
mul v0 C shared for mul_1 and mul_2
test $1, R8(vn)
jz L(do_mul_2)
L(do_mul_1):
test $1, R8(un)
jnz L(m1x1)
L(m1x0):mov %rax, w0 C un = 2, 4, 6, 8, ...
mov %rdx, w1
mov 8(up,un,8), %rax
test $2, R8(un)
jnz L(m110)
L(m100):lea 2(un), n C un = 4, 8, 12, ...
jmp L(m1l0)
L(m110):lea (un), n C un = 2, 6, 10, ...
jmp L(m1l2)
L(m1x1):mov %rax, w1 C un = 1, 3, 5, 7, ...
mov %rdx, w0
test $2, R8(un)
jz L(m111)
L(m101):lea 3(un), n C un = 1, 5, 9, ...
test n, n
js L(m1l1)
mov %rax, -8(rp)
mov %rdx, (rp)
pop %rbp
pop %rbx
FUNC_EXIT()
ret
L(m111):lea 1(un), n C un = 3, 7, 11, ...
mov 8(up,un,8), %rax
jmp L(m1l3)
ALIGN(16) C FIXME
L(m1tp):mov %rdx, w0
add %rax, w1
L(m1l1):mov -16(up,n,8), %rax
adc $0, w0
mul v0
add %rax, w0
mov w1, -24(rp,n,8)
mov -8(up,n,8), %rax
mov %rdx, w1
adc $0, w1
L(m1l0):mul v0
mov w0, -16(rp,n,8)
add %rax, w1
mov %rdx, w0
mov (up,n,8), %rax
adc $0, w0
L(m1l3):mul v0
mov w1, -8(rp,n,8)
mov %rdx, w1
add %rax, w0
mov 8(up,n,8), %rax
adc $0, w1
L(m1l2):mul v0
mov w0, (rp,n,8)
add $4, n
jnc L(m1tp)
L(m1ed):add %rax, w1
adc $0, %rdx
mov w1, I(-8(rp),-24(rp,n,8))
mov %rdx, I((rp),-16(rp,n,8))
dec R32(vn)
jz L(ret2)
lea 8(vp), vp
lea 8(rp), rp
push %r12
push %r13
push %r14
jmp L(do_addmul)
L(do_mul_2):
define(`v1', `%r14')
push %r12
push %r13
push %r14
mov 8(vp), v1
test $1, R8(un)
jnz L(m2b1)
L(m2b0):lea (un), n
xor w0, w0
mov %rax, w2
mov %rdx, w1
jmp L(m2l0)
L(m2b1):lea 1(un), n
xor w1, w1
xor w2, w2
mov %rax, w0
mov %rdx, w3
jmp L(m2l1)
ALIGN(32)
L(m2tp):mul v0
add %rax, w0
mov %rdx, w3
adc $0, w3
L(m2l1):mov -8(up,n,8), %rax
mul v1
add w1, w0
adc $0, w3
add %rax, w2
mov w0, -8(rp,n,8)
mov %rdx, w0
adc $0, w0
mov (up,n,8), %rax
mul v0
add %rax, w2
mov %rdx, w1
adc $0, w1
add w3, w2
L(m2l0):mov (up,n,8), %rax
adc $0, w1
mul v1
mov w2, (rp,n,8)
add %rax, w0
mov %rdx, w2
mov 8(up,n,8), %rax
adc $0, w2
add $2, n
jnc L(m2tp)
L(m2ed):mul v0
add %rax, w0
mov %rdx, w3
adc $0, w3
mov I(-8(up),-8(up,n,8)), %rax
mul v1
add w1, w0
adc $0, w3
add %rax, w2
mov w0, I(-8(rp),-8(rp,n,8))
adc $0, %rdx
add w3, w2
mov w2, I((rp),(rp,n,8))
adc $0, %rdx
mov %rdx, I(8(rp),8(rp,n,8))
add $-2, R32(vn)
jz L(ret5)
lea 16(vp), vp
lea 16(rp), rp
L(do_addmul):
push %r15
push vn C save vn in new stack slot
define(`vn', `(%rsp)')
define(`X0', `%r14')
define(`X1', `%r15')
define(`v1', `%r8')
L(outer):
mov (vp), v0
mov 8(vp), v1
mov (up,un,8), %rax
mul v0
test $1, R8(un)
jnz L(a1x1)
L(a1x0):mov (rp,un,8), X0
xor w0, w0
mov %rdx, w1
test $2, R8(un)
jnz L(a110)
L(a100):lea 2(un), n C un = 4, 8, 12, ...
add %rax, X0
adc $0, w1
mov (up,un,8), %rax
mul v1
mov 8(rp,un,8), X1
jmp L(lo0)
L(a110):lea (un), n C un = 2, 6, 10, ...
xor w3, w3
jmp L(lo2)
L(a1x1):mov (rp,un,8), X1
xor w2, w2
xor w1, w1
test $2, R8(un)
jz L(a111)
L(a101):lea 3(un), n C un = 1, 5, 9, ...
mov %rdx, w3
add %rax, X1
mov (up,un,8), %rax
mov 8(rp,un,8), X0
adc $0, w3
jmp L(top)
L(a111):lea 1(un), n C un = 3, 7, 11, ...
jmp L(lo3)
ALIGN(32)
L(top): mul v1
mov %rdx, w0
add %rax, X0
adc $0, w0
add w1, X1
adc $0, w3
add w2, X0
adc $0, w0
mov -16(up,n,8), %rax
mul v0
add %rax, X0
mov %rdx, w1
adc $0, w1
mov -16(up,n,8), %rax
mul v1
mov X1, -24(rp,n,8)
mov -8(rp,n,8), X1
add w3, X0
adc $0, w1
L(lo0): mov %rdx, w2
mov X0, -16(rp,n,8)
add %rax, X1
adc $0, w2
mov -8(up,n,8), %rax
add w0, X1
adc $0, w2
mul v0
L(lo3): add %rax, X1
mov %rdx, w3
adc $0, w3
mov -8(up,n,8), %rax
mul v1
add w1, X1
mov (rp,n,8), X0
adc $0, w3
mov %rdx, w0
add %rax, X0
adc $0, w0
mov (up,n,8), %rax
mul v0
add w2, X0
mov X1, -8(rp,n,8)
mov %rdx, w1
adc $0, w0
L(lo2): add %rax, X0
adc $0, w1
mov (up,n,8), %rax
add w3, X0
adc $0, w1
mul v1
mov 8(rp,n,8), X1
add %rax, X1
mov %rdx, w2
adc $0, w2
mov 8(up,n,8), %rax
mov X0, (rp,n,8)
mul v0
add w0, X1
mov %rdx, w3
adc $0, w2
add %rax, X1
mov 8(up,n,8), %rax
mov 16(rp,n,8), X0 C useless but harmless in final iter
adc $0, w3
add $4, n
jnc L(top)
L(end): mul v1
add w1, X1
adc $0, w3
add w2, %rax
adc $0, %rdx
mov X1, I(-8(rp),-24(rp,n,8))
add w3, %rax
adc $0, %rdx
mov %rax, I((rp),-16(rp,n,8))
mov %rdx, I(8(rp),-8(rp,n,8))
addl $-2, vn
lea 16(vp), vp
lea 16(rp), rp
jnz L(outer)
pop %rax C deallocate vn slot
pop %r15
L(ret5):pop %r14
pop %r13
pop %r12
L(ret2):pop %rbp
pop %rbx
FUNC_EXIT()
ret
EPILOGUE()
反汇编后的原代码,Intel格式 00000000000008a0 <__gmpn_mul_basecase>:
8a0: 53 push rbx
8a1: 55 push rbp
8a2: 48 89 d3 mov rbx,rdx
8a5: 48 f7 db neg rbx
8a8: 48 8b 06 mov rax,QWORD PTR
8ab: 48 8d 34 d6 lea rsi,
8af: 48 8d 3c d7 lea rdi,
8b3: 4c 8b 09 mov r9,QWORD PTR
8b6: 49 f7 e1 mul r9
8b9: 41 f6 c0 01 test r8b,0x1
8bd: 0f 84 d7 00 00 00 je 99a <__gmpn_mul_basecase+0xfa>
8c3: f6 c3 01 test bl,0x1
8c6: 75 1e jne 8e6 <__gmpn_mul_basecase+0x46>
8c8: 49 89 c2 mov r10,rax
8cb: 49 89 d3 mov r11,rdx
8ce: 48 8b 44 de 08 mov rax,QWORD PTR
8d3: f6 c3 02 test bl,0x2
8d6: 75 06 jne 8de <__gmpn_mul_basecase+0x3e>
8d8: 48 8d 6b 02 lea rbp,
8dc: eb 58 jmp 936 <__gmpn_mul_basecase+0x96>
8de: 48 8d 2b lea rbp,
8e1: e9 7d 00 00 00 jmp 963 <__gmpn_mul_basecase+0xc3>
8e6: 49 89 c3 mov r11,rax
8e9: 49 89 d2 mov r10,rdx
8ec: f6 c3 02 test bl,0x2
8ef: 74 13 je 904 <__gmpn_mul_basecase+0x64>
8f1: 48 8d 6b 03 lea rbp,
8f5: 48 85 ed test rbp,rbp
8f8: 78 1c js 916 <__gmpn_mul_basecase+0x76>
8fa: 48 89 47 f8 mov QWORD PTR ,rax
8fe: 48 89 17 mov QWORD PTR ,rdx
901: 5d pop rbp
902: 5b pop rbx
903: c3 ret
904: 48 8d 6b 01 lea rbp,
908: 48 8b 44 de 08 mov rax,QWORD PTR
90d: eb 3d jmp 94c <__gmpn_mul_basecase+0xac>
90f: 90 nop
910: 49 89 d2 mov r10,rdx
913: 49 01 c3 add r11,rax
916: 48 8b 44 ee f0 mov rax,QWORD PTR
91b: 49 83 d2 00 adc r10,0x0
91f: 49 f7 e1 mul r9
922: 49 01 c2 add r10,rax
925: 4c 89 5c ef e8 mov QWORD PTR ,r11
92a: 48 8b 44 ee f8 mov rax,QWORD PTR
92f: 49 89 d3 mov r11,rdx
932: 49 83 d3 00 adc r11,0x0
936: 49 f7 e1 mul r9
939: 4c 89 54 ef f0 mov QWORD PTR ,r10
93e: 49 01 c3 add r11,rax
941: 49 89 d2 mov r10,rdx
944: 48 8b 04 ee mov rax,QWORD PTR
948: 49 83 d2 00 adc r10,0x0
94c: 49 f7 e1 mul r9
94f: 4c 89 5c ef f8 mov QWORD PTR ,r11
954: 49 89 d3 mov r11,rdx
957: 49 01 c2 add r10,rax
95a: 48 8b 44 ee 08 mov rax,QWORD PTR
95f: 49 83 d3 00 adc r11,0x0
963: 49 f7 e1 mul r9
966: 4c 89 14 ef mov QWORD PTR ,r10
96a: 48 83 c5 04 add rbp,0x4
96e: 73 a0 jae 910 <__gmpn_mul_basecase+0x70>
970: 49 01 c3 add r11,rax
973: 48 83 d2 00 adc rdx,0x0
977: 4c 89 5f f8 mov QWORD PTR ,r11
97b: 48 89 17 mov QWORD PTR ,rdx
97e: 41 ff c8 dec r8d
981: 0f 84 cd 02 00 00 je c54 <__gmpn_mul_basecase+0x3b4>
987: 48 8d 49 08 lea rcx,
98b: 48 8d 7f 08 lea rdi,
98f: 41 54 push r12
991: 41 55 push r13
993: 41 56 push r14
995: e9 ef 00 00 00 jmp a89 <__gmpn_mul_basecase+0x1e9>
99a: 41 54 push r12
99c: 41 55 push r13
99e: 41 56 push r14
9a0: 4c 8b 71 08 mov r14,QWORD PTR
9a4: f6 c3 01 test bl,0x1
9a7: 75 0e jne 9b7 <__gmpn_mul_basecase+0x117>
9a9: 48 8d 2b lea rbp,
9ac: 4d 31 d2 xor r10,r10
9af: 49 89 c4 mov r12,rax
9b2: 49 89 d3 mov r11,rdx
9b5: eb 68 jmp a1f <__gmpn_mul_basecase+0x17f>
9b7: 48 8d 6b 01 lea rbp,
9bb: 4d 31 db xor r11,r11
9be: 4d 31 e4 xor r12,r12
9c1: 49 89 c2 mov r10,rax
9c4: 49 89 d5 mov r13,rdx
9c7: eb 24 jmp 9ed <__gmpn_mul_basecase+0x14d>
9c9: 0f 1f 00 nop DWORD PTR
9cc: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:
9d3: 00 00 00
9d6: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:
9dd: 00 00 00
9e0: 49 f7 e1 mul r9
9e3: 49 01 c2 add r10,rax
9e6: 49 89 d5 mov r13,rdx
9e9: 49 83 d5 00 adc r13,0x0
9ed: 48 8b 44 ee f8 mov rax,QWORD PTR
9f2: 49 f7 e6 mul r14
9f5: 4d 01 da add r10,r11
9f8: 49 83 d5 00 adc r13,0x0
9fc: 49 01 c4 add r12,rax
9ff: 4c 89 54 ef f8 mov QWORD PTR ,r10
a04: 49 89 d2 mov r10,rdx
a07: 49 83 d2 00 adc r10,0x0
a0b: 48 8b 04 ee mov rax,QWORD PTR
a0f: 49 f7 e1 mul r9
a12: 49 01 c4 add r12,rax
a15: 49 89 d3 mov r11,rdx
a18: 49 83 d3 00 adc r11,0x0
a1c: 4d 01 ec add r12,r13
a1f: 48 8b 04 ee mov rax,QWORD PTR
a23: 49 83 d3 00 adc r11,0x0
a27: 49 f7 e6 mul r14
a2a: 4c 89 24 ef mov QWORD PTR ,r12
a2e: 49 01 c2 add r10,rax
a31: 49 89 d4 mov r12,rdx
a34: 48 8b 44 ee 08 mov rax,QWORD PTR
a39: 49 83 d4 00 adc r12,0x0
a3d: 48 83 c5 02 add rbp,0x2
a41: 73 9d jae 9e0 <__gmpn_mul_basecase+0x140>
a43: 49 f7 e1 mul r9
a46: 49 01 c2 add r10,rax
a49: 49 89 d5 mov r13,rdx
a4c: 49 83 d5 00 adc r13,0x0
a50: 48 8b 46 f8 mov rax,QWORD PTR
a54: 49 f7 e6 mul r14
a57: 4d 01 da add r10,r11
a5a: 49 83 d5 00 adc r13,0x0
a5e: 49 01 c4 add r12,rax
a61: 4c 89 57 f8 mov QWORD PTR ,r10
a65: 48 83 d2 00 adc rdx,0x0
a69: 4d 01 ec add r12,r13
a6c: 4c 89 27 mov QWORD PTR ,r12
a6f: 48 83 d2 00 adc rdx,0x0
a73: 48 89 57 08 mov QWORD PTR ,rdx
a77: 41 83 c0 fe add r8d,0xfffffffe
a7b: 0f 84 cd 01 00 00 je c4e <__gmpn_mul_basecase+0x3ae>
a81: 48 8d 49 10 lea rcx,
a85: 48 8d 7f 10 lea rdi,
a89: 41 57 push r15
a8b: 41 50 push r8
a8d: 4c 8b 09 mov r9,QWORD PTR
a90: 4c 8b 41 08 mov r8,QWORD PTR
a94: 48 8b 04 de mov rax,QWORD PTR
a98: 49 f7 e1 mul r9
a9b: f6 c3 01 test bl,0x1
a9e: 75 36 jne ad6 <__gmpn_mul_basecase+0x236>
aa0: 4c 8b 34 df mov r14,QWORD PTR
aa4: 4d 31 d2 xor r10,r10
aa7: 49 89 d3 mov r11,rdx
aaa: f6 c3 02 test bl,0x2
aad: 75 1c jne acb <__gmpn_mul_basecase+0x22b>
aaf: 48 8d 6b 02 lea rbp,
ab3: 49 01 c6 add r14,rax
ab6: 49 83 d3 00 adc r11,0x0
aba: 48 8b 04 de mov rax,QWORD PTR
abe: 49 f7 e0 mul r8
ac1: 4c 8b 7c df 08 mov r15,QWORD PTR
ac6: e9 9b 00 00 00 jmp b66 <__gmpn_mul_basecase+0x2c6>
acb: 48 8d 2b lea rbp,
ace: 4d 31 ed xor r13,r13
ad1: e9 eb 00 00 00 jmp bc1 <__gmpn_mul_basecase+0x321>
ad6: 4c 8b 3c df mov r15,QWORD PTR
ada: 4d 31 e4 xor r12,r12
add: 4d 31 db xor r11,r11
ae0: f6 c3 02 test bl,0x2
ae3: 74 19 je afe <__gmpn_mul_basecase+0x25e>
ae5: 48 8d 6b 03 lea rbp,
ae9: 49 89 d5 mov r13,rdx
aec: 49 01 c7 add r15,rax
aef: 48 8b 04 de mov rax,QWORD PTR
af3: 4c 8b 74 df 08 mov r14,QWORD PTR
af8: 49 83 d5 00 adc r13,0x0
afc: eb 22 jmp b20 <__gmpn_mul_basecase+0x280>
afe: 48 8d 6b 01 lea rbp,
b02: e9 7d 00 00 00 jmp b84 <__gmpn_mul_basecase+0x2e4>
b07: 0f 1f 44 00 00 nop DWORD PTR
b0c: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:
b13: 00 00 00
b16: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:
b1d: 00 00 00
b20: 49 f7 e0 mul r8
b23: 49 89 d2 mov r10,rdx
b26: 49 01 c6 add r14,rax
b29: 49 83 d2 00 adc r10,0x0
b2d: 4d 01 df add r15,r11
b30: 49 83 d5 00 adc r13,0x0
b34: 4d 01 e6 add r14,r12
b37: 49 83 d2 00 adc r10,0x0
b3b: 48 8b 44 ee f0 mov rax,QWORD PTR
b40: 49 f7 e1 mul r9
b43: 49 01 c6 add r14,rax
b46: 49 89 d3 mov r11,rdx
b49: 49 83 d3 00 adc r11,0x0
b4d: 48 8b 44 ee f0 mov rax,QWORD PTR
b52: 49 f7 e0 mul r8
b55: 4c 89 7c ef e8 mov QWORD PTR ,r15
b5a: 4c 8b 7c ef f8 mov r15,QWORD PTR
b5f: 4d 01 ee add r14,r13
b62: 49 83 d3 00 adc r11,0x0
b66: 49 89 d4 mov r12,rdx
b69: 4c 89 74 ef f0 mov QWORD PTR ,r14
b6e: 49 01 c7 add r15,rax
b71: 49 83 d4 00 adc r12,0x0
b75: 48 8b 44 ee f8 mov rax,QWORD PTR
b7a: 4d 01 d7 add r15,r10
b7d: 49 83 d4 00 adc r12,0x0
b81: 49 f7 e1 mul r9
b84: 49 01 c7 add r15,rax
b87: 49 89 d5 mov r13,rdx
b8a: 49 83 d5 00 adc r13,0x0
b8e: 48 8b 44 ee f8 mov rax,QWORD PTR
b93: 49 f7 e0 mul r8
b96: 4d 01 df add r15,r11
b99: 4c 8b 34 ef mov r14,QWORD PTR
b9d: 49 83 d5 00 adc r13,0x0
ba1: 49 89 d2 mov r10,rdx
ba4: 49 01 c6 add r14,rax
ba7: 49 83 d2 00 adc r10,0x0
bab: 48 8b 04 ee mov rax,QWORD PTR
baf: 49 f7 e1 mul r9
bb2: 4d 01 e6 add r14,r12
bb5: 4c 89 7c ef f8 mov QWORD PTR ,r15
bba: 49 89 d3 mov r11,rdx
bbd: 49 83 d2 00 adc r10,0x0
bc1: 49 01 c6 add r14,rax
bc4: 49 83 d3 00 adc r11,0x0
bc8: 48 8b 04 ee mov rax,QWORD PTR
bcc: 4d 01 ee add r14,r13
bcf: 49 83 d3 00 adc r11,0x0
bd3: 49 f7 e0 mul r8
bd6: 4c 8b 7c ef 08 mov r15,QWORD PTR
bdb: 49 01 c7 add r15,rax
bde: 49 89 d4 mov r12,rdx
be1: 49 83 d4 00 adc r12,0x0
be5: 48 8b 44 ee 08 mov rax,QWORD PTR
bea: 4c 89 34 ef mov QWORD PTR ,r14
bee: 49 f7 e1 mul r9
bf1: 4d 01 d7 add r15,r10
bf4: 49 89 d5 mov r13,rdx
bf7: 49 83 d4 00 adc r12,0x0
bfb: 49 01 c7 add r15,rax
bfe: 48 8b 44 ee 08 mov rax,QWORD PTR
c03: 4c 8b 74 ef 10 mov r14,QWORD PTR
c08: 49 83 d5 00 adc r13,0x0
c0c: 48 83 c5 04 add rbp,0x4
c10: 0f 83 0a ff ff ff jae b20 <__gmpn_mul_basecase+0x280>
c16: 49 f7 e0 mul r8
c19: 4d 01 df add r15,r11
c1c: 49 83 d5 00 adc r13,0x0
c20: 4c 01 e0 add rax,r12
c23: 48 83 d2 00 adc rdx,0x0
c27: 4c 89 7f f8 mov QWORD PTR ,r15
c2b: 4c 01 e8 add rax,r13
c2e: 48 83 d2 00 adc rdx,0x0
c32: 48 89 07 mov QWORD PTR ,rax
c35: 48 89 57 08 mov QWORD PTR ,rdx
c39: 83 04 24 fe add DWORD PTR ,0xfffffffe
c3d: 48 8d 49 10 lea rcx,
c41: 48 8d 7f 10 lea rdi,
c45: 0f 85 42 fe ff ff jne a8d <__gmpn_mul_basecase+0x1ed>
c4b: 58 pop rax
c4c: 41 5f pop r15
c4e: 41 5e pop r14
c50: 41 5d pop r13
c52: 41 5c pop r12
c54: 5d pop rbp
c55: 5b pop rbx
c56: c3 ret
gxqcn
发表于 2020-2-24 09:11:51
liangbch 发表于 2020-2-23 20:42
实际测试了一下,64位版本的效能大约为32位版本的2.6倍左右,比想象的要低一些。
我的电脑是Intel(R) Xeo ...谢谢!非常感谢!这样,我对 x64 平台的性能有个预期目标了。
其实,从测试数据来看,2.6 倍已经是非常不错了(甚至好于预期),如果不是 OS 处理的差异,仅就算法层面来说。
在 gmp-man-6.2.0.pdf 中:The overheads grow as O(Nr), whereas in an r = 2k FFT they grow only as
O(N log r).这里的对数,可以当成以 2 为底的对数。
以比例最大的 3.55 为例(185364*185364),
\(L=185364, b=32\)
\(N(b)=L/b=5792.625\)
\(r(b)=log_2^{N(b)}=12.500\)
\(f(b)=N(b) r(b)\)
\(\D \text{ratio}=\frac{f(b)}{f(2b)}=\frac{2 r(b)}{r(b)-1}=2.174\)
替换 \(L=4194304\),可得对应的 \(\text{ratio}=2*22/(22-1)=2.095\)
从以上粗略的推导来看,
如果算法同处于 FFT 级别,随着 \(L\) 的增大, \(\text{ratio}\) 将递减,但下限始终在 \(2.0\) 之上,
而测试数据也正好印证了此结论。
gxqcn
发表于 2020-2-24 09:44:40
happysxyf 发表于 2020-2-19 13:45
有啥新的进展没?
我最近在搭框架,
已声明了一些基础操作需要的函数,尤其是一些运算符的重载接口。
比如:构造、赋值、输出、比较、四则运算、位运算的等接口。
从已完成的部分来看,比十多年前的接口更合理。
比如命名方式:参照 stl 的命名方式,全小写加下划线。
其中,四则运算,仅完成了 +、- 部分(尚待优化);*/% 还未开始;&^| 位运算仅声明。
余下的比较重要的是:乘法部分的实现,以及 C++11 线程池的支持,
前者已有积累,后者需探索一下。
在完成四则运算之后,才会继续去考虑模幂算法,探索大整数素性检测;
而后再加上一些数论或整数函数,这样才初见雏形。
页:
1
2
3
4
5
[6]
7
8
9
10
11
12