找回密码
 欢迎注册
楼主: 无心人

[擂台] 大数运算基选择测试

[复制链接]
 楼主| 发表于 2008-4-15 09:52:42 | 显示全部楼层


liangbch的代码确实有高效的地方
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
发表于 2011-11-16 19:48:27 | 显示全部楼层
51# liangbch

贴出在intel 双核处理器E8500的运算结果, 可以看出相对于PIV,虽然频率提升有限,但整数乘法的速度有极大的提高,但SSE2,MMX指令的性能提高的幅度却相对较小

函数                      E8500(3.16G)
UInt480x480_ALU           310.890 ms
UInt480x480_ALU2          385.515 ms

UInt480x480_MMX           304.725 ms
UInt480x480_MMX2          353.785 ms
UInt480x480_MMX3          350.830 ms

UInt480x480_base30_ALU    275.235 ms
UInt480x480_base30_MMX    260.435 ms
UInt480x480_base30_SSE2   197.840 ms
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
发表于 2011-11-16 20:42:38 | 显示全部楼层
intel 双核处理器E8500 的整数乘,确实有极大的提高,优于Intel Core i5 . 多个版本是必须的,呵呵。
用LS的包测试:
11.jpg
UInt480x480_ALU  1392.100 ms
UInt480x480_ALU2 1919.780 ms

UInt480x480_MMX  728.820 ms

UInt480x480_base30_ALU 932.190 ms
UInt480x480_base30_MMX 629.200 ms
UInt480x480_base30_SSE2 663.225 ms
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
发表于 2013-6-8 23:13:08 | 显示全部楼层
理论上,CPU存在多个执行单元,调整指令顺序,减少指令依赖可以提升速度。然而我之前的种种实验表明,这种调整指令顺序,提高并行度的做法并无多大效果,不知何故。

  另一种行之有效的优化循环的办法是减少控制指 ...
liangbch 发表于 2008-4-11 10:41

1. "All modern x86 processors can execute instructions out of order."意味着调整指令顺序基本没有太大的效果。
  1. ;  Out-of-order execution
  2. mov eax, [mem1]
  3. imul eax, 6
  4. mov [mem2], eax
  5. mov ebx, [mem3]
  6. add ebx, 2
  7. mov [mem4], ebx
复制代码
乱序执行,如果[mem1]不在cache中,则imul无法继续,但是第4条mov ebx,[mem3]不依赖于前面的指令,所以也已经开始执行了。

2. 寄存器重命名技术register renaming
  1. mov eax, [mem1]
  2. imul eax, 6
  3. mov [mem2], eax
  4. mov eax, [mem3]
  5. add eax, 2
  6. mov [mem4], eax
复制代码
"the CPU is able to use different physical registers for the same logical register eax." 所以我们看到的只是逻辑上的eax,在微内核会被重命名,所以也不用担心在乱序执行的时候会干扰。
This means that the above code is changed inside the CPU to a code that uses four different physical registers for eax. The first register is used for the value loaded from
[mem1]. The second register is used for the output of the imul instruction. The third register is used for the value loaded from [mem3]. And the fourth register is used for the output of the add instruction.
The use of different physical registers for the same logical register enables the CPU to make the last three instructions in example  independent of the first three instructions.

The CPU must have a lot of physical registers for this mechanism to work efficiently. The number of physical registers is different for different microprocessors, but
you can generally assume that the number is sufficient for quite a lot of instruction reordering.
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
您需要登录后才可以回帖 登录 | 欢迎注册

本版积分规则

小黑屋|手机版|数学研发网 ( 苏ICP备07505100号 )

GMT+8, 2024-4-19 08:30 , Processed in 0.042016 second(s), 17 queries .

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表