找回密码
 欢迎注册
查看: 55310|回复: 28

[原创] 【日积月累】优化小技巧

[复制链接]
发表于 2010-12-13 13:01:09 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。

您需要 登录 才可以下载或查看,没有账号?欢迎注册

×
大家都来说说吧.........,我看到哪,说到哪...... 一.消除、减少分支的数量
精华
通过应用setcc指令或奔腾处理器的条件转移指令(CMOV 或 FCMOVE)来削除分支。 比如下面这个c语言的例子:
  1. ebx = (A < B) ? CONST1 : CONST2;
复制代码
上面这个代码条件比较两个值,A 和 B. 如果为真(A
  1. cmp A, B ; condition
  2. jge L30 ; conditional branch
  3. mov ebx, CONST1
  4. jmp L31 ; unconditional branch
  5. L30:
  6. mov ebx, CONST2
  7. L31:
复制代码对于上面的汇编,我们可以用setcc指令来替换jge,消除分支,以达到优化的目的,程序如下:
  1. xor ebx, ebx ;clear ebx
  2. cmp A, B
  3. setge bl ;When ebx = 0 or 1
  4. ;OR the complement condition
  5. dec ebx ;ebx=00...00 or 11...11
  6. and ebx, (CONST2-CONST1) ;ebx=0 or(CONST2-CONST1)
  7. add ebx, min(CONST1,CONST2) ;ebx=CONST1 or CONST2
复制代码
注意一点,min(CONST1,CONST1)由常量CONST1,CONST2已定,比如min(6,8)=6; 当abs(CONST1-CONST2)为1,2,4,8时,另一种有趣的实现如下:
  1. xor ebx, ebx
  2. mov eax, A
  3. cmp eax, B
  4. setge bl
  5. lea ebx, [ebx*(CONST2-CONST1)+CONST1]
复制代码
部分机器码如下:
  1. ;=======================================================
  2. 00401000 >/\$ A1 00304000 MOV EAX,DWORD PTR DS:[403000]
  3. 00401005 |. 3B05 04304000 CMP EAX,DWORD PTR DS:[403004]
  4. 0040100B |. 7D 07 JGE SHORT good.00401014
  5. 0040100D |. BB 06000000 MOV EBX,6
  6. 00401012 |. EB 05 JMP SHORT good.00401019
  7. 00401014 |> BB 0E000000 MOV EBX,0E
  8. ;=======================================================
  9. 00401028 |. 33DB XOR EBX,EBX
  10. 0040102A |. A1 00304000 MOV EAX,DWORD PTR DS:[403000]
  11. 0040102F |. 3B05 04304000 CMP EAX,DWORD PTR DS:[403004]
  12. 00401035 |. 0F9CC3 SETL BL
  13. 00401038 |. 4B DEC EBX
  14. 00401039 |. 83E3 08 AND EBX,8
  15. 0040103C |. 83C3 06 ADD EBX,6
  16. ;=======================================================
  17. 0040104E |. 33DB XOR EBX,EBX
  18. 00401050 |. A1 00304000 MOV EAX,DWORD PTR DS:[403000]
  19. 00401055 |. 3B05 04304000 CMP EAX,DWORD PTR DS:[403004]
  20. 0040105B |. 0F9DC3 SETGE BL
  21. 0040105E |. 8D1CDD 06000000 LEA EBX,DWORD PTR DS:[EBX*8+6]
  22. ;=======================================================
复制代码
再细化,比如对于上面的程序块,通过填充其他指令或nop,实现对齐,以达到性能峰值。 特别是对于核心模块。
  1. xor ebx, ebx ;clear ebx
  2. mov eax,A
  3. cmp eax, B
  4. setl bl ;When ebx = 0 or 1
  5. ;OR the complement condition
  6. dec ebx ;ebx=00...00 or 11...11
  7. and ebx, (CONST2-CONST1) ;ebx=0 or(CONST2-CONST1)
  8. add ebx, CONST1 ;ebx=CONST1 or CONST2
复制代码
做到8字节对齐:
  1. xor ebx, ebx ;clear ebx
  2. mov eax,A
  3. nop
  4. nop
  5. cmp eax, B
  6. setl bl ;When ebx = 0 or 1
  7. ;OR the complement condition
  8. dec ebx ;ebx=00...00 or 11...11
  9. and ebx, (CONST2-CONST1) ;ebx=0 or(CONST2-CONST1)
  10. nop
  11. add ebx, CONST1 ;ebx=CONST1 or CONST2
复制代码
//----------------------------------------------------- 第二种方法是用"新"的指令CMOV,FCMOV来消除分支 比如测试ecx是否为0
  1. test ecx,ecx
  2. jne L20
  3. mov eax,ebx
  4. L20:
复制代码
可以换成:
  1. test ecx, ecx ; test the flags
  2. cmoveq eax, ebx ; if the equal flag is set, move ebx to eax
  3. L20:
复制代码
指令参考:http://www.cnblogs.com/itjin45/archive/2009/09/03/1559129.html 测试程序:
  1. .386
  2. .model flat,stdcall
  3. include user32.inc
  4. include kernel32.inc
  5. include msvcrt.inc
  6. includelib user32.lib
  7. includelib kernel32.lib
  8. includelib msvcrt.lib
  9. CONST1 equ 6
  10. CONST2 equ 14 ;7,8,10,14
  11. .data
  12. A dd 5 ;10
  13. B dd 7
  14. fmt db ' %d',0
  15. szPause db 'Pause',0
  16. .code
  17. ;****************************
  18. start:
  19. ;----------------------------
  20. ;ebx = (A < B) ? CONST1 : CONST2;
  21. ;**********法一**************
  22. mov eax,A
  23. cmp eax,B
  24. jge L30
  25. mov ebx,CONST1
  26. jmp L31
  27. L30:
  28. mov ebx,CONST2
  29. L31:
  30. invoke crt_printf,offset fmt,ebx
  31. ;**********法二**************
  32. xor ebx, ebx ;clear ebx
  33. mov eax,A
  34. cmp eax, B
  35. setl bl ;When ebx = 0 or 1
  36. ;OR the complement condition
  37. dec ebx ;ebx=00...00 or 11...11
  38. and ebx, (CONST2-CONST1) ;ebx=0 or(CONST2-CONST1)
  39. add ebx, CONST1 ;ebx=CONST1 or CONST2
  40. invoke crt_printf,offset fmt,ebx
  41. ;**********法三**************
  42. xor ebx, ebx
  43. mov eax, A
  44. cmp eax, B
  45. setge bl ; or the complement condition
  46. lea ebx, [ebx*(CONST2-CONST1)+CONST1]
  47. invoke crt_printf,offset fmt,ebx
  48. invoke crt_system,offset szPause
  49. invoke ExitProcess,0
  50. end start
复制代码
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
 楼主| 发表于 2010-12-13 16:15:22 | 显示全部楼层
下面这段文字很好理解,我就不翻译了,不经意看到gotoblas的kernel中提到了这个: ATHLON performance tips Now some information how to achieve peak performance on Athlon: (1) The first and most important thing is that you must put three x86 instructions into packages of exactly 8 bytes to make the decoders run as smooth as possible. Additionally these packages must be 8 byte aligned. If one of these packages for example consists of three instruction but only 7 bytes, you can use the REP prefix as natural code filler. If you already have two (longer) instructions in one package and your next does not fit, use a suitable neutral x87 instruction ("nop", "fnop" see Athlon manual) as code filler and move your instruction into the next block. Here some examples: bad! 20 00000010 DD4038 fld qword [eax+7*8] 21 00000013 D8C9 fmul st0,st1 22 00000015 D8C1 fadd st0,st1 good! 26 00000020 DD4038 fld qword [eax+7*8] 27 00000023 D8C9 fmul st0,st1 28 00000025 F3 db 0F3H 29 00000026 D8C1 fadd st0,st1 good! 33 00000030 DC4C1818 fmul qword [eax+ebx+3*8] 34 00000034 DC4020 fadd qword [eax+4*8] 35 00000037 90 nop (2) To keep instructions short and to achieve FP peak performance you MUST use 8 bit immediates only when operating with memory operands. Otherwise it is not possible to put 1 fadd + 1 fmul + 1fld/fst into a 8 byte package. Example: bad! 39 00000040 DD8038020000 fld qword [eax+71*8] 40 00000046 D8C9 fmul st0,st1 41 00000048 D8C1 fadd st0,st1 (3) In praxis it is not possible to achieve peak in 3dnow! because 3dnow! instructions are to long. Examples 45 00000050 0F6F01 movq mm0,[ecx] 46 00000053 0F0FC1B4 pfmul mm0,mm1 47 00000057 0F0FD09E pfadd mm2,mm0 (4) Athlon's prefetch instructions are not compatible with Intel's. Athlon can only prefetch into L1 cache and it is prefetch=prefetchnta=prefetcht0=prefetcht1=prefetcht2 (5) Athlon instruction scheduler is very aggressive and register renaming is very intelligent. Let them do most work. It is not problem to initiate the calculation of a dependency chain like b=b*a and c=c+b in one cycle. Additionally you can use 'a' in the next cycle without waiting that b=b*a is complete, because Athlon's register renaming makes a copy of 'a' for you internally. Example: Assuming our stack looks already like this: c0 c1 c2 c3 a0 <-top If we want to calculate the following: c0=c0+a0*b0 c1=c1+a0*b1 c2=c2+a0*b2 c3=c3+a0*b3 c0=c0+a0*b4 c1=c1+a0*b5 [...] .. the pseudo assembly code would look like this (with stack view) fld b0 c0c1c2c3a0b0 fmul b0,a0 c0c1c2c3a0b0 faddp c0,b0 c0c1c2c3a0 fld b1 c0c1c2c3a0b1 fmul b1,a0 c0c1c2c3a0b1 faddp c1,b1 c0c1c2c3a0 [...] this code runs with peak performance which is clear when you take a look on how the instructions are scheduled by the CPU cycle 0 >| | fld oo fmul oooo faddp oooo <- c0 calculated fld oo fmul oooo <- uses copy of original 'a0' faddp oooo <- c1 calculated fld oo fmul oooo faddp oooo <- c2 calculated fld oo fmul oooo faddp oooo <- c3 calculated fld oo fmul oooo faddp oooo |<- in this cycle c0 is free again (see above) and exactly here the new c0 calculation starts As you see this code needs only 6 stack registers to achieve peak performance.
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
 楼主| 发表于 2010-12-13 17:27:48 | 显示全部楼层
一直没找到一个比较好的处理器中文资料,今天总算找到了一个,再结合Intel文档,很不错。 Intel Nehalem-EP首发深度评测(一)   Intel Nehalem-EP首发深度评测(二)  (***)  强CPU20倍!正睿Tesla GPU计算系统评测! Intel 32nm Westmere-EP处理器首发评测
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
发表于 2010-12-13 21:13:02 | 显示全部楼层
这个帖子不错,也比较耐读。
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
发表于 2010-12-14 10:56:10 | 显示全部楼层
Mark, 等有空儿时研读研读
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
发表于 2010-12-14 12:26:31 | 显示全部楼层
我写了4个求一个数组最大值最小值的函数。 版本1:getMaxMin1, 普通写法, 版本2:getMaxMin_asm,使用C中嵌入汇编的用法,采用无跳转的cmovb 和 cmova 条件赋值指令 版本3: getMaxMin_MT2, 2线程求最大值最小值算法 版本4:getMaxMin_MT4,2线程求最大值最小值算法 CPU:Intel Core 2 Duo E8500 双核CPU, L2 cache 6M. OS: Windows XP, 编译器 VC++6.0, release 模式编译 从运行结果来看,发现 cmov指令并没有提高性能,应该说intel的分支预测效果很好。下面是运行结果
  1. getMaxMin1 spend 0.006152 second on search max and min in 4194304 number
  2. min=458,max=2147450654, from getMaxMin1
  3. getMaxMin_asm spend 0.006186 second on search max and min in 4194304 number
  4. min=458,max=2147450654, from getMaxMin1
  5. getMaxMin_MT2 spend 0.003353 second on search max and min in 4194304 number
  6. min=458,max=2147450654, from getMaxMin1
  7. getMaxMin_MT4 spend 0.003441 second on search max and min in 4194304 number
  8. min=458,max=2147450654, from getMaxMin1
复制代码
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
发表于 2010-12-14 12:27:50 | 显示全部楼层
这里给出C版和汇编版
  1. void getMaxMin1(DWORD *pData,int len,DWORD *min,DWORD *max)
  2. {
  3. int i;
  4. *min=~0;
  5. *max=0;
  6. for (i=0;i<len;i++)
  7. {
  8. if ( pData[i] > *max)
  9. *max=pData[i];
  10. if (pData[i]< *min)
  11. *min=pData[i];
  12. }
  13. }
  14. void getMaxMin_asm(DWORD *pData,int len,DWORD *min,DWORD *max)
  15. {
  16. DWORD _min, _max;
  17. __asm
  18. {
  19. mov esi, pData
  20. mov ecx, len
  21. lea edi, [esi+ecx*4]
  22. mov eax, 0xffffffff //min
  23. mov edx, 0 //max
  24. loop_start:
  25. cmp esi, edi
  26. jge next10
  27. cmp [esi], eax
  28. cmovb eax,[esi]
  29. cmp [esi], edx
  30. cmova edx,[esi]
  31. add esi, 4
  32. jmp loop_start
  33. next10:
  34. mov _min, eax
  35. mov _max, edx
  36. }
  37. *min=_min;
  38. *max=_max;
  39. }
复制代码
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
 楼主| 发表于 2010-12-14 13:12:00 | 显示全部楼层
有点意思,发现liangbch 与我的cpu一个型号,呵呵,我对第一个消除分支的测试了一下,发现有效果。(前后顺序不存在依赖关系) 结果:
  1. Elapsed time:0.124000 s
  2. Elapsed time:0.094000 s
  3. Elapsed time:0.094000 s
复制代码
测试代码:
  1. #include <stdio.h>
  2. #include <time.h>
  3. int main()
  4. {
  5. int i,j,ebx;
  6. double t1,t2,t3;
  7. //测试1#################
  8. j=2009;
  9. t1=clock();
  10. for(i=0;i<99990000;i+=2,j-=10)
  11. {
  12. ebx = (i < j) ? 6 : 8;
  13. }
  14. printf("Elapsed time: %f s\n",(clock()-t1)/CLOCKS_PER_SEC);
  15. //测试2#################
  16. j=2009;
  17. t2=clock();
  18. for(i=0;i<99990000;i+=2,j-=10)
  19. {
  20. // GetMaxAsm(i,j);
  21. __asm
  22. {
  23. xor ebx, ebx
  24. mov eax, i
  25. cmp eax, j
  26. setl bl
  27. dec ebx
  28. and ebx, 2
  29. add ebx, 6
  30. }
  31. }
  32. printf("Elapsed time: %f s\n",(clock()-t2)/CLOCKS_PER_SEC);
  33. //测试3#################
  34. j=2009;
  35. t3=clock();
  36. for(i=0;i<99990000;i+=2,j-=10)
  37. {
  38. // GetMaxAsm(i,j);
  39. __asm
  40. {
  41. xor ebx, ebx
  42. mov eax, i
  43. cmp eax, j
  44. setge bl
  45. lea ebx, [ebx*2+6]
  46. }
  47. }
  48. printf("Elapsed time: %f s\n",(clock()-t3)/CLOCKS_PER_SEC);
  49. system("Pause");
  50. return 0;
  51. }
复制代码
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
发表于 2010-12-14 15:06:24 | 显示全部楼层
刚刚查了下Intel CPU文档,发现可用SSE2指令 PMINUD, PMAXUD 求无符号整数最大最小值,用指令PMINSD, PMAXSD 求带符号整数最大最小值,有时间实现一下,看看速度如何。
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
发表于 2010-12-14 15:33:30 | 显示全部楼层
PMINUD, PMAXUD 不是 SSE2 指令吧?应该是 SSE4 指令了。
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
您需要登录后才可以回帖 登录 | 欢迎注册

本版积分规则

小黑屋|手机版|数学研发网 ( 苏ICP备07505100号 )

GMT+8, 2024-11-25 09:03 , Processed in 0.027814 second(s), 17 queries .

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表