找回密码
 欢迎注册
楼主: G-Spider

[原创] 【日积月累】优化小技巧

[复制链接]
发表于 2010-12-14 16:45:59 | 显示全部楼层
sorry。 以上指令确实是SSE4指令集。属于SSE4.1。 刚才在发帖前,我试图从手边的文档中查出PMINUD属于那个指令集,但没有得到答案,猜想应该是SSE2指令集,就按此发帖了。楼上一提醒,遂google了一下,果然发现是SSE4指令。请参照http://en.wikipedia.org/wiki/SSE4
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
发表于 2010-12-14 16:50:30 | 显示全部楼层
假如SSE2有此指令,HugeCalc里的部分汇编会更精简。 所以看到老兄的帖子才比较敏感,有点疑惑。
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
 楼主| 发表于 2010-12-14 17:04:33 | 显示全部楼层
G-Spider 发表于 2010-12-14 13:12
之前的可能有点小不公平,第一个ebx是内存变量,之后的两个ebx是寄存器变量,修改了下,还是要快一点。 结果(时间可能不稳定,大致趋势如此):
  1. Elapsed time:0.121000 s
  2. Elapsed time:0.100000 s
  3. Elapsed time:0.098000 s
  4. result=8
复制代码
测试代码:
  1. #include <stdio.h>
  2. #include <time.h>
  3. int main()
  4. {
  5. int i,j,result;
  6. double t1,t2,t3;
  7. //测试1#################
  8. result=0;
  9. j=2009;
  10. t1=clock();
  11. for(i=0;i<99990000;i+=2,j-=10)
  12. {
  13. result = (i < j) ? 6 : 8;
  14. }
  15. printf("Elapsed time: %f s\n",(clock()-t1)/CLOCKS_PER_SEC);
  16. //printf("result=%d\n",result);
  17. //测试2#################
  18. result=0;
  19. j=2009;
  20. t2=clock();
  21. for(i=0;i<99990000;i+=2,j-=10)
  22. {
  23. __asm
  24. {
  25. xor ebx, ebx
  26. mov eax, i
  27. cmp eax, j
  28. setl bl
  29. dec ebx
  30. and ebx, 2
  31. add ebx, 6
  32. mov result,ebx
  33. }
  34. }
  35. printf("Elapsed time: %f s\n",(clock()-t2)/CLOCKS_PER_SEC);
  36. //printf("result=%d\n",result);
  37. //测试3#################
  38. result=0;
  39. j=2009;
  40. t3=clock();
  41. for(i=0;i<99990000;i+=2,j-=10)
  42. {
  43. __asm
  44. {
  45. xor ebx, ebx
  46. mov eax, i
  47. cmp eax, j
  48. setge bl
  49. lea ebx, [ebx*2+6]
  50. mov result,ebx
  51. }
  52. }
  53. printf("Elapsed time: %f s\n",(clock()-t3)/CLOCKS_PER_SEC);
  54. printf("result=%d\n",result);
  55. system("Pause");
  56. return 0;
  57. }
复制代码
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
 楼主| 发表于 2010-12-14 17:57:09 | 显示全部楼层
6# liangbch 有时间我也来试试..... 1.int型 2.float型 3.double型
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
 楼主| 发表于 2010-12-14 20:45:42 | 显示全部楼层
Intel优化文档部分翻译 By G-Spider 2010-12-14 不妥之处,欢迎指正。 http://blog.csdn.net/G_Spider Software Prefetch Scheduling Distance 软件预取调度的距离 Determining the ideal prefetch placement in the code depends on many architecturalparameters, including: the amount of memory to be prefetched, cache lookuplatency, system memory latency, and estimate of computation cycle. The ideal distance for prefetching data is processor- and platform-dependent. If the distance is too short, the prefetch will not hide the latency of the fetch behind computation. Ifthe prefetch is too far ahead, prefetched data may be flushed out of the cache by the time it is required. 在代码中确定理想的预取位置取决于许多结构性参数,其中包括:将预取的存储量,缓存查找延迟,系统内存延迟,和运算周期的估计。理想 预取数据的距离是处理器和平台相关的。如果距离太短,预取将不能掩盖背后的提取计算延迟。如果预取是过于超前,有用的预取数据可能被刷出缓存。 Since prefetch distance is not a well-defined metric, for this discussion, we define a new term, prefetch scheduling distance (PSD), which is represented by the number of iterations. For large loops, prefetch scheduling distance can be set to 1 (that is, schedule prefetch instructions one iteration ahead). For small loop bodies (that is, loop iterations with little computation), the prefetch scheduling distance must be more than one iteration. 由于预取距离不是一个明确的指标,为了讨论,我们定义一个新的术语,预取调度距离(PSD),它是由迭代的次数反映。对于大循环,调度预取距离可设置为1(即,预取指令附在第一次迭代前)。对于小的循环体(即有很少的循环迭代计算),预取距离必须调度不止一次迭代。 A simplified equation to compute PSD is deduced from the mathematical model. For a simplified equation, complete mathematical model, and methodology of prefetch distance determination, see Appendix E, “Summary of Rules and Suggestions.” 关于计算PSD的一个简化公式可由数学模型推导出。对于简化方程,完整的数学模型和预取方法距离测定,见附录E,“规则和建议摘要”。 Example 7-3 illustrates the use of a prefetch within the loop body. The prefetch scheduling distance is set to 3, ESI is effectively the pointer to a line, EDX is the address of the data being referenced and XMM1-XMM4 are the data used in computation. Example 7-4 uses two independent cache lines of data per iteration. The PSD would need to be increased/decreased if more/less than two cache lines are used per iteration. 例7-3说明了一个预取在循环体内的使用。预取调度距离(PSD)设置为3,ESI是有效的数据基指,EDX是数据的参考地址,XMM1 - XMM4存放计算中使用的数据。示例7-4每次迭代使用两个独立的数据高速缓存行。如果每次迭代使用多于/小于两个缓存行,PSD需要增加/减少。 例 7-3. 预取调度距离 top_loop: prefetchnta [edx + esi + 128*3] prefetchnta [edx*4 + esi + 128*3] ...... ...... movaps xmm1, [edx + esi] movaps xmm2, [edx*4 + esi] movaps xmm3, [edx + esi + 16] movaps xmm4, [edx*4 + esi + 16] ...... ...... add esi, 128 cmp esi, ecx jl top_loop
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
 楼主| 发表于 2010-12-22 12:09:45 | 显示全部楼层
之前也看了些快速memcpy()的实现,经过亲自尝试,是乎都不尽人意......今又看到一兄台的文章,我笑了... http://blog.csdn.net/OJOE/archive/2010/08/18/5819921.aspx 难道果真如这位兄台所说的:"使用MMS/SSE内存技术对memcpy的性能优化空间不太大,而且在执行初期,优化的性能甚至比不上未优化的性能。" 还有一些: 真的能成就30-70% faster 的提升? Courtesy of William Chan and Google. 30-70% faster than memcpy in Microsoft Visual Studio 2005.
  1. void X_aligned_memcpy_sse2(void* dest, const void* src, const unsigned long size_t)
  2. {
  3. __asm
  4. {
  5. mov esi, src; //src pointer
  6. mov edi, dest; //dest pointer
  7. mov ebx, size_t; //ebx is our counter
  8. shr ebx, 7; //divide by 128 (8 * 128bit registers)
  9. loop_copy:
  10. prefetchnta 128[ESI]; //SSE2 prefetch
  11. prefetchnta 160[ESI];
  12. prefetchnta 192[ESI];
  13. prefetchnta 224[ESI];
  14. movdqa xmm0, 0[ESI]; //move data from src to registers
  15. movdqa xmm1, 16[ESI];
  16. movdqa xmm2, 32[ESI];
  17. movdqa xmm3, 48[ESI];
  18. movdqa xmm4, 64[ESI];
  19. movdqa xmm5, 80[ESI];
  20. movdqa xmm6, 96[ESI];
  21. movdqa xmm7, 112[ESI];
  22. movntdq 0[EDI], xmm0; //move data from registers to dest
  23. movntdq 16[EDI], xmm1;
  24. movntdq 32[EDI], xmm2;
  25. movntdq 48[EDI], xmm3;
  26. movntdq 64[EDI], xmm4;
  27. movntdq 80[EDI], xmm5;
  28. movntdq 96[EDI], xmm6;
  29. movntdq 112[EDI], xmm7;
  30. add esi, 128;
  31. add edi, 128;
  32. dec ebx;
  33. jnz loop_copy; //loop please
  34. loop_copy_end:
  35. }
  36. }
复制代码
还有这个: 试了一下,似乎在intel上没有得到它所说的完美的性能提升,不过想法很好,应该取自:amd资料. 内存拷贝的优化方法 http://www.freegames.com.cn/school/383/2007/27312.html
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
 楼主| 发表于 2010-12-23 12:09:01 | 显示全部楼层
经过测试发现,对于存拷贝,只有当数据量较大时,以M为单位的数据量时,SSE系列滴指令才突显优势。 Intel(R) Core (TM) 2 Duo CPU E8500 3.16GHz 测试151,885KB _fast_memcpy9 (SSE)计算用时: 52 ms 49 ms 50 ms 49 ms 52 ms _fast_memcpy1 (movsd)计算用时: 73 ms 71 ms 73 ms 72 ms 74 ms --------------------------- 测试63KB _fast_memcpy9 (SSE)计算用时: 24 us 10 us 10 us 10 us 10 us _fast_memcpy1 (movsd)计算用时: 6 us 6 us 6 us 6 us 6 us -------------------------------- 代码:
  1. ;ml /c /coff memcpyTest.asm
  2. ;link /subsystem:console memcpyTest.obj ;5329 15044
  3. ;************************************************************
  4. .686p
  5. .XMM
  6. .model flat,stdcall
  7. option casemap:none
  8. include windows.inc
  9. include user32.inc
  10. include kernel32.inc
  11. include msvcrt.inc
  12. includelib user32.lib
  13. includelib kernel32.lib
  14. includelib msvcrt.lib
  15. .data
  16. dwlm dd 1000000 ;1000是毫秒为单位,1000000则是微秒为单位
  17. fmt db '计算用时:',0dh,0ah,0
  18. fmt1 db '%6lld us',0dh,0ah,0
  19. ;szFileName db 'xinyu.mkv',0 ;151,885KB 原文件
  20. ;szOutName db 'output.mkv',0 ;输出文件;
  21. szFileName db 'test.jpg',0 ;63KB 请以微秒为单位 原文件
  22. szOutName db 'output.jpg',0 ;输出文件
  23. szPause db 'Pause',0
  24. .data?
  25. hHandle dd ?
  26. hHandle1 dd ?
  27. lpInputBuf dd ?
  28. lpOutputBuf dd ?
  29. dwStrlen dd ?
  30. lpNumberOfBytes dd ?
  31. dwOldProcessP dd ?
  32. dwOldThreadP dd ?
  33. ;-------------------------------------
  34. dqTickCounter1 dq ?
  35. dqTickCounter2 dq ?
  36. dqFreq dq ?
  37. dqTime dq ?
  38. .code
  39. ;*************************************
  40. _fast_memcpy1 proc lpdst,lpsrc,dwlen
  41. ;%define param esp+8+4
  42. ;%define src param+0
  43. ;%define dst param+4
  44. ;%define len param+8
  45. push esi
  46. push edi
  47. mov esi, lpsrc ; source array
  48. mov edi, lpdst ; destination array
  49. mov ecx, dwlen
  50. shr ecx, 2 ; convert to DWORD count
  51. rep movsd
  52. pop edi
  53. pop esi
  54. xor eax,eax
  55. ret
  56. _fast_memcpy1 endp
  57. ;***************************************
  58. _fast_memcpy9 proc lpdst,lpsrc,dwlen
  59. mov esi, lpsrc; //src pointer
  60. mov edi, lpdst; //dest pointer
  61. mov ebx, dwlen; //ebx is our counter
  62. shr ebx, 7; //divide by 128 (8 * 128bit registers)
  63. ALIGN 8
  64. loop_copy:
  65. prefetchnta 128[ESI]; //SSE2 prefetch
  66. prefetchnta 160[ESI];
  67. prefetchnta 192[ESI];
  68. prefetchnta 224[ESI];
  69. movdqa xmm0, 0[ESI]; //move data from src to registers
  70. movdqa xmm1, 16[ESI];
  71. movdqa xmm2, 32[ESI];
  72. movdqa xmm3, 48[ESI];
  73. movdqa xmm4, 64[ESI];
  74. movdqa xmm5, 80[ESI];
  75. movdqa xmm6, 96[ESI];
  76. movdqa xmm7, 112[ESI];
  77. movntdq 0[EDI], xmm0; //move data from registers to dest
  78. movntdq 16[EDI], xmm1;
  79. movntdq 32[EDI], xmm2;
  80. movntdq 48[EDI], xmm3;
  81. movntdq 64[EDI], xmm4;
  82. movntdq 80[EDI], xmm5;
  83. movntdq 96[EDI], xmm6;
  84. movntdq 112[EDI], xmm7;
  85. add esi, 128;
  86. add edi, 128;
  87. dec ebx;
  88. jnz loop_copy; //loop please
  89. xor eax,eax
  90. ret
  91. _fast_memcpy9 endp
  92. ;*****************************************************
  93. start:
  94. invoke CreateFile,offset szFileName,GENERIC_READ,FILE_SHARE_READ,\
  95. NULL,OPEN_EXISTING,FILE_ATTRIBUTE_NORMAL,NULL
  96. .if eax == INVALID_HANDLE_VALUE
  97. invoke MessageBox,NULL,0,0,0
  98. .endif
  99. mov hHandle,eax
  100. invoke GetFileSize,eax,NULL
  101. mov dwStrlen,eax
  102. add eax,16
  103. invoke crt_malloc,eax
  104. mov lpInputBuf,eax
  105. mov edx,lpInputBuf
  106. and eax,0fh
  107. jz Good1
  108. xor eax,edx
  109. add eax,10h
  110. mov lpInputBuf,eax
  111. Good1:
  112. invoke RtlZeroMemory,lpInputBuf,dwStrlen
  113. invoke ReadFile,hHandle,lpInputBuf,dwStrlen,offset lpNumberOfBytes,NULL
  114. mov eax,dwStrlen
  115. add eax,16
  116. invoke crt_malloc,eax
  117. mov lpOutputBuf,eax
  118. mov edx,lpOutputBuf
  119. and eax,0fh
  120. jz Good2
  121. xor eax,edx
  122. add eax,10h
  123. mov lpOutputBuf,eax
  124. Good2:
  125. invoke RtlZeroMemory,lpOutputBuf,dwStrlen
  126. ;----------------------------------------------------
  127. invoke crt_printf,offset fmt
  128. mov ecx,5 ;测试5次
  129. .while ecx!=0
  130. push ecx
  131. invoke GetCurrentProcess
  132. invoke GetPriorityClass,eax
  133. mov dwOldProcessP,eax
  134. invoke GetCurrentThread
  135. invoke GetThreadPriority,eax
  136. mov dwOldThreadP,eax
  137. invoke GetCurrentProcess
  138. invoke SetPriorityClass,eax,REALTIME_PRIORITY_CLASS
  139. invoke GetCurrentThread
  140. invoke SetThreadPriority,eax,THREAD_PRIORITY_TIME_CRITICAL
  141. ;--------------------------------------------------
  142. invoke QueryPerformanceCounter,addr dqTickCounter1
  143. ;时间测试
  144. ;invoke _fast_memcpy1,lpOutputBuf,lpInputBuf,dwStrlen
  145. invoke _fast_memcpy9,lpOutputBuf,lpInputBuf,dwStrlen
  146. ;测试结束
  147. invoke QueryPerformanceCounter,addr dqTickCounter2
  148. invoke QueryPerformanceFrequency,addr dqFreq
  149. mov eax,dword ptr dqTickCounter1
  150. mov edx,dword ptr dqTickCounter1[4]
  151. sub dword ptr dqTickCounter2,eax
  152. sub dword ptr dqTickCounter2[4],edx
  153. ;----------------------------------------------------
  154. ;优先级还原
  155. invoke GetCurrentThread
  156. invoke SetThreadPriority,eax,dwOldThreadP
  157. invoke GetCurrentProcess
  158. invoke SetPriorityClass,eax, dwOldProcessP
  159. finit
  160. fild dqFreq
  161. fild dqTickCounter2
  162. fimul dwlm
  163. fdivr
  164. fistp dqTime ;dqTime中的64位值就是时间间隔(以微秒为单位)
  165. ;---------------------------------------------------
  166. ;----------------------------------------------------
  167. invoke crt_printf,offset fmt1,dqTime
  168. pop ecx
  169. dec ecx
  170. .endw
  171. ;输出copy文件
  172. invoke CreateFile,offset szOutName,GENERIC_WRITE,FILE_SHARE_READ,\
  173. NULL,CREATE_ALWAYS,FILE_ATTRIBUTE_NORMAL,NULL
  174. .if eax == INVALID_HANDLE_VALUE
  175. invoke MessageBox,NULL,0,0,0
  176. .endif
  177. mov hHandle1,eax
  178. invoke WriteFile,eax,lpOutputBuf,dwStrlen,offset lpNumberOfBytes,NULL
  179. invoke CloseHandle,hHandle
  180. invoke CloseHandle,hHandle1
  181. invoke crt_system,offset szPause
  182. invoke ExitProcess,0
  183. end start
复制代码
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
发表于 2010-12-24 10:18:17 | 显示全部楼层
关于 memcpy 函数的优化,在intel 的官方文档 Intel @ 64 and IA-32 Architectures Optimization Reference Manual (更新日期:2009-11月)第7.7.2.4 Optimizing Memory Copy Routines 部分,能找出多种优化方式的代码。
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
 楼主| 发表于 2010-12-24 12:52:39 | 显示全部楼层
18# liangbch 嗯,看过。难于对细节的把握。 The memory copy algorithm can be optimized using the Streaming SIMD Extensions with these considerations: Alignment of data Proper layout of pages in memory Cache size Interaction of the transaction lookaside buffer (TLB) with memory accesses Combining prefetch and streaming-store instructions. 似乎movaps快一些,指令滴顺序也有影响....,指令体8字节对齐没有发现影响。
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
 楼主| 发表于 2010-12-24 22:13:05 | 显示全部楼层
17# G-Spider 有bug 更正(精确拷贝到字节),顺便加上硬预取方式,对于小字节量拷贝用movsd过渡。 测试平台: cpu-26661.PNG 测试32.1 MB文件存拷贝: _fast_memcpy1 (movsd) 33 ms _fast_memcpy9 (SSE 系列) 23 ms _block_prefetch (硬预取 block_size 8KB) 22 ms 代码:
  1. ;************************************************************
  2. ;-==-: fast_memcpyTest By G-Spider @2010
  3. ;-==-: ml /c /coff memcpyTest.asm
  4. ;-==-: link /subsystem:console memcpyTest.obj
  5. ;************************************************************
  6. .686p
  7. .XMM
  8. .model flat,stdcall
  9. option casemap:none
  10. include windows.inc
  11. include user32.inc
  12. include kernel32.inc
  13. include msvcrt.inc
  14. includelib user32.lib
  15. includelib kernel32.lib
  16. includelib msvcrt.lib
  17. BLOCK_SIZE equ 8192
  18. .data
  19. dwlm dd 1000 ;1000是毫秒为单位,1000000则是微秒为单位
  20. fmt db '计算用时:',0dh,0ah,0
  21. fmt1 db '%6lld ms',0dh,0ah,0
  22. szFileName db 'xinyu.avi',0 ;32,954KB 原文件
  23. szOutName db 'output.avi',0 ;输出文件;
  24. ;szFileName db 'test.png',0 ;63KB 请以微秒为单位 原文件
  25. ;szOutName db 'output.png',0 ;输出文件
  26. szPause db 'Pause',0
  27. .data?
  28. hHandle dd ?
  29. hHandle1 dd ?
  30. lpInputBuf dd ?
  31. lpOutputBuf dd ?
  32. dwStrlen dd ?
  33. lpNumberOfBytes dd ?
  34. dwOldProcessP dd ?
  35. dwOldThreadP dd ?
  36. ;-------------------------------------
  37. dqTickCounter1 dq ?
  38. dqTickCounter2 dq ?
  39. dqFreq dq ?
  40. dqTime dq ?
  41. .code
  42. ;*************************************
  43. _fast_memcpy1 proc lpdst,lpsrc,dwlen
  44. ;%define param esp+8+4
  45. ;%define src param+0
  46. ;%define dst param+4
  47. ;%define len param+8
  48. mov esi, lpsrc ; source array
  49. mov edi, lpdst ; destination array
  50. mov ecx, dwlen
  51. mov eax,ecx
  52. and eax,3
  53. shr ecx, 2 ; convert to DWORD count
  54. test ecx,ecx
  55. jz A000
  56. rep movsd
  57. A000:
  58. test eax,eax
  59. jz A001
  60. mov ecx,eax
  61. rep movsb
  62. A001:
  63. xor eax,eax
  64. ret
  65. _fast_memcpy1 endp
  66. ;***************************************
  67. _fast_memcpy9 proc lpdst,lpsrc,dwlen
  68. mov esi, lpsrc ;src pointer
  69. mov edi, lpdst ;dest pointer
  70. mov ebx, dwlen ;ebx is our counter
  71. mov ecx, ebx
  72. and ecx, 07fh ;剩余的<128字节
  73. shr ebx, 7 ;divide by 128 (8 * 128bit registers)
  74. test ebx,ebx
  75. jz A000
  76. ALIGN 16
  77. loop_copy:
  78. prefetchnta 128[ESI]; SSE2 prefetch
  79. prefetchnta 160[ESI];
  80. prefetchnta 192[ESI];
  81. prefetchnta 224[ESI];
  82. movdqa xmm0, 0[ESI] ; move data from src to registers
  83. movdqa xmm1, 16[ESI];
  84. movdqa xmm2, 32[ESI];
  85. movdqa xmm3, 48[ESI];
  86. movdqa xmm4, 64[ESI];
  87. movdqa xmm5, 80[ESI];
  88. movdqa xmm6, 96[ESI];
  89. movdqa xmm7, 112[ESI];
  90. movntdq 0[EDI], xmm0 ; move data from registers to dest
  91. movntdq 16[EDI], xmm1;
  92. movntdq 32[EDI], xmm2;
  93. movntdq 48[EDI], xmm3;
  94. movntdq 64[EDI], xmm4;
  95. movntdq 80[EDI], xmm5;
  96. movntdq 96[EDI], xmm6;
  97. movntdq 112[EDI], xmm7;
  98. add esi, 128;
  99. add edi, 128;
  100. dec ebx;
  101. jnz loop_copy; //loop please
  102. sfence
  103. align 16
  104. A000:
  105. mov eax, ecx
  106. and eax, 3
  107. shr ecx, 2 ; co[local]1[/local]nvert to DWORD count
  108. test ecx,ecx
  109. jz short A001
  110. rep movsd
  111. A001:
  112. test eax,eax
  113. jz A002
  114. mov ecx,eax
  115. rep movsb
  116. A002:
  117. xor eax,eax
  118. ret
  119. _fast_memcpy9 endp
  120. _block_prefetch proc lpdst,lpsrc,dwlen
  121. mov edi, lpdst
  122. mov esi, lpsrc
  123. mov eax, dwlen
  124. mov edx, eax
  125. and eax, (BLOCK_SIZE-1) ;4096-1=0fffh ;8192-1=1fffh;16*1024-1=3fffh
  126. and edx, 0ffffe000h ;与 BLOCK_SIZE有关
  127. test edx,edx
  128. jz A000
  129. align 16
  130. main_loop:
  131. xor ecx,ecx
  132. align 16
  133. prefetch_loop:
  134. movaps xmm0, [esi+ecx]
  135. movaps xmm0, [esi+ecx+64]
  136. add ecx,128
  137. cmp ecx,BLOCK_SIZE
  138. jne prefetch_loop
  139. xor ecx,ecx
  140. align 16
  141. cpy_loop:
  142. movdqa xmm0,[esi+ecx]
  143. movdqa xmm1,[esi+ecx+16]
  144. movdqa xmm2,[esi+ecx+32]
  145. movdqa xmm3,[esi+ecx+48]
  146. movdqa xmm4,[esi+ecx+64]
  147. movdqa xmm5,[esi+ecx+16+64]
  148. movdqa xmm6,[esi+ecx+32+64]
  149. movdqa xmm7,[esi+ecx+48+64]
  150. movntdq [edi+ecx],xmm0
  151. movntdq [edi+ecx+16],xmm1
  152. movntdq [edi+ecx+32],xmm2
  153. movntdq [edi+ecx+48],xmm3
  154. movntdq [edi+ecx+64],xmm4
  155. movntdq [edi+ecx+80],xmm5
  156. movntdq [edi+ecx+96],xmm6
  157. movntdq [edi+ecx+112],xmm7
  158. add ecx,128
  159. cmp ecx,BLOCK_SIZE
  160. jne cpy_loop
  161. add esi,ecx
  162. add edi,ecx
  163. sub edx,ecx
  164. jnz main_loop
  165. sfence
  166. align 16
  167. A000:
  168. mov ecx, eax
  169. and eax, 3
  170. shr ecx, 2 ; convert to DWORD count
  171. test ecx,ecx
  172. jz short A001
  173. rep movsd
  174. A001:
  175. test eax,eax
  176. jz A002
  177. mov ecx,eax
  178. rep movsb
  179. A002:
  180. xor eax,eax
  181. ret
  182. _block_prefetch endp
  183. ;*****************************************************
  184. start:
  185. invoke CreateFile,offset szFileName,GENERIC_READ,FILE_SHARE_READ,\
  186. NULL,OPEN_EXISTING,FILE_ATTRIBUTE_NORMAL,NULL
  187. .if eax == INVALID_HANDLE_VALUE
  188. invoke MessageBox,NULL,0,0,0
  189. .endif
  190. mov hHandle,eax
  191. invoke GetFileSize,eax,NULL
  192. mov dwStrlen,eax
  193. add eax,16
  194. invoke crt_malloc,eax
  195. mov lpInputBuf,eax
  196. mov edx,lpInputBuf
  197. and eax,0fh
  198. jz Good1
  199. xor eax,edx
  200. add eax,10h
  201. mov lpInputBuf,eax
  202. Good1:
  203. invoke RtlZeroMemory,lpInputBuf,dwStrlen
  204. invoke ReadFile,hHandle,lpInputBuf,dwStrlen,offset lpNumberOfBytes,NULL
  205. mov eax,dwStrlen
  206. add eax,16
  207. invoke crt_malloc,eax
  208. mov lpOutputBuf,eax
  209. mov edx,lpOutputBuf
  210. and eax,0fh
  211. jz Good2
  212. xor eax,edx
  213. add eax,10h
  214. mov lpOutputBuf,eax
  215. Good2:
  216. invoke RtlZeroMemory,lpOutputBuf,dwStrlen
  217. ;----------------------------------------------------
  218. invoke crt_printf,offset fmt
  219. mov ecx,5 ;测试5次
  220. .while ecx!=0
  221. push ecx
  222. invoke GetCurrentProcess
  223. invoke GetPriorityClass,eax
  224. mov dwOldProcessP,eax
  225. invoke GetCurrentThread
  226. invoke GetThreadPriority,eax
  227. mov dwOldThreadP,eax
  228. invoke GetCurrentProcess
  229. invoke SetPriorityClass,eax,REALTIME_PRIORITY_CLASS
  230. invoke GetCurrentThread
  231. invoke SetThreadPriority,eax,THREAD_PRIORITY_TIME_CRITICAL
  232. ;--------------------------------------------------
  233. invoke QueryPerformanceCounter,addr dqTickCounter1
  234. ;时间测试
  235. ;invoke _fast_memcpy1,lpOutputBuf,lpInputBuf,dwStrlen
  236. ;invoke _fast_memcpy9,lpOutputBuf,lpInputBuf,dwStrlen
  237. invoke _block_prefetch,lpOutputBuf,lpInputBuf,dwStrlen
  238. ;测试结束
  239. invoke QueryPerformanceCounter,addr dqTickCounter2
  240. invoke QueryPerformanceFrequency,addr dqFreq
  241. mov eax,dword ptr dqTickCounter1
  242. mov edx,dword ptr dqTickCounter1[4]
  243. sub dword ptr dqTickCounter2,eax
  244. sub dword ptr dqTickCounter2[4],edx
  245. ;----------------------------------------------------
  246. ;优先级还原
  247. invoke GetCurrentThread
  248. invoke SetThreadPriority,eax,dwOldThreadP
  249. invoke GetCurrentProcess
  250. invoke SetPriorityClass,eax, dwOldProcessP
  251. finit
  252. fild dqFreq
  253. fild dqTickCounter2
  254. fimul dwlm
  255. fdivr
  256. fistp dqTime ;dqTime中的64位值就是时间间隔(以微秒为单位)
  257. ;---------------------------------------------------
  258. invoke crt_printf,offset fmt1,dqTime
  259. pop ecx
  260. dec ecx
  261. .endw
  262. ;输出copy文件
  263. invoke CreateFile,offset szOutName,GENERIC_WRITE,FILE_SHARE_READ,\
  264. NULL,CREATE_ALWAYS,FILE_ATTRIBUTE_NORMAL,NULL
  265. .if eax == INVALID_HANDLE_VALUE
  266. invoke MessageBox,NULL,0,0,0
  267. .endif
  268. mov hHandle1,eax
  269. invoke WriteFile,eax,lpOutputBuf,dwStrlen,offset lpNumberOfBytes,NULL
  270. invoke CloseHandle,hHandle
  271. invoke CloseHandle,hHandle1
  272. invoke crt_system,offset szPause
  273. invoke ExitProcess,0
  274. end start
复制代码
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
您需要登录后才可以回帖 登录 | 欢迎注册

本版积分规则

小黑屋|手机版|数学研发网 ( 苏ICP备07505100号 )

GMT+8, 2025-2-21 03:29 , Processed in 0.095121 second(s), 17 queries .

Powered by Discuz! X3.5

© 2001-2025 Discuz! Team.

快速回复 返回顶部 返回列表