liangbch
发表于 2010-12-14 16:45:59
sorry。 以上指令确实是SSE4指令集。属于SSE4.1。
刚才在发帖前,我试图从手边的文档中查出PMINUD属于那个指令集,但没有得到答案,猜想应该是SSE2指令集,就按此发帖了。楼上一提醒,遂google了一下,果然发现是SSE4指令。请参照http://en.wikipedia.org/wiki/SSE4
gxqcn
发表于 2010-12-14 16:50:30
假如SSE2有此指令,HugeCalc里的部分汇编会更精简。:)
所以看到老兄的帖子才比较敏感,有点疑惑。
G-Spider
发表于 2010-12-14 17:04:33
G-Spider 发表于 2010-12-14 13:12 http://bbs.emath.ac.cn/images/common/back.gif
之前的可能有点小不公平,第一个ebx是内存变量,之后的两个ebx是寄存器变量,修改了下,还是要快一点。
结果(时间可能不稳定,大致趋势如此):Elapsed time:0.121000 s
Elapsed time:0.100000 s
Elapsed time:0.098000 s
result=8
测试代码:#include <stdio.h>
#include <time.h>
int main()
{
int i,j,result;
doublet1,t2,t3;
//测试1#################
result=0;
j=2009;
t1=clock();
for(i=0;i<99990000;i+=2,j-=10)
{
result = (i < j) ? 6 : 8;
}
printf("Elapsed time: %f s\n",(clock()-t1)/CLOCKS_PER_SEC);
//printf("result=%d\n",result);
//测试2#################
result=0;
j=2009;
t2=clock();
for(i=0;i<99990000;i+=2,j-=10)
{
__asm
{
xor ebx, ebx
mov eax, i
cmp eax, j
setl bl
dec ebx
and ebx, 2
add ebx, 6
mov result,ebx
}
}
printf("Elapsed time: %f s\n",(clock()-t2)/CLOCKS_PER_SEC);
//printf("result=%d\n",result);
//测试3#################
result=0;
j=2009;
t3=clock();
for(i=0;i<99990000;i+=2,j-=10)
{
__asm
{
xor ebx, ebx
mov eax, i
cmp eax, j
setge bl
lea ebx,
mov result,ebx
}
}
printf("Elapsed time: %f s\n",(clock()-t3)/CLOCKS_PER_SEC);
printf("result=%d\n",result);
system("Pause");
return 0;
}
G-Spider
发表于 2010-12-14 17:57:09
6# liangbch
有时间我也来试试.....
1.int型
2.float型
3.double型
G-Spider
发表于 2010-12-14 20:45:42
Intel优化文档部分翻译 ByG-Spider 2010-12-14不妥之处,欢迎指正。
http://blog.csdn.net/G_Spider
Software Prefetch Scheduling Distance
软件预取调度的距离
Determining the ideal prefetch placement in the code depends on many architecturalparameters, including: the amount of memory to be prefetched, cache lookuplatency, system memory latency, and estimate of computation cycle. The ideal
distance for prefetching data is processor- and platform-dependent. If the distance is too short, the prefetch will not hide the latency of the fetch behind computation. Ifthe prefetch is too far ahead, prefetched data may be flushed out of the cache by the time it is required.
在代码中确定理想的预取位置取决于许多结构性参数,其中包括:将预取的存储量,缓存查找延迟,系统内存延迟,和运算周期的估计。理想
预取数据的距离是处理器和平台相关的。如果距离太短,预取将不能掩盖背后的提取计算延迟。如果预取是过于超前,有用的预取数据可能被刷出缓存。
Since prefetch distance is not a well-defined metric, for this discussion, we define a new term, prefetch scheduling distance (PSD), which is represented by the number of iterations. For large loops, prefetch scheduling distance can be set to 1 (that is, schedule prefetch instructions one iteration ahead). For small loop bodies (that is, loop iterations with little computation), the prefetch scheduling distance must be more than one iteration.
由于预取距离不是一个明确的指标,为了讨论,我们定义一个新的术语,预取调度距离(PSD),它是由迭代的次数反映。对于大循环,调度预取距离可设置为1(即,预取指令附在第一次迭代前)。对于小的循环体(即有很少的循环迭代计算),预取距离必须调度不止一次迭代。
A simplified equation to compute PSD is deduced from the mathematical model. For a simplified equation, complete mathematical model, and methodology of prefetch distance determination, see Appendix E, “Summary of Rules and Suggestions.”
关于计算PSD的一个简化公式可由数学模型推导出。对于简化方程,完整的数学模型和预取方法距离测定,见附录E,“规则和建议摘要”。
Example 7-3 illustrates the use of a prefetch within the loop body. The prefetch scheduling distance is set to 3, ESI is effectively the pointer to a line, EDX is the address of the data being referenced and XMM1-XMM4 are the data used in computation. Example 7-4 uses two independent cache lines of data per iteration. The PSD would need to be increased/decreased if more/less than two cache lines are used per iteration.
例7-3说明了一个预取在循环体内的使用。预取调度距离(PSD)设置为3,ESI是有效的数据基指,EDX是数据的参考地址,XMM1 - XMM4存放计算中使用的数据。示例7-4每次迭代使用两个独立的数据高速缓存行。如果每次迭代使用多于/小于两个缓存行,PSD需要增加/减少。
例 7-3. 预取调度距离
top_loop:
prefetchnta
prefetchnta
......
......
movaps xmm1,
movaps xmm2,
movaps xmm3,
movaps xmm4,
......
......
add esi, 128
cmp esi, ecx
jl top_loop
G-Spider
发表于 2010-12-22 12:09:45
之前也看了些快速memcpy()的实现,经过亲自尝试,是乎都不尽人意......今又看到一兄台的文章,我笑了...
http://blog.csdn.net/OJOE/archive/2010/08/18/5819921.aspx
难道果真如这位兄台所说的:"使用MMS/SSE内存技术对memcpy的性能优化空间不太大,而且在执行初期,优化的性能甚至比不上未优化的性能。"
还有一些:
真的能成就30-70% faster 的提升?
Courtesy of William Chan and Google. 30-70% faster than memcpy in Microsoft Visual Studio 2005.void X_aligned_memcpy_sse2(void* dest, const void* src, const unsigned long size_t)
{
__asm
{
mov esi, src; //src pointer
mov edi, dest; //dest pointer
mov ebx, size_t; //ebx is our counter
shr ebx, 7; //divide by 128 (8 * 128bit registers)
loop_copy:
prefetchnta 128; //SSE2 prefetch
prefetchnta 160;
prefetchnta 192;
prefetchnta 224;
movdqa xmm0, 0; //move data from src to registers
movdqa xmm1, 16;
movdqa xmm2, 32;
movdqa xmm3, 48;
movdqa xmm4, 64;
movdqa xmm5, 80;
movdqa xmm6, 96;
movdqa xmm7, 112;
movntdq 0, xmm0; //move data from registers to dest
movntdq 16, xmm1;
movntdq 32, xmm2;
movntdq 48, xmm3;
movntdq 64, xmm4;
movntdq 80, xmm5;
movntdq 96, xmm6;
movntdq 112, xmm7;
add esi, 128;
add edi, 128;
dec ebx;
jnz loop_copy; //loop please
loop_copy_end:
}
}还有这个: 试了一下,似乎在intel上没有得到它所说的完美的性能提升,不过想法很好,应该取自:amd资料.
内存拷贝的优化方法
http://www.freegames.com.cn/school/383/2007/27312.html
G-Spider
发表于 2010-12-23 12:09:01
经过测试发现,对于存拷贝,只有当数据量较大时,以M为单位的数据量时,SSE系列滴指令才突显优势。
Intel(R) Core (TM) 2 Duo CPU E8500 3.16GHz
测试151,885KB
_fast_memcpy9(SSE)计算用时:
52 ms
49 ms
50 ms
49 ms
52 ms
_fast_memcpy1 (movsd)计算用时:
73 ms
71 ms
73 ms
72 ms
74 ms
---------------------------
测试63KB
_fast_memcpy9 (SSE)计算用时:
24 us
10 us
10 us
10 us
10 us
_fast_memcpy1 (movsd)计算用时:
6 us
6 us
6 us
6 us
6 us
--------------------------------
代码:;ml/c /coff memcpyTest.asm
;link /subsystem:console memcpyTest.obj;5329 15044
;************************************************************
.686p
.XMM
.model flat,stdcall
option casemap:none
include windows.inc
include user32.inc
include kernel32.inc
include msvcrt.inc
includelib user32.lib
includelib kernel32.lib
includelib msvcrt.lib
.data
dwlm dd 1000000 ;1000是毫秒为单位,1000000则是微秒为单位
fmt db '计算用时:',0dh,0ah,0
fmt1 db '%6lld us',0dh,0ah,0
;szFileName db 'xinyu.mkv',0 ;151,885KB 原文件
;szOutName db 'output.mkv',0 ;输出文件;
szFileName db 'test.jpg',0 ;63KB 请以微秒为单位 原文件
szOutName db 'output.jpg',0 ;输出文件
szPause db 'Pause',0
.data?
hHandle dd ?
hHandle1 dd ?
lpInputBuf dd ?
lpOutputBuf dd ?
dwStrlen dd ?
lpNumberOfBytes dd ?
dwOldProcessP dd ?
dwOldThreadP dd ?
;-------------------------------------
dqTickCounter1dq ?
dqTickCounter2dq ?
dqFreq dq ?
dqTime dq ?
.code
;*************************************
_fast_memcpy1 proc lpdst,lpsrc,dwlen
;%define param esp+8+4
;%define src param+0
;%define dst param+4
;%define len param+8
push esi
push edi
mov esi, lpsrc; source array
mov edi, lpdst; destination array
mov ecx, dwlen
shr ecx, 2 ; convert to DWORD count
rep movsd
pop edi
pop esi
xor eax,eax
ret
_fast_memcpy1 endp
;***************************************
_fast_memcpy9proc lpdst,lpsrc,dwlen
mov esi, lpsrc; //src pointer
mov edi, lpdst; //dest pointer
mov ebx, dwlen; //ebx is our counter
shr ebx, 7; //divide by 128 (8 * 128bit registers)
ALIGN 8
loop_copy:
prefetchnta 128; //SSE2 prefetch
prefetchnta 160;
prefetchnta 192;
prefetchnta 224;
movdqa xmm0, 0; //move data from src to registers
movdqa xmm1, 16;
movdqa xmm2, 32;
movdqa xmm3, 48;
movdqa xmm4, 64;
movdqa xmm5, 80;
movdqa xmm6, 96;
movdqa xmm7, 112;
movntdq 0, xmm0; //move data from registers to dest
movntdq 16, xmm1;
movntdq 32, xmm2;
movntdq 48, xmm3;
movntdq 64, xmm4;
movntdq 80, xmm5;
movntdq 96, xmm6;
movntdq 112, xmm7;
add esi, 128;
add edi, 128;
dec ebx;
jnz loop_copy; //loop please
xor eax,eax
ret
_fast_memcpy9 endp
;*****************************************************
start:
invokeCreateFile,offset szFileName,GENERIC_READ,FILE_SHARE_READ,\
NULL,OPEN_EXISTING,FILE_ATTRIBUTE_NORMAL,NULL
.if eax == INVALID_HANDLE_VALUE
invoke MessageBox,NULL,0,0,0
.endif
mov hHandle,eax
invokeGetFileSize,eax,NULL
mov dwStrlen,eax
add eax,16
invokecrt_malloc,eax
mov lpInputBuf,eax
mov edx,lpInputBuf
and eax,0fh
jz Good1
xor eax,edx
add eax,10h
mov lpInputBuf,eax
Good1:
invokeRtlZeroMemory,lpInputBuf,dwStrlen
invokeReadFile,hHandle,lpInputBuf,dwStrlen,offset lpNumberOfBytes,NULL
mov eax,dwStrlen
add eax,16
invokecrt_malloc,eax
mov lpOutputBuf,eax
mov edx,lpOutputBuf
and eax,0fh
jz Good2
xor eax,edx
add eax,10h
mov lpOutputBuf,eax
Good2:
invokeRtlZeroMemory,lpOutputBuf,dwStrlen
;----------------------------------------------------
invokecrt_printf,offset fmt
mov ecx,5 ;测试5次
.whileecx!=0
pushecx
invokeGetCurrentProcess
invokeGetPriorityClass,eax
mov dwOldProcessP,eax
invokeGetCurrentThread
invokeGetThreadPriority,eax
mov dwOldThreadP,eax
invokeGetCurrentProcess
invokeSetPriorityClass,eax,REALTIME_PRIORITY_CLASS
invokeGetCurrentThread
invokeSetThreadPriority,eax,THREAD_PRIORITY_TIME_CRITICAL
;--------------------------------------------------
invokeQueryPerformanceCounter,addr dqTickCounter1
;时间测试
;invoke_fast_memcpy1,lpOutputBuf,lpInputBuf,dwStrlen
invoke_fast_memcpy9,lpOutputBuf,lpInputBuf,dwStrlen
;测试结束
invokeQueryPerformanceCounter,addr dqTickCounter2
invokeQueryPerformanceFrequency,addrdqFreq
mov eax,dword ptr dqTickCounter1
mov edx,dword ptr dqTickCounter1
sub dword ptr dqTickCounter2,eax
sub dword ptr dqTickCounter2,edx
;----------------------------------------------------
;优先级还原
invokeGetCurrentThread
invokeSetThreadPriority,eax,dwOldThreadP
invokeGetCurrentProcess
invokeSetPriorityClass,eax, dwOldProcessP
finit
fild dqFreq
fild dqTickCounter2
fimul dwlm
fdivr
fistp dqTime;dqTime中的64位值就是时间间隔(以微秒为单位)
;---------------------------------------------------
;----------------------------------------------------
invokecrt_printf,offset fmt1,dqTime
pop ecx
dec ecx
.endw
;输出copy文件
invokeCreateFile,offset szOutName,GENERIC_WRITE,FILE_SHARE_READ,\
NULL,CREATE_ALWAYS,FILE_ATTRIBUTE_NORMAL,NULL
.if eax == INVALID_HANDLE_VALUE
invoke MessageBox,NULL,0,0,0
.endif
mov hHandle1,eax
invokeWriteFile,eax,lpOutputBuf,dwStrlen,offset lpNumberOfBytes,NULL
invokeCloseHandle,hHandle
invokeCloseHandle,hHandle1
invokecrt_system,offset szPause
invoke ExitProcess,0
end start
liangbch
发表于 2010-12-24 10:18:17
关于 memcpy 函数的优化,在intel 的官方文档
Intel @ 64 and IA-32 Architectures Optimization Reference Manual (更新日期:2009-11月)第7.7.2.4 Optimizing Memory Copy Routines 部分,能找出多种优化方式的代码。
G-Spider
发表于 2010-12-24 12:52:39
18# liangbch
嗯,看过。难于对细节的把握。
The memory copy algorithm can be optimized using the Streaming SIMD Extensions
with these considerations:
Alignment of data
Proper layout of pages in memory
Cache size
Interaction of the transaction lookaside buffer (TLB) with memory accesses
Combining prefetch and streaming-store instructions.
似乎movaps快一些,指令滴顺序也有影响....,指令体8字节对齐没有发现影响。
G-Spider
发表于 2010-12-24 22:13:05
17# G-Spider
有bug 更正(精确拷贝到字节),顺便加上硬预取方式,对于小字节量拷贝用movsd过渡。
测试平台:
测试32.1 MB文件存拷贝:
_fast_memcpy1 (movsd)
33 ms
_fast_memcpy9(SSE 系列)
23 ms
_block_prefetch(硬预取 block_size 8KB)
22 ms
代码:;************************************************************
;-==-: fast_memcpyTestBy G-Spider @2010
;-==-: ml/c /coff memcpyTest.asm
;-==-: link /subsystem:console memcpyTest.obj
;************************************************************
.686p
.XMM
.model flat,stdcall
option casemap:none
include windows.inc
include user32.inc
include kernel32.inc
include msvcrt.inc
includelib user32.lib
includelib kernel32.lib
includelib msvcrt.lib
BLOCK_SIZE equ8192
.data
dwlm dd 1000 ;1000是毫秒为单位,1000000则是微秒为单位
fmt db '计算用时:',0dh,0ah,0
fmt1 db '%6lld ms',0dh,0ah,0
szFileName db 'xinyu.avi',0 ;32,954KB 原文件
szOutName db 'output.avi',0 ;输出文件;
;szFileName db 'test.png',0 ;63KB 请以微秒为单位 原文件
;szOutName db 'output.png',0 ;输出文件
szPause db 'Pause',0
.data?
hHandle dd ?
hHandle1 dd ?
lpInputBuf dd ?
lpOutputBuf dd ?
dwStrlen dd ?
lpNumberOfBytes dd ?
dwOldProcessP dd ?
dwOldThreadP dd ?
;-------------------------------------
dqTickCounter1dq ?
dqTickCounter2dq ?
dqFreq dq ?
dqTime dq ?
.code
;*************************************
_fast_memcpy1 proc lpdst,lpsrc,dwlen
;%define param esp+8+4
;%define src param+0
;%define dst param+4
;%define len param+8
mov esi, lpsrc; source array
mov edi, lpdst; destination array
mov ecx, dwlen
mov eax,ecx
and eax,3
shr ecx, 2 ; convert to DWORD count
test ecx,ecx
jz A000
rep movsd
A000:
test eax,eax
jz A001
mov ecx,eax
rep movsb
A001:
xor eax,eax
ret
_fast_memcpy1 endp
;***************************************
_fast_memcpy9proc lpdst,lpsrc,dwlen
mov esi, lpsrc ;src pointer
mov edi, lpdst ;dest pointer
mov ebx, dwlen ;ebx is our counter
mov ecx, ebx
and ecx, 07fh ;剩余的<128字节
shr ebx, 7 ;divide by 128 (8 * 128bit registers)
test ebx,ebx
jzA000
ALIGN 16
loop_copy:
prefetchnta 128; SSE2 prefetch
prefetchnta 160;
prefetchnta 192;
prefetchnta 224;
movdqa xmm0, 0 ; move data from src to registers
movdqa xmm1, 16;
movdqa xmm2, 32;
movdqa xmm3, 48;
movdqa xmm4, 64;
movdqa xmm5, 80;
movdqa xmm6, 96;
movdqa xmm7, 112;
movntdq 0, xmm0 ; move data from registers to dest
movntdq 16, xmm1;
movntdq 32, xmm2;
movntdq 48, xmm3;
movntdq 64, xmm4;
movntdq 80, xmm5;
movntdq 96, xmm6;
movntdq 112, xmm7;
add esi, 128;
add edi, 128;
dec ebx;
jnz loop_copy; //loop please
sfence
align 16
A000:
mov eax, ecx
and eax, 3
shr ecx, 2 ; co1nvert to DWORD count
test ecx,ecx
jz short A001
rep movsd
A001:
test eax,eax
jz A002
movecx,eax
repmovsb
A002:
xor eax,eax
ret
_fast_memcpy9 endp
_block_prefetch proc lpdst,lpsrc,dwlen
movedi, lpdst
movesi, lpsrc
moveax, dwlen
movedx, eax
andeax, (BLOCK_SIZE-1) ;4096-1=0fffh ;8192-1=1fffh;16*1024-1=3fffh
andedx, 0ffffe000h ;与 BLOCK_SIZE有关
test edx,edx
jzA000
align 16
main_loop:
xor ecx,ecx
align 16
prefetch_loop:
movaps xmm0,
movaps xmm0,
add ecx,128
cmp ecx,BLOCK_SIZE
jne prefetch_loop
xor ecx,ecx
align 16
cpy_loop:
movdqa xmm0,
movdqa xmm1,
movdqa xmm2,
movdqa xmm3,
movdqa xmm4,
movdqa xmm5,
movdqa xmm6,
movdqa xmm7,
movntdq ,xmm0
movntdq ,xmm1
movntdq ,xmm2
movntdq ,xmm3
movntdq ,xmm4
movntdq ,xmm5
movntdq ,xmm6
movntdq ,xmm7
add ecx,128
cmp ecx,BLOCK_SIZE
jne cpy_loop
add esi,ecx
add edi,ecx
sub edx,ecx
jnz main_loop
sfence
align 16
A000:
mov ecx, eax
and eax, 3
shr ecx, 2 ; convert to DWORD count
test ecx,ecx
jz short A001
rep movsd
A001:
test eax,eax
jz A002
movecx,eax
repmovsb
A002:
xor eax,eax
ret
_block_prefetch endp
;*****************************************************
start:
invokeCreateFile,offset szFileName,GENERIC_READ,FILE_SHARE_READ,\
NULL,OPEN_EXISTING,FILE_ATTRIBUTE_NORMAL,NULL
.if eax == INVALID_HANDLE_VALUE
invoke MessageBox,NULL,0,0,0
.endif
mov hHandle,eax
invokeGetFileSize,eax,NULL
mov dwStrlen,eax
add eax,16
invokecrt_malloc,eax
mov lpInputBuf,eax
mov edx,lpInputBuf
and eax,0fh
jz Good1
xor eax,edx
add eax,10h
mov lpInputBuf,eax
Good1:
invokeRtlZeroMemory,lpInputBuf,dwStrlen
invokeReadFile,hHandle,lpInputBuf,dwStrlen,offset lpNumberOfBytes,NULL
mov eax,dwStrlen
add eax,16
invokecrt_malloc,eax
mov lpOutputBuf,eax
mov edx,lpOutputBuf
and eax,0fh
jz Good2
xor eax,edx
add eax,10h
mov lpOutputBuf,eax
Good2:
invokeRtlZeroMemory,lpOutputBuf,dwStrlen
;----------------------------------------------------
invokecrt_printf,offset fmt
mov ecx,5 ;测试5次
.whileecx!=0
pushecx
invokeGetCurrentProcess
invokeGetPriorityClass,eax
mov dwOldProcessP,eax
invokeGetCurrentThread
invokeGetThreadPriority,eax
mov dwOldThreadP,eax
invokeGetCurrentProcess
invokeSetPriorityClass,eax,REALTIME_PRIORITY_CLASS
invokeGetCurrentThread
invokeSetThreadPriority,eax,THREAD_PRIORITY_TIME_CRITICAL
;--------------------------------------------------
invokeQueryPerformanceCounter,addr dqTickCounter1
;时间测试
;invoke_fast_memcpy1,lpOutputBuf,lpInputBuf,dwStrlen
;invoke_fast_memcpy9,lpOutputBuf,lpInputBuf,dwStrlen
invoke_block_prefetch,lpOutputBuf,lpInputBuf,dwStrlen
;测试结束
invokeQueryPerformanceCounter,addr dqTickCounter2
invokeQueryPerformanceFrequency,addrdqFreq
mov eax,dword ptr dqTickCounter1
mov edx,dword ptr dqTickCounter1
sub dword ptr dqTickCounter2,eax
sub dword ptr dqTickCounter2,edx
;----------------------------------------------------
;优先级还原
invokeGetCurrentThread
invokeSetThreadPriority,eax,dwOldThreadP
invokeGetCurrentProcess
invokeSetPriorityClass,eax, dwOldProcessP
finit
fild dqFreq
fild dqTickCounter2
fimul dwlm
fdivr
fistp dqTime;dqTime中的64位值就是时间间隔(以微秒为单位)
;---------------------------------------------------
invokecrt_printf,offset fmt1,dqTime
pop ecx
dec ecx
.endw
;输出copy文件
invokeCreateFile,offset szOutName,GENERIC_WRITE,FILE_SHARE_READ,\
NULL,CREATE_ALWAYS,FILE_ATTRIBUTE_NORMAL,NULL
.if eax == INVALID_HANDLE_VALUE
invoke MessageBox,NULL,0,0,0
.endif
mov hHandle1,eax
invokeWriteFile,eax,lpOutputBuf,dwStrlen,offset lpNumberOfBytes,NULL
invokeCloseHandle,hHandle
invokeCloseHandle,hHandle1
invokecrt_system,offset szPause
invoke ExitProcess,0
end start