Exact division 严格整除法

liangbch 发表于 2016-11-25 15:48:43

这里，我介绍一种除法算法，Exact division,可翻译成严格整除法，这个帖子不但给出算法，而且给出具体的代码和优化过程及结论。

在GMP的文档中，提到一些函数名包含divexact的除法函数，如mpz_divexact_ui。见下。

void mpz_divexact_ui (mpz t q, const mpz t n, unsigned long d)
Set q to n/d. These functions produce correct results only when it is known in advance that d divides n.
These routines are much faster than the other division functions, and are the best choice when exact division is known to occur, for example reducing a rational to lowest terms.
翻译成中文
void mpz_divexact_ui (mpz t q, const mpz t n, unsigned long d)
计算n/d商并赋值给q。当事先已经知道除数d能够整除被除数n（被除数是除数的倍数），在这种情况下，这个函数能够得到正确的结果。这些例程比其他除法函数更快，当知道除数d能够整除被除数n时，使用这个函数是最佳的选择。
GMP文档中也给出了算法的描述，见下
15.2.5 Exact Division
A so-called exact division is when the dividend is known to be an exact multiple of the divisor.
Jebelean’s exact division algorithm uses this knowledge to make some significant optimizations
(see Appendix B , page 126).
The idea can be illustrated in decimal for example with 368154 divided by 543. Because the low digit of the dividend is 4, the low digit of the quotient must be 8. This is arrived at from 4X7 mod 10, using the fact 7 is the modular inverse of 3 (the low digit of the divisor), since 3X7= 1 mod 10. So 8X543 = 4344 can be subtracted from the dividend leaving 363810. Notice the low digit has become zero. The procedure is repeated at the second digit, with the next quotient digit 7 (1X7 mod 10), subtracting 7X543 = 3801, leaving 325800. And finally at the third digit with quotient digit 6 (8X7 mod 10), subtracting 6X543 = 3258 leaving 0. So the quotient is 678.
翻译成中文
15.2.5整除
一个所谓的“严格除法”是这样的，当被除数是除数的倍数时，jebelean的”严格除法“使用数论方面的知识，做一些重要的优化（见附录B [参考文献]，126页）。
这个算法，可以用十进制的例子来说明，例如368154除以543。因为被除数低位的数字为4，低位商必须为8，这是根据4X7 mod 10等于8这个方法来的。7是3（被除数的最低位）的模逆(数论中的逆元)，因为3X7=1 mod 10。8X543 = 4344，从被除数中减去4344，留下余数363810。注意最低位数字已变成零。重复这个过程在倒数第2位数字，可以得到下一位商7（1x7 mod 10）, 7x543 = 3801，余下325800。最后，得到第3位商6（8X7 mod 10），减去6x543 = 3258余下0。所以这个算式的商是678。

一点说明，正如，减去一个数等于加上这个数的补数一样，除以一个数等于乘以这个数的倒数，这里我们又发现，乘以一个数的逆元可以得到商。补数，倒数，逆元这三个概念很类似，但是又不尽相同。当进制为10时，1是9的补数，9同样是1的补数。同理，3和7互为补数。逆元是数论中模算术的概念，不是所有的数都有逆元。仅当一个数a和模m互素时，数a才有逆元。当m=10时，10以下的数字中，仅1,3,7,9这四个数有逆元。1的逆元是1, 9的逆元仍然是9，3和7互为逆元。如果你感兴趣，可从网上或者数论方面的书得到更多模算术的知识，这里不再重复。

Jebelean 方法和一般方法除法相比，最重要的区别是
1.从低位开始，先得到商的最低位，然后得到次低位。最后得到商的最高位。而通常的除法则是从高位开始试商。
2.试商时采用模算术，而不是普通的算术。理论上，这个算法可应用到基为2^B进制的系统，也可以应用到基为10^B进制的系统，但是在基为10^B进制系统中，模算术非常慢，所以没有实用价值。而在基为2^B的进制中，除以2^B的模可使用“and”指令来得到，速度非常快，即 a mod (2^B)= a and (2^B-1)。

liangbch 发表于 2016-11-25 15:55:32

我学习了下GMP的代码。完成了基为2^30 和GMP类似的除法函数。

这个算法的第一步是求n的逆元。下面函数mod_inverse，使用欧几里得辗转相除法，可得到a关于任意m的逆元。

int extgcd(DWORD a, DWORD b, DWORD *px, DWORD *py)
{
DWORD d = a;
if(b != 0)
{
   d = extgcd(b, a % b, py, px);
*py -= (a / b) * (*px);
}
else
{
   *px = 1;
   *py = 0;
}
return d;
}

int mod_inverse(DWORD a, DWORD m)
{
DWORD x,y;
extgcd(a, m, &x, &y);
return ( (m + x) % m);
}

在GMP的代码中，则是使用一次查表，再加上2次迭代来快速地求逆元。值得注意是，求逆元的迭代算法和求倒数的迭代算法非常类似。下面给出求模=2^30的逆元，下面的代码中，MASK_B20= (2^30-1)= 0x3fffffff
/* g_bi_tab is the multiplicative inverse of 2*i+1 mod 256,
ie. (g_bi_tab * (2*i+1)) % 256 == 1 */

const BYTE g_bi_tab =
{
0x01, 0xAB, 0xCD, 0xB7, 0x39, 0xA3, 0xC5, 0xEF,
0xF1, 0x1B, 0x3D, 0xA7, 0x29, 0x13, 0x35, 0xDF,
0xE1, 0x8B, 0xAD, 0x97, 0x19, 0x83, 0xA5, 0xCF,
0xD1, 0xFB, 0x1D, 0x87, 0x09, 0xF3, 0x15, 0xBF,
0xC1, 0x6B, 0x8D, 0x77, 0xF9, 0x63, 0x85, 0xAF,
0xB1, 0xDB, 0xFD, 0x67, 0xE9, 0xD3, 0xF5, 0x9F,
0xA1, 0x4B, 0x6D, 0x57, 0xD9, 0x43, 0x65, 0x8F,
0x91, 0xBB, 0xDD, 0x47, 0xC9, 0xB3, 0xD5, 0x7F,
0x81, 0x2B, 0x4D, 0x37, 0xB9, 0x23, 0x45, 0x6F,
0x71, 0x9B, 0xBD, 0x27, 0xA9, 0x93, 0xB5, 0x5F,
0x61, 0x0B, 0x2D, 0x17, 0x99, 0x03, 0x25, 0x4F,
0x51, 0x7B, 0x9D, 0x07, 0x89, 0x73, 0x95, 0x3F,
0x41, 0xEB, 0x0D, 0xF7, 0x79, 0xE3, 0x05, 0x2F,
0x31, 0x5B, 0x7D, 0xE7, 0x69, 0x53, 0x75, 0x1F,
0x21, 0xCB, 0xED, 0xD7, 0x59, 0xC3, 0xE5, 0x0F,
0x11, 0x3B, 0x5D, 0xC7, 0x49, 0x33, 0x55, 0xFF
};

DWORD _b_invert_C( DWORD *pN,int* pShift)
{
DWORDinv;
int _shift;
DWORD n=*pN;

assert (n != 0);
_shift=0;
while ( (n & 1)==0 )
{
n>>=1; _shift++;
}

*pShift=_shift;
*pN=n;
if ( n==1)
{
return 1;
}

inv = g_bi_tab[(n/2) & 0x7F];
inv = 2 * inv - inv * inv * n;
inv = 2 * inv - inv * inv * n;

return(inv & MASK_B30);
}

我还写了一个功能等价的汇编语言版本
;_declspec(naked) DWORD _b_invert_ALU( DWORD *pN, int *pShift)
__b_invert_ALU PROC

PARAM_np = 4
PARAM_pShift = 8
REG_rp EQU edi
REG_np EQU ebx
REG_shift EQU ecx

push REG_np
push REG_rp

;--------------------------------------------
movREG_np,
movREG_rp,

mov eax, DP
oreax,eax
jzinvalid_para

bsfREG_shift,eax ; check tail bit0, n= x * 2^ ecx
movDP , REG_shift ; store shift

shreax,cl ; n= (n >> shift)
movDP ,eax ; update n

cmpeax,1
jz funExit ; if ( n==1) return 1

shr eax,1 ; eax= n/2
and eax, 127

movzxecx, byte ptr _g_bi_tab ;inv= g_bi_tab

;Do "inv = 2 * inv - inv * inv * n"
mov eax,ecx ; eax=inv
mul    ecx ; edx:eax = inv*inv
mul DP ; edx:eax = inv*inv*n
shl ecx,1 ; inv=inv*2
sub ecx, eax ; inv=inv*2-inv*inv*n

; Do "inv = 2 * inv - inv * inv * n" again
mov eax,ecx ; eax=inv
mul    ecx ; edx:eax = inv*inv
mul DP ; edx:eax = inv*inv*n
shl ecx,1 ; inv=inv*2
sub ecx, eax ; inv=inv*2-inv*inv*n

mov eax,ecx
and eax,MASK_B30
funExit:
;--------------------------------------------
pop REG_rp
pop REG_np
ret

invalid_para:
mov eax,-1
mov DP ,eax
pop REG_rp
pop REG_np
ret
__b_invert_ALU ENDP

liangbch 发表于 2016-11-25 15:58:41

下面是主函数，可求一个长为len*30 bit的数除以d的商，当d是2的方幂时，调用 _mpn_B30_divrem_1_C 来实现，当d是2的方幂时，函数 _mpn_B30_divrem_1_C使用速度更快的移位算法来实现，否则使用Tudor Jebelean 的新方法，调用_mpn_B30_exact_div1_core_C 来实现。
// Exact Division,
// Reference Tudor Jebelean, <<An algorithm for exact division>>
DWORD mpn_B30_divexact_1_C (DWORD *rp, const DWORD* up, DWORD len, DWORD d)
{
DWORD di;
int shift;

assert (len >= 1);
assert (d != 0);

di= _b_invert_C( &d,&shift);
if ( di==1)
return_mpn_B30_divrem_1_C (rp,0,up,len,d,1);
else
return _mpn_B30_exact_div1_core_C (rp, up, len, d, di, shift);
}

// Exact Division,
DWORD _mpn_B30_exact_div1_core_C (DWORD *rp, const DWORD *up, DWORD n, DWORD d,DWORD di, int shift)
{
DWORDi;
DWORDc, l, u, u_next;
UINT64 prod;

assert (n >= 1);
assert (d != 0);

if (shift != 0)
{
c = 0;
u = up;
rp--;
for (i = 1; i < n; i++)
{
u_next = up;
u = ((u >> shift) | (u_next << (30 -shift))) & MASK_B30;
l = u - c;
c = (l > u); // if u<c, set c=1 else c=0
l = (l * di) & MASK_B30;
rp = l;

prod=(UINT64)(l)* (UINT64)(d);
c += (DWORD)(prod>>30);
u = u_next;
}

u = (u >> shift);
l = u - c;
l = (l * di) & MASK_B30;
rp = l;
}
else
{
u = up;
l = (u * di) & MASK_B30;
rp = l;
c = 0;
for (i = 1; i < n; i++)
{
prod=(UINT64)(l)* (UINT64)(d);
c += (DWORD)(prod>>30);
u = up;
l = u - c;
c = (l > u);// if u<c, set c=1 else c=0
l = (l * di) & MASK_B30;
rp = l;
}
}
return c;
}

作为一个追求运算速度的coder，速度达不到最快，是不会罢手的，于是又完成了使用ALU指令的汇编版本和使用XMM寄存器SSE2。下面是汇编代码，函数后缀ALU表示这个版本仅仅使用通用寄存器。

说明，当除数是奇数是（shift==0），其执行路径上的主要部分是包含一个12条指令循环，见下面的 shift0_body_start后面的指令。

;Exact Division,
; _declspec(naked)DWORD _mpn_B30_exact_div1_core_ALU (DWORD *rp, const DWORD *sp, DWORD n, DWORD d,DWORD di, int shift)

__mpn_B30_exact_div1_core_ALU PROC

PARAM_rp = 4
PARAM_sp = 8
PARAM_len = 12
PARAM_d = 16
PARAM_di = 20
PARAM_shift = 24

REG_I EQU ebx ; loop varible, the initial value is -len
REG_carry EQU ebp
REG_sp EQU esi
REG_rp EQU edi
REG_shift EQU ecx

OFS_SHIFT2 = 0
IF_SIZE = 20

push esi ; save registers
push edi
push ebx
push ebp
subesp,4

movREG_rp,
movREG_sp,
moveax,

xorREG_carry,REG_carry
or eax,eax ; 0 length,diretly return
jz FunExit1

movREG_shift, DP
movedx,30
subedx,REG_shift
mov, edx ; esp= 30 -shift

movREG_I,
leaREG_sp,
leaREG_rp,
neg REG_I

or REG_shift, REG_shift
jnzshift_not_0

shift_is_0:
mov eax, ; u = up;
mul DP ; l = (u * di)
and eax, MASK_B30 ; l = (u * di) & MASK_B30;
mov DP ,eax ; rp=l
xor REG_carry,REG_carry ; c=0
add REG_I,1
jz FunExit1

shift0_body_start:
mul DP ; edx: eax =(l)* (d);
shrd eax,edx,30 ; eax= (prod((l)* (d))>>30;
addREG_carry, eax ; c += (prod>>30);
moveax,DP ; u = up;
subeax, REG_carry ; l = u - c;
sbbREG_carry,REG_carry
imul eax, DP ; ecx= l = (l * di)
negREG_carry ; if u<c, set c=1 else c=0
andeax, MASK_B30 ; ecx= l = (l * di) & MASK_B30;
movDP ,eax ; rp =eax= l;
addREG_I,1
jnzshift0_body_start

FunExit1:
moveax, REG_carry
addesp,4
popebp ; restore registers
popebx
popedi
popesi
ret

shift_not_0:
xor REG_carry,REG_carry ; c=0
mov eax, DP ; u = up;
add REG_I, 1 ; i++
jztail_handle

shift_n0_body_start:
movecx, DP
movedx, DP ; edx=u_next
shledx,cl ; edx=u_next<< (30-shift)
movecx,
shreax, cl ; eax=(u >>shift)
or eax, edx
andeax, MASK_B30 ; u = ((u >> shift) | (u_next << (30 -shift))) & MASK_B30;
subeax, REG_carry ; l= u-c
sbbREG_carry,REG_carry
imul eax, DP ; edx:eax= l=l*di
negREG_carry ; if u<c, set c=1 else c=0
andeax, MASK_B30 ; eax= l = (l * di) & MASK_B30;
movDP ,eax ; eax=rp = l;
mulDP ; edx:eax=prod=(l)* (d);
shrd eax,edx,30 ; eax=(prod>>30)
addREG_carry, eax ; c += (prod>>30);
moveax, DP ; eax =u_next=
addREG_I,1
jnz shift_n0_body_start

tail_handle:
mov ecx,
shr eax, cl ; eax= u= u>>shift
sub eax, REG_carry ; eax=l=u-c
mul DP DP ; eax=l= l*di
and eax, MASK_B30 ; eax=l= (l * di) & MASK_B30;
mov DP , eax ; rp=l

mov eax, REG_carry ; return c

funExit2:
addesp,4
popebp ; restore registers
popebx
popedi
popesi
ret
__mpn_B30_exact_div1_core_ALU ENDP

liangbch 发表于 2016-11-25 16:17:44

divisor=odd number
   blk_len mpz_divexact_ui mpn_B30_divexact_1_C mpn_B30_divexact_1_ALU mpn_B30_divexact_1_SSE2
   8192 7.017 18.8639.810 10.234
   16384 5.697 15.3159.808 10.229
   32768 5.697 15.3159.808 10.231
   65536 5.701 15.3149.806 10.235

divisor=even number
   blk_len mpz_divexact_ui mpn_B30_divexact_1_C mpn_B30_divexact_1_ALU mpn_B30_divexact_1_SSE2
   8192 5.979 15.1929.434 10.722
   16384 5.968 15.1069.433 10.716
   32768 5.981 15.1049.437 10.713
   65536 5.973 15.2569.452 10.716

我的测试程序分别测试块长度为8K到64K是的乘法，即被除数为(blk_len*30)bit，并和GMP 32位版本的速度做了对比。因为GMP采用2^32次为基，所有其实际循环次数我的程序少1/16。但为了公平起见，blk_len仍然以30bit 为单位比较。上表给出的数据为平均每30bit的实际用时，单位为时钟周期。测试结果显示
1. ALU版本的速度快于C语言版本，优化没有白费
2. GMP版本的函数大幅领先我的程序，GMP版仅为5.7个周期，而我的ALU版为9.8个周期。
3. 当除数为偶数时，其执行路径上的循环块为21条指令，但是执行时间比只有12条指令的奇数分支还有少。
4. 对于这个程序，SSE2版本没有优势。

liangbch 发表于 2016-11-25 16:26:17

为了挖出GMP快的原因，使用最笨的办法，在调试器中打开反汇编窗口，一步一步跟踪，找出GMP对应函数的汇编代码，没费太多工夫，居然找到其代码。见下，为了便于阅读，加上了标号和少许注释。
fun_start:
D570:
mov eax, dword ptr
sub esp,14h
mov dword ptr ,esi
mov esi, dword ptr
mov dword ptr ,ebx
mov ebx,dword ptr
bsf ecx,eax
mov dword ptr ,ebp
shr eax,cl
mov edx,eax
shr eax,1
mov dword ptr ,edx
and eax,7fh
movzx ebp, byte ptr

lea eax,
imul ebp,ebp
mov dword ptr ,edi
mov edi, dword ptr
lea esi,
imul ebp, dword ptr
sub eax,ebp

lea ebp,
imul eax,eax
lea edi,
neg ebx
mov dword ptr ,edi
imul eax,dword ptr
subebp,eax
mov dword ptr ,ebp
mov eax,dword ptr
orecx,ecx ; shift=0

jneD60D ; shift!=0
jmpD5ED ; shift_0_update

D5DD:

shfit_is_0:
mul eax, dword ptr
mov eax,dword ptr
sub eax,ecx
sbb ecx,ecx
sub eax,edx
sbb ecx,0

D5ED:
shift_0_MID:
imul eax,dword ptr
mov dword ptr ,eax
neg ecx
inc ebx
jne D5DD

mov esi, dword ptr
mov edi, dword ptr
mov ebp, dword ptr
mov ebx, dword ptr
add esp,14h
ret

D60D:
shift_isnot_0:
xor ebp,ebp
xor edx,edx
inc ebx
jeD64B ; 转收尾处理

mov edi, dword ptr
shrd eax,edi, cl
jmp D633

D61C: ; loop_start
mov edi, dword ptr
mul eax, dword ptr
mov eax, dword ptr
shrd eax,edi, cl
sub eax,ebp
sbb ebp,ebp
sub eax,edx
sub ebp,0

D633:
imul eax, dword ptr
mov edi, dword ptr
neg ebp
mov dword ptr ,eax
inc ebx
jne D61c

mul eax, dword ptr
mov eax, dword ptr

D64B:
shr eax,cl
mov esi, dword ptr
sub eax,ebp
mov ebp, dword ptr
sub eax,edx
mov ebx, dword ptr
imul eax, dword ptr
mov dword ptr ,eax
mov edi, dword ptr
add esp, 14h
ret
:fun_end

可以看到，对于奇数分支，GMP中循环块包含如下11条指令。
mul eax, dword ptr
mov eax,dword ptr
sub eax,ecx
sbb ecx,ecx
sub eax,edx
sbb ecx,0
imul eax,dword ptr
mov dword ptr ,eax
neg ecx
inc ebx
jne D5DD
我的版本有12条指令，仅仅比GMP多1条指令，但是为什么速度相差很大呢？

liangbch 发表于 2016-11-25 16:48:34

1. 将
shld edx,eax,2
改为
shreax,30"
shledx,2
or edx,eax

2. 将
"a=a+b", "c-=a"
改为
"c-=b,c-=a"

最终指令数从12变为15条，增加了3条指令，但是速度却提升了，耗时从9.8个周期降到8.5个周期。

继续做实验，删除那3条移位指令，而不考虑其正确性，至此只比GMP的版本多一条指令，测试发现，耗时降为6.8个时钟周期。
最后删除一条and指令，和GMP完全相同，耗时降为6.06个时钟周期，比GMP慢了6%，和理论相符。因为GMP采用2^32，故平均耗时很小。

结论：
1. shld指令较慢，换成单精度指令shl,shr和or组合更好。
2. 降低指令依赖，尽可能晚的使用耗时指令的计算结果。
3. 虽然在多数情况下，减少指令条数可获得性能的提升。但依赖链比较长的情况下，具有较少的指令的指令组合可能比具有较多指令的指令组合（弱依赖）更慢。

liangbch 发表于 2016-11-25 16:51:44

最后，贴出SSE2版本的汇编代码。;Exact Division,
; _declspec(naked)DWORD __mpn_B30_exact_div1_core_SSE2 (DWORD *rp, const DWORD *sp, DWORD n, DWORD d,DWORD di, int shift)
__mpn_B30_exact_div1_core_SSE2 PROC
PARAM_rp = 4
PARAM_sp = 8
PARAM_len = 12
PARAM_d = 16
PARAM_di = 20
PARAM_shift = 24

REG_sp EQU esi
REG_rp EQU edi
REG_I EQU ebx

XMM_TMP EQU XMM0
XMM_TMP2 EQU XMM1
XMM_CARRY EQU XMM2

XMM_D EQU XMM4
XMM_DI EQU XMM5
XMM_SHIFT EQU XMM6
XMM_MASK EQU XMM7

IF_SIZE = 16
push esi ; save registers
push edi
push ebx
push ebp

movREG_rp,
movREG_sp,
moveax,
PXOR XMM_carry, XMM_carry
or eax,eax
jz funExit1 ; directly exit if len=0

mov eax, MASK_B30
mov ecx, DP
MOVDXMM_D , DP
MOVDXMM_DI, DP
MOVDXMM_MASK,eax
MOVDXMM_SHIFT, ecx

movREG_I,
leaREG_rp,
leaREG_sp,
negREG_I

or ecx, ecx
jnzshift_isnot_0

shift_is_0:
MOVD XMM_TMP, DP ; u = up;
PMULUDQ XMM_TMP,XMM_DI ; l = (u * di)
PAND XMM_TMP, XMM_MASK ; l = (u * di) & MASK_B30;
MOVD DP ,XMM_TMP ; rp=l
PXOR XMM_CARRY,XMM_CARRY ; c=0
addREG_I,1
jz funExit1

shift0_body_start:
PMULUDQ XMM_TMP,XMM_D ; tmp = prod =(UINT64)(l)* (UINT64)(d);
PSRLQ XMM_TMP,30 ; tmp=(DWORD)(prod>>30);

MOVD XMM_TMP2,DP ; u = sp;
PSUBD XMM_TMP2,XMM_carry ; l = u - c;
PSUBD XMM_TMP2,XMM_TMP ; tmp2 = u -c - (prod>>30)
MOVDQAXMM_CARRY,XMM_TMP2 ;carry=l
PMULUDQ XMM_TMP2,XMM_DI ;tmp= (l * di)
PSRLD XMM_CARRY,31 ; if u<c, set c=1 else c=0
PAND XMM_TMP2, XMM_MASK ; tmp= (l * di) & MASK_B30;
MOVD DP ,XMM_TMP2 ; rp =tmp2= l;
MOVDQAXMM_TMP, XMM_TMP2 ; tmp=l

add REG_I,1
jnz shift0_body_start

funExit1:
MOVDeax, XMM_carry
popebp ; restore registers
popebx
popedi
popesi
ret

shift_isnot_0:
PXOR XMM_CARRY,XMM_CARRY ; c=0
MOVD XMM_TMP, DP ; u = up;
addREG_I,1 ; i++
jz shift_n0_tail

shift_n0_body_start:
MOVD XMM_TMP2, DP ; u_next=rp
PSLLQ XMM_TMP2, 30 ; u_next<< 30
PADDQ XMM_TMP, XMM_TMP2 ; u= (u_next<<30)+u
PSRLQ XMM_TMP, XMM_SHIFT ; u = (u>>shift)
PAND XMM_TMP, XMM_MASK ; u = (u>>shift) & MASK_B30;
PSUBD XMM_TMP, XMM_CARRY ; l= u-c

MOVDQAXMM_CARRY, XMM_TMP
PMULUDQ XMM_TMP,XMM_DI ; tmp= l*di
PSRLD XMM_CARRY, 31 ; if u<c, set c=1 else c=0

MOVD XMM_TMP2, DP ; u_next=
PAND XMM_TMP, XMM_MASK ; tmp= l = (l * di) & MASK_B30;
MOVD DP ,XMM_TMP
PMULUDQ XMM_TMP,XMM_D ; tmp=l*d;
PSRLQXMM_TMP,30 ; tmp=prod>>30
PADDDXMM_CARRY,XMM_TMP ; c += (prod>>30);
MOVDQA XMM_TMP, XMM_TMP2 ; u= u_next

add REG_I,1
jnz shift_n0_body_start

shift_n0_tail:
PSRLQ XMM_TMP, XMM_SHIFT ; u = (u>>shift)
PSUBD XMM_TMP, XMM_CARRY ; tmp=l=u-c
PMULUDQ XMM_TMP, XMM_DI ; tmp=l= l*di
PAND XMM_TMP, XMM_MASK ; l= (l * di) & MASK_B30;
MOVD DP , XMM_TMP ; rp=l
MOVD eax, XMM_carry ; return c

funExit2:
popebp ; restore registers
popebx
popedi
popesi
ret
__mpn_B30_exact_div1_core_SSE2 ENDP

liangbch 发表于 2016-11-25 17:01:57

贴出算法发明者Tudor Jebelean 的论文，论文是1992年发表的，算是比较新的算法。

gxqcn 发表于 2016-12-4 11:10:27

这个算法，在 CSDN 上讨论“分数化小数”的擂台中，mathe提及了循环节“逆序输出”的算法，当时很震撼，原理应该与楼主说的一致。
后来，我借用该算法，专门写了个小程序：关于“擂台：分数化小数”，我专门写了一个快速程序（下载见内）
其原始讨论贴，我搜了下，为：擂台：分数化小数
上述讨论，是2007年5月份，距今已10年（严格点，是九年半）.

liangbch 发表于 2016-12-4 21:03:31

看了一下你的那个帖子，其实质是求离散对数。

页: [1]

数学研发论坛's Archiver

Exact division 严格整除法