一个快速得到bit位索引的算法

gxqcn · 发表于 2008-3-3 16:26:00

马上注册，结交更多好友，享用更多功能，让你轻松玩转社区。

您需要登录才可以下载或查看，没有账号？欢迎注册

×

已知32位中有n(0 <= n <= 32)个1。要求：快速找处所有的1位索引数值。一般的算法，都需要大量的跳转指令才能完成，而CPU一旦分支预测失败，将会有很大的惩罚，应尽可能避免。下面是我想到的一种完全无跳转指令的算法，抛出来与大家共享：

// 作者：GxQ，2008-03-03
// 返回返回当前bit=1的数目
DWORD find_by_GxQ(DWORD code, int index[33])
{
__asm
{
mov ecx, code;
mov edx, dword ptr index;
xor eax, eax;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 0;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 1;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 2;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 3;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 4;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 5;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 6;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 7;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 8;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 9;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 10;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 11;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 12;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 13;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 14;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 15;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 16;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 17;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 18;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 19;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 20;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 21;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 22;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 23;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 24;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 25;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 26;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 27;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 28;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 29;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 30;
adc eax, 0;
shr ecx, 1;
mov dword ptr[edx + 4*eax], 31;
adc eax, 0;
// 休止符
mov dword ptr[edx + 4*eax], -1;
}
}

复制代码

以上代码涉及的汇编指令非常少，稍微有些ALU汇编指令基础的朋友很容易揣摩出其门道。本文部分摘自：http://topic.csdn.net/u/20080217 ... c-9734e7a2b41b.html，上述代码在其68楼，是本人原创的。

可否有更美妙的算法？

liangbch · 发表于 2008-3-3 16:49:06

楼主的算法确实巧妙。

mathe · 发表于 2008-3-3 18:29:59

用汇编的话，还可以用bsr指令。

liangbch · 发表于 2008-3-3 18:43:57

当 1 的个数较少时，bsr可能有优势。当用bsr指令时，算法大概是这样的，找到一位1,然后用当前索引对数组赋值，然后清掉当前位重复上一步骤，知道所有的1全部找完。因为bsr比平常的指令慢，当1的个数较多时，需要多次调用bsr指令，故速度可能不理想。

mathe · 发表于 2008-3-3 18:53:04

1的数目多时的确优势不明显。上面代码有一个问题是所有指令之间有很强的数据依赖关系。现在CPU很强大，没有依赖关系的指令往往可以非常好的并行处理。所以可以考虑这样一个算法。将32位数拆分成两个16位数据。首先通过一段代码快速计算出低16位中1的数目。然后再分别对两部分找索引值，将它们的代码交叉起来，就可以并行处理了。只是这样对寄存器压力会比较大，估计要动用esi,edi等

无心人 · 发表于 2008-3-3 20:46:40

BSR慢在了可能是阻塞指令猜想的至于GXQ的并不影响并行因为存在寄存器重命名

mathe · 发表于 2008-3-4 07:54:24

数据依赖在eax上面，再寄存器重命名也没有用，这里有数据真依赖存在

gxqcn · 发表于 2008-3-4 07:54:47

bsr指令我早考虑过，但似乎并不适合解决该问题。寄存器重命名是需要前后没有依赖关系的，而我的代码确实有很强的数据依赖关系。这个题目由于“索引”需要判断累加，似乎很难做到并行且高效。

gxqcn · 发表于 2008-3-4 07:57:26

发完贴，才发现与 mathe 同时在回复一个问题。

mathe · 发表于 2008-3-4 08:40:57

我们知道,像下面的代码（来源自http://www.everything2.com/index.pl?node_id=1181258）可以比较快速计算出一个int中比特1的数目，稍微修改一下，就可以变成计算short中比特1的数目的代码（可能需要4个时钟周期？？）通过预先计算后半个short部分比特1的数目，我们应该可以将代码拆分成两个可以并行的部分。

unsigned numbits(unsigned int i)
{
unsigned int const MASK1 = 0x55555555;
unsigned int const MASK2 = 0x33333333;
unsigned int const MASK4 = 0x0f0f0f0f;
unsigned int const MASK8 = 0x00ff00ff;
unsigned int const MASK16 = 0x0000ffff;
i = (i&MASK1 ) + (i>>1 &MASK1 );
i = (i&MASK2 ) + (i>>2 &MASK2 );
i = (i&MASK4 ) + (i>>4 &MASK4 );
i = (i&MASK8 ) + (i>>8 &MASK8 );
i = (i&MASK16) + (i>>16&MASK16);
return i;
}

复制代码

账号		自动登录	找回密码
密码			欢迎注册

[讨论] 一个快速得到bit位索引的算法

马上注册，结交更多好友，享用更多功能，让你轻松玩转社区。

评分

撞车了:)