GotoBLAS 作者 Kazushige Goto 离开了 TACC
Hi All,!I put GotoBLAS2-1.13.tar.gz on TACC web page. Since I leave TACC today, this is the final version of GotoBLAS2 distribution (except I find fatal mistake).
1. Fixed incorrect calculation of ZDOT
2. Added kernel for Fujitsu SPARC VI/VII
Now total number of registered users is more than 16,000 and I would like to say thank you for using my library.
My email account will be disabled soon and I’m not able to answer your question. But this mailing list is still available, TACC staff probably will answer your question.
Thanks and good bye,
Kazushige Goto
这是Kazushige Goto 在 TACC 的 GotoBLAS maillist 上留下的最后一个 mail,表明他已经离开了 TACC。
( ZKazushige Goto 编写的 GotoBLAS 是目前最快的 BLAS 库,不知道他离开 TACC 后是否会加入 GPU Computing 行列里,
GotoBLAS - "the, currently, fastest implementation of the Basic Linear Algebra Subroutines (BLAS)"
维基了一下:
Kazushige Goto
From Wikipedia, the free encyclopediaJump to: navigation, search
Kazushige Gotō (後藤和茂, Gotō Kazushige?) was a research associate at the Texas Advanced Computing Center at the University of Texas at Austin when he famously hand-optimized assembly routines for supercomputing and PC platforms that outperform the best compiler generated codes. Several of the fastest supercomputers in the world still use his implementation of the Basic Linear Algebra Subprograms (BLAS) known as “Goto BLAS”. He joined Microsoft's Technical Computing Group in 2010.
按这说法,他加入了微软的技术计算组。
看了一下,TACC站
GotoBLAS2 has been released by the Texas Advanced Computing Center as open source software under the BSD license. This product is no longer under active development by TACC, but it is being made available to the community to use, study, and extend. GotoBLAS2 uses new algorithms and memory techniques for optimal performance of the BLAS routines. The changes in this final version target new architecture features in microprocessors and interprocessor communication techniques; also, NUMA controls enhance multi-threaded execution of BLAS routines on node. The library features optimal performance on the following platforms:
Intel Nehalem and Atom systems
VIA Nanoprocessor
AMD Shanghai and Istanbul
The library includes the following features:
•Configurations for a variety of hardware platforms
•Incorporation of features of many ISAs (Instruction Set Architecture)
•Implementation of NUMA controls to assure best process affinity and memory policy
•Dynamic detection of multiple architecture components, which can be included in a single binary (for binary distributions)
Originally developed by Kazushige Goto but is no longer under active development. For questions regarding the code, contact Dr. Kent Milfeld.
种种迹象表明 Kazushige Goto 确是离开了TACC 。 正巧,我在并行算法导论中看到了这个Goto BLAS,下面的内容摘自《并行算法导论》附录
对于BLAS 库,现在有多种不同的优化实现,适用于Intel/Linux平台的主要有以下几种:
BLAS 参考实现
这是一组标准Fortran 子程序,可以从BLAS 的主页下载:http://www.netlib.org/blas/index.html;
ATLAS 库(Automatically Tuned Linear Algebra Software)它可以在不同平台上自动生成优化的BLAS 库,其主页为http://math-atlas.sourceforge.net/;
Goto 库
Kazushige Goto 开发的一套高性能BLAS 库,其主页为http://www.cs.utexas.edu/users/flame/goto/;
MKL 库(Math Kernel Library)
Intel 为自己的CPU 专门优化的基本数学运算库,其中包含BLAS 库,其主页为http://www.intel.com/cd/software/products/asmo-na/eng/perflib/mkl/index.htm。
前三种库可以免费下载,而Intel MKL 库是商业软件,对于商业应用需要购买,而非商业应用可以免费使用。 一直想入手一本并行算法的书 面对 GotoBLAS2的源码就像面对着一个刺猬一样,还没找到一个很好的切入点。很难找到一个清晰的调用通道。或许可以从gprof中得到些线索。
貌似并没有用传统的blas的fortran子库(reference文件夹中).....
不像最原始的BLAS思路清晰。光x86核的通用矩阵乘(gemm)子块就有近20种。
http://blog.csdn.net/G_Spider/archive/2010/12/04/6054764.aspx 呵呵,又把这个帖子翻出来了。
刚才,编译出libgoto2.dll(core2模式),vc下可以用了,一直纠结不知道怎么调用。
在这里总算找到了思路。
Naming conventions of CLAPACK routines
libgoto2.dll并没有经过CBLAS_***的包装。
均以
EXPORTS
caxpy=caxpy_@1
caxpy_=caxpy_@2
CAXPY=caxpy_@3
ccopy=ccopy_@4
ccopy_=ccopy_@5
...
导出。
先把其中一个调用例子在这里备份一下:#include < stdio.h>
void dgesv_(const int *N, const int *nrhs, double *A, const int *lda, int
*ipiv, double *b, const int *ldb, int *info);
void dgels_(const char *trans, const int *M, const int *N, const int *nrhs,
double *A, const int *lda, double *b, const int *ldb, double *work,
const int * lwork, int *info);
int
main(void)
{
/* 3x3 matrix A
* 76 25 11
* 27 89 51
* 18 60 32
*/
double A = {76, 27, 18, 25, 89, 60, 11, 51, 32};
double b = {10, 7, 43};
int N = 3;
int nrhs = 1;
int lda = 3;
int ipiv;
int ldb = 3;
int info;
dgesv_(&N, &nrhs, A, &lda, ipiv, b, &ldb, &info);
if(info == 0) /* succeed */
printf("The solution is %lf %lf %lf\n", b, b, b);
else
fprintf(stderr, "dgesv_ fails %d\n", info);
return info;
}
文件有点大,不过还是要共享一下,有兴趣的可以研究一下,嘿嘿。
下载 整理了一份头文件gotoblas2.h。
示例:#include < stdio.h>
#include "gotoblas2.h"
#pragmacomment(lib,"libgoto2.lib")
//void dgesv_(const int *N, const int *nrhs, double *A, const int *lda, int
// *ipiv, double *b, const int *ldb, int *info);
int
main(void)
{
/* 3x3 matrix A
* 76 25 11
* 27 89 51
* 18 60 32
*/
double A = {76, 27, 18, 25, 89, 60, 11, 51, 32};
double b = {10, 7, 43};
int N = 3;
int nrhs = 1;
int lda = 3;
int ipiv;
int ldb = 3;
int info;
dgesv_(&N, &nrhs, A, &lda, ipiv, b, &ldb, &info);
if(info == 0) /* succeed */
printf("The solution is %lf %lf %lf\n", b, b, b);
else
fprintf(stderr, "dgesv_ fails %d\n", info);
return info;
}示例2:
含义参考#include <stdio.h>
#include "gotoblas2.h"
#pragmacomment(lib,"libgoto2.lib")
/////////////////////////////////
// 直接声明
// typedef struct {
// float r, i;
//}complex;
//
// complexcdotu(int *, complex*, int *, complex*, int *);
/////////////////////////////////
void main()
{
myccomplex_t x[]={ {1.0,2.0},{2.0,3.0},{3.0,4.0} };
myccomplex_t y[]={ {2.0,1.0},{4.0,0.0},{0.0,9.0} };
myccomplex_t z={0,0};
int n=3,incx=1,incy=1;
z=cdotu(&n,(float*)x,&incx,(float*)y,&incy);
printf("%.2f+%.2fi\n",z.r,z.i); //-28.00+44.00i
}头文件下载
发现上面的调用还不够刺激,直接试一下GotoBlas2的内核。
sasum求单精度浮点数组中,绝对值之和。
比如:float X1={1.0, 2.0, 7.0, -8.0} ;所有的元素绝对值之和为18.
在语法上改写sasum内核,必须支持sse2 ,保存下面的代码为:sasum_k.asm;//GotoBlas2内核之sasum
.686p
.xmm ;支持SSE2
.model flat,c
option casemap :none
.code
sasum_k proc
push esi
push ebx
mov ecx, ;//N
mov esi, ;//X
mov ebx, ;//INCX
xorps xmm0, xmm0
test ecx, ecx
jle loc_2A0
test ebx, ebx
jle loc_2A0
xorps xmm1, xmm1
pcmpeqb xmm3, xmm3
psrld xmm3, 1
lea ebx, ds:0
cmp ebx, 4
jnz loc_1F0
sub esi, 0FFFFFF80h
cmp ecx, 3
jle loc_1B8
test esi, 4
jz short loc_68
movss xmm0, dword ptr
andps xmm0, xmm3
add esi, 4
dec ecx
jle loc_290
nop
lea esi,
loc_68:
test esi, 8
jz short loc_88
movsd xmm1, qword ptr
andps xmm1, xmm3
add esi, 8
sub ecx, 2
jle loc_290
lea esi,
loc_88:
mov eax, ecx
sar eax, 5
jle loc_148
movaps xmm4, xmmword ptr
movaps xmm5, xmmword ptr
movaps xmm6, xmmword ptr
movaps xmm7, xmmword ptr
dec eax
jle short loc_100
db 66h
nop
loc_A8:
andps xmm4, xmm3
addps xmm0, xmm4
movaps xmm4, xmmword ptr
andps xmm5, xmm3
addps xmm1, xmm5
movaps xmm5, xmmword ptr
andps xmm6, xmm3
addps xmm0, xmm6
movaps xmm6, xmmword ptr
andps xmm7, xmm3
addps xmm1, xmm7
movaps xmm7, xmmword ptr
andps xmm4, xmm3
addps xmm0, xmm4
movaps xmm4, xmmword ptr
andps xmm5, xmm3
addps xmm1, xmm5
movaps xmm5, xmmword ptr
andps xmm6, xmm3
addps xmm0, xmm6
movaps xmm6, xmmword ptr
andps xmm7, xmm3
addps xmm1, xmm7
movaps xmm7, xmmword ptr
sub esi, 0FFFFFF80h
dec eax
jg short loc_A8
lea esi,
loc_100:
andps xmm4, xmm3
addps xmm0, xmm4
movaps xmm4, xmmword ptr
andps xmm5, xmm3
addps xmm1, xmm5
movaps xmm5, xmmword ptr
andps xmm6, xmm3
addps xmm0, xmm6
movaps xmm6, xmmword ptr
andps xmm7, xmm3
addps xmm1, xmm7
movaps xmm7, xmmword ptr
andps xmm4, xmm3
addps xmm0, xmm4
andps xmm5, xmm3
addps xmm1, xmm5
andps xmm6, xmm3
addps xmm0, xmm6
andps xmm7, xmm3
addps xmm1, xmm7
sub esi, 0FFFFFF80h
nop
lea esi,
loc_148:
test ecx, 10h
jz short loc_180
movaps xmm4, xmmword ptr
andps xmm4, xmm3
addps xmm0, xmm4
movaps xmm5, xmmword ptr
andps xmm5, xmm3
addps xmm1, xmm5
movaps xmm6, xmmword ptr
andps xmm6, xmm3
addps xmm0, xmm6
movaps xmm7, xmmword ptr
andps xmm7, xmm3
addps xmm1, xmm7
add esi, 40h ; '@'
nop
lea esi,
loc_180:
test ecx, 8
jz short loc_1A0
movaps xmm4, xmmword ptr
andps xmm4, xmm3
addps xmm0, xmm4
movaps xmm5, xmmword ptr
andps xmm5, xmm3
addps xmm1, xmm5
add esi, 20h ; ' '
nop
loc_1A0:
test ecx, 4
jz short loc_1B8
movaps xmm4, xmmword ptr
andps xmm4, xmm3
addps xmm0, xmm4
add esi, 10h
lea esi,
loc_1B8:
test ecx, 2
jz short loc_1D0
movsd xmm4, qword ptr
andps xmm4, xmm3
addps xmm1, xmm4
add esi, 8
db 66h
nop
loc_1D0:
test ecx, 1
jz loc_290
movss xmm4, dword ptr
andps xmm4, xmm3
addps xmm0, xmm4
jmp loc_290
align 10h
loc_1F0:
mov eax, ecx
sar eax, 3
jle short loc_270
mov esi, esi
lea edi,
loc_200:
movss xmm4, dword ptr
add esi, ebx
andps xmm4, xmm3
addss xmm0, xmm4
movss xmm5, dword ptr
add esi, ebx
andps xmm5, xmm3
addss xmm1, xmm5
movss xmm6, dword ptr
add esi, ebx
andps xmm6, xmm3
addss xmm0, xmm6
movss xmm7, dword ptr
add esi, ebx
andps xmm7, xmm3
addss xmm1, xmm7
movss xmm4, dword ptr
add esi, ebx
andps xmm4, xmm3
addss xmm0, xmm4
movss xmm5, dword ptr
add esi, ebx
andps xmm5, xmm3
addss xmm1, xmm5
movss xmm6, dword ptr
add esi, ebx
andps xmm6, xmm3
addss xmm0, xmm6
movss xmm7, dword ptr
add esi, ebx
andps xmm7, xmm3
addss xmm1, xmm7
dec eax
jg short loc_200
nop
lea esi,
loc_270:
and ecx, 7
jle short loc_290
lea esi,
lea edi,
loc_280:
movss xmm4, dword ptr
andps xmm4, xmm3
addss xmm0, xmm4
add esi, ebx
dec ecx
jg short loc_280
loc_290:
addps xmm0, xmm1
haddps xmm0, xmm0
haddps xmm0, xmm0
nop
lea esi,
loc_2A0:
movss dword ptr , xmm0
fld dword ptr
pop ebx
pop esi
ret
sasum_k endp
end
接着用vs2008或以上的版本的ml.exe编译上面的汇编生成sasum_k.obj.
命令行批处理如下:@echo off
call "D:\Microsoft Visual Studio 10.0\VC\bin\vcvars32.bat"
echo on
ml/c /coff sasum_k.asm
pause之后的c代码,只要连接这个sasum_k.obj文件即可。
c代码如下:#include <stdio.h>
floatsasum_k (int , float*, int );//声明一下
int
main(void)
{
__declspec(align(16))float X1={1.0, 2.0, 7.0, -8.0, -5.0, -10.0, -9.0, 10.0,1.0, 2.0, 7.0, -8.0, -5.0, -10.0, -9.0, 10.0};
float I1;
int N;
int INCX;
N=16;
INCX=1;
I1=sasum_k( N , X1,INCX);
printf(" The IASUM is %.3f\n",I1);
system("\npause");
return 0;
}
/*result
1 1;2 3;3 10;4 18; 5 23 ;6 33;7 42;8 52;9 53;10 55;11 62;12 70;13 75;14 85;15 94;16 104
*/
编译命令批处理(保存为makeC.bat):@echo off
set VS=D:\vcPackaa
call "D:\Microsoft Visual Studio 10.0\VC\bin\vcvars32.bat"
echo on
cl /c test.c
link /subsystem:console test.obj sasum_k.obj
pause 哎,可惜了哦,这么好的东西就终结了
貌似中国有个在继续做gotoblas,不过改名叫openblas 1# G-Spider
请问下楼主知道有哪些比较好的计算稀疏矩阵的blas库?最好是要支持openmp的
页:
[1]
2