找回密码
 欢迎注册
查看: 25571|回复: 17

[讨论] GotoBLAS 作者 Kazushige Goto 离开了 TACC

[复制链接]
发表于 2010-12-5 17:53:29 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。

您需要 登录 才可以下载或查看,没有账号?欢迎注册

×
Hi All,! I put GotoBLAS2-1.13.tar.gz on TACC web page. Since I leave TACC today, this is the final version of GotoBLAS2 distribution (except I find fatal mistake). 1. Fixed incorrect calculation of ZDOT 2. Added kernel for Fujitsu SPARC VI/VII Now total number of registered users is more than 16,000 and I would like to say thank you for using my library. My email account will be disabled soon and I’m not able to answer your question. But this mailing list is still available, TACC staff probably will answer your question. Thanks and good bye, Kazushige Goto 这是 Kazushige Goto 在 TACC 的 GotoBLAS maillist 上留下的最后一个 mail,表明他已经离开了 TACC。 ( ZKazushige Goto 编写的 GotoBLAS 是目前最快的 BLAS 库,不知道他离开 TACC 后是否会加入 GPU Computing 行列里, GotoBLAS - "the, currently, fastest implementation of the Basic Linear Algebra Subroutines (BLAS)" 维基了一下: Kazushige Goto From Wikipedia, the free encyclopediaJump to: navigation, search Kazushige Gotō (後藤和茂, Gotō Kazushige?) was a research associate at the Texas Advanced Computing Center at the University of Texas at Austin when he famously hand-optimized assembly routines for supercomputing and PC platforms that outperform the best compiler generated codes. Several of the fastest supercomputers in the world still use his implementation of the Basic Linear Algebra Subprograms (BLAS) known as “Goto BLAS”. He joined Microsoft's Technical Computing Group in 2010. 按这说法,他加入了微软的技术计算组。 看了一下,TACC站 GotoBLAS2 has been released by the Texas Advanced Computing Center as open source software under the BSD license. This product is no longer under active development by TACC, but it is being made available to the community to use, study, and extend. GotoBLAS2 uses new algorithms and memory techniques for optimal performance of the BLAS routines. The changes in this final version target new architecture features in microprocessors and interprocessor communication techniques; also, NUMA controls enhance multi-threaded execution of BLAS routines on node. The library features optimal performance on the following platforms: Intel Nehalem and Atom systems VIA Nanoprocessor AMD Shanghai and Istanbul The library includes the following features: •Configurations for a variety of hardware platforms •Incorporation of features of many ISAs (Instruction Set Architecture) •Implementation of NUMA controls to assure best process affinity and memory policy •Dynamic detection of multiple architecture components, which can be included in a single binary (for binary distributions) Originally developed by Kazushige Goto but is no longer under active development. For questions regarding the code, contact Dr. Kent Milfeld. 种种迹象表明 Kazushige Goto 确是离开了TACC 。
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
发表于 2010-12-9 08:09:15 | 显示全部楼层
正巧,我在并行算法导论中看到了这个Goto BLAS,下面的内容摘自《并行算法导论》附录 对于BLAS 库,现在有多种不同的优化实现,适用于Intel/Linux平台的主要有以下几种: BLAS 参考实现 这是一组标准Fortran 子程序,可以从BLAS 的主页下载:http://www.netlib.org/blas/index.html; ATLAS 库(Automatically Tuned Linear Algebra Software)它可以在不同平台上自动生成优化的BLAS 库,其主页为http://math-atlas.sourceforge.net/; Goto 库 Kazushige Goto 开发的一套高性能BLAS 库,其主页为http://www.cs.utexas.edu/users/flame/goto/; MKL 库(Math Kernel Library) Intel 为自己的CPU 专门优化的基本数学运算库,其中包含BLAS 库,其主页为http://www.intel.com/cd/software ... rflib/mkl/index.htm。 前三种库可以免费下载,而Intel MKL 库是商业软件,对于商业应用需要购买,而非商业应用可以免费使用。
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
发表于 2010-12-9 08:13:19 | 显示全部楼层
一直想入手一本并行算法的书
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
 楼主| 发表于 2010-12-10 10:46:40 | 显示全部楼层
面对 GotoBLAS2的源码就像面对着一个刺猬一样,还没找到一个很好的切入点。很难找到一个清晰的调用通道。或许可以从gprof中得到些线索。 貌似并没有用传统的blas的fortran子库(reference文件夹中)..... 不像最原始的BLAS思路清晰。光x86核的通用矩阵乘(gemm)子块就有近20种。 http://blog.csdn.net/G_Spider/archive/2010/12/04/6054764.aspx
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
 楼主| 发表于 2011-6-20 21:19:34 | 显示全部楼层
呵呵,又把这个帖子翻出来了。 刚才,编译出libgoto2.dll (core2模式),vc下可以用了,一直纠结不知道怎么调用。 在这里总算找到了思路。 Naming conventions of CLAPACK routines libgoto2.dll 并没有经过CBLAS_***的包装。 均以 EXPORTS caxpy=caxpy_ @1 caxpy_=caxpy_ @2 CAXPY=caxpy_ @3 ccopy=ccopy_ @4 ccopy_=ccopy_ @5 ... 导出。 先把其中一个调用例子在这里备份一下:
  1. #include < stdio.h>
  2. void dgesv_(const int *N, const int *nrhs, double *A, const int *lda, int
  3. *ipiv, double *b, const int *ldb, int *info);
  4. void dgels_(const char *trans, const int *M, const int *N, const int *nrhs,
  5. double *A, const int *lda, double *b, const int *ldb, double *work,
  6. const int * lwork, int *info);
  7. int
  8. main(void)
  9. {
  10. /* 3x3 matrix A
  11. * 76 25 11
  12. * 27 89 51
  13. * 18 60 32
  14. */
  15. double A[9] = {76, 27, 18, 25, 89, 60, 11, 51, 32};
  16. double b[3] = {10, 7, 43};
  17. int N = 3;
  18. int nrhs = 1;
  19. int lda = 3;
  20. int ipiv[3];
  21. int ldb = 3;
  22. int info;
  23. dgesv_(&N, &nrhs, A, &lda, ipiv, b, &ldb, &info);
  24. if(info == 0) /* succeed */
  25. printf("The solution is %lf %lf %lf\n", b[0], b[1], b[2]);
  26. else
  27. fprintf(stderr, "dgesv_ fails %d\n", info);
  28. return info;
  29. }
复制代码
文件有点大,不过还是要共享一下,有兴趣的可以研究一下,嘿嘿。 下载
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
 楼主| 发表于 2011-6-21 10:15:40 | 显示全部楼层
整理了一份头文件gotoblas2.h。 示例:
  1. #include < stdio.h>
  2. #include "gotoblas2.h"
  3. #pragma comment(lib,"libgoto2.lib")
  4. //void dgesv_(const int *N, const int *nrhs, double *A, const int *lda, int
  5. // *ipiv, double *b, const int *ldb, int *info);
  6. int
  7. main(void)
  8. {
  9. /* 3x3 matrix A
  10. * 76 25 11
  11. * 27 89 51
  12. * 18 60 32
  13. */
  14. double A[9] = {76, 27, 18, 25, 89, 60, 11, 51, 32};
  15. double b[3] = {10, 7, 43};
  16. int N = 3;
  17. int nrhs = 1;
  18. int lda = 3;
  19. int ipiv[3];
  20. int ldb = 3;
  21. int info;
  22. dgesv_(&N, &nrhs, A, &lda, ipiv, b, &ldb, &info);
  23. if(info == 0) /* succeed */
  24. printf("The solution is %lf %lf %lf\n", b[0], b[1], b[2]);
  25. else
  26. fprintf(stderr, "dgesv_ fails %d\n", info);
  27. return info;
  28. }
复制代码
示例2: 含义参考
  1. #include <stdio.h>
  2. #include "gotoblas2.h"
  3. #pragma comment(lib,"libgoto2.lib")
  4. /////////////////////////////////
  5. // 直接声明
  6. // typedef struct {
  7. // float r, i;
  8. //} complex;
  9. //
  10. // complex cdotu(int *, complex *, int *, complex *, int *);
  11. /////////////////////////////////
  12. void main()
  13. {
  14. myccomplex_t x[]={ {1.0,2.0},{2.0,3.0},{3.0,4.0} };
  15. myccomplex_t y[]={ {2.0,1.0},{4.0,0.0},{0.0,9.0} };
  16. myccomplex_t z={0,0};
  17. int n=3,incx=1,incy=1;
  18. z=cdotu(&n,(float*)x,&incx,(float*)y,&incy);
  19. printf("%.2f+%.2fi\n",z.r,z.i); //-28.00+44.00i
  20. }
复制代码
头文件下载 gotoblas2.rar (3.46 KB, 下载次数: 12)
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
 楼主| 发表于 2011-6-23 18:52:01 | 显示全部楼层
发现上面的调用还不够刺激,直接试一下GotoBlas2的内核。 sasum求单精度浮点数组中,绝对值之和。 比如:float X1[4]={1.0, 2.0, 7.0, -8.0} ; 所有的元素绝对值之和为18. 在语法上改写sasum内核,必须支持sse2 ,保存下面的代码为:sasum_k.asm
  1. ;//GotoBlas2内核之sasum
  2. .686p
  3. .xmm ;支持SSE2
  4. .model flat,c
  5. option casemap :none
  6. .code
  7. sasum_k proc
  8. push esi
  9. push ebx
  10. mov ecx, [esp+8+4] ;//N
  11. mov esi, [esp+8+8] ;//X
  12. mov ebx, [esp+8+0ch] ;//INCX
  13. xorps xmm0, xmm0
  14. test ecx, ecx
  15. jle loc_2A0
  16. test ebx, ebx
  17. jle loc_2A0
  18. xorps xmm1, xmm1
  19. pcmpeqb xmm3, xmm3
  20. psrld xmm3, 1
  21. lea ebx, ds:0[ebx*4]
  22. cmp ebx, 4
  23. jnz loc_1F0
  24. sub esi, 0FFFFFF80h
  25. cmp ecx, 3
  26. jle loc_1B8
  27. test esi, 4
  28. jz short loc_68
  29. movss xmm0, dword ptr [esi-80h]
  30. andps xmm0, xmm3
  31. add esi, 4
  32. dec ecx
  33. jle loc_290
  34. nop
  35. lea esi, [esi+0]
  36. loc_68:
  37. test esi, 8
  38. jz short loc_88
  39. movsd xmm1, qword ptr [esi-80h]
  40. andps xmm1, xmm3
  41. add esi, 8
  42. sub ecx, 2
  43. jle loc_290
  44. lea esi, [esi+0]
  45. loc_88:
  46. mov eax, ecx
  47. sar eax, 5
  48. jle loc_148
  49. movaps xmm4, xmmword ptr [esi-80h]
  50. movaps xmm5, xmmword ptr [esi-70h]
  51. movaps xmm6, xmmword ptr [esi-60h]
  52. movaps xmm7, xmmword ptr [esi-50h]
  53. dec eax
  54. jle short loc_100
  55. db 66h
  56. nop
  57. loc_A8:
  58. andps xmm4, xmm3
  59. addps xmm0, xmm4
  60. movaps xmm4, xmmword ptr [esi-40h]
  61. andps xmm5, xmm3
  62. addps xmm1, xmm5
  63. movaps xmm5, xmmword ptr [esi-30h]
  64. andps xmm6, xmm3
  65. addps xmm0, xmm6
  66. movaps xmm6, xmmword ptr [esi-20h]
  67. andps xmm7, xmm3
  68. addps xmm1, xmm7
  69. movaps xmm7, xmmword ptr [esi-10h]
  70. andps xmm4, xmm3
  71. addps xmm0, xmm4
  72. movaps xmm4, xmmword ptr [esi]
  73. andps xmm5, xmm3
  74. addps xmm1, xmm5
  75. movaps xmm5, xmmword ptr [esi+10h]
  76. andps xmm6, xmm3
  77. addps xmm0, xmm6
  78. movaps xmm6, xmmword ptr [esi+20h]
  79. andps xmm7, xmm3
  80. addps xmm1, xmm7
  81. movaps xmm7, xmmword ptr [esi+30h]
  82. sub esi, 0FFFFFF80h
  83. dec eax
  84. jg short loc_A8
  85. lea esi, [esi+0]
  86. loc_100:
  87. andps xmm4, xmm3
  88. addps xmm0, xmm4
  89. movaps xmm4, xmmword ptr [esi-40h]
  90. andps xmm5, xmm3
  91. addps xmm1, xmm5
  92. movaps xmm5, xmmword ptr [esi-30h]
  93. andps xmm6, xmm3
  94. addps xmm0, xmm6
  95. movaps xmm6, xmmword ptr [esi-20h]
  96. andps xmm7, xmm3
  97. addps xmm1, xmm7
  98. movaps xmm7, xmmword ptr [esi-10h]
  99. andps xmm4, xmm3
  100. addps xmm0, xmm4
  101. andps xmm5, xmm3
  102. addps xmm1, xmm5
  103. andps xmm6, xmm3
  104. addps xmm0, xmm6
  105. andps xmm7, xmm3
  106. addps xmm1, xmm7
  107. sub esi, 0FFFFFF80h
  108. nop
  109. lea esi, [esi+0]
  110. loc_148:
  111. test ecx, 10h
  112. jz short loc_180
  113. movaps xmm4, xmmword ptr [esi-80h]
  114. andps xmm4, xmm3
  115. addps xmm0, xmm4
  116. movaps xmm5, xmmword ptr [esi-70h]
  117. andps xmm5, xmm3
  118. addps xmm1, xmm5
  119. movaps xmm6, xmmword ptr [esi-60h]
  120. andps xmm6, xmm3
  121. addps xmm0, xmm6
  122. movaps xmm7, xmmword ptr [esi-50h]
  123. andps xmm7, xmm3
  124. addps xmm1, xmm7
  125. add esi, 40h ; '@'
  126. nop
  127. lea esi, [esi+0]
  128. loc_180:
  129. test ecx, 8
  130. jz short loc_1A0
  131. movaps xmm4, xmmword ptr [esi-80h]
  132. andps xmm4, xmm3
  133. addps xmm0, xmm4
  134. movaps xmm5, xmmword ptr [esi-70h]
  135. andps xmm5, xmm3
  136. addps xmm1, xmm5
  137. add esi, 20h ; ' '
  138. nop
  139. loc_1A0:
  140. test ecx, 4
  141. jz short loc_1B8
  142. movaps xmm4, xmmword ptr [esi-80h]
  143. andps xmm4, xmm3
  144. addps xmm0, xmm4
  145. add esi, 10h
  146. lea esi, [esi+0]
  147. loc_1B8:
  148. test ecx, 2
  149. jz short loc_1D0
  150. movsd xmm4, qword ptr [esi-80h]
  151. andps xmm4, xmm3
  152. addps xmm1, xmm4
  153. add esi, 8
  154. db 66h
  155. nop
  156. loc_1D0:
  157. test ecx, 1
  158. jz loc_290
  159. movss xmm4, dword ptr [esi-80h]
  160. andps xmm4, xmm3
  161. addps xmm0, xmm4
  162. jmp loc_290
  163. align 10h
  164. loc_1F0:
  165. mov eax, ecx
  166. sar eax, 3
  167. jle short loc_270
  168. mov esi, esi
  169. lea edi, [edi+0]
  170. loc_200:
  171. movss xmm4, dword ptr [esi]
  172. add esi, ebx
  173. andps xmm4, xmm3
  174. addss xmm0, xmm4
  175. movss xmm5, dword ptr [esi]
  176. add esi, ebx
  177. andps xmm5, xmm3
  178. addss xmm1, xmm5
  179. movss xmm6, dword ptr [esi]
  180. add esi, ebx
  181. andps xmm6, xmm3
  182. addss xmm0, xmm6
  183. movss xmm7, dword ptr [esi]
  184. add esi, ebx
  185. andps xmm7, xmm3
  186. addss xmm1, xmm7
  187. movss xmm4, dword ptr [esi]
  188. add esi, ebx
  189. andps xmm4, xmm3
  190. addss xmm0, xmm4
  191. movss xmm5, dword ptr [esi]
  192. add esi, ebx
  193. andps xmm5, xmm3
  194. addss xmm1, xmm5
  195. movss xmm6, dword ptr [esi]
  196. add esi, ebx
  197. andps xmm6, xmm3
  198. addss xmm0, xmm6
  199. movss xmm7, dword ptr [esi]
  200. add esi, ebx
  201. andps xmm7, xmm3
  202. addss xmm1, xmm7
  203. dec eax
  204. jg short loc_200
  205. nop
  206. lea esi, [esi+0]
  207. loc_270:
  208. and ecx, 7
  209. jle short loc_290
  210. lea esi, [esi+0]
  211. lea edi, [edi+0]
  212. loc_280:
  213. movss xmm4, dword ptr [esi]
  214. andps xmm4, xmm3
  215. addss xmm0, xmm4
  216. add esi, ebx
  217. dec ecx
  218. jg short loc_280
  219. loc_290:
  220. addps xmm0, xmm1
  221. haddps xmm0, xmm0
  222. haddps xmm0, xmm0
  223. nop
  224. lea esi, [esi+0]
  225. loc_2A0:
  226. movss dword ptr [esp+8+4], xmm0
  227. fld dword ptr [esp+8+4]
  228. pop ebx
  229. pop esi
  230. ret
  231. sasum_k endp
  232. end
复制代码
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
 楼主| 发表于 2011-6-23 19:01:21 | 显示全部楼层
接着用vs2008或以上的版本的ml.exe编译上面的汇编生成sasum_k.obj. 命令行批处理如下:
  1. @echo off
  2. call "D:\Microsoft Visual Studio 10.0\VC\bin\vcvars32.bat"
  3. echo on
  4. ml /c /coff sasum_k.asm
  5. pause
复制代码
之后的c代码,只要连接这个sasum_k.obj文件即可。 c代码如下:
  1. #include <stdio.h>
  2. float sasum_k (int , float *, int ); //声明一下
  3. int
  4. main(void)
  5. {
  6. __declspec(align(16)) float X1[18]={1.0, 2.0, 7.0, -8.0, -5.0, -10.0, -9.0, 10.0,1.0, 2.0, 7.0, -8.0, -5.0, -10.0, -9.0, 10.0};
  7. float I1;
  8. int N;
  9. int INCX;
  10. N=16;
  11. INCX=1;
  12. I1=sasum_k( N , X1, INCX);
  13. printf(" The IASUM is %.3f\n",I1);
  14. system("\npause");
  15. return 0;
  16. }
  17. /*result
  18. 1 1;2 3;3 10;4 18; 5 23 ;6 33;7 42;8 52;9 53;10 55;11 62;12 70;13 75;14 85;15 94;16 104
  19. */
复制代码
编译命令批处理(保存为makeC.bat):
  1. @echo off
  2. set VS=D:\vcPackaa
  3. call "D:\Microsoft Visual Studio 10.0\VC\bin\vcvars32.bat"
  4. echo on
  5. cl /c test.c
  6. link /subsystem:console test.obj sasum_k.obj
  7. pause
复制代码
GotoTest.rar (36.23 KB, 下载次数: 4)
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
发表于 2011-8-28 08:33:16 | 显示全部楼层
哎,可惜了哦,这么好的东西就终结了 貌似中国有个在继续做gotoblas,不过改名叫openblas
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
发表于 2011-8-28 14:58:10 | 显示全部楼层
1# G-Spider 请问下楼主知道有哪些比较好的计算稀疏矩阵的blas库?最好是要支持openmp的
毋因群疑而阻独见  毋任己意而废人言
毋私小惠而伤大体  毋借公论以快私情
您需要登录后才可以回帖 登录 | 欢迎注册

本版积分规则

小黑屋|手机版|数学研发网 ( 苏ICP备07505100号 )

GMT+8, 2025-1-21 12:01 , Processed in 0.030076 second(s), 19 queries .

Powered by Discuz! X3.5

© 2001-2025 Discuz! Team.

快速回复 返回顶部 返回列表