- 注册时间
- 2007-12-28
- 最后登录
- 1970-1-1
- 威望
- 星
- 金币
- 枚
- 贡献
- 分
- 经验
- 点
- 鲜花
- 朵
- 魅力
- 点
- 上传
- 次
- 下载
- 次
- 积分
- 12787
- 在线时间
- 小时
|
楼主 |
发表于 2010-6-1 20:02:08
|
显示全部楼层
值得注意的是,楼上的FFT测试是 单精度浮点型1维FFT。由于双精度浮点型运算的速度比单精度慢的多,而基于FFT的大数乘法需要使用双精度浮点类型,所以是用CUDA做大数运算可能不是很乐观。
以下的内容(摘自NVIDIA CUDA FAT version 2.1)指出,双精度浮点的速度仅为单精度浮点的1/8.
What are the technical specifications of the NVIDIA Tesla C1060 Processor ?
The Tesla C1060 consists of 30 multiprocessors, each of which is comprised of 8 scalar processor cores, for a total of 240 processors. There is 16KB of shared memory per multiprocessor.
Each processor has a floating point unit which is capable of performing a scalar multiply-add operation per clock cycle. Each multiprocessor also includes two special function units which execute operations such as rsqrt, rcp, log, exp and sin/cos.
The processors are clocked at 1.296 GHz. The peak computation rate accessible from CUDA is therefore around 933 GFLOPS (240 * 3 * 1.296). If you include the graphics functionality that is accessible from CUDA (such as texture interpolation), the FLOPs rate is much higher.
Each multiprocessor includes a single double precision multiply-add unit, so the double precision floating point performance is around 78 GFLOPS (30 * 2 * 1.296).
The card includes 4 GB of device memory. The maximum observed bandwidth between system and device memory is about 6GB/second with a PCI-Express 2.0 compatible motherboard.
Other GPUs in the Tesla series have the same basic architecture, but vary in the number of multiprocessors, clock speed, memory bus width and amount of memory.
See the programming guide for more details. |
|