NVIDIA CUDA 目前已經發展到 5.0 版了,而 Linux 的最新版安裝程式已經打包成一個單一可執行檔,安裝過程與以往有些差異,不過大致上都是差不多的。
這裡示範如何在 Ubuntu Linux 12.04 LTS 版本中安裝 NVIDIA CUDA,因為目前 NVIDIA 官方的 CUDA 5.0 只有支援 Ubuntu Linux 10.04 與 11.10,沒有對應 12.04 的版本,所以只好下載最接近的 11.10 版來安裝。
首先建立安裝 CUDA Toolkit 的目錄,並把目錄的擁有者改為自己(不改也可以,因為我個人習慣盡量不用 root 權限):
sudo mkdir /usr/local/cuda-5.0 sudo chown seal:seal /usr/local/cuda-5.0
下載 CUDA 5.0 安裝檔,下載後直接執行:
chmod +x cuda_5.0.35_linux_64_ubuntu11.10-1.run ./cuda_5.0.35_linux_64_ubuntu11.10-1.run
Do you accept the previously read EULA? (accept/decline/quit):
是否接受上面的條款,請輸入「accept」。
Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 304.54? ((y)es/(n)o/(q)uit):
是否安裝 NVIDIA 驅動程式,如果系統中已經安裝過了,就可以省略,否則請輸入「y」。
Please enter the root password:
輸入 root 密碼,安裝驅動程式用的。
Install the CUDA 5.0 Toolkit? ((y)es/(n)o/(q)uit):
是否安裝 CUDA 5.0 Toolkit,輸入「y」。
Enter Toolkit Location [ default is /usr/local/cuda-5.0 ]:
輸入 Toolkit 的安裝位置,若使用預設路徑,則直接按 Enter 鍵。
Install the CUDA 5.0 Samples? ((y)es/(n)o/(q)uit):
是否安裝 CUDA 5.0 範例程式,輸入「y」。
Enter CUDA Samples Location [ default is /usr/local/cuda-5.0/samples ]:
輸入範例程式安裝路徑,若使用預設路徑,則直接按 Enter 鍵。
安裝完成後,會有一些訊息:
Driver: Installed
Toolkit: Installed in /usr/local/cuda-5.0
Samples: Installation Failed. Missing required libraries.
* Please make sure your PATH includes /usr/local/cuda-5.0/bin
* Please make sure your LD_LIBRARY_PATH
* for 32-bit Linux distributions includes /usr/local/cuda-5.0/lib
* for 64-bit Linux distributions includes /usr/local/cuda-5.0/lib64:/lib
* OR
* for 32-bit Linux distributions add /usr/local/cuda-5.0/lib
* for 64-bit Linux distributions add /usr/local/cuda-5.0/lib64 and /lib
* to /etc/ld.so.conf and run ldconfig as root
* To uninstall CUDA, remove the CUDA files in /usr/local/cuda-5.0
* Installation Complete
這裡看起來範例程式沒有安裝成功,這時候請看看更早的訊息,其中有兩行:
Installing the CUDA Toolkit in /usr/local/cuda-5.0 …
Missing required library libglut.so
原來是少了 glut library,如果系統之前沒有安裝 freeglut 的話,就要用 apt 裝一下:
sudo apt-get install freeglut3-dev
如果已經安裝過的話,就可以跳過上面這個安裝的步驟。因為 freeglut3-dev 這個套件中 libglut.so 是放在 /usr/lib/x86_64-linux-gnu 這個路徑下,而 NVIDUA 的安裝程式似乎找不到,所以我們直接在 /usr/lib 建立一個連結檔:
sudo ln -s /usr/lib/x86_64-linux-gnu/libglut.so /usr/lib/libglut.so
接著再安裝範例程式,因為之前已經安裝好驅動程式與 CUDA Toolkit 了,所以現在只要安裝範例程式即可:
./cuda_5.0.35_linux_64_ubuntu11.10-1.run
這樣就大功告成了,而這時候剛剛建立的 libglut.so 連結檔就沒有用了,可以直接刪除:
rm /usr/lib/libglut.so
接著要設定環境變數:
echo 'export PATH=$PATH:/usr/local/cuda-5.0/bin' >> ~/.bashrc echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-5.0/lib64:/lib' >> ~/.bashrc
這樣就可以開始使用 CUDA 了,因為上面的環境變數剛被加入,如果要馬上使用記得先執行:
source ~/.bashrc
接下來我們把編譯範例程式編譯起來測試看看,首先把範例程式複製一份到自己的目錄中:
mkdir ~/tmp cp -r /usr/local/cuda-5.0/samples ~/tmp/
然後編譯:
cd ~/tmp/samples make
正常的情況等他跑完之後,在 ~/tmp/samples/bin/linux/release 這個資料夾就會有編譯好的範例程式執行檔。
cd ~/tmp/samples/bin/linux/release ls
輸出為
BlackScholes matrixMulCUBLAS
FDTD3d matrixMulDrv
FunctionPointers matrixMulDynlinkJIT
HSOpticalFlow mergeSort
MC_EstimatePiInlineP nbody
MC_EstimatePiInlineQ newdelete
MC_EstimatePiP oceanFFT
MC_EstimatePiQ particles
MC_SingleAsianOptionP postProcessGL
Mandelbrot ptxjit
MersenneTwisterGP11213 quasirandomGenerator
MonteCarloMultiGPU radixSortThrust
SobelFilter randomFog
SobolQRNG recursiveGaussian
alignedTypes reduction
asyncAPI scalarProd
bandwidthTest scan
batchCUBLAS segmentationTreeThrust
bicubicTexture shfl_scan
bilateralFilter simpleAssert
bindlessTexture simpleAtomicIntrinsics
binomialOptions simpleCUBLAS
boxFilter simpleCUFFT
boxFilterNPP simpleCallback
cdpAdvancedQuicksort simpleCubemapTexture
cdpLUDecomposition simpleDevLibCUBLAS
cdpQuadtree simpleGL
cdpSimplePrint simpleHyperQ
cdpSimpleQuicksort simpleIPC
clock simpleLayeredTexture
concurrentKernels simpleMPI
conjugateGradient simpleMultiCopy
conjugateGradientPrecond simpleMultiGPU
convolutionFFT2D simpleP2P
convolutionSeparable simplePitchLinearTexture
convolutionTexture simplePrintf
cppIntegration simpleSeparateCompilation
cudaOpenMP simpleStreams
dct8x8 simpleSurfaceWrite
deviceQuery simpleTemplates
deviceQueryDrv simpleTexture
dwtHaar1D simpleTexture3D
dxtc simpleTextureDrv
eigenvalues simpleVoteIntrinsics
fastWalshTransform simpleZeroCopy
fluidsGL smokeParticles
freeImageInteropNPP sortingNetworks
grabcutNPP stereoDisparity
histEqualizationNPP template
histogram template_runtime
imageDenoising threadFenceReduction
imageSegmentationNPP threadMigration
inlinePTX transpose
interval vectorAdd
lineOfSight vectorAddDrv
marchingCubes volumeFiltering
matrixMul volumeRender
挑選幾個測試看看:
./deviceQueryDrv
輸出為
./deviceQueryDrv Starting…
CUDA Device Query (Driver API) statically linked version
Detected 1 CUDA Capable device(s)
Device 0: “Quadro FX 4600”
CUDA Driver Version: 5.0
CUDA Capability Major/Minor version number: 1.0
Total amount of global memory: 768 MBytes (804978688 bytes)
(12) Multiprocessors x ( 8) CUDA Cores/MP: 96 CUDA Cores
GPU Clock rate: 1200 MHz (1.20 GHz)
Memory Clock rate: 700 Mhz
Memory Bus Width: 384-bit
Max Texture Dimension Sizes 1D=(8192) 2D=(65536,32768) 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Texture alignment: 256 bytes
Maximum memory pitch: 2147483647 bytes
Concurrent copy and kernel execution: No with 0 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: No
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 6 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
使用 CUBLAS 的矩陣乘法:
./matrixMulCUBLAS
輸出為
[Matrix Multiply CUBLAS] -- Starting…
GPU Device 0: “Quadro FX 4600” with compute capability 1.0
MatrixA(160,320), MatrixB(160,320), MatrixC(160,320)
Computing result using CUBLAS…done.
Performance= 56.99 GFlop/s, Time= 0.287 msec, Size= 16384000 Ops
Computing result using host CPU…done.
Comparing CUBLAS Matrix Multiply with CPU results: OK
Eigen Values:
./eigenvalues
輸出為
Starting eigenvalues
GPU Device 0: “Quadro FX 4600” with compute capability 1.0
Matrix size: 2048 x 2048
Precision: 0.000010
Iterations to be timed: 100
Result filename: ‘eigenvalues.dat’
Gerschgorin interval: -2.894310 / 2.923303
Average time step 1: 17.917166 ms
Average time step 2, one intervals: 5.539938 ms
Average time step 2, mult intervals: 0.013190 ms
Average time TOTAL: 23.516399 ms
Test Succeeded!