NVIDIA CUDA 目前已經發展到 5.0 版了,而 Linux 的最新版安裝程式已經打包成一個單一可執行檔,安裝過程與以往有些差異,不過大致上都是差不多的。

這裡示範如何在 Ubuntu Linux 12.04 LTS 版本中安裝 NVIDIA CUDA,因為目前 NVIDIA 官方的 CUDA 5.0 只有支援 Ubuntu Linux 10.04 與 11.10,沒有對應 12.04 的版本,所以只好下載最接近的 11.10 版來安裝。

首先建立安裝 CUDA Toolkit 的目錄,並把目錄的擁有者改為自己(不改也可以,因為我個人習慣盡量不用 root 權限):

sudo mkdir /usr/local/cuda-5.0
sudo chown seal:seal /usr/local/cuda-5.0

下載 CUDA 5.0 安裝檔,下載後直接執行:

chmod +x cuda_5.0.35_linux_64_ubuntu11.10-1.run
./cuda_5.0.35_linux_64_ubuntu11.10-1.run


執行之後,會顯示一堆說明文件,之後安裝程式會詢問一些問題,請依序回答:


Do you accept the previously read EULA? (accept/decline/quit):

是否接受上面的條款,請輸入「accept」。


Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 304.54? ((y)es/(n)o/(q)uit):

是否安裝 NVIDIA 驅動程式,如果系統中已經安裝過了,就可以省略,否則請輸入「y」。


Please enter the root password:

輸入 root 密碼,安裝驅動程式用的。


Install the CUDA 5.0 Toolkit? ((y)es/(n)o/(q)uit):

是否安裝 CUDA 5.0 Toolkit,輸入「y」。


Enter Toolkit Location [ default is /usr/local/cuda-5.0 ]:

輸入 Toolkit 的安裝位置,若使用預設路徑,則直接按 Enter 鍵。


Install the CUDA 5.0 Samples? ((y)es/(n)o/(q)uit):

是否安裝 CUDA 5.0 範例程式,輸入「y」。


Enter CUDA Samples Location [ default is /usr/local/cuda-5.0/samples ]:

輸入範例程式安裝路徑,若使用預設路徑,則直接按 Enter 鍵。

安裝完成後,會有一些訊息:


Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-5.0
Samples:  Installation Failed. Missing required libraries.
* Please make sure your PATH includes /usr/local/cuda-5.0/bin
* Please make sure your LD_LIBRARY_PATH
*   for 32-bit Linux distributions includes /usr/local/cuda-5.0/lib
*   for 64-bit Linux distributions includes /usr/local/cuda-5.0/lib64:/lib
* OR
*   for 32-bit Linux distributions add /usr/local/cuda-5.0/lib
*   for 64-bit Linux distributions add /usr/local/cuda-5.0/lib64 and /lib
* to /etc/ld.so.conf and run ldconfig as root
* To uninstall CUDA, remove the CUDA files in /usr/local/cuda-5.0
* Installation Complete

這裡看起來範例程式沒有安裝成功,這時候請看看更早的訊息,其中有兩行:


Installing the CUDA Toolkit in /usr/local/cuda-5.0 …
   Missing required library libglut.so

原來是少了 glut library,如果系統之前沒有安裝 freeglut 的話,就要用 apt 裝一下:

sudo apt-get install freeglut3-dev

如果已經安裝過的話,就可以跳過上面這個安裝的步驟。因為 freeglut3-dev 這個套件中 libglut.so 是放在 /usr/lib/x86_64-linux-gnu 這個路徑下,而 NVIDUA 的安裝程式似乎找不到,所以我們直接在 /usr/lib 建立一個連結檔:

sudo ln -s /usr/lib/x86_64-linux-gnu/libglut.so /usr/lib/libglut.so

接著再安裝範例程式,因為之前已經安裝好驅動程式與 CUDA Toolkit 了,所以現在只要安裝範例程式即可:

./cuda_5.0.35_linux_64_ubuntu11.10-1.run

這樣就大功告成了,而這時候剛剛建立的 libglut.so 連結檔就沒有用了,可以直接刪除:

rm /usr/lib/libglut.so

接著要設定環境變數:

echo 'export PATH=$PATH:/usr/local/cuda-5.0/bin' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-5.0/lib64:/lib' >> ~/.bashrc

這樣就可以開始使用 CUDA 了,因為上面的環境變數剛被加入,如果要馬上使用記得先執行:

source ~/.bashrc

接下來我們把編譯範例程式編譯起來測試看看,首先把範例程式複製一份到自己的目錄中:

mkdir ~/tmp
cp -r /usr/local/cuda-5.0/samples ~/tmp/

然後編譯:

cd ~/tmp/samples
make

正常的情況等他跑完之後,在 ~/tmp/samples/bin/linux/release 這個資料夾就會有編譯好的範例程式執行檔。

cd ~/tmp/samples/bin/linux/release
ls

輸出為

BlackScholes              matrixMulCUBLAS
FDTD3d                    matrixMulDrv
FunctionPointers          matrixMulDynlinkJIT
HSOpticalFlow             mergeSort
MC_EstimatePiInlineP      nbody
MC_EstimatePiInlineQ      newdelete
MC_EstimatePiP            oceanFFT
MC_EstimatePiQ            particles
MC_SingleAsianOptionP     postProcessGL
Mandelbrot                ptxjit
MersenneTwisterGP11213    quasirandomGenerator
MonteCarloMultiGPU        radixSortThrust
SobelFilter               randomFog
SobolQRNG                 recursiveGaussian
alignedTypes              reduction
asyncAPI                  scalarProd
bandwidthTest             scan
batchCUBLAS               segmentationTreeThrust
bicubicTexture            shfl_scan
bilateralFilter           simpleAssert
bindlessTexture           simpleAtomicIntrinsics
binomialOptions           simpleCUBLAS
boxFilter                 simpleCUFFT
boxFilterNPP              simpleCallback
cdpAdvancedQuicksort      simpleCubemapTexture
cdpLUDecomposition        simpleDevLibCUBLAS
cdpQuadtree               simpleGL
cdpSimplePrint            simpleHyperQ
cdpSimpleQuicksort        simpleIPC
clock                     simpleLayeredTexture
concurrentKernels         simpleMPI
conjugateGradient         simpleMultiCopy
conjugateGradientPrecond  simpleMultiGPU
convolutionFFT2D          simpleP2P
convolutionSeparable      simplePitchLinearTexture
convolutionTexture        simplePrintf
cppIntegration            simpleSeparateCompilation
cudaOpenMP                simpleStreams
dct8x8                    simpleSurfaceWrite
deviceQuery               simpleTemplates
deviceQueryDrv            simpleTexture
dwtHaar1D                 simpleTexture3D
dxtc                      simpleTextureDrv
eigenvalues               simpleVoteIntrinsics
fastWalshTransform        simpleZeroCopy
fluidsGL                  smokeParticles
freeImageInteropNPP       sortingNetworks
grabcutNPP                stereoDisparity
histEqualizationNPP       template
histogram                 template_runtime
imageDenoising            threadFenceReduction
imageSegmentationNPP      threadMigration
inlinePTX                 transpose
interval                  vectorAdd
lineOfSight               vectorAddDrv
marchingCubes             volumeFiltering
matrixMul                 volumeRender

挑選幾個測試看看:

./deviceQueryDrv

輸出為

./deviceQueryDrv Starting…
CUDA Device Query (Driver API) statically linked version
Detected 1 CUDA Capable device(s)
Device 0: “Quadro FX 4600”
CUDA Driver Version: 5.0
CUDA Capability Major/Minor version number: 1.0
Total amount of global memory: 768 MBytes (804978688 bytes)
(12) Multiprocessors x ( 8) CUDA Cores/MP: 96 CUDA Cores
GPU Clock rate: 1200 MHz (1.20 GHz)
Memory Clock rate: 700 Mhz
Memory Bus Width: 384-bit
Max Texture Dimension Sizes 1D=(8192) 2D=(65536,32768) 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Texture alignment: 256 bytes
Maximum memory pitch: 2147483647 bytes
Concurrent copy and kernel execution: No with 0 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: No
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 6 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

使用 CUBLAS 的矩陣乘法:

./matrixMulCUBLAS

輸出為

[Matrix Multiply CUBLAS] -- Starting…
GPU Device 0: “Quadro FX 4600” with compute capability 1.0
MatrixA(160,320), MatrixB(160,320), MatrixC(160,320)
Computing result using CUBLAS…done.
Performance= 56.99 GFlop/s, Time= 0.287 msec, Size= 16384000 Ops
Computing result using host CPU…done.
Comparing CUBLAS Matrix Multiply with CPU results: OK

Eigen Values:

./eigenvalues

輸出為

Starting eigenvalues
GPU Device 0: “Quadro FX 4600” with compute capability 1.0

Matrix size: 2048 x 2048
Precision: 0.000010
Iterations to be timed: 100
Result filename: ‘eigenvalues.dat’
Gerschgorin interval: -2.894310 / 2.923303
Average time step 1: 17.917166 ms
Average time step 2, one intervals: 5.539938 ms
Average time step 2, mult intervals: 0.013190 ms
Average time TOTAL: 23.516399 ms
Test Succeeded!