Ubuntu 22.04 LTS AIGC GPU环境搭建(NVIDIA+CUDA+cuDNN)
一、服务器信息
1.1 硬件配置
服务器名称 CPU(Intel® Xeon® W-2245) 显卡(NVIDIA Quadro RTX 6000 24G) 内存(G) 硬盘(G) AIGC-Precision 8核16线程 24G × 2 64 20001.2 软件配置
Nvidia显卡驱动:https://www.nvidia.cn/download/driverResults.aspx/223630/cn/ Nvidia显卡驱动与CUDA驱动版本对应参考表:https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html CUDA Toolkit 12.4 Downloads:https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=runfile_local cuDNN Downloads:https://developer.nvidia.com/cudnn-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local NVIDIA Container Toolkit:https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html 操作系统 显卡驱动 CUDA驱动 cuDNN驱动 MiniConda Docker 22.04.4 LTS (Jammy Jellyfish) https://cn.download.nvidia.com/XFree86/Linux-x86_64/550.67/NVIDIA-Linux-x86_64-550.67.run https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run https://developer.download.nvidia.com/compute/cudnn/9.0.0/local_installers/cudnn-local-repo-ubuntu2204-9.0.0_1.0-1_amd64.deb https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh 26.0.0二、系统初始化
2.1 安装常用工具
# 更新系统、内核等 sudo apt-get -y update && sudo apt-get -y upgrade && apt list --upgradable && sudo apt autoremove # 备份系统默认apt源 sudo cp sources.list sources.list.bak$(date '+%Y%m%d%H%M%S') # 安装常用工具 sudo apt -y install lsb-release openssh-server vim jq net-tools \ git expect dkms autoconf nmon ansible screen # 查看系统版本号信息 uname -a && cat /proc/version && lsb_release -a && cat /etc/*release # 临时关闭swap swapoff -a # 备份源文件 cp -p /etc/fstab /etc/fstab.bak$(date '+%Y%m%d%H%M%S') # 永久关闭swap sed -ri '/^[^#]*swap/s@^@#@' /etc/fstab
2.1.1 基本配置时间+用户+IP设置(推荐)
vi /etc/profile #vi /etc/bashrc #vi /etc/profile.d/env.sh
# 远程登录超时 TMOUT=300 # 60*5=300秒 # 基本配置时间+用户+ip设置 HISTFILESIZE=2000 # 默认保存命令条数 HISTSIZE=2000 # 使用命令时输出的记录数 IP=`who -u am i 2>/dev/null| awk '{print $NF}'|sed -e 's/[()]//g'` #获取客户端IP if [ -z $IP ] # IP长度为零时则赋值本机主机名 then IP=`hostname` fi HISTTIMEFORMAT="%F %T $IP:`whoami` " # 设置history输出格式 export HISTTIMEFORMAT
source /etc/profile history -r && sudo echo > ~/.bash_history
2.2 安装NVIDIA驱动
禁用自带的nouveau nvidia驱动sudo vi /etc/modprobe.d/blacklist.conf
blacklist.conf文件末尾添加以下内容 blacklist nouveau options nouveau modeset=0
sudo update-initramfs -u && sudo reboot
查看是否将自带的驱动屏蔽 lsmod | grep nouveau
安装gcc-12、g+±12 我使用gcc --version | grep -e 'gcc'
观察到我的GCC版本是11,而推荐的版本是12。
apt install -y gcc-12 g++-12 # 现有的2个版本添加到 update-alternatives 组 sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-11 11 --slave /usr/bin/g++ g++ /usr/bin/g++-11 --slave /usr/bin/gcov gcov /usr/bin/gcov-11 sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 12 --slave /usr/bin/g++ g++ /usr/bin/g++-12 --slave /usr/bin/gcov gcov /usr/bin/gcov-12 # sudo update-alternatives --config gcc
手动选择指定的gcc版本 There are 2 choices for the alternative gcc (providing /usr/bin/gcc). Selection Path Priority Status ------------------------------------------------------------ * 0 /usr/bin/gcc-12 12 auto mode 1 /usr/bin/gcc-11 11 manual mode 2 /usr/bin/gcc-12 12 manual mode Press <enter> to keep the current choice[*], or type selection number: 0
规避显卡安装过程中的报错:ERROR: Unable to find the development tool cc
in your path; please make sure that you have the package ‘gcc’ installed. If gcc is installed on your system, then please check that cc
is in your PATH.
sudo ln -s /usr/bin/gcc /usr/bin/cc # 如果您之后希望删除 cc 符号链接,可以使用以下命令: sudo rm /usr/bin/cc
安装NVIDIA显卡驱动 -no-x-check: 安装时关闭X服务; -no-nouveau-check: 安装时禁用nouveau; -no-opengl-files: 只安装驱动文件,不安装OpenGL文件。 chmod a+x *.run sudo sh NVIDIA-Linux-x86_64-550.67.run -no-x-check -no-nouveau-check -no-opengl-files
验证Nvidia显卡驱动是否安装成功 nvidia-smi
Tue Mar 26 13:25:25 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.67 Driver Version: 550.67 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 Quadro RTX 6000 Off | 00000000:17:00.0 Off | Off | | 33% 27C P8 11W / 260W | 6MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 Quadro RTX 6000 Off | 00000000:65:00.0 On | Off | | 34% 28C P8 17W / 260W | 53MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2574 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 2574 G /usr/lib/xorg/Xorg 51MiB | +-----------------------------------------------------------------------------------------+
2.3 安装CUDA Toolkit
安装CUDA Toolkitchmod a+x *.run sudo sh cuda_11.3.1_465.19.01_linux.run
由于已安装过NVIDIA驱动,此处取消 Driver 安装勾选选项,反之可使NVIDIA与CUDA Toolkit一同安装 x CUDA Installer se Agreement x x - [ ] Driver x x [ ] 550.54.14 x x + [X] CUDA Toolkit 12.4 x x [X] CUDA Demo Suite 12.4 x x [X] CUDA Documentation 12.4 x x - [ ] Kernel Objects x x [ ] nvidia-fs x x Options x x Install x x x x reface x x x x x Up/Down: Move | Left/Right: Expand | 'Enter': Select | 'A': Advanced options x
安装成功后需配置CUDA环境变量 =========== = Summary = =========== Driver: Not Selected Toolkit: Installed in /usr/local/cuda-12.4/ Please make sure that - PATH includes /usr/local/cuda-12.4/bin - LD_LIBRARY_PATH includes /usr/local/cuda-12.4/lib64, or, add /usr/local/cuda-12.4/lib64 to /etc/ld.so.conf and run ldconfig as root To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-12.4/bin ***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 550.00 is required for CUDA 12.4 functionality to work. To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file: sudo <CudaInstaller>.run --silent --driver Logfile is /var/log/cuda-installer.log
编辑/etc/profile文件末尾添加以下内容 sudo vim /etc/profile
# CUDA export PATH=/usr/local/cuda-12.4/bin${PATH:+:${PATH}} export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} export CUDA_HOME=/usr/local/cuda-12.4
sudo ldconfig && source /etc/profile
验证CUDA是否安装成功 nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Tue_Feb_27_16:19:38_PST_2024 Cuda compilation tools, release 12.4, V12.4.99 Build cuda_12.4.r12.4/compiler.33961263_0
2.4 安装cuDNN
# wget https://developer.download.nvidia.com/compute/cudnn/9.0.0/local_installers/cudnn-local-repo-ubuntu2204-9.0.0_1.0-1_amd64.deb sudo dpkg -i cudnn-local-repo-ubuntu2204-9.0.0_1.0-1_amd64.deb sudo cp /var/cudnn-local-repo-ubuntu2204-9.0.0/cudnn-*-keyring.gpg /usr/share/keyrings/ sudo apt-get update sudo apt-get -y install cudnn
默认安装最新版,也可指定版本安装cuDNN sudo apt-get -y install cudnn-cuda-12
验证cuDNN是否安装并正常运行,请编译mnistCUDNN
位于/usr/src/cudnn_samples_v9
Debian 文件目录中的示例 sudo apt-get -y install libcudnn9-samples libfreeimage-dev cd $HOME/cudnn_samples_v9/mnistCUDNN whereis mnistCUDNN
mnistCUDNN: /usr/src/cudnn_samples_v9/mnistCUDNN
cd /usr/src/cudnn_samples_v9/mnistCUDNN sudo make clean && sudo make ./mnistCUDNN
Resulting weights from Softmax: 0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000714 0.0000000 0.0000000 0.0000000 0.0000000 Loading image data/five_28x28.pgm Performing forward propagation ... Resulting weights from Softmax: 0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 1.0000000 0.0000154 0.0000000 0.0000012 0.0000006 Result of classification: 1 3 5 Test passed!
2.5 安装Miniconda
sudo -s mkdir -p /opt/miniconda3 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /opt/miniconda3/miniconda.sh bash /opt/miniconda3/miniconda.sh -b -u -p /opt/miniconda3 rm -rf /opt/miniconda3/miniconda.sh # 初始化Miniconda /opt/miniconda3/bin/conda init bash /opt/miniconda3/bin/conda init zsh
验证miniconda是否安装成功 sudo conda --version conda config --set auto_activate_base false # 设置非自动启动base环境
设置conda清华源 vim ~/.condarc
channels: - defaults show_channel_urls: true default_channels: - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2 custom_channels: conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud pytorch-lts: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud deepmodeling: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/
conda clean -i
安装conda命令补全 conda install -c conda-forge conda-bash-completion exec bash
2.6 安装NGINX
# 安装必备工具 sudo apt install curl gnupg2 ca-certificates lsb-release ubuntu-keyring # 导入官方 nginx 签名密钥,以便 apt 可以验证包的真实性。 curl https://nginx.org/keys/nginx_signing.key | gpg --dearmor \ | sudo tee /usr/share/keyrings/nginx-archive-keyring.gpg >/dev/null # 验证下载的文件是否包含正确的密钥 gpg --dry-run --quiet --no-keyring --import --import-options import-show /usr/share/keyrings/nginx-archive-keyring.gpg
输出应包含完整指纹,573BFD6B3D8FBC641079A6ABABF5BD827BD9BF62
如下所示(如果指纹不同,请删除该文件): pub rsa2048 2011-08-19 [SC] [expires: 2024-06-14] 573BFD6B3D8FBC641079A6ABABF5BD827BD9BF62 uid nginx signing key <signing-key@nginx.com>
设置apt
稳定 nginx 软件包的存储库,并安装nginx echo "deb [arch=amd64 signed-by=/usr/share/keyrings/nginx-archive-keyring.gpg] \ http://nginx.org/packages/ubuntu `lsb_release -cs` nginx" \ | sudo tee /etc/apt/sources.list.d/nginx.list echo "# deb [arch=amd64 signed-by=/usr/share/keyrings/nginx-archive-keyring.gpg] \ http://nginx.org/packages/mainline/ubuntu `lsb_release -cs` nginx" \ | sudo tee /etc/apt/sources.list.d/nginx.list echo -e "Package: *\nPin: origin nginx.org\nPin: release o=nginx\nPin-Priority: 900\n" \ | sudo tee /etc/apt/preferences.d/99nginx sudo apt update && sudo apt install nginx -y && nginx -v
2.7 安装Terraform
sudo apt-get update && sudo apt-get install -y gnupg software-properties-common wget -O- https://apt.releases.hashicorp.com/gpg | gpg --dearmor | \ sudo tee /usr/share/keyrings/hashicorp-archive-keyring.gpg > /dev/null gpg --no-default-keyring \ --keyring /usr/share/keyrings/hashicorp-archive-keyring.gpg \ --fingerprint gpg --no-default-keyring \ --keyring /usr/share/keyrings/hashicorp-archive-keyring.gpg \ --fingerprint echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] \ https://apt.releases.hashicorp.com $(lsb_release -cs) main" | \ sudo tee /etc/apt/sources.list.d/hashicorp.list sudo apt update && sudo apt-get install terraform && terraform -v
安装Terraform命令补全 terraform -install-autocomplete
2.8 安装Docker与NVIDIA容器工具包(nvidia-container-toolkit)
卸载所有相互冲突的软件包for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done sudo apt-get remove docker docker-engine docker.io containerd runc
添加Docker官方源、密钥等 # Add Docker's official GPG key: sudo apt-get update sudo apt-get install ca-certificates curl sudo install -m 0755 -d /etc/apt/keyrings sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc sudo chmod a+r /etc/apt/keyrings/docker.asc # Add Docker repository to Apt sources: echo \ "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \ $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \ sudo tee /etc/apt/sources.list.d/docker.list > /dev/null # Add NVIDIA容器工具包 生产库Apt源 curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt-get update
安装Docker、NVIDIA容器工具包及命令补全工具 cat /proc/driver/nvidia/version sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin nvidia-container-toolkit nvidia-docker2 bash-completion
三、Docker设置
3.1 推荐配置
镜像代理加速、并发限制、日志限制、NVIDIA容器工具包、开启2375远程访问等sudo cat > /etc/docker/daemon.json <<EOF { "iptables": true, "bip": "172.17.0.1/24", "data-root": "/var/lib/docker", "storage-driver": "overlay2", "insecure-registries":["http://Harbor_HostName:8082"], "exec-opts": ["native.cgroupdriver=systemd"], "registry-mirrors": [ "https://docker.nju.edu.cn", "https://hub-mirror.c.163.com", "https://registry.cn-hangzhou.aliyuncs.com" ], "max-concurrent-downloads": 10, "max-concurrent-uploads": 20, "live-restore": true, "log-driver": "json-file", "log-opts": { "max-size": "500m", "max-file": "3" }, "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } } } EOF
开启2375远程访问(可选) sudo vim /usr/lib/systemd/system/docker.service
#ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock -H tcp://0.0.0.0:2375
3.2 可选配置
添加自己为docker、sudo组用户sudo usermod -aG docker $USER && newgrp docker
添加其他用户到docker组 # 切换user用户 su - user # 查看所有用户和组 sudo cat /etc/passwd && cat /etc/group # 添加用户admin添加文件夹所有权限 sudo chown -R user /opt/docker-app
通过运行hello-world
映像来验证是否正确安装了Docker Engine sudo docker run --rm hello-world
3.3 配置生效 & 开机自启
# 设置docker开机自启且启动docker(C-N) sudo systemctl daemon-reload && sudo systemctl restart docker && sudo systemctl enable docker
3.4 命令自动补全
安装 bash-completion
sudo yum install -y bash-completion 安装完成之后重启系统或者重新登录 shell。如果安装成功。键入 docker p 后,再 Tab 键,系统显示如下: pause plugin port ps pull push
Docker 命令自动补全
sudo curl -L https://raw.githubusercontent.com/docker/cli/25.0.0/contrib/completion/bash/docker -o /etc/bash_completion.d/docker source /etc/bash_completion.d/docker
Docker Composer 命令自动补全
sudo curl -L https://raw.githubusercontent.com/docker/compose/1.29.2/contrib/completion/bash/docker-compose -o /etc/bash_completion.d/docker-compose source /etc/bash_completion.d/docker-compose
Containerd Ctr 命令自动补全
curl -L https://raw.githubusercontent.com/containerd/containerd/main/contrib/autocomplete/ctr -o /etc/bash_completion.d/ctr # ctr自动补全
K8s-Master节点 命令自动补全
source /usr/share/bash-completion/bash_completion source <(kubectl completion bash) echo "source <(kubectl completion bash)" >> ~/.bashrc
Helm 命令自动补全
helm completion bash > .helmrc && echo "source .helmrc" >> .bashrc
参考文档
NGINX安装文档:https://docs.nginx.com/nginx/admin-guide/installing-nginx/installing-nginx-open-source/#installing-prebuilt-ubuntu-packages
Terraform安装文档:https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli
Docker安装文档:https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository
https://blog.csdn.net/qq_49323609/article/details/130310522
https://blog.csdn.net/qq_28356373/article/details/136746520
https://docs.nvidia.com/deeplearning/cudnn/installation/linux.html
https://mirrors.tuna.tsinghua.edu.cn/help/anaconda/
docdockerbashiconsharelinuxurlhiveelo自动补全nativestemterraformnistgitgpugithub工具包ctr2024