TensorFlow不知道又干了什么,最近脑子抽风,删掉了conda相关的安装教程。只剩下pip和docker了,而pip安装在Ubuntu22.04上需要额外的折腾才能正常跑起来。下面把conda安装教程记录一下。

安装miniconda

具体的步骤可以找其他教程,其实一路确认就可以了。

curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

创建conda环境

conda create --name tf python=3.9 # 环境名称为tf,python版本为3.9
conda deactivate # 脱离原始conda环境
conda activate tf # 进入刚才创建的tf环境

设置GPU环境

nvidia-smi # 查看是否安装了驱动

安装CUDA和cuDNN

conda install -c conda-forge cudatoolkit=11.2.2 cudnn=8.1.0

配置自动激活工具,省得后面环境掉了还不知道

mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

安装TensorFlow

pip install --upgrade pip
pip install tensorflow==2.11.* # 注意不要用conda安装,会有一堆奇怪的bug

确认GPU是否正常工作 先看完后面的Ubuntu 22.04配置再执行命令

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

正常工作会出现下列内容,机器有几张显卡就会出现几个物理设备。

[
  "PhysicalDevice(name='/physical_device:GPU:0'",
  "device_type='GPU')",
  "PhysicalDevice(name='/physical_device:GPU:1'",
  "device_type='GPU')",
  "PhysicalDevice(name='/physical_device:GPU:2'",
  "device_type='GPU')",
  "PhysicalDevice(name='/physical_device:GPU:3'",
  "device_type='GPU')"
]

Ubuntu 22.04的问题

因为系统原因,直接上述的GPU检查,系统无法找到显卡,或者报错:

Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice.
Couldn't invoke ptxas --version
InternalError: libdevice not found at ./libdevice.10.bc [Op:__some_op]

需要按照以下步骤安装nvcc并正确配置:

conda install -c nvidia cuda-nvcc=11.3.58 # 安装nvcc
mkdir -p $CONDA_PREFIX/etc/conda/activate.d # 配置XLA cuda文件夹
printf 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/\nexport XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX/lib/\n' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
mkdir -p $CONDA_PREFIX/lib/nvvm/libdevice # 复制所需文件
cp $CONDA_PREFIX/lib/libdevice.10.bc $CONDA_PREFIX/lib/nvvm/libdevice/

标签: tensorflow, conda, ubuntu

添加新评论