大语言LLM环境准备与基本测试

安装ubuntu22.04 并更换为清华源

sudo nano /etc/apt/sources.list

替换文件内容为如下：

# 默认注释了源码镜像以提高 apt update 速度，如有需要可自行取消注释
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-updates main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-updates main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-backports main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-backports main restricted universe multiverse

# deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-security main restricted universe multiverse
# # deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-security main restricted universe multiverse

deb http://security.ubuntu.com/ubuntu/ jammy-security main restricted universe multiverse
# deb-src http://security.ubuntu.com/ubuntu/ jammy-security main restricted universe multiverse

# 预发布软件源，不建议启用
# deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-proposed main restricted universe multiverse
# # deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-proposed main restricted universe multiverse

更换输入法为搜狗输入法

安装fcitx输入法框架

sudo apt install fcitx

下载搜狗输入法安装包

https://shurufa.sogou.com/

cd ~/Downloads
sudo dpkg -i sogoupinyin_4.2.1.145_amd64.deb
sudo apt install libqt5qml5 libqt5quick5 libqt5quickwidgets5 qml-module-qtquick2
sudo apt install libgsettings-qt1

卸载ibus输入法框架

sudo apt purge ibus

Miniconda

参考链接下载这个安装脚本，并运行

wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.1.0-1-Linux-x86_64.sh
bash Miniconda3-py310_23.1.0-1-Linux-x86_64.sh
source ~/.bashrc

在需要创建环境的时候可以按需创建

conda create -n 31009text python=3.10.9
conda create -n 31006sd python=3.10.6

Git

sudo apt install git

如果没有科学上网，安装油猴脚本加速浏览器的github 或者其他方案

驱动

参考文章

#卸载系统自带的驱动
sudo apt purge nvidia*
sudo vi /etc/modprobe.d/blacklist.conf

在最后增加两行

blacklist nouveau 
options nouveau modeset=0

sudo update-initramfs -u
sudo reboot

重启以后,检查nouveau是否被屏蔽，如果以下命令无输出则成功

lsmod | grep nouveau

根据显卡型号下载驱动，并安装

sudo chmod a+x NVIDIA-Linux-x86_64-525.89.02.run
sudo ./NVIDIA-Linux-x86_64-525.89.02.run -no-x-check -no-nouveau-check -no-opengl-files

-no-x-check: 安装时关闭X服务；
-no-nouveau-check: 安装时禁用nouveau；
-no-opengl-files: 只安装驱动文件，不安装OpenGL文件, 这个无所谓

安装完成后,通过下述命令查看兼容的cuda版本，2080ti最高可以兼容12.0

nvidia-smi

安装cuda

访问https://developer.nvidia.com/cuda-toolkit-archive

进行选择，然后执行相应的命令行下载并安装

https://developer.nvidia.com/cuda-12-0-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.0.0/local_installers/cuda-repo-ubuntu2204-12-0-local_12.0.0-525.60.13-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-0-local_12.0.0-525.60.13-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-0-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

验证安装完成，注意V是大写

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Mon_Oct_24_19:12:58_PDT_2022
Cuda compilation tools, release 12.0, V12.0.76
Build cuda_12.0.r12.0/compiler.31968024_0

！！！由于要使用量化QPTY版本，实际安装11.7.1版本,如下

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.7.1/local_installers/cuda-repo-ubuntu2204-11-7-local_11.7.1-515.65.01-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-11-7-local_11.7.1-515.65.01-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-11-7-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

安装cudnn

注册nv-developer账号

https://developer.nvidia.com/rdp/cudnn-download

下载对应版本，选择TAR压缩包，随后按照说明进行安装，实际是解压缩以及将文件移动到指定位置

Text-generation-webui

https://github.com/oobabooga/text-generation-webui

sudo apt install build-essential
# conda create -n 31009text python=3.10.9
conda env list 
conda activate 31009text
# install pytorch 根据显卡不同命令不一样
pip3 install torch torchvision torchaudio

git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt

python server.py

GPTQ 安装

激活conda 的 31009text环境，

按照此分支进行安装，安装目录为text-generation-webui 的repositories目录下，如果仍然报错，参考hf 配置

git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
cd GPTQ-for-LLaMa
pip install -r requirements.txt

hf的配置方案，主要是切换到了gptq的一个指定版本

git clone -n  https://github.com/qwopqwop200/GPTQ-for-LLaMa gptq-safe
cd gptq-safe && git checkout 58c8ab4c7aaccc50f507fd08cce941976affe5e0

由于这个方案重新命名了文件夹gpta-safe，所以在GPTQ-Loader.py的文件中需要相应修改，增加path的位置，以便后续的运行过程中能找到modelutils 和find_layers

sys.path.insert(0, str(Path("repositories/GPTQ-for-LLaMa")))
sys.path.insert(0, str(Path("repositories/GPTQ-for-LLaMa/utils")))
sys.path.insert(0, str(Path("repositories/gptq-safe")))
import llama_inference_offload
import modelutils
from modelutils import find_layers

测试加载模型

llama-7b-4bit-128g 成功

llama-13b-4bit-128g 成功

llama-30b-4bit-128g 成功

llama-13b-hf 不成功，其模型大小达40g, 报错out of memory, 因此必须用4bit量化模型

GPT4-x-alpaca-30b-4bit 不成功，在text-generation-webui中有bug

vicuna-13B-1.1-GPTQ-4bit-128g 成功

单独测试chatglm-6b ，成功，在text-generation-webui中报错，trust_remote_code问题

加载模型vicuna 4bit如下，也可以在webui中选取

python server.py --model vicuna-13B-1.1-GPTQ-4bit-128g --wbits 4 --groupsize 128 --model_type Llama

加载vicuna模型

#先用delta模型去处理原始llama，生成可以直接使用的模型
python -m fastchat.model.apply_delta --base-model-path ./llama-13b-hf --target-model-path ~/TDowload/vicuna-13b --delta-path ~/TDowload/vicuna-13b-delta-v1.1/
# 此处由于需要内存超过64GB，暂时无法完成，有待后续

转换原始的llama模型到hf 量化模型

#convert LLaMA to hf
python convert_llama_weights_to_hf.py --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir ./llama-hf

# Benchmark language generation with 4-bit LLaMA-7B:

# Save compressed model
CUDA_VISIBLE_DEVICES=0 python llama.py {MODEL_DIR} c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors llama7b-4bit-128g.safetensors

# Benchmark generating a 2048 token sequence with the saved model
CUDA_VISIBLE_DEVICES=0 python llama.py {MODEL_DIR} c4 --benchmark 2048 --check

# model inference with the saved model
CUDA_VISIBLE_DEVICES=7 python llama_inference.py {MODEL_DIR} --wbits 4 --groupsize 128 --load llama7b-4bit-128g.safetensors --text "this is llama" --device=0

# model inference with the saved model with offload(This is very slow. This is a simple implementation and could be improved with technologies like flexgen(https://github.com/FMInference/FlexGen).
CUDA_VISIBLE_DEVICES=0 python llama_inference_offload.py ${MODEL_DIR} --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is llama" --pre_layer 16
It takes about 180 seconds to generate 45 tokens(5->50 tokens) on single RTX3090 based on LLaMa-65B. pre_layer is set to 50.

总结：建议直接下载已经量化好的模型

测试 MOSS 模型

HF地址

github

注意有plugin 版本的区别

安装迅雷或者其他下载软件

用来下载huggingface.co 上的模型

安装v2raya 或者其他科学上网方法

sudo snap install v2raya
# 进行配置

Langchain

在conda31009text环境下安装langchain

conda env list
conda activate 31009text
pip install langchain

data independent

youtube教程langchain+alpaca7b

youtube教程langchain+pdf

referece

https://www.youtube.com/watch?v=v6sF8Ed3nTE

https://colab.research.google.com/drive/115ba3EFCT0PvyXzFNv9E18QnKiyyjsm5?usp=sharing

pip -q install git+https://github.com/huggingface/transformers 
pip install -q datasets loralib sentencepiece 
pip -q install bitsandbytes accelerate
pip -q install langchain

Troubleshooting

常见问题一

Hi @candowu, thanks for raising this issue. This is arising, because the tokenizer in the config on the hub points to LLaMATokenizer. However, the tokenizer in the library is LlamaTokenizer.

This is likely due to the configuration files being created before the final PR was merged in.

Change the LLaMATokenizer in tokenizer_config.json into lowercase LlamaTokenizer and it works like a charm.

常见问题二

ValueError: The current `device_map` had weights offloaded to the disk. Please provide an 
`offload_folder` for them.

Adding this argument should resolve the error:

text-generation-webui/modules/models.py line 56

import torch
from transformers import AutoModelForCausalLM

# Will go out of RAM on Colab
checkpoint = "facebook/opt-13b"
model = AutoModelForCausalLM.from_pretrained(
    checkpoint, device_map="auto", offload_folder="offload", torch_dtype=torch.float16
)

问题三，moss运行中

TypeError: '<' not supported between instances of 'tuple' and 'float'

这个是由于 OutOfResources 导致返回 float值，随后不能完成比较

参考以下解决方案，采用源码重新编译和安装triton

其实就是修改一行代码，将资源不足时返回的float类型改成了tuple类型

删除		return float('inf')
增加		return (float('inf'), float('inf'), float('inf'))

问题四： (待验证)

In the current version, we have model_class not being an Auto class for (TF) ImageClassificationPipelineTests, and we get test failure TypeError: ('Keyword argument not understood:', 'trust_remote_code')

https://github.com/huggingface/transformers/runs/7421505300?check_suite_focus=true

Adding TFAutoModelForImageClassification in src/transformers/pipelines/__init__.py will fix the issue.