Last updated on March 29, 2024 am
Llama.cpp 能 CPU & GPU 环境混合推理,这里记录一下在 windows10 平台上运行 Qwen-1.8B 的过程,显卡是 1660Ti 。
准备模型
1 2 3 4 5 6 7
| conda create -n llamaConvert python=3.10 git -c conda-forge conda activate llamaConvert cd D:\llama git clone --depth=1 https://github.com/ggerganov/llama.cpp.git cd llama.cpp python -m pip install -r requirements.txt pip install tiktoken
|
1 2 3 4
| python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='Qwen/Qwen-1_8B-Chat', local_dir=r'D:\qwen', ignore_patterns=['*.h5', '*.ot', '*.msgpack', '*.safetensors'])" cd D:\qwen D:\aria2\aria2c.exe --all-proxy='http://127.0.0.1:7890' -o 'model-00001-of-00002.safetensors' "https://huggingface.co/Qwen/Qwen-1_8B-Chat/resolve/main/model-00001-of-00002.safetensors?download=true" D:\aria2\aria2c.exe --all-proxy='http://127.0.0.1:7890' -o 'model-00002-of-00002.safetensors' "https://huggingface.co/Qwen/Qwen-1_8B-Chat/resolve/main/model-00002-of-00002.safetensors?download=true"
|
1 2 3
| cd D:\llama\llama.cpp python convert-hf-to-gguf.py D:\qwen
|
运行模型
1 2 3 4 5 6
| conda create -n llamaCpp libcublas cuda-toolkit git -c nvidia -c conda-forge conda activate llamaCpp cd D:\llama ; .\main.exe cd D:\llama ; .\quantize.exe --help .\quantize.exe D:\qwen\ggml-model-f16.gguf .\qwen-1_8-f16.gguf COPY .\server.exe -m .\qwen-1_8-f16.gguf -c 4096 --n-gpu-layers 50
|
- 访问
http://127.0.0.1:8080
选择 Completion
进行测试
微调模型
附加 Yi-6B-Chat
Yi-6B是零一万物开源的双语语言模型,经过3T多语种语料库的训练,在语言理解、常识推理、阅读理解等方面有一定潜力。
1 2 3 4 5 6
| cd D:\models\01yi D:\aria2\aria2c.exe --all-proxy='http://127.0.0.1:7890' -o 'model-00001-of-00003.safetensors' "https://huggingface.co/01-ai/Yi-6B-Chat/resolve/main/model-00001-of-00003.safetensors?download=true" D:\aria2\aria2c.exe --all-proxy='http://127.0.0.1:7890' -o 'model-00002-of-00003.safetensors' "https://huggingface.co/01-ai/Yi-6B-Chat/resolve/main/model-00002-of-00003.safetensors?download=true" D:\aria2\aria2c.exe --all-proxy='http://127.0.0.1:7890' -o 'model-00003-of-00003.safetensors' https://huggingface.co/01-ai/Yi-6B-Chat/resolve/main/model-00003-of-00003.safetensors?download=true conda activate llamaConvert python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='01-ai/Yi-6B-Chat', local_dir=r'D:\models\01yi', ignore_patterns=['*.h5', '*.ot', '*.msgpack', '*.safetensors'])"
|
1 2 3 4 5 6 7 8
| conda activate llamaConvert cd D:\llama\llama.cpp python convert.py D:\models\01yi
conda activate llamaCpp cd D:\llama ; .\quantize.exe --help .\quantize.exe D:\models\01yi\ggml-model-f16.gguf .\01yi-6b-Q4_K_M.gguf Q4_K_M .\server.exe -m .\01yi-6b-Q4_K_M.gguf -c 4096 --n-gpu-layers 50
|
附加 百川2
1 2 3 4 5 6 7 8 9 10 11
| cd D:\models\baichuan D:\aria2\aria2c.exe --all-proxy='http://127.0.0.1:7890' -o 'pytorch_model.bin' "https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat/resolve/main/pytorch_model.bin?download=true" conda activate llamaConvert python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='baichuan-inc/Baichuan2-7B-Chat', local_dir=r'D:\models\baichuan', ignore_patterns=['*.h5', '*.bin', '*.ot', '*.msgpack', '*.safetensors'])" cd D:\llama\llama.cpp python convert.py D:\models\baichuan
conda activate llamaCpp cd D:\llama ; .\quantize.exe --help .\quantize.exe D:\models\baichuan\ggml-model-f16.gguf .\baichuan-7b-Q3_K_M.gguf Q3_K_M .\server.exe -m .\baichuan-7b-Q3_K_M.gguf -c 2048 --n-gpu-layers 30
|
附加 tigerbot-13b
tigerbot-13b 在 chinese-llm-benchmark 上排名靠前。
1 2 3 4 5 6 7 8 9 10 11
| cd D:\models\tigerbot D:\aria2\aria2c.exe --all-proxy='http://127.0.0.1:7890' -o 'pytorch_model-00001-of-00003.bin' --max-download-limit=6M "https://huggingface.co/TigerResearch/tigerbot-13b-chat-v5/resolve/main/pytorch_model-00001-of-00003.bin?download=true" D:\aria2\aria2c.exe --all-proxy='http://127.0.0.1:7890' -o 'pytorch_model-00002-of-00003.bin' --max-download-limit=6M "https://huggingface.co/TigerResearch/tigerbot-13b-chat-v5/resolve/main/pytorch_model-00002-of-00003.bin?download=true" D:\aria2\aria2c.exe --all-proxy='http://127.0.0.1:7890' -o 'pytorch_model-00003-of-00003.bin' --max-download-limit=6M "https://huggingface.co/TigerResearch/tigerbot-13b-chat-v5/resolve/main/pytorch_model-00003-of-00003.bin?download=true" conda activate llamaConvert python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='TigerResearch/tigerbot-13b-chat-v5', local_dir=r'D:\models\tigerbot', ignore_patterns=['*.h5', '*.bin', '*.ot', '*.msgpack', '*.safetensors'])" cd D:\llama\llama.cpp python convert.py D:\models\tigerbot --padvocab cd D:\llama ; .\quantize.exe --help .\quantize.exe D:\models\tigerbot\ggml-model-f16.gguf D:\models\tigerbot-13B-Chat-Q4_K_M.gguf Q4_K_M .\server.exe -m D:\models\tigerbot-13B-Chat-Q4_K_M.gguf -c 4096
|
感觉 6G 显存下,比较好用的是 Yi-6B-Chat-Q4_K_M
tigerbot-13b 在 R5 5600H 上推理速度 4.6 tokens/s,CPU 使用率 60%,频率 3.5GHz,应该是内存带宽瓶颈
附加 在 Colab 上量化
安装 llama.cpp
1 2 3
| !git clone --depth=1 https://github.com/ggerganov/llama.cpp.git %cd /content/llama.cpp !LLAMA_CUDA=1 make -j
|
计算 imatrix
1 2 3 4 5 6
| %cd /content !wget -O transient.txt.gz https://huggingface.co/datasets/Limour/b-corpus/resolve/main/00-preview/00-transient.txt.gz?download=true !gunzip transient.txt.gz !mkdir -p /content/CausalLM-14B-GGUF !wget -O /content/CausalLM-14B-GGUF/causallm_14b.Q8_0.gguf https://huggingface.co/TheBloke/CausalLM-14B-GGUF/resolve/main/causallm_14b.Q8_0.gguf?download=true !/content/llama.cpp/imatrix -m /content/CausalLM-14B-GGUF/causallm_14b.Q8_0.gguf -f /content/transient.txt -ngl 36
|
登录拥抱脸
1 2 3 4 5 6
| from google.colab import userdata from huggingface_hub import login
login(token=userdata.get('HF_TOKEN'), write_permission=True)
|
(跳过) 转换模型
1 2 3 4 5 6 7
| %cd llama.cpp !python -m pip install -r requirements.txt !pip install tiktoken from huggingface_hub import snapshot_download !mkdir -p ~/CausalLM snapshot_download(repo_id='CausalLM/7B', local_dir=r'/content/CausalLM', ignore_patterns=['*.h5', '*.ot', '*.msgpack', '*.safetensors']) !python convert.py --vocab-type bpe --pad-vocab --outtype f16 /content/CausalLM
|
量化模型
1
| !/content/llama.cpp/quantize --allow-requantize --imatrix /content/imatrix.dat /content/CausalLM-14B-GGUF/causallm_14b.Q8_0.gguf /content/CausalLM-14B-GGUF/causallm_14b.IQ3_XS.gguf IQ3_XS
|
上传模型
1 2 3 4 5 6 7
| from huggingface_hub import HfApi api = HfApi() api.upload_file( path_or_fileobj="/content/CausalLM-14B-GGUF/causallm_14b.IQ3_XS.gguf", path_in_repo="causallm_14b.IQ3_XS.gguf", repo_id="Limour/CausalLM-14B-GGUF" )
|