--- title: 【记录】win10平台6G显存运行Qwen-1.8B urlname: Running-Qwen-on-the-Win10-platform-with-6GB-of-video-memory index_img: https://api.limour.top/randomImg?d=2024-01-01 03:11:36 date: 2024-01-01 11:11:36 tags: llama --- [Llama.cpp](https://github.com/ggerganov/llama.cpp) 能 CPU & GPU 环境混合推理,这里记录一下在 windows10 平台上运行 [Qwen-1.8B](https://huggingface.co/Qwen/Qwen-1_8B) 的过程,显卡是 1660Ti 。 ## 准备模型 + [安装conda](/-ji-lu--an-zhuang-conda-bing-geng-huan-qing-hua-yuan) + [Tun模式](/Use-Tunnel-to-speed-up-the-connection-of-VPS)(管理员权限) ```powershell conda create -n llamaConvert python=3.10 git -c conda-forge conda activate llamaConvert cd D:\llama git clone --depth=1 https://github.com/ggerganov/llama.cpp.git cd llama.cpp python -m pip install -r requirements.txt pip install tiktoken ``` ```powershell python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='Qwen/Qwen-1_8B-Chat', local_dir=r'D:\qwen', ignore_patterns=['*.h5', '*.ot', '*.msgpack', '*.safetensors'])" cd D:\qwen D:\aria2\aria2c.exe --all-proxy='http://127.0.0.1:7890' -o 'model-00001-of-00002.safetensors' "https://huggingface.co/Qwen/Qwen-1_8B-Chat/resolve/main/model-00001-of-00002.safetensors?download=true" D:\aria2\aria2c.exe --all-proxy='http://127.0.0.1:7890' -o 'model-00002-of-00002.safetensors' "https://huggingface.co/Qwen/Qwen-1_8B-Chat/resolve/main/model-00002-of-00002.safetensors?download=true" ``` ```powershell cd D:\llama\llama.cpp python convert-hf-to-gguf.py D:\qwen # Model successfully exported to 'D:\qwen\ggml-model-f16.gguf' ``` ## 运行模型 + 下载 [llama-b1732-bin-win-cublas-cu12.2.0-x64.zip](https://github.com/ggerganov/llama.cpp/releases) + 提取文件到 `D:\llama` ```powershell conda create -n llamaCpp libcublas cuda-toolkit git -c nvidia -c conda-forge conda activate llamaCpp cd D:\llama ; .\main.exe ## 检查能否正确运行 cd D:\llama ; .\quantize.exe --help ## 自己决定量化方式 .\quantize.exe D:\qwen\ggml-model-f16.gguf .\qwen-1_8-f16.gguf COPY .\server.exe -m .\qwen-1_8-f16.gguf -c 4096 --n-gpu-layers 50 ## 调节 n-gpu-layers 平衡 CPU & GPU ``` + 访问 `http://127.0.0.1:8080` 选择 `Completion` 进行测试 ## 微调模型 + [h-corpus数据集](https://huggingface.co/datasets/a686d380/h-corpus-2023) + [官方微调教程](https://github.com/QwenLM/Qwen/blob/main/README_CN.md#微调) ## 附加 Yi-6B-Chat [Yi-6B](https://huggingface.co/01-ai/Yi-6B-Chat)是零一万物开源的双语语言模型,经过3T多语种语料库的训练,在语言理解、常识推理、阅读理解等方面有一定潜力。 ```powershell cd D:\models\01yi D:\aria2\aria2c.exe --all-proxy='http://127.0.0.1:7890' -o 'model-00001-of-00003.safetensors' "https://huggingface.co/01-ai/Yi-6B-Chat/resolve/main/model-00001-of-00003.safetensors?download=true" D:\aria2\aria2c.exe --all-proxy='http://127.0.0.1:7890' -o 'model-00002-of-00003.safetensors' "https://huggingface.co/01-ai/Yi-6B-Chat/resolve/main/model-00002-of-00003.safetensors?download=true" D:\aria2\aria2c.exe --all-proxy='http://127.0.0.1:7890' -o 'model-00003-of-00003.safetensors' https://huggingface.co/01-ai/Yi-6B-Chat/resolve/main/model-00003-of-00003.safetensors?download=true conda activate llamaConvert python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='01-ai/Yi-6B-Chat', local_dir=r'D:\models\01yi', ignore_patterns=['*.h5', '*.ot', '*.msgpack', '*.safetensors'])" ``` ```powershell conda activate llamaConvert cd D:\llama\llama.cpp python convert.py D:\models\01yi # Wrote D:\models\01yi\ggml-model-f16.gguf conda activate llamaCpp cd D:\llama ; .\quantize.exe --help .\quantize.exe D:\models\01yi\ggml-model-f16.gguf .\01yi-6b-Q4_K_M.gguf Q4_K_M .\server.exe -m .\01yi-6b-Q4_K_M.gguf -c 4096 --n-gpu-layers 50 ``` ## 附加 百川2 ```powershell cd D:\models\baichuan D:\aria2\aria2c.exe --all-proxy='http://127.0.0.1:7890' -o 'pytorch_model.bin' "https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat/resolve/main/pytorch_model.bin?download=true" conda activate llamaConvert python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='baichuan-inc/Baichuan2-7B-Chat', local_dir=r'D:\models\baichuan', ignore_patterns=['*.h5', '*.bin', '*.ot', '*.msgpack', '*.safetensors'])" cd D:\llama\llama.cpp python convert.py D:\models\baichuan # Wrote D:\models\baichuan\ggml-model-f16.gguf conda activate llamaCpp cd D:\llama ; .\quantize.exe --help .\quantize.exe D:\models\baichuan\ggml-model-f16.gguf .\baichuan-7b-Q3_K_M.gguf Q3_K_M .\server.exe -m .\baichuan-7b-Q3_K_M.gguf -c 2048 --n-gpu-layers 30 ``` ## 附加 tigerbot-13b [tigerbot-13b](https://huggingface.co/TigerResearch/tigerbot-13b-chat-v5) 在 [chinese-llm-benchmark](https://github.com/jeinlee1991/chinese-llm-benchmark) 上排名靠前。 ```powershell cd D:\models\tigerbot D:\aria2\aria2c.exe --all-proxy='http://127.0.0.1:7890' -o 'pytorch_model-00001-of-00003.bin' --max-download-limit=6M "https://huggingface.co/TigerResearch/tigerbot-13b-chat-v5/resolve/main/pytorch_model-00001-of-00003.bin?download=true" D:\aria2\aria2c.exe --all-proxy='http://127.0.0.1:7890' -o 'pytorch_model-00002-of-00003.bin' --max-download-limit=6M "https://huggingface.co/TigerResearch/tigerbot-13b-chat-v5/resolve/main/pytorch_model-00002-of-00003.bin?download=true" D:\aria2\aria2c.exe --all-proxy='http://127.0.0.1:7890' -o 'pytorch_model-00003-of-00003.bin' --max-download-limit=6M "https://huggingface.co/TigerResearch/tigerbot-13b-chat-v5/resolve/main/pytorch_model-00003-of-00003.bin?download=true" conda activate llamaConvert python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='TigerResearch/tigerbot-13b-chat-v5', local_dir=r'D:\models\tigerbot', ignore_patterns=['*.h5', '*.bin', '*.ot', '*.msgpack', '*.safetensors'])" cd D:\llama\llama.cpp python convert.py D:\models\tigerbot --padvocab cd D:\llama ; .\quantize.exe --help .\quantize.exe D:\models\tigerbot\ggml-model-f16.gguf D:\models\tigerbot-13B-Chat-Q4_K_M.gguf Q4_K_M .\server.exe -m D:\models\tigerbot-13B-Chat-Q4_K_M.gguf -c 4096 ``` {% note info %} 感觉 6G 显存下,比较好用的是 Yi-6B-Chat-Q4_K_M tigerbot-13b 在 R5 5600H 上推理速度 4.6 tokens/s,CPU 使用率 60%,频率 3.5GHz,应该是内存带宽瓶颈 {% endnote %}