【国产异构加速卡】基于llama.cpp实现Llama3模型的guff格式转换、4bit量化以及推理加速

重要说明：本文从网上资料整理而来，仅记录博主学习相关知识点的过程，侵删。

序言

本文使用llama.cpp框架，对 Llama3-8B-Instruct 模型进行gguf格式转换，8bit量化，并在CPU和GPU上对8bit模型进行推理。

测试平台：超算互联网平台SCNet

GPU：异构加速卡AI 显存64GB PCIE（基于ROCm平台的GPGPU）

测试服务器的详细配置，请参考：超算互联网平台SCNet之国产异构加速卡

一、参考资料

llama.cpp 代码仓库：https://github.com/ggerganov/llama.cpp

Tutorial: How to convert HuggingFace model to GGUF format #2948

llamacpp_zh

使用llama.cpp实现LLM大模型的格式转换、量化、推理、部署

【学习笔记】：Ubuntu 22 使用模型量化工具llama.cpp部署大模型 CPU+GPU

llama3 微调教程之 llama factory 的安装部署与模型微调过程，模型量化和gguf转换。

二、llama.cpp相关介绍

1. `llama.cpp`简介

llama.cpp 是一个C++库，用于在本地或云端高效地执行大型语言模型（LLM）的推理任务。该库是一个纯C/C++实现，不依赖任何外部库，并且针对x86架构提供了AVX、AVX2和AVX512加速支持。此外，它还提供了2、3、4、5、6以及8位量化功能，以加快推理速度并减少内存占用。对于大于总VRAM容量的大规模模型，该库还支持CPU+GPU混合推理模式进行部分加速。

与传统的基于 Python 的实现相比，llama.cpp 通过直接在 C/C++ 环境中运行，减少了对解释器的依赖，从而可能提高性能并降低资源消耗。此外，llama.cpp 支持跨平台，可以在多种操作系统上编译和运行，包括但不限于 macOS、Linux、Windows，以及通过 Docker 容器化部署。

2. `llama.cpp` 优势

选择llama.cpp作为LLM推理的平台，有几个显著优势：

无依赖实现：llama.cpp不依赖Python、PyTorch或TensorFlow等框架，可以直接在C/C++环境中运行，减少了复杂性和潜在的性能瓶颈。跨平台支持：从支持苹果硅片到各种GPU和CPU，llama.cpp优化了多种硬件的性能，确保在不同系统上都能获得最佳性能。灵活的性能配置：用户可以通过设置不同的位深（1.5位至8位）来量化模型，这有助于在保持推理速度的同时减少内存使用。

3. llama.cpp目标

llama.cpp 出现之后，在 GitHub 上狂砍 63.2k star（截止到2024年8月8日），比 stable diffusion 还要夸张，堪称 “star rocket”。这背后是 llama.cpp 切中了 “AI at the edge” 这一方向。“AI at the edge“ 中的 edge 可以理解为与 cloud 相对的概念。不管是个人的 laptop，gaming PC，手机，甚至树莓派，都可以称为 edge。

4. GGUT格式

探索GGUF：利用llama.cpp高效运行大型语言模型

4.1 引言

在人工智能领域，大型语言模型的发展日新月异，它们在自然语言处理、机器翻译、智能助手等多个领域展现出了前所未有的能力。然而，随着模型规模的不断扩大，这些庞大的神经网络模型在存储、传输和加载上面临着一系列挑战。传统的文件格式在处理这些庞大的数据集时显得力不从心，不仅效率低下，而且兼容性和扩展性也难以满足日益增长的需求。

在这样的背景下，开发者Georgi Gerganov提出GGUF格式，该模型格式可以对模型进行高效的压缩，减少模型的大小与内存占用，从而提升模型的推理速度和效率。

4.2 GGUT简介

GGUF（Georgi Gerganov’s Universal Format），即 Georgi Gerganov 通用格式，是 llama.cpp 项目中提出的一种创新模型文件格式。GGUF格式是专为大型语言模型设计的二进制文件格式，旨在解决当前大模型在实际应用中遇到的存储效率、加载速度、兼容性和扩展性等问题。GGUF通过优化数据结构和编码方式，显著提升了模型文件的存储效率，同时保证了快速的加载性能。此外，它的设计考虑了跨平台和跨框架的兼容性，使得模型能够无缝地在不同的硬件和软件环境中运行，极大地促进了大型模型的广泛应用和进一步发展。当前，GGUF格式广泛应用于各类大模型的部署和分享，特别是在Hugging Face等开源社区中广受欢迎。

关于 GGUF 的更多信息可以参考：2398#issuecomment-1682837610。

4.3 GGUT优势

GGUF格式模型在实际使用中体现出的主要特点和优势包括：

高效存储：GGUF格式优化了数据的存储方式，减少了存储空间的占用，这对于大型模型尤为重要。快速加载：GGUF格式支持快速加载模型数据，这对于需要即时响应的应用场景非常有用，比如在线聊天机器人或实时翻译系统。高效推理：GGUF 格式对模型数据进行了优化，以实现更快的加载时间和推理速度，这对于需要快速响应的应用场景至关重要。内存优化：通过精心设计的数据结构和存储方案，GGUF 减少了模型在运行时的内存占用，使得在资源受限的设备上部署大型语言模型成为可能。复杂令牌化支持：GGUF 支持复杂的令牌化过程，包括对特殊令牌的识别和处理，这使得模型能够更准确地理解和生成语言文本。跨平台兼容性：作为一种统一的格式，GGUF 格式的模型文件可以在多种硬件和操作系统上使用，确保了模型的广泛适用性。灵活性和扩展性：GGUF 格式设计考虑了未来的扩展，可以适应不同语言模型的需求，包括自定义词汇和特殊操作。量化支持：GGUF 支持多种量化技术，允许模型在不同精度级别上运行，从而在性能和模型大小之间取得平衡。

通过这些创新，GGUF 格式成为了 llama.cpp 高效运行大型语言模型的关键因素，为开发者提供了一个强大的工具，以在各种环境中部署和使用先进的自然语言处理能力。

5. GGML

ggml.ai 官网：http://ggml.ai/

ggml 代码仓库：https://github.com/ggerganov/ggml

llama.cpp 代码仓库：https://github.com/ggerganov/llama.cpp

whisper.cpp 代码仓库：https://github.com/ggerganov/whisper.cpp

解开封印！加倍 LLM 推理吞吐: ggml.ai 与 llama.cpp

5.1 ggml简介

5.2 ggml目标

6. llama-cpp-python

llama-cpp-python 代码仓库：https://github.com/abetlen/llama-cpp-python

llama-cpp-python 文档：https://llama-cpp-python.readthedocs.io/en/latest/

Installing llama-cpp-python with GPU Support

llama-cpp-python 是 llama-cpp的python高级API。

三、快速体验llama.cpp

经过测试，tag=b3045 亲测有效。

llama.cpp/tree/b3045 代码仓库：https://github.com/ggerganov/llama.cpp/tree/b3045

1. 准备环境

测试环境，仅供参考。

`requirements.txt`

accelerate==0.32.1 addict==2.4.0 aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 apex @ https://cancon.hpccube.com:65024/directlink/4/apex/DAS1.0/apex-1.1.0+das1.0+0dd7f68.abi0.dtk2404.torch2.1-cp310-cp310-manylinux2014_x86_64.whl#sha256=fdeb7c8a0b354a6a2faa61ae2055b2c2e7deb07bfa4aa7811068c5e02455ee1e argon2-cffi==23.1.0 argon2-cffi-bindings==21.2.0 arrow==1.3.0 asttokens==2.4.1 async-lru==2.0.4 async-timeout==4.0.3 attrs==23.2.0 Babel==2.15.0 beautifulsoup4==4.12.3 bitsandbytes @ https://cancon.hpccube.com:65024/directlink/4/bitsandbytes/DAS1.0/bitsandbytes-0.37.0+das1.0+gitd3d888f.abi0.dtk2404.torch2.1-py3-none-any.whl#sha256=c46eb3f1555f2153424c3c0297e6645c0881cb76965cf5f3d11f77b52d80c19c bleach==6.1.0 boltons @ file:///croot/boltons_1677628692245/work brotlipy==0.7.0 certifi @ file:///croot/certifi_1707229174982/work/certifi cffi @ file:///tmp/abs_98z5h56wf8/croots/recipe/cffi_1659598650955/work charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work click==8.1.7 coloredlogs==15.0.1 comm==0.2.2 conda-content-trust @ file:///tmp/abs_5952f1c8-355c-4855-ad2e-538535021ba5h26t22e5/croots/recipe/conda-content-trust_1658126371814/work conda-package-handling @ file:///croot/conda-package-handling_1666940373510/work contourpy==1.2.1 cryptography @ file:///croot/cryptography_1665612644927/work cycler==0.12.1 datasets==2.19.2 debugpy==1.8.1 decorator==5.1.1 deepspeed @ https://cancon.hpccube.com:65024/directlink/4/deepspeed/DAS1.0/deepspeed-0.12.3+das1.0+gita724046.abi0.dtk2404.torch2.1.0-cp310-cp310-manylinux2014_x86_64.whl#sha256=726d64f73ab2ed7bcd716dcb2af53bb3c790ab4a24180b1b9319e7a7ab2cc569 defusedxml==0.7.1 diffusers==0.29.2 dill==0.3.8 dnspython==2.6.1 einops==0.8.0 email_validator==2.1.1 exceptiongroup==1.2.1 executing==2.0.1 fastapi==0.111.0 fastapi-cli==0.0.4 fastjsonschema==2.19.1 filelock==3.14.0 fire==0.6.0 flash-attn @ https://cancon.hpccube.com:65024/directlink/4/flash_attn/DAS1.0/flash_attn-2.0.4+das1.0+82379d7.abi0.dtk2404.torch2.1-cp310-cp310-manylinux2014_x86_64.whl#sha256=2facc1831d95b55bf1bca88c7f23163751f4c749e4f7fc9256d8311ddbb5d399 flatbuffers==24.3.25 fonttools==4.52.4 fqdn==1.5.1 frozenlist==1.4.1 fsspec==2024.3.1 h11==0.14.0 hf_transfer==0.1.8 hjson==3.1.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.0 huggingface==0.0.1 huggingface-hub==0.24.5 humanfriendly==10.0 hypothesis==5.35.1 idna @ file:///croot/idna_1666125576474/work importlib_metadata==7.1.0 invisible-watermark==0.2.0 ipykernel==6.29.4 ipython==8.24.0 ipywidgets==8.1.3 isoduration==20.11.0 jedi==0.19.1 Jinja2==3.1.4 json5==0.9.25 jsonpatch @ file:///croot/jsonpatch_1714483231291/work jsonpointer==2.1 jsonschema==4.22.0 jsonschema-specifications==2023.12.1 jupyter-events==0.10.0 jupyter-lsp==2.2.5 jupyter_client==8.6.2 jupyter_core==5.7.2 jupyter_ext_dataset==0.1.0 jupyter_ext_logo==0.1.0 jupyter_server==2.14.0 jupyter_server_terminals==0.5.3 jupyterlab==4.2.1 jupyterlab-language-pack-zh-CN==4.0.post6 jupyterlab_pygments==0.3.0 jupyterlab_server==2.27.2 jupyterlab_widgets==3.0.11 kiwisolver==1.4.5 lightop @ https://cancon.hpccube.com:65024/directlink/4/lightop/DAS1.0/lightop-0.3+das1.0+837dbb7.abi0.dtk2404.torch2.1-cp310-cp310-manylinux2014_x86_64.whl#sha256=7f4eb1190a570c05a63a4aade326c87367c4e5ccf6ff82ad5e92220790817e5c lmdeploy @ https://cancon.hpccube.com:65024/directlink/4/lmdeploy/DAS1.0/lmdeploy-0.1.0_das1.0+git782048c.abi0.dtk2404.torch2.1.-cp310-cp310-manylinux2014_x86_64.whl#sha256=499940e022de16b3f1211a52c2daa3a603b109a015487499c9e11a53c6d5ad2c markdown-it-py==3.0.0 MarkupSafe==2.1.5 matplotlib==3.9.0 matplotlib-inline==0.1.7 mdurl==0.1.2 mistune==3.0.2 mmcv @ https://cancon.hpccube.com:65024/directlink/4/mmcv/DAS1.0/mmcv-2.0.1_das1.0+gitc0ccf15.abi0.dtk2404.torch2.1.-cp310-cp310-manylinux2014_x86_64.whl#sha256=4fc5ff39d232e5ca1efebf7cfdfcf9bc0675308cf40e5f17237c4f2eec66f210 mmengine==0.10.4 mmengine-lite==0.10.4 mpmath==1.3.0 msgpack==1.0.8 multidict==6.0.5 multiprocess==0.70.16 nbclient==0.10.0 nbconvert==7.16.4 nbformat==5.10.4 nest-asyncio==1.6.0 networkx==3.3 ninja==1.11.1.1 notebook_shim==0.2.4 numpy==1.24.3 onnxruntime @ https://cancon.hpccube.com:65024/directlink/4/onnxruntime/DAS1.0/onnxruntime-1.15.0+das1.0+gita9ca438.abi0.dtk2404-cp310-cp310-manylinux2014_x86_64.whl#sha256=509446b41adb89e7507700482cb99e2c399ab3164bc9ea6d9a50e11f84a2406e opencv-python==4.9.0.80 orjson==3.10.3 overrides==7.7.0 packaging @ file:///croot/packaging_1710807400464/work pandas==2.2.2 pandocfilters==1.5.1 parso==0.8.4 pexpect==4.9.0 pillow==10.3.0 platformdirs==4.2.2 pluggy @ file:///tmp/build/80754af9/pluggy_1648024709248/work prometheus_client==0.20.0 prompt_toolkit==3.0.45 protobuf==5.27.0 psutil==5.9.8 ptyprocess==0.7.0 pure-eval==0.2.2 py-cpuinfo==9.0.0 pyarrow==16.1.0 pyarrow-hotfix==0.6 pycosat @ file:///croot/pycosat_1666805502580/work pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work pydantic==2.7.2 pydantic_core==2.18.3 Pygments==2.18.0 pynvml==11.5.0 pyOpenSSL @ file:///opt/conda/conda-bld/pyopenssl_1643788558760/work pyparsing==3.1.2 PySocks @ file:///home/builder/ci_310/pysocks_1640793678128/work python-dateutil==2.9.0.post0 python-dotenv==1.0.1 python-json-logger==2.0.7 python-multipart==0.0.9 pytz==2024.1 PyWavelets==1.6.0 PyYAML==6.0.1 pyzmq==26.0.3 ray==2.9.3 referencing==0.35.1 regex==2024.5.15 requests==2.32.3 rfc3339-validator==0.1.4 rfc3986-validator==0.1.1 rich==13.7.1 rpds-py==0.18.1 ruamel.yaml @ file:///croot/ruamel.yaml_1666304550667/work ruamel.yaml.clib @ file:///croot/ruamel.yaml.clib_1666302247304/work safetensors==0.4.3 Send2Trash==1.8.3 sentencepiece==0.2.0 shellingham==1.5.4 six @ file:///tmp/build/80754af9/six_1644875935023/work sniffio==1.3.1 sortedcontainers==2.4.0 soupsieve==2.5 stack-data==0.6.3 starlette==0.37.2 sympy==1.12.1 termcolor==2.4.0 terminado==0.18.1 tiktoken==0.7.0 tinycss2==1.3.0 tokenizers==0.15.0 tomli==2.0.1 toolz @ file:///croot/toolz_1667464077321/work torch @ https://cancon.hpccube.com:65024/directlink/4/pytorch/DAS1.0/torch-2.1.0+das1.0+git00661e0.abi0.dtk2404-cp310-cp310-manylinux2014_x86_64.whl#sha256=0b5f4be74ffdd6fe7540a844bf4f02e432b7d267b5e9fdd7f9448192d93bf3b6 torchaudio @ https://cancon.hpccube.com:65024/directlink/4/torchaudio/DAS1.0/torchaudio-2.1.2+das1.0+253903e.abi0.dtk2404.torch2.1.0-cp310-cp310-manylinux2014_x86_64.whl#sha256=2a7b3bbe8b558f48784f302900fd1dff3ff9d10a3c139e00f2b136a76d6d7f1c torchvision @ https://cancon.hpccube.com:65024/directlink/4/vision/DAS1.0/torchvision-0.16.0+das1.0+gitc9e7141.abi0.dtk2404.torch2.1-cp310-cp310-manylinux2014_x86_64.whl#sha256=4d5e5071e89892cccb24c3ee0216cd79b3c22bc5cf1eb0eb49c2792d9f49fb62 tornado==6.4 tqdm @ file:///opt/conda/conda-bld/tqdm_1664392687731/work traitlets==5.14.3 transformers==4.38.0 triton @ https://cancon.hpccube.com:65024/directlink/4/triton/DAS1.0/triton-2.1.0+das1.0+git3841f975.abi0.dtk2404-cp310-cp310-manylinux2014_x86_64.whl#sha256=0dda810eb171af0b3f5cf90a1a4b2f41c9ef0ef08453762a798c86dd01fe976f typer==0.12.3 types-python-dateutil==2.9.0.20240316 typing_extensions==4.12.0 tzdata==2024.1 ujson==5.10.0 uri-template==1.3.0 urllib3 @ file:///croot/urllib3_1670526988650/work uvicorn==0.30.0 uvloop==0.19.0 vllm @ https://cancon.hpccube.com:65024/directlink/4/vllm/DAS1.0/vllm-0.3.3+das1.0+git3380931.abi0.dtk2404.torch2.1-cp310-cp310-manylinux2014_x86_64.whl#sha256=23bcdb8a6eb0382770dc7460ea3f7c85cd0c885913b28759eb8a9894731cdb87 watchfiles==0.22.0 wcwidth==0.2.13 webcolors==1.13 webencodings==0.5.1 websocket-client==1.8.0 websockets==12.0 widgetsnbextension==4.0.11 xformers @ https://cancon.hpccube.com:65024/directlink/4/xformers/DAS1.0/xformers-0.0.25+das1.0+gitd11e899.abi0.dtk2404.torch2.1-cp310-cp310-manylinux2014_x86_64.whl#sha256=b086d1bd50bd19c82ca44c424fe193dfcdd48bdd6695d3e6a58f53764c64f428 xxhash==3.4.1 yapf==0.40.2 yarl==1.9.4 zipp==3.19.0

`envs.yaml`

name: llama.cpp channels: - https://repo.anaconda.com/pkgs/main - https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2 - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main - defaults dependencies: - _libgcc_mutex=0.1=main - _openmp_mutex=5.1=1_gnu - boltons=23.0.0=py310h06a4308_0 - brotlipy=0.7.0=py310h7f8727e_1002 - bzip2=1.0.8=h7b6447c_0 - ca-certificates=2024.3.11=h06a4308_0 - certifi=2024.2.2=py310h06a4308_0 - cffi=1.15.1=py310h74dc2b5_0 - charset-normalizer=2.0.4=pyhd3eb1b0_0 - conda-content-trust=0.1.3=py310h06a4308_0 - conda-package-handling=1.9.0=py310h5eee18b_1 - cryptography=38.0.1=py310h9ce1e76_0 - idna=3.4=py310h06a4308_0 - jsonpatch=1.33=py310h06a4308_1 - ld_impl_linux-64=2.38=h1181459_1 - libffi=3.3=he6710b0_2 - libgcc-ng=11.2.0=h1234567_1 - libgomp=11.2.0=h1234567_1 - libstdcxx-ng=11.2.0=h1234567_1 - libuuid=1.41.5=h5eee18b_0 - ncurses=6.3=h5eee18b_3 - openssl=1.1.1w=h7f8727e_0 - pluggy=1.0.0=py310h06a4308_1 - pycosat=0.6.4=py310h5eee18b_0 - pycparser=2.21=pyhd3eb1b0_0 - pyopenssl=22.0.0=pyhd3eb1b0_0 - pysocks=1.7.1=py310h06a4308_0 - python=3.10.8=haa1d7c7_0 - readline=8.2=h5eee18b_0 - ruamel.yaml=0.17.21=py310h5eee18b_0 - ruamel.yaml.clib=0.2.6=py310h5eee18b_1 - setuptools=65.5.0=py310h06a4308_0 - six=1.16.0=pyhd3eb1b0_1 - sqlite=3.40.0=h5082296_0 - tk=8.6.12=h1ccaba5_0 - toolz=0.12.0=py310h06a4308_0 - tqdm=4.64.1=py310h06a4308_0 - urllib3=1.26.13=py310h06a4308_0 - wheel=0.37.1=pyhd3eb1b0_0 - xz=5.2.8=h5eee18b_0 - zlib=1.2.13=h5eee18b_0 - pip: - accelerate==0.32.1 - addict==2.4.0 - aiohttp==3.9.5 - aiosignal==1.3.1 - annotated-types==0.7.0 - anyio==4.4.0 - apex==1.1.0+0dd7f68.abi0.dtk2404.torch2.1 - argon2-cffi==23.1.0 - argon2-cffi-bindings==21.2.0 - arrow==1.3.0 - asttokens==2.4.1 - async-lru==2.0.4 - async-timeout==4.0.3 - attrs==23.2.0 - babel==2.15.0 - beautifulsoup4==4.12.3 - bitsandbytes==0.37.0+gitd3d888f.abi0.dtk2404.torch2.1 - bleach==6.1.0 - click==8.1.7 - coloredlogs==15.0.1 - comm==0.2.2 - contourpy==1.2.1 - cycler==0.12.1 - datasets==2.19.2 - debugpy==1.8.1 - decorator==5.1.1 - deepspeed==0.12.3+gita724046.abi0.dtk2404.torch2.1.0 - defusedxml==0.7.1 - diffusers==0.29.2 - dill==0.3.8 - dnspython==2.6.1 - einops==0.8.0 - email-validator==2.1.1 - exceptiongroup==1.2.1 - executing==2.0.1 - fastapi==0.111.0 - fastapi-cli==0.0.4 - fastjsonschema==2.19.1 - filelock==3.14.0 - fire==0.6.0 - flash-attn==2.0.4+82379d7.abi0.dtk2404.torch2.1 - flatbuffers==24.3.25 - fonttools==4.52.4 - fqdn==1.5.1 - frozenlist==1.4.1 - fsspec==2024.3.1 - h11==0.14.0 - hf-transfer==0.1.8 - hjson==3.1.0 - httpcore==1.0.5 - httptools==0.6.1 - httpx==0.27.0 - huggingface==0.0.1 - huggingface-hub==0.24.5 - humanfriendly==10.0 - hypothesis==5.35.1 - importlib-metadata==7.1.0 - invisible-watermark==0.2.0 - ipykernel==6.29.4 - ipython==8.24.0 - ipywidgets==8.1.3 - isoduration==20.11.0 - jedi==0.19.1 - jinja2==3.1.4 - json5==0.9.25 - jsonpointer==2.4 - jsonschema==4.22.0 - jsonschema-specifications==2023.12.1 - jupyter-client==8.6.2 - jupyter-core==5.7.2 - jupyter-events==0.10.0 - jupyter-ext-dataset==0.1.0 - jupyter-ext-logo==0.1.0 - jupyter-lsp==2.2.5 - jupyter-server==2.14.0 - jupyter-server-terminals==0.5.3 - jupyterlab==4.2.1 - jupyterlab-language-pack-zh-cn==4.0.post6 - jupyterlab-pygments==0.3.0 - jupyterlab-server==2.27.2 - jupyterlab-widgets==3.0.11 - kiwisolver==1.4.5 - lightop==0.3+837dbb7.abi0.dtk2404.torch2.1 - lmdeploy==0.1.0-git782048c.abi0.dtk2404.torch2.1. - markdown-it-py==3.0.0 - markupsafe==2.1.5 - matplotlib==3.9.0 - matplotlib-inline==0.1.7 - mdurl==0.1.2 - mistune==3.0.2 - mmcv==2.0.1-gitc0ccf15.abi0.dtk2404.torch2.1. - mmengine==0.10.4 - mmengine-lite==0.10.4 - mpmath==1.3.0 - msgpack==1.0.8 - multidict==6.0.5 - multiprocess==0.70.16 - nbclient==0.10.0 - nbconvert==7.16.4 - nbformat==5.10.4 - nest-asyncio==1.6.0 - networkx==3.3 - ninja==1.11.1.1 - notebook-shim==0.2.4 - numpy==1.24.3 - onnxruntime==1.15.0+gita9ca438.abi0.dtk2404 - opencv-python==4.9.0.80 - orjson==3.10.3 - overrides==7.7.0 - packaging==24.0 - pandas==2.2.2 - pandocfilters==1.5.1 - parso==0.8.4 - pexpect==4.9.0 - pillow==10.3.0 - pip==24.0 - platformdirs==4.2.2 - prometheus-client==0.20.0 - prompt-toolkit==3.0.45 - protobuf==5.27.0 - psutil==5.9.8 - ptyprocess==0.7.0 - pure-eval==0.2.2 - py-cpuinfo==9.0.0 - pyarrow==16.1.0 - pyarrow-hotfix==0.6 - pydantic==2.7.2 - pydantic-core==2.18.3 - pygments==2.18.0 - pynvml==11.5.0 - pyparsing==3.1.2 - python-dateutil==2.9.0.post0 - python-dotenv==1.0.1 - python-json-logger==2.0.7 - python-multipart==0.0.9 - pytz==2024.1 - pywavelets==1.6.0 - pyyaml==6.0.1 - pyzmq==26.0.3 - ray==2.9.3 - referencing==0.35.1 - regex==2024.5.15 - requests==2.32.3 - rfc3339-validator==0.1.4 - rfc3986-validator==0.1.1 - rich==13.7.1 - rpds-py==0.18.1 - safetensors==0.4.3 - send2trash==1.8.3 - sentencepiece==0.2.0 - shellingham==1.5.4 - sniffio==1.3.1 - sortedcontainers==2.4.0 - soupsieve==2.5 - stack-data==0.6.3 - starlette==0.37.2 - sympy==1.12.1 - termcolor==2.4.0 - terminado==0.18.1 - tiktoken==0.7.0 - tinycss2==1.3.0 - tokenizers==0.15.0 - tomli==2.0.1 - torch==2.1.0+git00661e0.abi0.dtk2404 - torchaudio==2.1.2+253903e.abi0.dtk2404.torch2.1.0 - torchvision==0.16.0+gitc9e7141.abi0.dtk2404.torch2.1 - tornado==6.4 - traitlets==5.14.3 - transformers==4.38.0 - triton==2.1.0+git3841f975.abi0.dtk2404 - typer==0.12.3 - types-python-dateutil==2.9.0.20240316 - typing-extensions==4.12.0 - tzdata==2024.1 - ujson==5.10.0 - uri-template==1.3.0 - uvicorn==0.30.0 - uvloop==0.19.0 - vllm==0.3.3+git3380931.abi0.dtk2404.torch2.1 - watchfiles==0.22.0 - wcwidth==0.2.13 - webcolors==1.13 - webencodings==0.5.1 - websocket-client==1.8.0 - websockets==12.0 - widgetsnbextension==4.0.11 - xformers==0.0.25+gitd11e899.abi0.dtk2404.torch2.1 - xxhash==3.4.1 - yapf==0.40.2 - yarl==1.9.4 - zipp==3.19.0 prefix: /opt/conda/envs/llama.cpp

2. 下载llama.cpp

# 下载llama.cpp # 如果下载失败，可以手动下载，再上传到服务器 git clone https://github.com/ggerganov/llama.cpp.git # 检出b3045标签，并创建b3045分支 git checkout -b b3045 b3045 cd llama.cpp

编译前的文件目录：

root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# tree -L 1 . |-- AUTHORS |-- CMakeLists.txt |-- CMakePresets.json |-- LICENSE |-- Makefile |-- Package.swift |-- README-sycl.md |-- README.md |-- SECURITY.md |-- ci |-- cmake |-- codecov.yml |-- common |-- convert-hf-to-gguf-update.py |-- convert-hf-to-gguf.py |-- convert-llama-ggml-to-gguf.py |-- convert.py |-- docs |-- examples |-- flake.lock |-- flake.nix |-- ggml-alloc.c |-- ggml-alloc.h |-- ggml-backend-impl.h |-- ggml-backend.c |-- ggml-backend.h |-- ggml-common.h |-- ggml-cuda |-- ggml-cuda.cu |-- ggml-cuda.h |-- ggml-impl.h |-- ggml-kompute.cpp |-- ggml-kompute.h |-- ggml-metal.h |-- ggml-metal.m |-- ggml-metal.metal |-- ggml-opencl.cpp |-- ggml-opencl.h |-- ggml-quants.c |-- ggml-quants.h |-- ggml-rpc.cpp |-- ggml-rpc.h |-- ggml-sycl.cpp |-- ggml-sycl.h |-- ggml-vulkan-shaders.hpp |-- ggml-vulkan.cpp |-- ggml-vulkan.h |-- ggml.c |-- ggml.h |-- ggml_vk_generate_shaders.py |-- gguf-py |-- grammars |-- kompute |-- kompute-shaders |-- llama.cpp |-- llama.h |-- media |-- models |-- mypy.ini |-- pocs |-- prompts |-- pyrightconfig.json |-- requirements |-- requirements.txt |-- scripts |-- sgemm.cpp |-- sgemm.h |-- spm-headers |-- tests |-- unicode-data.cpp |-- unicode-data.h |-- unicode.cpp `-- unicode.h

3. 编译llama.cpp

Build llama.cpp locally

3.1 编译CPU版本

# 非首次编译 make clean make -j32

root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# make -j32 I ccache not found. Consider installing it for faster compilation. I llama.cpp build info: I UNAME_S: Linux I UNAME_P: x86_64 I UNAME_M: x86_64 I CFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion I CXXFLAGS: -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE I NVCCFLAGS: -std=c++11 -O3 I LDFLAGS: I CC: cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 I CXX: c++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion -c ggml.c -o ggml.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c llama.cpp -o llama.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c common/common.cpp -o common.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c common/sampling.cpp -o sampling.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c common/grammar-parser.cpp -o grammar-parser.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c common/json-schema-to-grammar.cpp -o json-schema-to-grammar.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c common/console.cpp -o console.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c sgemm.cpp -o sgemm.o cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion -c ggml-alloc.c -o ggml-alloc.o cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion -c ggml-backend.c -o ggml-backend.o cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion -c ggml-quants.c -o ggml-quants.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c unicode.cpp -o unicode.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c unicode-data.cpp -o unicode-data.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c common/train.cpp -o train.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c common/ngram-cache.cpp -o ngram-cache.o cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion -c tests/test-c.c -o tests/test-c.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c common/build-info.cpp -o build-info.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c pocs/vdot/vdot.cpp -o pocs/vdot/vdot.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c pocs/vdot/q8dot.cpp -o pocs/vdot/q8dot.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/gguf/gguf.cpp -o examples/gguf/gguf.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/benchmark/benchmark-matmult.cpp -o examples/benchmark/benchmark-matmult.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/export-lora/export-lora.cpp -o examples/export-lora/export-lora.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gguf/gguf.o -o gguf c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o pocs/vdot/q8dot.o -o q8dot c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o pocs/vdot/vdot.o -o vdot c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE build-info.o ggml.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/benchmark/benchmark-matmult.o -o benchmark-matmult c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/export-lora/export-lora.o -o export-lora c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/main/main.cpp -o examples/main/main.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/quantize/quantize.cpp -o examples/quantize/quantize.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/quantize-stats/quantize-stats.cpp -o examples/quantize-stats/quantize-stats.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/perplexity/perplexity.cpp -o examples/perplexity/perplexity.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/imatrix/imatrix.cpp -o examples/imatrix/imatrix.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/embedding/embedding.cpp -o examples/embedding/embedding.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/train-text-from-scratch/train-text-from-scratch.cpp -o examples/train-text-from-scratch/train-text-from-scratch.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp -o examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/simple/simple.cpp -o examples/simple/simple.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/batched/batched.cpp -o examples/batched/batched.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/batched-bench/batched-bench.cpp -o examples/batched-bench/batched-bench.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/save-load-state/save-load-state.cpp -o examples/save-load-state/save-load-state.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/server/server.cpp -o examples/server/server.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/gguf-split/gguf-split.cpp -o examples/gguf-split/gguf-split.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/eval-callback/eval-callback.cpp -o examples/eval-callback/eval-callback.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/llama-bench/llama-bench.cpp -o examples/llama-bench/llama-bench.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -static -fPIC -c examples/llava/llava.cpp -o libllava.a -Wno-cast-qual c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/llava/llava-cli.cpp -o examples/llava/llava-cli.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/baby-llama/baby-llama.cpp -o examples/baby-llama/baby-llama.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/beam-search/beam-search.cpp -o examples/beam-search/beam-search.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/retrieval/retrieval.cpp -o examples/retrieval/retrieval.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/speculative/speculative.cpp -o examples/speculative/speculative.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/infill/infill.cpp -o examples/infill/infill.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/tokenize/tokenize.cpp -o examples/tokenize/tokenize.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/parallel/parallel.cpp -o examples/parallel/parallel.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/finetune/finetune.cpp -o examples/finetune/finetune.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/lookahead/lookahead.cpp -o examples/lookahead/lookahead.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/lookup/lookup.cpp -o examples/lookup/lookup.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/passkey/passkey.cpp -o examples/passkey/passkey.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/gritlm/gritlm.cpp -o examples/gritlm/gritlm.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o train.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/baby-llama/baby-llama.o -o baby-llama c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/tokenize/tokenize.o -o tokenize c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/eval-callback/eval-callback.o -o eval-callback c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/save-load-state/save-load-state.o -o save-load-state c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/beam-search/beam-search.o -o beam-search c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gguf-split/gguf-split.o -o gguf-split c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/simple/simple.o -o simple c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gritlm/gritlm.o -o gritlm c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/embedding/embedding.o -o embedding c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE build-info.o ggml.o llama.o common.o sampling.o grammar-parser.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/batched-bench/batched-bench.o -o batched-bench c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/passkey/passkey.o -o passkey c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/batched/batched.o -o batched c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/quantize/quantize.o -o quantize c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup.o -o lookup c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookahead/lookahead.o -o lookahead c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/llava/clip.cpp -o examples/llava/clip.o -Wno-cast-qual c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/parallel/parallel.o -o parallel c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o train.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/train-text-from-scratch/train-text-from-scratch.o -o train-text-from-scratch c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/lookup/lookup-create.cpp -o examples/lookup/lookup-create.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/retrieval/retrieval.o -o retrieval c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.o -o convert-llama2c-to-ggml c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o console.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/infill/infill.o -o infill c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/imatrix/imatrix.o -o imatrix c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/speculative/speculative.o -o speculative c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o train.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/finetune/finetune.o -o finetune c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o console.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/main/main.o -o main c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-create.o -o lookup-create ==== Run ./main -h for help. ==== c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/lookup/lookup-merge.cpp -o examples/lookup/lookup-merge.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-merge.o -o lookup-merge c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/perplexity/perplexity.o -o perplexity c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/lookup/lookup-stats.cpp -o examples/lookup/lookup-stats.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE build-info.o ggml.o llama.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/quantize-stats/quantize-stats.o -o quantize-stats c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-stats.o -o lookup-stats c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/llama-bench/llama-bench.o -o llama-bench c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -c examples/llava/llava.cpp -o examples/llava/llava.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/llava/llava-cli.o examples/llava/clip.o examples/llava/llava.o -o llava-cli c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o -Iexamples/server examples/server/server.o -o server

编译后的文件目录：

root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# tree -L 1 . |-- AUTHORS |-- CMakeLists.txt |-- CMakePresets.json |-- LICENSE |-- Makefile |-- Package.swift |-- README-sycl.md |-- README.md |-- SECURITY.md |-- baby-llama |-- batched |-- batched-bench |-- beam-search |-- benchmark-matmult |-- build-info.o |-- ci |-- cmake |-- codecov.yml |-- common |-- common.o |-- console.o |-- convert-hf-to-gguf-update.py |-- convert-hf-to-gguf.py |-- convert-llama-ggml-to-gguf.py |-- convert-llama2c-to-ggml |-- convert.py |-- docs |-- embedding |-- eval-callback |-- examples |-- export-lora |-- finetune |-- flake.lock |-- flake.nix |-- ggml-alloc.c |-- ggml-alloc.h |-- ggml-alloc.o |-- ggml-backend-impl.h |-- ggml-backend.c |-- ggml-backend.h |-- ggml-backend.o |-- ggml-common.h |-- ggml-cuda |-- ggml-cuda.cu |-- ggml-cuda.h |-- ggml-impl.h |-- ggml-kompute.cpp |-- ggml-kompute.h |-- ggml-metal.h |-- ggml-metal.m |-- ggml-metal.metal |-- ggml-opencl.cpp |-- ggml-opencl.h |-- ggml-quants.c |-- ggml-quants.h |-- ggml-quants.o |-- ggml-rpc.cpp |-- ggml-rpc.h |-- ggml-sycl.cpp |-- ggml-sycl.h |-- ggml-vulkan-shaders.hpp |-- ggml-vulkan.cpp |-- ggml-vulkan.h |-- ggml.c |-- ggml.h |-- ggml.o |-- ggml_vk_generate_shaders.py |-- gguf |-- gguf-py |-- gguf-split |-- grammar-parser.o |-- grammars |-- gritlm |-- imatrix |-- infill |-- json-schema-to-grammar.o |-- kompute |-- kompute-shaders |-- libllava.a |-- llama-bench |-- llama.cpp |-- llama.h |-- llama.o |-- llava-cli |-- lookahead |-- lookup |-- lookup-create |-- lookup-merge |-- lookup-stats |-- main |-- media |-- models |-- mypy.ini |-- ngram-cache.o |-- parallel |-- passkey |-- perplexity |-- pocs |-- prompts |-- pyrightconfig.json |-- q8dot |-- quantize |-- quantize-stats |-- requirements |-- requirements.txt |-- retrieval |-- sampling.o |-- save-load-state |-- scripts |-- server |-- sgemm.cpp |-- sgemm.h |-- sgemm.o |-- simple |-- speculative |-- spm-headers |-- tests |-- tokenize |-- train-text-from-scratch |-- train.o |-- unicode-data.cpp |-- unicode-data.h |-- unicode-data.o |-- unicode.cpp |-- unicode.h |-- unicode.o `-- vdot

解释说明

main，用于推理模型。 quantize，用于量化模型。 server，用于提供模型API服务。

3.2 编译GPU版本（hipBLAS）

speedup ROCm AMD Unified Memory Architecture #7399

Install and run llama.cpp with ROCm 5.7 on Ubuntu 22.04

HIP_VISIBLE_DEVICES

User Guide for AMDGPU Backend

用 llama.cpp 跑 llama 2，用 AMD Radeon RX 6900 做 GPU 加速

note：
国产异构加速卡是基于ROCm平台的GPGPU，编译步骤可参考 hipBLAS。

# 查看GPU架构 rocminfo | grep gfx 或者 rocminfo | grep gfx | head -1 | awk '{print $2}' # 编译 make -j32 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx928

(ollama) root@notebook-1823641624653922306-scnlbe5oi5-42808:~# rocminfo | grep gfx Name: gfx928 Name: amdgcn-amd-amdhsa--gfx928:sramecc+:xnack-

root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# make -j32 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx928 I ccache not found. Consider installing it for faster compilation. I llama.cpp build info: I UNAME_S: Linux I UNAME_P: x86_64 I UNAME_M: x86_64 I CFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion I CXXFLAGS: -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA I NVCCFLAGS: -std=c++11 -O3 I LDFLAGS: -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas I CC: cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 I CXX: c++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 ... c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/main/main.cpp -o examples/main/main.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/quantize/quantize.cpp -o examples/quantize/quantize.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/quantize-stats/quantize-stats.cpp -o examples/quantize-stats/quantize-stats.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/perplexity/perplexity.cpp -o examples/perplexity/perplexity.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/imatrix/imatrix.cpp -o examples/imatrix/imatrix.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/embedding/embedding.cpp -o examples/embedding/embedding.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c pocs/vdot/vdot.cpp -o pocs/vdot/vdot.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c pocs/vdot/q8dot.cpp -o pocs/vdot/q8dot.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/train-text-from-scratch/train-text-from-scratch.cpp -o examples/train-text-from-scratch/train-text-from-scratch.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp -o examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/simple/simple.cpp -o examples/simple/simple.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/batched/batched.cpp -o examples/batched/batched.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/batched-bench/batched-bench.cpp -o examples/batched-bench/batched-bench.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/save-load-state/save-load-state.cpp -o examples/save-load-state/save-load-state.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/server/server.cpp -o examples/server/server.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/gguf/gguf.cpp -o examples/gguf/gguf.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/gguf-split/gguf-split.cpp -o examples/gguf-split/gguf-split.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/eval-callback/eval-callback.cpp -o examples/eval-callback/eval-callback.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/llama-bench/llama-bench.cpp -o examples/llama-bench/llama-bench.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -static -fPIC -c examples/llava/llava.cpp -o libllava.a -Wno-cast-qual c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/llava/llava-cli.cpp -o examples/llava/llava-cli.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/baby-llama/baby-llama.cpp -o examples/baby-llama/baby-llama.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/beam-search/beam-search.cpp -o examples/beam-search/beam-search.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/retrieval/retrieval.cpp -o examples/retrieval/retrieval.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/speculative/speculative.cpp -o examples/speculative/speculative.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/infill/infill.cpp -o examples/infill/infill.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/tokenize/tokenize.cpp -o examples/tokenize/tokenize.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/benchmark/benchmark-matmult.cpp -o examples/benchmark/benchmark-matmult.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/parallel/parallel.cpp -o examples/parallel/parallel.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/finetune/finetune.cpp -o examples/finetune/finetune.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/export-lora/export-lora.cpp -o examples/export-lora/export-lora.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/lookahead/lookahead.cpp -o examples/lookahead/lookahead.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gguf/gguf.o -o gguf -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o pocs/vdot/q8dot.o -o q8dot -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o pocs/vdot/vdot.o -o vdot -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o train.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/baby-llama/baby-llama.o -o baby-llama -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA build-info.o ggml.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/benchmark/benchmark-matmult.o -o benchmark-matmult -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/tokenize/tokenize.o -o tokenize -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/eval-callback/eval-callback.o -o eval-callback -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/beam-search/beam-search.o -o beam-search -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/save-load-state/save-load-state.o -o save-load-state -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/lookup/lookup.cpp -o examples/lookup/lookup.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/passkey/passkey.cpp -o examples/passkey/passkey.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/gritlm/gritlm.cpp -o examples/gritlm/gritlm.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gguf-split/gguf-split.o -o gguf-split -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA build-info.o ggml.o llama.o common.o sampling.o grammar-parser.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/batched-bench/batched-bench.o -o batched-bench -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/simple/simple.o -o simple -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/batched/batched.o -o batched -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/embedding/embedding.o -o embedding -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/export-lora/export-lora.o -o export-lora -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/quantize/quantize.o -o quantize -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookahead/lookahead.o -o lookahead -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/parallel/parallel.o -o parallel -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/llava/clip.cpp -o examples/llava/clip.o -Wno-cast-qual c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.o -o convert-llama2c-to-ggml -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o train.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/train-text-from-scratch/train-text-from-scratch.o -o train-text-from-scratch -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/retrieval/retrieval.o -o retrieval -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/imatrix/imatrix.o -o imatrix -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/speculative/speculative.o -o speculative -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o console.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/infill/infill.o -o infill -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o train.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/finetune/finetune.o -o finetune -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gritlm/gritlm.o -o gritlm -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o console.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/main/main.o -o main -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/passkey/passkey.o -o passkey -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup.o -o lookup -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas ==== Run ./main -h for help. ==== c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/lookup/lookup-create.cpp -o examples/lookup/lookup-create.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/perplexity/perplexity.o -o perplexity -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-create.o -o lookup-create -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/lookup/lookup-merge.cpp -o examples/lookup/lookup-merge.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA build-info.o ggml.o llama.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/quantize-stats/quantize-stats.o -o quantize-stats -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-merge.o -o lookup-merge -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/lookup/lookup-stats.cpp -o examples/lookup/lookup-stats.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-stats.o -o lookup-stats -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/llama-bench/llama-bench.o -o llama-bench -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA -c examples/llava/llava.cpp -o examples/llava/llava.o c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/llava/llava-cli.o examples/llava/clip.o examples/llava/llava.o -o llava-cli -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o -Iexamples/server examples/server/server.o -o server -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas

4. 准备模型

在 huggingface 上找到合适格式的模型，下载至 llama.cpp 的 models 目录下。或本地已下载的模型上传至models目录。

4.1 下载原版LLaMA模型

如果下载的是Meta原版LLaMA模型，则需要将原版LLaMA模型转换为HF格式。

使用transformers提供的脚本convert_llama_weights_to_hf.py，将原版LLaMA模型转换为HuggingFace格式。

python src/transformers/models/llama/convert_llama_weights_to_hf.py \ --input_dir path_to_original_llama_root_dir \ --model_size 7B \ --output_dir path_to_original_llama_hf_dir

值得注意的是，将原版LLaMA的tokenizer.model放在--input_dir指定的目录，其余文件放在${input_dir}/${model_size}下。执行以下命令后，--output_dir中将存放转换好的HF版权重。

4.2 下载gguf模型

可以直接下载gguf模型，跳过gguf格式转换过程。

QuantFactory/Meta-Llama-3-8B-Instruct-GGUF

./main -m $(./scripts/hf.sh --repo QuantFactory/Meta-Llama-3-8B-Instruct-GGUF --file Meta-Llama-3-8B-Instruct.Q4_0.gguf --outdir ./models) ./main -m $(./scripts/hf.sh --url https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct.Q4_0.gguf --outdir ./models) ./main -m $(./scripts/hf.sh https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q4_0.gguf --outdir ./models)

4.3 下载HuggingFace模型

以LLM-Research/Meta-Llama-3-8B-Instruct 模型为例。由于从Hugging Face申请许可失败，从ModelScope魔塔社区中下载该模型。

模型下载方法，请参考：Hugging Face和ModelScope大模型/数据集的下载加速方法

5.（可选）合并LoRA权重

由于原版的LLaMA模型不具备中文理解能力，而Chinese-LLaMA-Alpaca具有良好的中文理解能力。因此，通过对原版LLaMA模型（HF格式）扩充中文词表，并与LoRA权重进行合并，生成全量模型权重。

合并LoRA权重的详细步骤，请参考：llama.cpp一种在本地CPU上部署的量化模型（超低配推理llama）

6. 转换gguf格式

Converting HuggingFace Models to GGUF/GGML

The convert-hf-to-gguf-update.py seems doesn’t work. #7088

Tutorial: How to convert HuggingFace model to GGUF format #2948

llama.cpp 支持转换的模型格式有PyTorch 的 .pth，huggingface的 .safetensors，还有之前 llamma.cpp 采用的 ggmlv3。

6.1 convert脚本

convert脚本包括：

root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# ll | grep convert -rwxr-xr-x 1 root root 13029 Aug 8 10:57 convert-hf-to-gguf-update.py* -rwxr-xr-x 1 root root 127129 Aug 8 10:57 convert-hf-to-gguf.py* -rwxr-xr-x 1 root root 18993 Aug 8 10:57 convert-llama-ggml-to-gguf.py* -rwxr-xr-x 1 root root 2218136 Aug 8 11:02 convert-llama2c-to-ggml* -rwxr-xr-x 1 root root 69417 Aug 8 10:57 convert.py*

解释说明

convert_hf_to_gguf_update.py: Downloads the tokenizer models of the specified models from Huggingface and generates the get_vocab_base_pre() function for convert_hf_to_gguf.py. convert-hf-to-gguf.py: Convert from HuggingFace format to gguf. convert-llama-ggml-to-gguf.py: Convert from ggml format to gguf. convert-llama2c-to-ggml: Convert from llama2.c model format to ggml. convert.py.

6.2 convert.py

(llama.cpp) root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# python convert.py -h usage: convert.py [-h] [--dump] [--dump-single] [--vocab-only] [--no-vocab] [--outtype {f32,f16,q8_0}] [--vocab-dir VOCAB_DIR] [--vocab-type VOCAB_TYPE] [--outfile OUTFILE] [--ctx CTX] [--concurrency CONCURRENCY] [--big-endian] [--pad-vocab] [--skip-unknown] [--verbose] [--metadata METADATA] [--get-outfile] model Convert a LLaMA model to a GGML compatible file positional arguments: model directory containing model file, or model file itself (*.pth, *.pt, *.bin) options: -h, --help show this help message and exit --dump don't convert, just show what's in the model --dump-single don't convert, just show what's in a single model file --vocab-only extract only the vocab --no-vocab store model without the vocab --outtype {f32,f16,q8_0} output format - note: q8_0 may be very slow (default: f16 or f32 based on input) --vocab-dir VOCAB_DIR directory containing tokenizer.model, if separate from model file --vocab-type VOCAB_TYPE vocab types to try in order, choose from 'spm', 'bpe', 'hfft' (default: spm,hfft) --outfile OUTFILE path to write to; default: based on input --ctx CTX model training context (default: based on input) --concurrency CONCURRENCY concurrency used for conversion (default: 8) --big-endian model is executed on big endian machine --pad-vocab add pad tokens when model vocab expects more than tokenizer metadata provides --skip-unknown skip unknown tensor names instead of failing --verbose increase output verbosity --metadata METADATA Specify the path for a metadata file --get-outfile get calculated default outfile name

解释说明

--outtype，包括：{f32,f16,q8_0}。 --vocab-type，包括：{'spm', 'bpe', 'hfft'}。

6.3 执行转换

将Hugging Face下载的模型转换为gguf格式，输出类型为FP16。

Llama-3相比其前两代显著扩充了词表大小，由32K扩充至128K，并且改为BPE词表。因此需要使用--vocab-type参数指定分词算法，默认值是spm，如果是bpe，需要显示指定。

注意：官方文档说 convert.py 不支持LLaMA 3，喊使用 convert-hf-to-gguf.py，但它不支持 --vocab-type，且出现异常：error: unrecognized arguments: --vocab-type bpe，因此使用 convert.py 不会出错。

python convert.py models/Meta-Llama-3-8B-Instruct/ --outfile models/ggml-vocab-llama3-8B-instruct-f16.gguf --outtype f16 --vocab-type bpe

(llama.cpp) root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# python convert.py models/Meta-Llama-3-8B-Instruct/ --outfile models/ggml-vocab-llama3-8B-instruct-f16.gguf --outtype f16 --vocab-type bpe INFO:convert:Loading model file models/Meta-Llama-3-8B-Instruct/model-00001-of-00004.safetensors INFO:convert:Loading model file models/Meta-Llama-3-8B-Instruct/model-00001-of-00004.safetensors INFO:convert:Loading model file models/Meta-Llama-3-8B-Instruct/model-00002-of-00004.safetensors INFO:convert:Loading model file models/Meta-Llama-3-8B-Instruct/model-00003-of-00004.safetensors INFO:convert:Loading model file models/Meta-Llama-3-8B-Instruct/model-00004-of-00004.safetensors INFO:convert:model parameters count : 8030261248 (8B) INFO:convert:params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=8192, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('models/Meta-Llama-3-8B-Instruct')) INFO:convert:Loaded vocab file PosixPath('models/Meta-Llama-3-8B-Instruct/tokenizer.json'), type 'bpe' INFO:convert:Vocab info: <BpeVocab with 128000 base tokens and 256 added tokens> INFO:convert:Special vocab info: <SpecialVocab with 280147 merges, special tokens {'bos': 128000, 'eos': 128009}, add special tokens unset> INFO:convert:Writing models/ggml-vocab-llama3-8B-instruct-f16.gguf, format 1 WARNING:convert:Ignoring added_tokens.json since model matches vocab size without it. INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only INFO:gguf.vocab:Adding 280147 merge(s). INFO:gguf.vocab:Setting special token type bos to 128000 INFO:gguf.vocab:Setting special token type eos to 128009 INFO:gguf.vocab:Setting chat_template to {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|> '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|> ' }}{% endif %} INFO:convert:[ 1/291] Writing tensor token_embd.weight | size 128256 x 4096 | type F16 | T+ 4 INFO:convert:[ 2/291] Writing tensor blk.0.attn_norm.weight | size 4096 | type F32 | T+ 5 INFO:convert:[ 3/291] Writing tensor blk.0.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 5 INFO:convert:[ 4/291] Writing tensor blk.0.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 5 INFO:convert:[ 5/291] Writing tensor blk.0.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 5 INFO:convert:[ 6/291] Writing tensor blk.0.ffn_norm.weight | size 4096 | type F32 | T+ 5 INFO:convert:[ 7/291] Writing tensor blk.0.attn_k.weight | size 1024 x 4096 | type F16 | T+ 5 INFO:convert:[ 8/291] Writing tensor blk.0.attn_output.weight | size 4096 x 4096 | type F16 | T+ 5 INFO:convert:[ 9/291] Writing tensor blk.0.attn_q.weight | size 4096 x 4096 | type F16 | T+ 5 INFO:convert:[ 10/291] Writing tensor blk.0.attn_v.weight | size 1024 x 4096 | type F16 | T+ 6 INFO:convert:[ 11/291] Writing tensor blk.1.attn_norm.weight | size 4096 | type F32 | T+ 6 INFO:convert:[ 12/291] Writing tensor blk.1.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 6 INFO:convert:[ 13/291] Writing tensor blk.1.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 7 INFO:convert:[ 14/291] Writing tensor blk.1.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 7 INFO:convert:[ 15/291] Writing tensor blk.1.ffn_norm.weight | size 4096 | type F32 | T+ 7 INFO:convert:[ 16/291] Writing tensor blk.1.attn_k.weight | size 1024 x 4096 | type F16 | T+ 7 INFO:convert:[ 17/291] Writing tensor blk.1.attn_output.weight | size 4096 x 4096 | type F16 | T+ 7 INFO:convert:[ 18/291] Writing tensor blk.1.attn_q.weight | size 4096 x 4096 | type F16 | T+ 7 INFO:convert:[ 19/291] Writing tensor blk.1.attn_v.weight | size 1024 x 4096 | type F16 | T+ 7 INFO:convert:[ 20/291] Writing tensor blk.2.attn_norm.weight | size 4096 | type F32 | T+ 7 INFO:convert:[ 21/291] Writing tensor blk.2.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 8 INFO:convert:[ 22/291] Writing tensor blk.2.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 8 INFO:convert:[ 23/291] Writing tensor blk.2.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 8 INFO:convert:[ 24/291] Writing tensor blk.2.ffn_norm.weight | size 4096 | type F32 | T+ 8 INFO:convert:[ 25/291] Writing tensor blk.2.attn_k.weight | size 1024 x 4096 | type F16 | T+ 8 INFO:convert:[ 26/291] Writing tensor blk.2.attn_output.weight | size 4096 x 4096 | type F16 | T+ 8 INFO:convert:[ 27/291] Writing tensor blk.2.attn_q.weight | size 4096 x 4096 | type F16 | T+ 8 INFO:convert:[ 28/291] Writing tensor blk.2.attn_v.weight | size 1024 x 4096 | type F16 | T+ 8 INFO:convert:[ 29/291] Writing tensor blk.3.attn_norm.weight | size 4096 | type F32 | T+ 8 INFO:convert:[ 30/291] Writing tensor blk.3.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 9 INFO:convert:[ 31/291] Writing tensor blk.3.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 9 INFO:convert:[ 32/291] Writing tensor blk.3.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 9 INFO:convert:[ 33/291] Writing tensor blk.3.ffn_norm.weight | size 4096 | type F32 | T+ 9 INFO:convert:[ 34/291] Writing tensor blk.3.attn_k.weight | size 1024 x 4096 | type F16 | T+ 9 INFO:convert:[ 35/291] Writing tensor blk.3.attn_output.weight | size 4096 x 4096 | type F16 | T+ 9 INFO:convert:[ 36/291] Writing tensor blk.3.attn_q.weight | size 4096 x 4096 | type F16 | T+ 9 INFO:convert:[ 37/291] Writing tensor blk.3.attn_v.weight | size 1024 x 4096 | type F16 | T+ 9 INFO:convert:[ 38/291] Writing tensor blk.4.attn_norm.weight | size 4096 | type F32 | T+ 9 INFO:convert:[ 39/291] Writing tensor blk.4.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 10 INFO:convert:[ 40/291] Writing tensor blk.4.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 10 INFO:convert:[ 41/291] Writing tensor blk.4.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 11 INFO:convert:[ 42/291] Writing tensor blk.4.ffn_norm.weight | size 4096 | type F32 | T+ 12 INFO:convert:[ 43/291] Writing tensor blk.4.attn_k.weight | size 1024 x 4096 | type F16 | T+ 12 INFO:convert:[ 44/291] Writing tensor blk.4.attn_output.weight | size 4096 x 4096 | type F16 | T+ 12 INFO:convert:[ 45/291] Writing tensor blk.4.attn_q.weight | size 4096 x 4096 | type F16 | T+ 12 INFO:convert:[ 46/291] Writing tensor blk.4.attn_v.weight | size 1024 x 4096 | type F16 | T+ 12 INFO:convert:[ 47/291] Writing tensor blk.5.attn_norm.weight | size 4096 | type F32 | T+ 12 INFO:convert:[ 48/291] Writing tensor blk.5.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 12 INFO:convert:[ 49/291] Writing tensor blk.5.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 12 INFO:convert:[ 50/291] Writing tensor blk.5.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 12 INFO:convert:[ 51/291] Writing tensor blk.5.ffn_norm.weight | size 4096 | type F32 | T+ 13 INFO:convert:[ 52/291] Writing tensor blk.5.attn_k.weight | size 1024 x 4096 | type F16 | T+ 13 INFO:convert:[ 53/291] Writing tensor blk.5.attn_output.weight | size 4096 x 4096 | type F16 | T+ 13 INFO:convert:[ 54/291] Writing tensor blk.5.attn_q.weight | size 4096 x 4096 | type F16 | T+ 13 INFO:convert:[ 55/291] Writing tensor blk.5.attn_v.weight | size 1024 x 4096 | type F16 | T+ 13 INFO:convert:[ 56/291] Writing tensor blk.6.attn_norm.weight | size 4096 | type F32 | T+ 13 INFO:convert:[ 57/291] Writing tensor blk.6.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 13 INFO:convert:[ 58/291] Writing tensor blk.6.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 13 INFO:convert:[ 59/291] Writing tensor blk.6.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 14 INFO:convert:[ 60/291] Writing tensor blk.6.ffn_norm.weight | size 4096 | type F32 | T+ 14 INFO:convert:[ 61/291] Writing tensor blk.6.attn_k.weight | size 1024 x 4096 | type F16 | T+ 14 INFO:convert:[ 62/291] Writing tensor blk.6.attn_output.weight | size 4096 x 4096 | type F16 | T+ 14 INFO:convert:[ 63/291] Writing tensor blk.6.attn_q.weight | size 4096 x 4096 | type F16 | T+ 14 INFO:convert:[ 64/291] Writing tensor blk.6.attn_v.weight | size 1024 x 4096 | type F16 | T+ 14 INFO:convert:[ 65/291] Writing tensor blk.7.attn_norm.weight | size 4096 | type F32 | T+ 14 INFO:convert:[ 66/291] Writing tensor blk.7.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 14 INFO:convert:[ 67/291] Writing tensor blk.7.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 15 INFO:convert:[ 68/291] Writing tensor blk.7.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 15 INFO:convert:[ 69/291] Writing tensor blk.7.ffn_norm.weight | size 4096 | type F32 | T+ 15 INFO:convert:[ 70/291] Writing tensor blk.7.attn_k.weight | size 1024 x 4096 | type F16 | T+ 15 INFO:convert:[ 71/291] Writing tensor blk.7.attn_output.weight | size 4096 x 4096 | type F16 | T+ 15 INFO:convert:[ 72/291] Writing tensor blk.7.attn_q.weight | size 4096 x 4096 | type F16 | T+ 15 INFO:convert:[ 73/291] Writing tensor blk.7.attn_v.weight | size 1024 x 4096 | type F16 | T+ 15 INFO:convert:[ 74/291] Writing tensor blk.8.attn_norm.weight | size 4096 | type F32 | T+ 15 INFO:convert:[ 75/291] Writing tensor blk.8.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 16 INFO:convert:[ 76/291] Writing tensor blk.8.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 16 INFO:convert:[ 77/291] Writing tensor blk.8.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 16 INFO:convert:[ 78/291] Writing tensor blk.8.ffn_norm.weight | size 4096 | type F32 | T+ 16 INFO:convert:[ 79/291] Writing tensor blk.8.attn_k.weight | size 1024 x 4096 | type F16 | T+ 16 INFO:convert:[ 80/291] Writing tensor blk.8.attn_output.weight | size 4096 x 4096 | type F16 | T+ 16 INFO:convert:[ 81/291] Writing tensor blk.8.attn_q.weight | size 4096 x 4096 | type F16 | T+ 16 INFO:convert:[ 82/291] Writing tensor blk.8.attn_v.weight | size 1024 x 4096 | type F16 | T+ 17 INFO:convert:[ 83/291] Writing tensor blk.10.attn_norm.weight | size 4096 | type F32 | T+ 17 INFO:convert:[ 84/291] Writing tensor blk.10.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 17 INFO:convert:[ 85/291] Writing tensor blk.10.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 17 INFO:convert:[ 86/291] Writing tensor blk.10.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 17 INFO:convert:[ 87/291] Writing tensor blk.10.ffn_norm.weight | size 4096 | type F32 | T+ 17 INFO:convert:[ 88/291] Writing tensor blk.10.attn_k.weight | size 1024 x 4096 | type F16 | T+ 17 INFO:convert:[ 89/291] Writing tensor blk.10.attn_output.weight | size 4096 x 4096 | type F16 | T+ 17 INFO:convert:[ 90/291] Writing tensor blk.10.attn_q.weight | size 4096 x 4096 | type F16 | T+ 18 INFO:convert:[ 91/291] Writing tensor blk.10.attn_v.weight | size 1024 x 4096 | type F16 | T+ 18 INFO:convert:[ 92/291] Writing tensor blk.11.attn_norm.weight | size 4096 | type F32 | T+ 18 INFO:convert:[ 93/291] Writing tensor blk.11.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 18 INFO:convert:[ 94/291] Writing tensor blk.11.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 18 INFO:convert:[ 95/291] Writing tensor blk.11.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 18 INFO:convert:[ 96/291] Writing tensor blk.11.ffn_norm.weight | size 4096 | type F32 | T+ 18 INFO:convert:[ 97/291] Writing tensor blk.11.attn_k.weight | size 1024 x 4096 | type F16 | T+ 18 INFO:convert:[ 98/291] Writing tensor blk.11.attn_output.weight | size 4096 x 4096 | type F16 | T+ 18 INFO:convert:[ 99/291] Writing tensor blk.11.attn_q.weight | size 4096 x 4096 | type F16 | T+ 19 INFO:convert:[100/291] Writing tensor blk.11.attn_v.weight | size 1024 x 4096 | type F16 | T+ 19 INFO:convert:[101/291] Writing tensor blk.12.attn_norm.weight | size 4096 | type F32 | T+ 19 INFO:convert:[102/291] Writing tensor blk.12.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 20 INFO:convert:[103/291] Writing tensor blk.12.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 20 INFO:convert:[104/291] Writing tensor blk.12.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 20 INFO:convert:[105/291] Writing tensor blk.12.ffn_norm.weight | size 4096 | type F32 | T+ 20 INFO:convert:[106/291] Writing tensor blk.12.attn_k.weight | size 1024 x 4096 | type F16 | T+ 20 INFO:convert:[107/291] Writing tensor blk.12.attn_output.weight | size 4096 x 4096 | type F16 | T+ 20 INFO:convert:[108/291] Writing tensor blk.12.attn_q.weight | size 4096 x 4096 | type F16 | T+ 20 INFO:convert:[109/291] Writing tensor blk.12.attn_v.weight | size 1024 x 4096 | type F16 | T+ 20 INFO:convert:[110/291] Writing tensor blk.13.attn_norm.weight | size 4096 | type F32 | T+ 20 INFO:convert:[111/291] Writing tensor blk.13.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 21 INFO:convert:[112/291] Writing tensor blk.13.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 21 INFO:convert:[113/291] Writing tensor blk.13.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 21 INFO:convert:[114/291] Writing tensor blk.13.ffn_norm.weight | size 4096 | type F32 | T+ 21 INFO:convert:[115/291] Writing tensor blk.13.attn_k.weight | size 1024 x 4096 | type F16 | T+ 22 INFO:convert:[116/291] Writing tensor blk.13.attn_output.weight | size 4096 x 4096 | type F16 | T+ 22 INFO:convert:[117/291] Writing tensor blk.13.attn_q.weight | size 4096 x 4096 | type F16 | T+ 22 INFO:convert:[118/291] Writing tensor blk.13.attn_v.weight | size 1024 x 4096 | type F16 | T+ 22 INFO:convert:[119/291] Writing tensor blk.14.attn_norm.weight | size 4096 | type F32 | T+ 22 INFO:convert:[120/291] Writing tensor blk.14.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 22 INFO:convert:[121/291] Writing tensor blk.14.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 22 INFO:convert:[122/291] Writing tensor blk.14.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 22 INFO:convert:[123/291] Writing tensor blk.14.ffn_norm.weight | size 4096 | type F32 | T+ 22 INFO:convert:[124/291] Writing tensor blk.14.attn_k.weight | size 1024 x 4096 | type F16 | T+ 22 INFO:convert:[125/291] Writing tensor blk.14.attn_output.weight | size 4096 x 4096 | type F16 | T+ 22 INFO:convert:[126/291] Writing tensor blk.14.attn_q.weight | size 4096 x 4096 | type F16 | T+ 22 INFO:convert:[127/291] Writing tensor blk.14.attn_v.weight | size 1024 x 4096 | type F16 | T+ 23 INFO:convert:[128/291] Writing tensor blk.15.attn_norm.weight | size 4096 | type F32 | T+ 23 INFO:convert:[129/291] Writing tensor blk.15.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 23 INFO:convert:[130/291] Writing tensor blk.15.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 23 INFO:convert:[131/291] Writing tensor blk.15.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 23 INFO:convert:[132/291] Writing tensor blk.15.ffn_norm.weight | size 4096 | type F32 | T+ 24 INFO:convert:[133/291] Writing tensor blk.15.attn_k.weight | size 1024 x 4096 | type F16 | T+ 24 INFO:convert:[134/291] Writing tensor blk.15.attn_output.weight | size 4096 x 4096 | type F16 | T+ 24 INFO:convert:[135/291] Writing tensor blk.15.attn_q.weight | size 4096 x 4096 | type F16 | T+ 24 INFO:convert:[136/291] Writing tensor blk.15.attn_v.weight | size 1024 x 4096 | type F16 | T+ 24 INFO:convert:[137/291] Writing tensor blk.16.attn_norm.weight | size 4096 | type F32 | T+ 24 INFO:convert:[138/291] Writing tensor blk.16.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 24 INFO:convert:[139/291] Writing tensor blk.16.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 24 INFO:convert:[140/291] Writing tensor blk.16.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 25 INFO:convert:[141/291] Writing tensor blk.16.ffn_norm.weight | size 4096 | type F32 | T+ 25 INFO:convert:[142/291] Writing tensor blk.16.attn_k.weight | size 1024 x 4096 | type F16 | T+ 25 INFO:convert:[143/291] Writing tensor blk.16.attn_output.weight | size 4096 x 4096 | type F16 | T+ 25 INFO:convert:[144/291] Writing tensor blk.16.attn_q.weight | size 4096 x 4096 | type F16 | T+ 25 INFO:convert:[145/291] Writing tensor blk.16.attn_v.weight | size 1024 x 4096 | type F16 | T+ 25 INFO:convert:[146/291] Writing tensor blk.17.attn_norm.weight | size 4096 | type F32 | T+ 25 INFO:convert:[147/291] Writing tensor blk.17.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 25 INFO:convert:[148/291] Writing tensor blk.17.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 26 INFO:convert:[149/291] Writing tensor blk.17.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 26 INFO:convert:[150/291] Writing tensor blk.17.ffn_norm.weight | size 4096 | type F32 | T+ 26 INFO:convert:[151/291] Writing tensor blk.17.attn_k.weight | size 1024 x 4096 | type F16 | T+ 26 INFO:convert:[152/291] Writing tensor blk.17.attn_output.weight | size 4096 x 4096 | type F16 | T+ 26 INFO:convert:[153/291] Writing tensor blk.17.attn_q.weight | size 4096 x 4096 | type F16 | T+ 26 INFO:convert:[154/291] Writing tensor blk.17.attn_v.weight | size 1024 x 4096 | type F16 | T+ 26 INFO:convert:[155/291] Writing tensor blk.18.attn_norm.weight | size 4096 | type F32 | T+ 26 INFO:convert:[156/291] Writing tensor blk.18.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 26 INFO:convert:[157/291] Writing tensor blk.18.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 27 INFO:convert:[158/291] Writing tensor blk.18.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 27 INFO:convert:[159/291] Writing tensor blk.18.ffn_norm.weight | size 4096 | type F32 | T+ 27 INFO:convert:[160/291] Writing tensor blk.18.attn_k.weight | size 1024 x 4096 | type F16 | T+ 27 INFO:convert:[161/291] Writing tensor blk.18.attn_output.weight | size 4096 x 4096 | type F16 | T+ 27 INFO:convert:[162/291] Writing tensor blk.18.attn_q.weight | size 4096 x 4096 | type F16 | T+ 27 INFO:convert:[163/291] Writing tensor blk.18.attn_v.weight | size 1024 x 4096 | type F16 | T+ 27 INFO:convert:[164/291] Writing tensor blk.19.attn_norm.weight | size 4096 | type F32 | T+ 27 INFO:convert:[165/291] Writing tensor blk.19.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 28 INFO:convert:[166/291] Writing tensor blk.19.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 28 INFO:convert:[167/291] Writing tensor blk.19.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 28 INFO:convert:[168/291] Writing tensor blk.19.ffn_norm.weight | size 4096 | type F32 | T+ 28 INFO:convert:[169/291] Writing tensor blk.19.attn_k.weight | size 1024 x 4096 | type F16 | T+ 28 INFO:convert:[170/291] Writing tensor blk.19.attn_output.weight | size 4096 x 4096 | type F16 | T+ 28 INFO:convert:[171/291] Writing tensor blk.19.attn_q.weight | size 4096 x 4096 | type F16 | T+ 29 INFO:convert:[172/291] Writing tensor blk.19.attn_v.weight | size 1024 x 4096 | type F16 | T+ 29 INFO:convert:[173/291] Writing tensor blk.20.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 29 INFO:convert:[174/291] Writing tensor blk.20.attn_k.weight | size 1024 x 4096 | type F16 | T+ 29 INFO:convert:[175/291] Writing tensor blk.20.attn_output.weight | size 4096 x 4096 | type F16 | T+ 29 INFO:convert:[176/291] Writing tensor blk.20.attn_q.weight | size 4096 x 4096 | type F16 | T+ 29 INFO:convert:[177/291] Writing tensor blk.20.attn_v.weight | size 1024 x 4096 | type F16 | T+ 29 INFO:convert:[178/291] Writing tensor blk.9.attn_norm.weight | size 4096 | type F32 | T+ 29 INFO:convert:[179/291] Writing tensor blk.9.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 30 INFO:convert:[180/291] Writing tensor blk.9.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 30 INFO:convert:[181/291] Writing tensor blk.9.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 30 INFO:convert:[182/291] Writing tensor blk.9.ffn_norm.weight | size 4096 | type F32 | T+ 30 INFO:convert:[183/291] Writing tensor blk.9.attn_k.weight | size 1024 x 4096 | type F16 | T+ 30 INFO:convert:[184/291] Writing tensor blk.9.attn_output.weight | size 4096 x 4096 | type F16 | T+ 30 INFO:convert:[185/291] Writing tensor blk.9.attn_q.weight | size 4096 x 4096 | type F16 | T+ 30 INFO:convert:[186/291] Writing tensor blk.9.attn_v.weight | size 1024 x 4096 | type F16 | T+ 30 INFO:convert:[187/291] Writing tensor blk.20.attn_norm.weight | size 4096 | type F32 | T+ 30 INFO:convert:[188/291] Writing tensor blk.20.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 31 INFO:convert:[189/291] Writing tensor blk.20.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 31 INFO:convert:[190/291] Writing tensor blk.20.ffn_norm.weight | size 4096 | type F32 | T+ 31 INFO:convert:[191/291] Writing tensor blk.21.attn_norm.weight | size 4096 | type F32 | T+ 31 INFO:convert:[192/291] Writing tensor blk.21.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 31 INFO:convert:[193/291] Writing tensor blk.21.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 31 INFO:convert:[194/291] Writing tensor blk.21.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 32 INFO:convert:[195/291] Writing tensor blk.21.ffn_norm.weight | size 4096 | type F32 | T+ 32 INFO:convert:[196/291] Writing tensor blk.21.attn_k.weight | size 1024 x 4096 | type F16 | T+ 32 INFO:convert:[197/291] Writing tensor blk.21.attn_output.weight | size 4096 x 4096 | type F16 | T+ 32 INFO:convert:[198/291] Writing tensor blk.21.attn_q.weight | size 4096 x 4096 | type F16 | T+ 32 INFO:convert:[199/291] Writing tensor blk.21.attn_v.weight | size 1024 x 4096 | type F16 | T+ 32 INFO:convert:[200/291] Writing tensor blk.22.attn_norm.weight | size 4096 | type F32 | T+ 32 INFO:convert:[201/291] Writing tensor blk.22.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 33 INFO:convert:[202/291] Writing tensor blk.22.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 33 INFO:convert:[203/291] Writing tensor blk.22.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 33 INFO:convert:[204/291] Writing tensor blk.22.ffn_norm.weight | size 4096 | type F32 | T+ 33 INFO:convert:[205/291] Writing tensor blk.22.attn_k.weight | size 1024 x 4096 | type F16 | T+ 33 INFO:convert:[206/291] Writing tensor blk.22.attn_output.weight | size 4096 x 4096 | type F16 | T+ 33 INFO:convert:[207/291] Writing tensor blk.22.attn_q.weight | size 4096 x 4096 | type F16 | T+ 33 INFO:convert:[208/291] Writing tensor blk.22.attn_v.weight | size 1024 x 4096 | type F16 | T+ 33 INFO:convert:[209/291] Writing tensor blk.23.attn_norm.weight | size 4096 | type F32 | T+ 33 INFO:convert:[210/291] Writing tensor blk.23.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 33 INFO:convert:[211/291] Writing tensor blk.23.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 34 INFO:convert:[212/291] Writing tensor blk.23.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 34 INFO:convert:[213/291] Writing tensor blk.23.ffn_norm.weight | size 4096 | type F32 | T+ 34 INFO:convert:[214/291] Writing tensor blk.23.attn_k.weight | size 1024 x 4096 | type F16 | T+ 34 INFO:convert:[215/291] Writing tensor blk.23.attn_output.weight | size 4096 x 4096 | type F16 | T+ 34 INFO:convert:[216/291] Writing tensor blk.23.attn_q.weight | size 4096 x 4096 | type F16 | T+ 34 INFO:convert:[217/291] Writing tensor blk.23.attn_v.weight | size 1024 x 4096 | type F16 | T+ 34 INFO:convert:[218/291] Writing tensor blk.24.attn_norm.weight | size 4096 | type F32 | T+ 34 INFO:convert:[219/291] Writing tensor blk.24.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 35 INFO:convert:[220/291] Writing tensor blk.24.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 35 INFO:convert:[221/291] Writing tensor blk.24.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 35 INFO:convert:[222/291] Writing tensor blk.24.ffn_norm.weight | size 4096 | type F32 | T+ 35 INFO:convert:[223/291] Writing tensor blk.24.attn_k.weight | size 1024 x 4096 | type F16 | T+ 35 INFO:convert:[224/291] Writing tensor blk.24.attn_output.weight | size 4096 x 4096 | type F16 | T+ 35 INFO:convert:[225/291] Writing tensor blk.24.attn_q.weight | size 4096 x 4096 | type F16 | T+ 35 INFO:convert:[226/291] Writing tensor blk.24.attn_v.weight | size 1024 x 4096 | type F16 | T+ 35 INFO:convert:[227/291] Writing tensor blk.25.attn_norm.weight | size 4096 | type F32 | T+ 35 INFO:convert:[228/291] Writing tensor blk.25.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 36 INFO:convert:[229/291] Writing tensor blk.25.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 36 INFO:convert:[230/291] Writing tensor blk.25.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 36 INFO:convert:[231/291] Writing tensor blk.25.ffn_norm.weight | size 4096 | type F32 | T+ 37 INFO:convert:[232/291] Writing tensor blk.25.attn_k.weight | size 1024 x 4096 | type F16 | T+ 37 INFO:convert:[233/291] Writing tensor blk.25.attn_output.weight | size 4096 x 4096 | type F16 | T+ 37 INFO:convert:[234/291] Writing tensor blk.25.attn_q.weight | size 4096 x 4096 | type F16 | T+ 37 INFO:convert:[235/291] Writing tensor blk.25.attn_v.weight | size 1024 x 4096 | type F16 | T+ 37 INFO:convert:[236/291] Writing tensor blk.26.attn_norm.weight | size 4096 | type F32 | T+ 37 INFO:convert:[237/291] Writing tensor blk.26.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 37 INFO:convert:[238/291] Writing tensor blk.26.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 37 INFO:convert:[239/291] Writing tensor blk.26.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 37 INFO:convert:[240/291] Writing tensor blk.26.ffn_norm.weight | size 4096 | type F32 | T+ 38 INFO:convert:[241/291] Writing tensor blk.26.attn_k.weight | size 1024 x 4096 | type F16 | T+ 38 INFO:convert:[242/291] Writing tensor blk.26.attn_output.weight | size 4096 x 4096 | type F16 | T+ 38 INFO:convert:[243/291] Writing tensor blk.26.attn_q.weight | size 4096 x 4096 | type F16 | T+ 38 INFO:convert:[244/291] Writing tensor blk.26.attn_v.weight | size 1024 x 4096 | type F16 | T+ 38 INFO:convert:[245/291] Writing tensor blk.27.attn_norm.weight | size 4096 | type F32 | T+ 38 INFO:convert:[246/291] Writing tensor blk.27.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 39 INFO:convert:[247/291] Writing tensor blk.27.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 39 INFO:convert:[248/291] Writing tensor blk.27.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 39 INFO:convert:[249/291] Writing tensor blk.27.ffn_norm.weight | size 4096 | type F32 | T+ 39 INFO:convert:[250/291] Writing tensor blk.27.attn_k.weight | size 1024 x 4096 | type F16 | T+ 39 INFO:convert:[251/291] Writing tensor blk.27.attn_output.weight | size 4096 x 4096 | type F16 | T+ 39 INFO:convert:[252/291] Writing tensor blk.27.attn_q.weight | size 4096 x 4096 | type F16 | T+ 39 INFO:convert:[253/291] Writing tensor blk.27.attn_v.weight | size 1024 x 4096 | type F16 | T+ 39 INFO:convert:[254/291] Writing tensor blk.28.attn_norm.weight | size 4096 | type F32 | T+ 39 INFO:convert:[255/291] Writing tensor blk.28.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 40 INFO:convert:[256/291] Writing tensor blk.28.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 40 INFO:convert:[257/291] Writing tensor blk.28.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 40 INFO:convert:[258/291] Writing tensor blk.28.ffn_norm.weight | size 4096 | type F32 | T+ 40 INFO:convert:[259/291] Writing tensor blk.28.attn_k.weight | size 1024 x 4096 | type F16 | T+ 40 INFO:convert:[260/291] Writing tensor blk.28.attn_output.weight | size 4096 x 4096 | type F16 | T+ 40 INFO:convert:[261/291] Writing tensor blk.28.attn_q.weight | size 4096 x 4096 | type F16 | T+ 40 INFO:convert:[262/291] Writing tensor blk.28.attn_v.weight | size 1024 x 4096 | type F16 | T+ 40 INFO:convert:[263/291] Writing tensor blk.29.attn_norm.weight | size 4096 | type F32 | T+ 40 INFO:convert:[264/291] Writing tensor blk.29.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 41 INFO:convert:[265/291] Writing tensor blk.29.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 41 INFO:convert:[266/291] Writing tensor blk.29.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 41 INFO:convert:[267/291] Writing tensor blk.29.ffn_norm.weight | size 4096 | type F32 | T+ 41 INFO:convert:[268/291] Writing tensor blk.29.attn_k.weight | size 1024 x 4096 | type F16 | T+ 41 INFO:convert:[269/291] Writing tensor blk.29.attn_output.weight | size 4096 x 4096 | type F16 | T+ 41 INFO:convert:[270/291] Writing tensor blk.29.attn_q.weight | size 4096 x 4096 | type F16 | T+ 41 INFO:convert:[271/291] Writing tensor blk.29.attn_v.weight | size 1024 x 4096 | type F16 | T+ 41 INFO:convert:[272/291] Writing tensor blk.30.attn_norm.weight | size 4096 | type F32 | T+ 41 INFO:convert:[273/291] Writing tensor blk.30.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 42 INFO:convert:[274/291] Writing tensor blk.30.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 42 INFO:convert:[275/291] Writing tensor blk.30.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 43 INFO:convert:[276/291] Writing tensor blk.30.ffn_norm.weight | size 4096 | type F32 | T+ 43 INFO:convert:[277/291] Writing tensor blk.30.attn_k.weight | size 1024 x 4096 | type F16 | T+ 43 INFO:convert:[278/291] Writing tensor blk.30.attn_output.weight | size 4096 x 4096 | type F16 | T+ 43 INFO:convert:[279/291] Writing tensor blk.30.attn_q.weight | size 4096 x 4096 | type F16 | T+ 43 INFO:convert:[280/291] Writing tensor blk.30.attn_v.weight | size 1024 x 4096 | type F16 | T+ 43 INFO:convert:[281/291] Writing tensor blk.31.ffn_gate.weight | size 14336 x 4096 | type F16 | T+ 43 INFO:convert:[282/291] Writing tensor blk.31.ffn_up.weight | size 14336 x 4096 | type F16 | T+ 43 INFO:convert:[283/291] Writing tensor blk.31.attn_k.weight | size 1024 x 4096 | type F16 | T+ 43 INFO:convert:[284/291] Writing tensor blk.31.attn_output.weight | size 4096 x 4096 | type F16 | T+ 43 INFO:convert:[285/291] Writing tensor blk.31.attn_q.weight | size 4096 x 4096 | type F16 | T+ 43 INFO:convert:[286/291] Writing tensor blk.31.attn_v.weight | size 1024 x 4096 | type F16 | T+ 44 INFO:convert:[287/291] Writing tensor output.weight | size 128256 x 4096 | type F16 | T+ 48 INFO:convert:[288/291] Writing tensor blk.31.attn_norm.weight | size 4096 | type F32 | T+ 49 INFO:convert:[289/291] Writing tensor blk.31.ffn_down.weight | size 4096 x 14336 | type F16 | T+ 49 INFO:convert:[290/291] Writing tensor blk.31.ffn_norm.weight | size 4096 | type F32 | T+ 49 INFO:convert:[291/291] Writing tensor output_norm.weight | size 4096 | type F32 | T+ 49 INFO:convert:Wrote models/ggml-vocab-llama3-8B-instruct-f16.gguf

7. 量化模型

quantize

Quantization of LLMs with llama.cpp

Llama.cpp量化简明手册

7.1 查看量化类型

(llama.cpp) root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# ./quantize -h usage: ./quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--override-kv] model-f32.gguf [model-quant.gguf] type [nthreads] --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing --pure: Disable k-quant mixtures and quantize all tensors to the same type --imatrix file_name: use data in file_name as importance matrix for quant optimizations --include-weights tensor_name: use importance matrix for this/these tensor(s) --exclude-weights tensor_name: use importance matrix for this/these tensor(s) --output-tensor-type ggml_type: use this ggml_type for the output.weight tensor --token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor --keep-split: will generate quatized model in the same shards as input --override-kv KEY=TYPE:VALUE Advanced option to override model metadata by key in the quantized model. May be specified multiple times. Note: --include-weights and --exclude-weights cannot be used together Allowed quantization types: 2 or Q4_0 : 3.56G, +0.2166 ppl @ LLaMA-v1-7B 3 or Q4_1 : 3.90G, +0.1585 ppl @ LLaMA-v1-7B 8 or Q5_0 : 4.33G, +0.0683 ppl @ LLaMA-v1-7B 9 or Q5_1 : 4.70G, +0.0349 ppl @ LLaMA-v1-7B 19 or IQ2_XXS : 2.06 bpw quantization 20 or IQ2_XS : 2.31 bpw quantization 28 or IQ2_S : 2.5 bpw quantization 29 or IQ2_M : 2.7 bpw quantization 24 or IQ1_S : 1.56 bpw quantization 31 or IQ1_M : 1.75 bpw quantization 10 or Q2_K : 2.63G, +0.6717 ppl @ LLaMA-v1-7B 21 or Q2_K_S : 2.16G, +9.0634 ppl @ LLaMA-v1-7B 23 or IQ3_XXS : 3.06 bpw quantization 26 or IQ3_S : 3.44 bpw quantization 27 or IQ3_M : 3.66 bpw quantization mix 12 or Q3_K : alias for Q3_K_M 22 or IQ3_XS : 3.3 bpw quantization 11 or Q3_K_S : 2.75G, +0.5551 ppl @ LLaMA-v1-7B 12 or Q3_K_M : 3.07G, +0.2496 ppl @ LLaMA-v1-7B 13 or Q3_K_L : 3.35G, +0.1764 ppl @ LLaMA-v1-7B 25 or IQ4_NL : 4.50 bpw non-linear quantization 30 or IQ4_XS : 4.25 bpw non-linear quantization 15 or Q4_K : alias for Q4_K_M 14 or Q4_K_S : 3.59G, +0.0992 ppl @ LLaMA-v1-7B 15 or Q4_K_M : 3.80G, +0.0532 ppl @ LLaMA-v1-7B 17 or Q5_K : alias for Q5_K_M 16 or Q5_K_S : 4.33G, +0.0400 ppl @ LLaMA-v1-7B 17 or Q5_K_M : 4.45G, +0.0122 ppl @ LLaMA-v1-7B 18 or Q6_K : 5.15G, +0.0008 ppl @ LLaMA-v1-7B 7 or Q8_0 : 6.70G, +0.0004 ppl @ LLaMA-v1-7B 1 or F16 : 14.00G, -0.0020 ppl @ Mistral-7B 32 or BF16 : 14.00G, -0.0050 ppl @ Mistral-7B 0 or F32 : 26.00G @ 7B COPY : only copy tensors, no quantizing

解释说明

使用quantize量化模型，它提供各种量化位数的模型：Q2、Q3、Q4、Q5、Q6、Q8、F16。量化模型的命名方法遵循: Q + 量化比特位 + 变种。量化位数越少，对硬件资源的要求越低，但是模型的精度也越低。

7.2 执行量化

对FP16模型进行4-bit量化。

./quantize models/ggml-vocab-llama3-8B-instruct-f16.gguf models/ggml-vocab-llama3-8B-instruct-q4_0.gguf Q4_0

(llama.cpp) root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# ./quantize models/ggml-vocab-llama3-8B-instruct-f16.gguf models/ggml-vocab-llama3-8B-instruct-q4_0.gguf Q4_0 main: build = 3045 (59b0d077) main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu main: quantizing 'models/ggml-vocab-llama3-8B-instruct-f16.gguf' to 'models/ggml-vocab-llama3-8B-instruct-q4_0.gguf' as Q4_0 llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from models/ggml-vocab-llama3-8B-instruct-f16.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.vocab_size u32 = 128256 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.block_count u32 = 32 llama_model_loader: - kv 6: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 7: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 8: llama.attention.head_count u32 = 32 llama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 11: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 12: general.file_type u32 = 1 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,128256] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - type f32: 65 tensors llama_model_loader: - type f16: 226 tensors [ 1/ 291] token_embd.weight - [ 4096, 128256, 1, 1], type = f16, converting to q4_0 .. size = 1002.00 MiB -> 281.81 MiB [ 2/ 291] blk.0.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 3/ 291] blk.0.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 4/ 291] blk.0.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 5/ 291] blk.0.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 6/ 291] blk.0.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 7/ 291] blk.0.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 8/ 291] blk.0.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 9/ 291] blk.0.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 10/ 291] blk.0.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 11/ 291] blk.1.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 12/ 291] blk.1.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 13/ 291] blk.1.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 14/ 291] blk.1.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 15/ 291] blk.1.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 16/ 291] blk.1.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 17/ 291] blk.1.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 18/ 291] blk.1.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 19/ 291] blk.1.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 20/ 291] blk.2.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 21/ 291] blk.2.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 22/ 291] blk.2.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 23/ 291] blk.2.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 24/ 291] blk.2.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 25/ 291] blk.2.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 26/ 291] blk.2.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 27/ 291] blk.2.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 28/ 291] blk.2.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 29/ 291] blk.3.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 30/ 291] blk.3.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 31/ 291] blk.3.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 32/ 291] blk.3.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 33/ 291] blk.3.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 34/ 291] blk.3.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 35/ 291] blk.3.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 36/ 291] blk.3.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 37/ 291] blk.3.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 38/ 291] blk.4.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 39/ 291] blk.4.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 40/ 291] blk.4.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 41/ 291] blk.4.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 42/ 291] blk.4.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 43/ 291] blk.4.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 44/ 291] blk.4.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 45/ 291] blk.4.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 46/ 291] blk.4.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 47/ 291] blk.5.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 48/ 291] blk.5.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 49/ 291] blk.5.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 50/ 291] blk.5.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 51/ 291] blk.5.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 52/ 291] blk.5.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 53/ 291] blk.5.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 54/ 291] blk.5.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 55/ 291] blk.5.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 56/ 291] blk.6.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 57/ 291] blk.6.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 58/ 291] blk.6.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 59/ 291] blk.6.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 60/ 291] blk.6.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 61/ 291] blk.6.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 62/ 291] blk.6.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 63/ 291] blk.6.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 64/ 291] blk.6.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 65/ 291] blk.7.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 66/ 291] blk.7.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 67/ 291] blk.7.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 68/ 291] blk.7.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 69/ 291] blk.7.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 70/ 291] blk.7.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 71/ 291] blk.7.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 72/ 291] blk.7.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 73/ 291] blk.7.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 74/ 291] blk.8.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 75/ 291] blk.8.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 76/ 291] blk.8.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 77/ 291] blk.8.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 78/ 291] blk.8.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 79/ 291] blk.8.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 80/ 291] blk.8.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 81/ 291] blk.8.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 82/ 291] blk.8.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 83/ 291] blk.10.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 84/ 291] blk.10.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 85/ 291] blk.10.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 86/ 291] blk.10.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 87/ 291] blk.10.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 88/ 291] blk.10.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 89/ 291] blk.10.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 90/ 291] blk.10.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 91/ 291] blk.10.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 92/ 291] blk.11.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 93/ 291] blk.11.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 94/ 291] blk.11.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 95/ 291] blk.11.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 96/ 291] blk.11.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 97/ 291] blk.11.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 98/ 291] blk.11.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 99/ 291] blk.11.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 100/ 291] blk.11.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 101/ 291] blk.12.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 102/ 291] blk.12.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 103/ 291] blk.12.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 104/ 291] blk.12.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 105/ 291] blk.12.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 106/ 291] blk.12.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 107/ 291] blk.12.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 108/ 291] blk.12.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 109/ 291] blk.12.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 110/ 291] blk.13.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 111/ 291] blk.13.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 112/ 291] blk.13.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 113/ 291] blk.13.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 114/ 291] blk.13.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 115/ 291] blk.13.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 116/ 291] blk.13.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 117/ 291] blk.13.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 118/ 291] blk.13.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 119/ 291] blk.14.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 120/ 291] blk.14.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 121/ 291] blk.14.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 122/ 291] blk.14.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 123/ 291] blk.14.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 124/ 291] blk.14.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 125/ 291] blk.14.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 126/ 291] blk.14.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 127/ 291] blk.14.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 128/ 291] blk.15.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 129/ 291] blk.15.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 130/ 291] blk.15.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 131/ 291] blk.15.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 132/ 291] blk.15.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 133/ 291] blk.15.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 134/ 291] blk.15.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 135/ 291] blk.15.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 136/ 291] blk.15.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 137/ 291] blk.16.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 138/ 291] blk.16.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 139/ 291] blk.16.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 140/ 291] blk.16.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 141/ 291] blk.16.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 142/ 291] blk.16.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 143/ 291] blk.16.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 144/ 291] blk.16.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 145/ 291] blk.16.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 146/ 291] blk.17.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 147/ 291] blk.17.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 148/ 291] blk.17.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 149/ 291] blk.17.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 150/ 291] blk.17.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 151/ 291] blk.17.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 152/ 291] blk.17.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 153/ 291] blk.17.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 154/ 291] blk.17.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 155/ 291] blk.18.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 156/ 291] blk.18.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 157/ 291] blk.18.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 158/ 291] blk.18.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 159/ 291] blk.18.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 160/ 291] blk.18.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 161/ 291] blk.18.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 162/ 291] blk.18.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 163/ 291] blk.18.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 164/ 291] blk.19.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 165/ 291] blk.19.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 166/ 291] blk.19.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 167/ 291] blk.19.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 168/ 291] blk.19.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 169/ 291] blk.19.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 170/ 291] blk.19.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 171/ 291] blk.19.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 172/ 291] blk.19.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 173/ 291] blk.20.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 174/ 291] blk.20.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 175/ 291] blk.20.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 176/ 291] blk.20.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 177/ 291] blk.20.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 178/ 291] blk.9.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 179/ 291] blk.9.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 180/ 291] blk.9.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 181/ 291] blk.9.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 182/ 291] blk.9.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 183/ 291] blk.9.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 184/ 291] blk.9.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 185/ 291] blk.9.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 186/ 291] blk.9.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 187/ 291] blk.20.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 188/ 291] blk.20.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 189/ 291] blk.20.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 190/ 291] blk.20.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 191/ 291] blk.21.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 192/ 291] blk.21.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 193/ 291] blk.21.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 194/ 291] blk.21.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 195/ 291] blk.21.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 196/ 291] blk.21.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 197/ 291] blk.21.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 198/ 291] blk.21.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 199/ 291] blk.21.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 200/ 291] blk.22.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 201/ 291] blk.22.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 202/ 291] blk.22.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 203/ 291] blk.22.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 204/ 291] blk.22.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 205/ 291] blk.22.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 206/ 291] blk.22.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 207/ 291] blk.22.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 208/ 291] blk.22.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 209/ 291] blk.23.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 210/ 291] blk.23.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 211/ 291] blk.23.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 212/ 291] blk.23.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 213/ 291] blk.23.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 214/ 291] blk.23.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 215/ 291] blk.23.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 216/ 291] blk.23.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 217/ 291] blk.23.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 218/ 291] blk.24.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 219/ 291] blk.24.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 220/ 291] blk.24.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 221/ 291] blk.24.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 222/ 291] blk.24.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 223/ 291] blk.24.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 224/ 291] blk.24.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 225/ 291] blk.24.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 226/ 291] blk.24.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 227/ 291] blk.25.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 228/ 291] blk.25.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 229/ 291] blk.25.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 230/ 291] blk.25.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 231/ 291] blk.25.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 232/ 291] blk.25.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 233/ 291] blk.25.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 234/ 291] blk.25.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 235/ 291] blk.25.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 236/ 291] blk.26.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 237/ 291] blk.26.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 238/ 291] blk.26.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 239/ 291] blk.26.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 240/ 291] blk.26.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 241/ 291] blk.26.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 242/ 291] blk.26.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 243/ 291] blk.26.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 244/ 291] blk.26.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 245/ 291] blk.27.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 246/ 291] blk.27.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 247/ 291] blk.27.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 248/ 291] blk.27.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 249/ 291] blk.27.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 250/ 291] blk.27.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 251/ 291] blk.27.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 252/ 291] blk.27.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 253/ 291] blk.27.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 254/ 291] blk.28.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 255/ 291] blk.28.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 256/ 291] blk.28.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 257/ 291] blk.28.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 258/ 291] blk.28.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 259/ 291] blk.28.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 260/ 291] blk.28.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 261/ 291] blk.28.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 262/ 291] blk.28.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 263/ 291] blk.29.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 264/ 291] blk.29.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 265/ 291] blk.29.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 266/ 291] blk.29.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 267/ 291] blk.29.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 268/ 291] blk.29.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 269/ 291] blk.29.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 270/ 291] blk.29.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 271/ 291] blk.29.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 272/ 291] blk.30.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 273/ 291] blk.30.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 274/ 291] blk.30.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 275/ 291] blk.30.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 276/ 291] blk.30.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 277/ 291] blk.30.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 278/ 291] blk.30.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 279/ 291] blk.30.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 280/ 291] blk.30.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 281/ 291] blk.31.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 282/ 291] blk.31.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 283/ 291] blk.31.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 284/ 291] blk.31.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 285/ 291] blk.31.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q4_0 .. size = 32.00 MiB -> 9.00 MiB [ 286/ 291] blk.31.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q4_0 .. size = 8.00 MiB -> 2.25 MiB [ 287/ 291] output.weight - [ 4096, 128256, 1, 1], type = f16, converting to q6_K .. size = 1002.00 MiB -> 410.98 MiB [ 288/ 291] blk.31.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 289/ 291] blk.31.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q4_0 .. size = 112.00 MiB -> 31.50 MiB [ 290/ 291] blk.31.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB [ 291/ 291] output_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB llama_model_quantize_internal: model size = 15317.02 MB llama_model_quantize_internal: quant size = 4437.80 MB main: quantize time = 61476.12 ms main: total time = 61476.12 ms

经过Q4_0量化后，模型的大小从15317.02 MB降低到4437.80 MB，但模型精度从16位浮点数降低到4位整数。

更详细的使用教程请访问：https://github.com/ggerganov/llama.cpp#quantization

8. 模型推理

8.1 main指令

llama.cpp/examples/main

(llama.cpp) root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# ./main -h usage: ./main [options] options: -h, --help show this help message and exit --version show version and build info -i, --interactive run in interactive mode --special special tokens output enabled --interactive-specials allow special tokens in user text, in interactive mode --interactive-first run in interactive mode and wait for input right away -cnv, --conversation run in conversation mode (does not print special tokens and suffix/prefix) -ins, --instruct run in instruction mode (use with Alpaca models) -cml, --chatml run in chatml mode (use with ChatML-compatible models) --multiline-input allows you to write or paste multiple lines without ending each in '\' -r PROMPT, --reverse-prompt PROMPT halt generation at PROMPT, return control in interactive mode (can be specified more than once for multiple prompts). --color colorise output to distinguish prompt and user input from generations -s SEED, --seed SEED RNG seed (default: -1, use random seed for < 0) -t N, --threads N number of threads to use during generation (default: 128) -tb N, --threads-batch N number of threads to use during batch and prompt processing (default: same as --threads) -td N, --threads-draft N number of threads to use during generation (default: same as --threads) -tbd N, --threads-batch-draft N number of threads to use during batch and prompt processing (default: same as --threads-draft) -p PROMPT, --prompt PROMPT prompt to start generation with (default: empty) -e, --escape process prompt escapes sequences (\n, \r, \t, \', \", \\) --prompt-cache FNAME file to cache prompt state for faster startup (default: none) --prompt-cache-all if specified, saves user input and generations to cache as well. not supported with --interactive or other interactive options --prompt-cache-ro if specified, uses the prompt cache but does not update it. --random-prompt start with a randomized prompt. --in-prefix-bos prefix BOS to user inputs, preceding the `--in-prefix` string --in-prefix STRING string to prefix user inputs with (default: empty) --in-suffix STRING string to suffix after user inputs with (default: empty) -f FNAME, --file FNAME prompt file to start generation. -bf FNAME, --binary-file FNAME binary file containing multiple choice tasks. -n N, --n-predict N number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled) -c N, --ctx-size N size of the prompt context (default: 512, 0 = loaded from model) -b N, --batch-size N logical maximum batch size (default: 2048) -ub N, --ubatch-size N physical maximum batch size (default: 512) --samplers samplers that will be used for generation in the order, separated by ';' (default: top_k;tfs_z;typical_p;top_p;min_p;temperature) --sampling-seq simplified sequence for samplers that will be used (default: kfypmt) --top-k N top-k sampling (default: 40, 0 = disabled) --top-p N top-p sampling (default: 0.9, 1.0 = disabled) --min-p N min-p sampling (default: 0.1, 0.0 = disabled) --tfs N tail free sampling, parameter z (default: 1.0, 1.0 = disabled) --typical N locally typical sampling, parameter p (default: 1.0, 1.0 = disabled) --repeat-last-n N last n tokens to consider for penalize (default: 64, 0 = disabled, -1 = ctx_size) --repeat-penalty N penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled) --presence-penalty N repeat alpha presence penalty (default: 0.0, 0.0 = disabled) --frequency-penalty N repeat alpha frequency penalty (default: 0.0, 0.0 = disabled) --dynatemp-range N dynamic temperature range (default: 0.0, 0.0 = disabled) --dynatemp-exp N dynamic temperature exponent (default: 1.0) --mirostat N use Mirostat sampling. Top K, Nucleus, Tail Free and Locally Typical samplers are ignored if used. (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0) --mirostat-lr N Mirostat learning rate, parameter eta (default: 0.1) --mirostat-ent N Mirostat target entropy, parameter tau (default: 5.0) -l TOKEN_ID(+/-)BIAS, --logit-bias TOKEN_ID(+/-)BIAS modifies the likelihood of token appearing in the completion, i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello', or `--logit-bias 15043-1` to decrease likelihood of token ' Hello' --grammar GRAMMAR BNF-like grammar to constrain generations (see samples in grammars/ dir) --grammar-file FNAME file to read grammar from -j SCHEMA, --json-schema SCHEMA JSON schema to constrain generations (https://json-schema.org/), e.g. `{}` for any JSON object. For schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead --cfg-negative-prompt PROMPT negative prompt to use for guidance. (default: empty) --cfg-negative-prompt-file FNAME negative prompt file to use for guidance. (default: empty) --cfg-scale N strength of guidance (default: 1.000000, 1.0 = disable) --rope-scaling {none,linear,yarn} RoPE frequency scaling method, defaults to linear unless specified by the model --rope-scale N RoPE context scaling factor, expands context by a factor of N --rope-freq-base N RoPE base frequency, used by NTK-aware scaling (default: loaded from model) --rope-freq-scale N RoPE frequency scaling factor, expands context by a factor of 1/N --yarn-orig-ctx N YaRN: original context size of model (default: 0 = model training context size) --yarn-ext-factor N YaRN: extrapolation mix factor (default: 1.0, 0.0 = full interpolation) --yarn-attn-factor N YaRN: scale sqrt(t) or attention magnitude (default: 1.0) --yarn-beta-slow N YaRN: high correction dim or alpha (default: 1.0) --yarn-beta-fast N YaRN: low correction dim or beta (default: 32.0) --pooling {none,mean,cls} pooling type for embeddings, use model default if unspecified -dt N, --defrag-thold N KV cache defragmentation threshold (default: -1.0, < 0 - disabled) --ignore-eos ignore end of stream token and continue generating (implies --logit-bias 2-inf) --penalize-nl penalize newline tokens --temp N temperature (default: 0.8) --all-logits return logits for all tokens in the batch (default: disabled) --hellaswag compute HellaSwag score over random tasks from datafile supplied with -f --hellaswag-tasks N number of tasks to use when computing the HellaSwag score (default: 400) --winogrande compute Winogrande score over random tasks from datafile supplied with -f --winogrande-tasks N number of tasks to use when computing the Winogrande score (default: 0) --multiple-choice compute multiple choice score over random tasks from datafile supplied with -f --multiple-choice-tasks N number of tasks to use when computing the multiple choice score (default: 0) --kl-divergence computes KL-divergence to logits provided via --kl-divergence-base --keep N number of tokens to keep from the initial prompt (default: 0, -1 = all) --draft N number of tokens to draft for speculative decoding (default: 5) --chunks N max number of chunks to process (default: -1, -1 = all) -np N, --parallel N number of parallel sequences to decode (default: 1) -ns N, --sequences N number of sequences to decode (default: 1) -ps N, --p-split N speculative decoding split probability (default: 0.1) -cb, --cont-batching enable continuous batching (a.k.a dynamic batching) (default: disabled) -fa, --flash-attn enable Flash Attention (default: disabled) --mmproj MMPROJ_FILE path to a multimodal projector file for LLaVA. see examples/llava/README.md --image IMAGE_FILE path to an image file. use with multimodal models. Specify multiple times for batching --mlock force system to keep model in RAM rather than swapping or compressing --no-mmap do not memory-map model (slower load but may reduce pageouts if not using mlock) --numa TYPE attempt optimizations that help on some NUMA systems - distribute: spread execution evenly over all nodes - isolate: only spawn threads on CPUs on the node that execution started on - numactl: use the CPU map provided by numactl if run without this previously, it is recommended to drop the system page cache before using this see https://github.com/ggerganov/llama.cpp/issues/1437 --rpc SERVERS comma separated list of RPC servers --verbose-prompt print a verbose prompt before generation (default: false) --no-display-prompt don't print prompt at generation (default: false) -gan N, --grp-attn-n N group-attention factor (default: 1) -gaw N, --grp-attn-w N group-attention width (default: 512.0) -dkvc, --dump-kv-cache verbose print of the KV cache -nkvo, --no-kv-offload disable KV offload -ctk TYPE, --cache-type-k TYPE KV cache data type for K (default: f16) -ctv TYPE, --cache-type-v TYPE KV cache data type for V (default: f16) --simple-io use basic IO for better compatibility in subprocesses and limited consoles --lora FNAME apply LoRA adapter (implies --no-mmap) --lora-scaled FNAME S apply LoRA adapter with user defined scaling S (implies --no-mmap) --lora-base FNAME optional model to use as a base for the layers modified by the LoRA adapter --control-vector FNAME add a control vector --control-vector-scaled FNAME S add a control vector with user defined scaling S --control-vector-layer-range START END layer range to apply the control vector(s) to, start and end inclusive -m FNAME, --model FNAME model path (default: models/$filename with filename from --hf-file or --model-url if set, otherwise models/7B/ggml-model-f16.gguf) -md FNAME, --model-draft FNAME draft model for speculative decoding (default: unused) -mu MODEL_URL, --model-url MODEL_URL model download url (default: unused) -hfr REPO, --hf-repo REPO Hugging Face model repository (default: unused) -hff FILE, --hf-file FILE Hugging Face model file (default: unused) -ld LOGDIR, --logdir LOGDIR path under which to save YAML logs (no logging if unset) -lcs FNAME, --lookup-cache-static FNAME path to static lookup cache to use for lookup decoding (not updated by generation) -lcd FNAME, --lookup-cache-dynamic FNAME path to dynamic lookup cache to use for lookup decoding (updated by generation) --override-kv KEY=TYPE:VALUE advanced option to override model metadata by key. may be specified multiple times. types: int, float, bool, str. example: --override-kv tokenizer.ggml.add_bos_token=bool:false -ptc N, --print-token-count N print token count every N tokens (default: -1) --check-tensors check model tensor data for invalid values log options: --log-test Run simple logging test --log-disable Disable trace logs --log-enable Enable trace logs --log-file Specify a log filename (without extension) --log-new Create a separate new log file on start. Each log file will have unique name: "<name>.<ID>.log"

参数解释

命令描述 -m 指定 LLaMA 模型文件的路径 -mu 指定远程 http url 来下载文件 -i 以交互模式运行程序 -ins 以指令模式运行程序，类似ChatGPT的对话交流模式 -f 指定prompt模板，alpaca模型请加载prompts/alpaca.txt指令模板 -n 控制回复生成的最大长度（默认：-1，表示无穷大） -c 设置提示上下文的大小，值越大越能参考更长的历史对话（默认：512） -b 控制batch size（默认：2048） -t 控制线程数量（默认：128） --repeat_penalty 控制生成回复中对重复文本的惩罚力度 --temp 温度系数，值越低回复的随机性越小 --top_p, top_k 控制解码采样的相关参数 --color 区分用户输入和生成的文本

更详细的官方说明请参考：https://github.com/ggerganov/llama.cpp/tree/master/examples/main

8.2 CPU推理

启动CPU推理，程序卡住，CPU利用率90%以上。

# 以指令模式执行推理 ./main -m models/ggml-vocab-llama3-8B-instruct-q4_0.gguf --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 --repeat_penalty 1.1

(llama.cpp) root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# ./main -m models/ggml-vocab-llama3-8B-instruct-q4_0.gguf --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 --repeat_penalty 1.1 Log start main: build = 3045 (59b0d077) main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu main: seed = 1723114741 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from models/ggml-vocab-llama3-8B-instruct-q4_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.vocab_size u32 = 128256 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.block_count u32 = 32 llama_model_loader: - kv 6: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 7: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 8: llama.attention.head_count u32 = 32 llama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 11: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 12: general.file_type u32 = 2 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,128256] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: missing pre-tokenizer type, using: 'default' llm_load_vocab: llm_load_vocab: ************************************ llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED! llm_load_vocab: CONSIDER REGENERATING THE MODEL llm_load_vocab: ************************************ llm_load_vocab: llm_load_vocab: special tokens cache size = 256. llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_tensors: ggml ctx size = 0.15 MiB llm_load_tensors: CPU buffer size = 4437.80 MiB ....................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 256.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CPU output buffer size = 0.49 MiB llama_new_context_with_model: CPU compute buffer size = 258.50 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 1 system_info: n_threads = 128 / 255 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | main: interactive mode on. Reverse prompt: '### Instruction: ' sampling: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.200 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 2048, n_batch = 2048, n_predict = 256, n_keep = 19 == Running in interactive mode. == - Press Ctrl+C to interject at any time. - Press Return to return control to the AI. - To return control without starting a new line, end your input with '/'. - If you want to submit another line, end your input with '\'. Below is an instruction that describes a task. Write a response that appropriately completes the request. > hi Hello! I'm happy to help. Please provide more context or clarify what you would like me to assist you with, and I'll do my best to respond accordingly. >

在提示符 > 之后输入你的prompt，command+c中断输出，多行信息以\作为行尾。如需查看帮助和参数说明，请执行./main -h命令。

8.3 国产异构加速卡推理

使用-ngl N或者 --n-gpu-layers N参数，表示加载到GPU的网络层数。

# 指定GPU export HIP_VISIBLE_DEVICES="0" # 指定GFX version版本 export HSA_OVERRIDE_GFX_VERSION=9.2.8 # 以指令模式执行推理 ./main -m models/ggml-vocab-llama3-8B-instruct-q4_0.gguf --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 --repeat_penalty 1.1 --n_gpu_layers 40 --no-mmap

# 或者 export HSA_OVERRIDE_GFX_VERSION=9.2.8 && export HIP_VISIBLE_DEVICES=0 && ./main -m models/ggml-vocab-llama3-8B-instruct-q4_0.gguf --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 --repeat_penalty 1.1 --n_gpu_layers 40 --no-mmap

(llama.cpp) root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# export HSA_OVERRIDE_GFX_VERSION=9.2.8 && export HIP_VISIBLE_DEVICES=0 && ./main -m models/ggml-vocab-llama3-8B-instruct-q4_0.gguf --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 --repeat_penalty 1.1 --n_gpu_layers 40 --no-mmap Log start main: build = 3045 (59b0d077) main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu main: seed = 1723178798 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from models/ggml-vocab-llama3-8B-instruct-q4_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.vocab_size u32 = 128256 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.block_count u32 = 32 llama_model_loader: - kv 6: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 7: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 8: llama.attention.head_count u32 = 32 llama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 11: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 12: general.file_type u32 = 2 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,128256] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: missing pre-tokenizer type, using: 'default' llm_load_vocab: llm_load_vocab: ************************************ llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED! llm_load_vocab: CONSIDER REGENERATING THE MODEL llm_load_vocab: ************************************ llm_load_vocab: llm_load_vocab: special tokens cache size = 256. llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 ROCm devices: Device 0: DCU K100_AI, compute capability 9.2, VMM: no llm_load_tensors: ggml ctx size = 0.30 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: ROCm0 buffer size = 4155.99 MiB llm_load_tensors: ROCm_Host buffer size = 281.81 MiB .............................

总结

llamanativewritingmmosemdebugcodeammllmtokenjsontpuschemaparsepromptgitrpapythonclijupyter