Relevant:
Since LLaMA leaked on torrent, it has been converted to Huggingface weights and it has been quantisized to 8bit for less vram requirements.
A few days ago it has also been quantisized to 4bit and 3bit is coming. The quantization method they use is from the GPTQ paper ( https://arxiv.org/abs/2210.17323 ) which leads to almost no quality degradation compared to the 16bit weights.
4 bit weights:
Model, weight size, vram req.
LLaMA-7B, 3.5GB, 6GB
LLaMA-13B, 6.5GB, 10GB
LLaMA-30B, 15.8GB, 20GB
LLaMA-65B, 31.2GB, 40GB
Here is a good overall guide for Linux and Windows:
A few days ago it has also been quantisized to 4bit and 3bit is coming. The quantization method they use is from the GPTQ paper ( https://arxiv.org/abs/2210.17323 ) which leads to almost no quality degradation compared to the 16bit weights.
4 bit weights:
Model, weight size, vram req.
LLaMA-7B, 3.5GB, 6GB
LLaMA-13B, 6.5GB, 10GB
LLaMA-30B, 15.8GB, 20GB
LLaMA-65B, 31.2GB, 40GB
Here is a good overall guide for Linux and Windows:
https://rentry.org/llama-tard-v2#bonus-4-4bit-llama-basic-se...
I also wrote a guide how to get the bitsandbytes library working on windows:
https://github.com/oobabooga/text-generation-webui/issues/14...