Relevant: Since LLaMA leaked on torrent, it has been converted to Huggingface we...

Relevant: Since LLaMA leaked on torrent, it has been converted to Huggingface weights and it has been quantisized to 8bit for less vram requirements.

A few days ago it has also been quantisized to 4bit and 3bit is coming. The quantization method they use is from the GPTQ paper ( https://arxiv.org/abs/2210.17323 ) which leads to almost no quality degradation compared to the 16bit weights.

4 bit weights:

Model, weight size, vram req.

LLaMA-7B, 3.5GB, 6GB

LLaMA-13B, 6.5GB, 10GB

LLaMA-30B, 15.8GB, 20GB

LLaMA-65B, 31.2GB, 40GB

Here is a good overall guide for Linux and Windows:

https://rentry.org/llama-tard-v2#bonus-4-4bit-llama-basic-se...

I also wrote a guide how to get the bitsandbytes library working on windows:

https://github.com/oobabooga/text-generation-webui/issues/14...