Loading the model 26gb?

by MooCow27 - opened Apr 1, 2023

Apr 1, 2023

I was trying to load up the model to integrate it with llama index, but does running this really use 26gb of vram? Is there a way to reduce this down?

Thanks!

tensiondriven

Apr 4, 2023

The model would likely need to be quantized to use less memory. You could probably load it as-is with the --load-in-8bit flag when using text-generation-webui. (The 8bit feature is provided by the bitsandbytes python dep.)

To take it down farther, it could be quantized to 4-bits. There's another discussion thread here that talks about that.

For 8bit, you can run the model in its current form. For 4-bit, you'll have to run a quantization step yourself, which takes a while, but is totally doable on a local machine.

autobots

Apr 23, 2023

i bet this model released as FP32 instead of FP16

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment