Instructions to use Rushi2901/bitnet_b1_58-xl with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Rushi2901/bitnet_b1_58-xl with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Rushi2901/bitnet_b1_58-xl") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Rushi2901/bitnet_b1_58-xl") model = AutoModelForCausalLM.from_pretrained("Rushi2901/bitnet_b1_58-xl") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use Rushi2901/bitnet_b1_58-xl with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Rushi2901/bitnet_b1_58-xl", filename="ggml-model-f32.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use Rushi2901/bitnet_b1_58-xl with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Rushi2901/bitnet_b1_58-xl:F32 # Run inference directly in the terminal: llama-cli -hf Rushi2901/bitnet_b1_58-xl:F32
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Rushi2901/bitnet_b1_58-xl:F32 # Run inference directly in the terminal: llama-cli -hf Rushi2901/bitnet_b1_58-xl:F32
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Rushi2901/bitnet_b1_58-xl:F32 # Run inference directly in the terminal: ./llama-cli -hf Rushi2901/bitnet_b1_58-xl:F32
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Rushi2901/bitnet_b1_58-xl:F32 # Run inference directly in the terminal: ./build/bin/llama-cli -hf Rushi2901/bitnet_b1_58-xl:F32
Use Docker
docker model run hf.co/Rushi2901/bitnet_b1_58-xl:F32
- LM Studio
- Jan
- vLLM
How to use Rushi2901/bitnet_b1_58-xl with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Rushi2901/bitnet_b1_58-xl" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Rushi2901/bitnet_b1_58-xl", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Rushi2901/bitnet_b1_58-xl:F32
- SGLang
How to use Rushi2901/bitnet_b1_58-xl with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Rushi2901/bitnet_b1_58-xl" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Rushi2901/bitnet_b1_58-xl", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Rushi2901/bitnet_b1_58-xl" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Rushi2901/bitnet_b1_58-xl", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use Rushi2901/bitnet_b1_58-xl with Ollama:
ollama run hf.co/Rushi2901/bitnet_b1_58-xl:F32
- Unsloth Studio new
How to use Rushi2901/bitnet_b1_58-xl with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Rushi2901/bitnet_b1_58-xl to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Rushi2901/bitnet_b1_58-xl to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Rushi2901/bitnet_b1_58-xl to start chatting
- Docker Model Runner
How to use Rushi2901/bitnet_b1_58-xl with Docker Model Runner:
docker model run hf.co/Rushi2901/bitnet_b1_58-xl:F32
- Lemonade
How to use Rushi2901/bitnet_b1_58-xl with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Rushi2901/bitnet_b1_58-xl:F32
Run and chat with the model
lemonade run user.bitnet_b1_58-xl-F32
List all available models
lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)Model Card for Model ID
Llama3-8B-1.58 Models
The Llama3-8B-1.58 models are large language models fine-tuned on the BitNet 1.58b architecture, starting from the base model Llama-3-8B-Instruct.
For a deeper dive into the methods and results, check out our blog post.
Model Details
Model Sources
- Repository: Model
- Paper: The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
How to Get Started with the Model
You can easily load and test our model in Transformers. Just follow the code below:
Start by installing the transformers version with the correct configuration to load bitnet models
pip install git+https://github.com/huggingface/transformers.git@refs/pull/33410/head
And then load the model :
model = AutoModelForCausalLM.from_pretrained("HF1BitLLM/Llama3-8B-1.58-100B-tokens", device_map="cuda", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
input_text = "Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?\nAnswer:"
input_ids = tokenizer.encode(input_text, return_tensors="pt").cuda()
output = model.generate(input_ids, max_length=10, do_sample=False)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
Training Details
Training Data
The model was trained on a subset of FineWeb-edu
Training Process
Starting Point
- Best-performing checkpoint from the 10 billion token runs with a linear lambda scheduler
Training Duration
- Fine-tuned for an additional 45,000 steps
- Reached a total of 100 billion tokens
Dataset
- FineWeb-edu dataset
Batch Size
- 2 million tokens per step
- Total per run: 45,000 steps * 2 million tokens = 90 billion tokens
- Combined with initial 10 billion tokens to reach 100 billion
Learning Rate Experiments
- Tested various learning rates to find optimal setting, according the to experiments, the best performing peak lr is 1e-5
Performance
- Close to Llama3 8B on some metrics
- Behind Llama3 8B in overall average performance
Evaluation
- Metrics included perplexity, MMLU scores, and other standard benchmarks
These extended training runs on 100 billion tokens pushed the boundaries of highly quantized models, bringing performance closer to half-precision models like Llama3.
Evaluation
The evaluation of the models is done on the nanotron checkpoints using LightEval :
Citation
@misc{,
title={1.58-Bit LLM: A New Era of Extreme Quantization},
author={Mohamed Mekkouri and Marc Sun and Leandro von Werra and Thomas Wolf},
year={2024},
}
- Downloads last month
- 70
Model tree for Rushi2901/bitnet_b1_58-xl
Base model
meta-llama/Meta-Llama-3-8B-Instruct
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Rushi2901/bitnet_b1_58-xl", filename="", )