Instructions to use SandLogicTechnologies/llama-2-13b-chat-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use SandLogicTechnologies/llama-2-13b-chat-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="SandLogicTechnologies/llama-2-13b-chat-GGUF", filename="llama-2-13b-chat_Q4_K_M.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use SandLogicTechnologies/llama-2-13b-chat-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M
Use Docker
docker model run hf.co/SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use SandLogicTechnologies/llama-2-13b-chat-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "SandLogicTechnologies/llama-2-13b-chat-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SandLogicTechnologies/llama-2-13b-chat-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M
- Ollama
How to use SandLogicTechnologies/llama-2-13b-chat-GGUF with Ollama:
ollama run hf.co/SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M
- Unsloth Studio new
How to use SandLogicTechnologies/llama-2-13b-chat-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for SandLogicTechnologies/llama-2-13b-chat-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for SandLogicTechnologies/llama-2-13b-chat-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for SandLogicTechnologies/llama-2-13b-chat-GGUF to start chatting
- Docker Model Runner
How to use SandLogicTechnologies/llama-2-13b-chat-GGUF with Docker Model Runner:
docker model run hf.co/SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M
- Lemonade
How to use SandLogicTechnologies/llama-2-13b-chat-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.llama-2-13b-chat-GGUF-Q4_K_M
List all available models
lemonade list
Llama-2-13b-chat-hf
Llama-2-13b-chat-hf is a conversational large language model developed by Meta and optimized for dialogue-oriented applications. It is part of the Llama 2 family of generative language models and is specifically fine-tuned to behave as a helpful assistant in multi-turn interactions.
This model builds on the pretrained Llama 2 13B foundation and incorporates supervised fine-tuning and human feedback alignment to improve response quality, helpfulness, and safety in conversational settings.
The Hugging Face version is converted for compatibility with the Transformers ecosystem, enabling straightforward integration into research and production pipelines.
Model Overview
- Model Name: Llama-2-13b-chat-hf
- Base Model: meta-llama/Llama-2-13b
- Architecture: Decoder-only Transformer
- Parameter Count: 13 Billion
- Context Window: Up to ~4096 tokens
- Modalities: Text
- Primary Language: English
- Developer: Meta
- License: Llama 2 Community License
Quantization Details
Q4_K_M
- Approx. ~71% size reduction (7.33 GB)
- Lower memory footprint for local inference
- Suitable for CPU execution and limited VRAM GPUs
- Faster token generation speeds
- Slight precision trade-offs in complex reasoning tasks
Q5_K_M
- Approx. ~66% size reduction (8.60 GB)
- Higher numerical precision compared to lower-bit variants
- Improved response stability and coherence
- Better performance for reasoning-heavy prompts
- Recommended when additional memory is available
Training Overview
Pretraining
The base Llama 2 models are trained on very large-scale text datasets consisting of publicly available, licensed, and proprietary sources. Training emphasizes language understanding, reasoning, and contextual coherence across diverse domains.
Chat Alignment
The chat variant is further refined through supervised fine-tuning and human feedback to improve:
- conversational quality
- instruction adherence
- safety and helpfulness
- response consistency
Llama-2-13b-chat-hf is designed to deliver strong conversational performance while maintaining efficient inference for a model of its scale.
Key design priorities include:
- Natural and coherent dialogue generation
- Reliable instruction following
- Improved safety and helpfulness
- Consistent multi-turn conversation handling
- Balanced reasoning and knowledge responses
Core Capabilities
Conversational interaction
Maintains coherent multi-turn dialogue.Instruction following
Executes structured prompts and complex tasks.Reasoning and explanation
Handles analytical questions and structured thinking.Contextual understanding
Processes extended conversations within its token window.Assistant-style communication
Produces helpful and informative responses.
Example Usage
llama.cpp
./llama-cli
-m SandlogicTechnologies\Llama-2-13b-chat_Q4_K_M.gguf
-p "Explain how attention mechanisms work in transformers."
Recommended Use Cases
- Conversational AI assistants
- Knowledge and question answering
- Technical explanation and tutoring
- Content generation and summarization
- Prompt-driven automation workflows
- Research and evaluation of chat models
Acknowledgments
These quantized models are based on the original work by meta-llama development team.
Special thanks to:
The meta-llama team for developing and releasing the meta-llama/Llama-2-13b-chat-hf model.
Georgi Gerganov and the entire
llama.cppopen-source community for enabling efficient model quantization and inference via the GGUF format.
Contact
For any inquiries or support, please contact us at support@sandlogic.com or visit our Website.
- Downloads last month
- 10
4-bit
5-bit
Model tree for SandLogicTechnologies/llama-2-13b-chat-GGUF
Base model
meta-llama/Llama-2-13b