Instructions to use SandLogicTechnologies/llama-2-13b-chat-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use SandLogicTechnologies/llama-2-13b-chat-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="SandLogicTechnologies/llama-2-13b-chat-GGUF",
	filename="llama-2-13b-chat_Q4_K_M.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use SandLogicTechnologies/llama-2-13b-chat-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M

Use Docker

docker model run hf.co/SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use SandLogicTechnologies/llama-2-13b-chat-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "SandLogicTechnologies/llama-2-13b-chat-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SandLogicTechnologies/llama-2-13b-chat-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M

Ollama
How to use SandLogicTechnologies/llama-2-13b-chat-GGUF with Ollama:
```
ollama run hf.co/SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M
```

Unsloth Studio new

How to use SandLogicTechnologies/llama-2-13b-chat-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for SandLogicTechnologies/llama-2-13b-chat-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for SandLogicTechnologies/llama-2-13b-chat-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for SandLogicTechnologies/llama-2-13b-chat-GGUF to start chatting

Docker Model Runner
How to use SandLogicTechnologies/llama-2-13b-chat-GGUF with Docker Model Runner:
```
docker model run hf.co/SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M
```

Lemonade

How to use SandLogicTechnologies/llama-2-13b-chat-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull SandLogicTechnologies/llama-2-13b-chat-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.llama-2-13b-chat-GGUF-Q4_K_M

List all available models

lemonade list

Llama-2-13b-chat-hf

Llama-2-13b-chat-hf is a conversational large language model developed by Meta and optimized for dialogue-oriented applications. It is part of the Llama 2 family of generative language models and is specifically fine-tuned to behave as a helpful assistant in multi-turn interactions.

This model builds on the pretrained Llama 2 13B foundation and incorporates supervised fine-tuning and human feedback alignment to improve response quality, helpfulness, and safety in conversational settings.

The Hugging Face version is converted for compatibility with the Transformers ecosystem, enabling straightforward integration into research and production pipelines.

Model Overview

Model Name: Llama-2-13b-chat-hf
Base Model: meta-llama/Llama-2-13b
Architecture: Decoder-only Transformer
Parameter Count: 13 Billion
Context Window: Up to ~4096 tokens
Modalities: Text
Primary Language: English
Developer: Meta
License: Llama 2 Community License

Quantization Details

Q4_K_M

Approx. ~71% size reduction (7.33 GB)
Lower memory footprint for local inference
Suitable for CPU execution and limited VRAM GPUs
Faster token generation speeds
Slight precision trade-offs in complex reasoning tasks

Q5_K_M

Approx. ~66% size reduction (8.60 GB)
Higher numerical precision compared to lower-bit variants
Improved response stability and coherence
Better performance for reasoning-heavy prompts
Recommended when additional memory is available

Training Overview

Pretraining

The base Llama 2 models are trained on very large-scale text datasets consisting of publicly available, licensed, and proprietary sources. Training emphasizes language understanding, reasoning, and contextual coherence across diverse domains.

Chat Alignment

The chat variant is further refined through supervised fine-tuning and human feedback to improve:

conversational quality
instruction adherence
safety and helpfulness
response consistency

Llama-2-13b-chat-hf is designed to deliver strong conversational performance while maintaining efficient inference for a model of its scale.

Key design priorities include:

Natural and coherent dialogue generation
Reliable instruction following
Improved safety and helpfulness
Consistent multi-turn conversation handling
Balanced reasoning and knowledge responses

Core Capabilities

Conversational interaction
Maintains coherent multi-turn dialogue.
Instruction following
Executes structured prompts and complex tasks.
Reasoning and explanation
Handles analytical questions and structured thinking.
Contextual understanding
Processes extended conversations within its token window.
Assistant-style communication
Produces helpful and informative responses.

Example Usage

llama.cpp


./llama-cli 
-m SandlogicTechnologies\Llama-2-13b-chat_Q4_K_M.gguf 
-p "Explain how attention mechanisms work in transformers."

Recommended Use Cases

Conversational AI assistants
Knowledge and question answering
Technical explanation and tutoring
Content generation and summarization
Prompt-driven automation workflows
Research and evaluation of chat models

Acknowledgments

These quantized models are based on the original work by meta-llama development team.

Special thanks to:

The meta-llama team for developing and releasing the meta-llama/Llama-2-13b-chat-hf model.
Georgi Gerganov and the entire llama.cpp open-source community for enabling efficient model quantization and inference via the GGUF format.

Contact

For any inquiries or support, please contact us at support@sandlogic.com or visit our Website.

Downloads last month: 10

GGUF

Model size

13B params

Architecture

llama

Hardware compatibility

4-bit

5-bit

Model tree for SandLogicTechnologies/llama-2-13b-chat-GGUF

Base model

meta-llama/Llama-2-13b

Quantized

(1)

this model