GigaChat3.1-702B-A36B — MLX 4-bit

MLX 4-bit quantization of ai-sage/GigaChat3.1-702B-A36B-bf16 for Apple Silicon.

Model Details

  • Architecture: DeepseekV3 MoE + MLA (Multi-head Latent Attention)
  • Total Parameters: 702B
  • Active Parameters: 36B (8 of 256 experts per token)
  • Quantization: 4-bit (group size 64, 4.502 bits/weight avg)
  • Model Size: 368 GB
  • Context: 262K tokens
  • License: MIT (same as original)
  • Converted by: RockTalk using mlx_lm v0.31.1

Requirements

  • Apple Silicon Mac with 384GB+ unified memory (512GB recommended)
  • macOS 14+
  • mlx_lm >= 0.31.0

Performance (M3 Ultra, 512GB)

Metric Value
Load time ~18 min
Generation speed ~12 tok/s
RAM usage ~368 GB

Usage

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load("RockTalk/GigaChat3.1-702B-A36B-MLX-4bit")
sampler = make_sampler(temp=0.7)

messages = [{"role": "user", "content": "Hello!"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
response = generate(model, tokenizer, prompt=prompt, max_tokens=500, sampler=sampler)
print(response)

Conversion Notes

Converted from bf16 source weights. The conversion required patching mlx_lm's deepseek_v3.py sanitize method — it hardcodes MTP (Multi-Token Prediction) layer removal at layer index 61 (correct for DeepSeek-V3's 61 transformer layers), but GigaChat has 64 transformer layers so the MTP head is at layer 64. The fix:

# In mlx_lm/models/deepseek_v3.py, sanitize() method:
# Change hardcoded "model.layers.61" to dynamic:
mtp_prefix = f"model.layers.{self.args.num_hidden_layers}"

The tokenizer_class was also patched from TokenizersBackend (transformers v5) to PreTrainedTokenizerFast.

About GigaChat

GigaChat is Sber's (Russia's largest bank) AI model family. The 702B "Ultra" is their flagship — same DeepseekV3 architecture (MoE + MLA + MTP) trained on multilingual data (~5.5T synthetic tokens, 10 languages). MIT licensed.

Downloads last month
52
Safetensors
Model size
702B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RockTalk/GigaChat3.1-702B-A36B-MLX-4bit

Quantized
(1)
this model