How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf thinktecture/gemma3-4b-ft-nextera-f16:F16
# Run inference directly in the terminal:
llama-cli -hf thinktecture/gemma3-4b-ft-nextera-f16:F16
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf thinktecture/gemma3-4b-ft-nextera-f16:F16
# Run inference directly in the terminal:
llama-cli -hf thinktecture/gemma3-4b-ft-nextera-f16:F16
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf thinktecture/gemma3-4b-ft-nextera-f16:F16
# Run inference directly in the terminal:
./llama-cli -hf thinktecture/gemma3-4b-ft-nextera-f16:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf thinktecture/gemma3-4b-ft-nextera-f16:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf thinktecture/gemma3-4b-ft-nextera-f16:F16
Use Docker
docker model run hf.co/thinktecture/gemma3-4b-ft-nextera-f16:F16
Quick Links

⚠️ Conference talk demo β€” not production weights.

This model accompanies a conference keynote on local on-device AI. Published as a reference for the fine-tuning patterns shown on stage β€” not a deployable artefact. No security audit, no SLA, pinned to the talk's state.


Gemma3-4B FT (f16) β€” RAG Synthesis (+ Vision)

Base model google/gemma-3-4b-it (4.3B params, multimodal: text + vision via mmproj)
License Gemma Terms of Use
Training script finetune/train_gemma3_4b.py
Method LoRA r=16, Ξ±=32, 3 epochs, lr=5e-5
Training data data/training-data/gemma3_4b_synthesis_{scenario}.jsonl (RAG passages + grounded answers)
Hardware tested RTX PRO 6000 (CUDA). MPS works but slow; QLoRA via --qlora for ≀24GB VRAM
Intended use RAG response synthesis β€” given retrieved passages and a user question, produce a grounded, source-faithful answer. The vision channel (mmproj) remains base-only.
Out of scope Tool calling (delegated to Qwen3.5-4B FT). Free-form chat without retrieved context.
Reference eval (Nextera) RAG keyword grounding: 96% on 25-query holdout. See docs/benchmarks/EVAL_RESULTS_*.md.
Known failure modes Will occasionally synthesise across documents that share lexical overlap but different domains β€” mitigated by the rewrite-query step that pre-filters retrieval.
Downloads last month
35
GGUF
Model size
4B params
Architecture
gemma3
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for thinktecture/gemma3-4b-ft-nextera-f16

Quantized
(221)
this model