VLSI-SLM V1 β€” CodeLlama Full Model

The first open-source, edge-trained, laptop-deployable Small Language Model specialized for VLSI design.

A 7B parameter CodeLlama model fine-tuned on 30,354 curated VLSI examples β€” trained entirely on a NVIDIA Jetson Orin edge device with no cloud compute. Generates syntactically correct Verilog, explains VLSI concepts accurately, and runs offline on a 4GB laptop after quantization.


Model Details

Property Value
Base Model CodeLlama-7B-Instruct
Fine-tuning Method LoRA (r=32, Ξ±=64)
Trainable Parameters 82,265,088 (1.21% of 6.82B)
Training Hardware NVIDIA Jetson Orin 64GB (edge device)
Training Time ~84 hours wall time
Dataset Size 30,354 examples (train) / 1,681 (val)
Training Epochs 3
Final Train Loss 0.0122
Best Val Loss 0.3892 (step 4000)
Precision bfloat16 (no quantization during training)
License MIT

LoRA Configuration

LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",   # Attention
        "gate_proj", "up_proj", "down_proj",       # MLP/FFN
        "embed_tokens", "lm_head",                 # Embeddings
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

Repository Contents

VLSI-SLM-V1-CodeLlama-Full/
β”œβ”€β”€ final_model/          ← Merged full model (~14GB, bf16 safetensors)
β”œβ”€β”€ final_adapter/        ← LoRA adapter only (~200MB)
β”œβ”€β”€ checkpoint-5000/      ← Training checkpoint
β”œβ”€β”€ checkpoint-5250/      ← Training checkpoint
β”œβ”€β”€ checkpoint-5500/      ← Training checkpoint
β”œβ”€β”€ checkpoint-5691/      ← Final training checkpoint
β”œβ”€β”€ evaluation/           ← Benchmark results and logs
β”œβ”€β”€ logs/                 ← Full training logs
β”œβ”€β”€ baseline_pre_ft.json  ← Base model responses (pre fine-tuning)
β”œβ”€β”€ best_checkpoint.txt   ← Best validation checkpoint info
β”œβ”€β”€ heartbeat.json        ← Last training state
└── m4_config_v31.json    ← Exact training hyperparameters

Evaluation Results

Evaluated using a semantic scoring system (not rigid keyword matching) with max_new_tokens=1024.

Standard 50-Question VLSI Benchmark

Metric Score Target Status
Code Syntax Pass (iverilog) 60.0% 40–60% βœ… PASS
Concept Accuracy 65.0% 85–90% 🟑 CLOSE
Hallucination Rate 0.0% <5% βœ… PERFECT
Code Block Formatting 95.0% β€” βœ…
Debug Accuracy 60.0% β€” 🟑
Overall 72.0% β€” βœ…

Coding Stress Test (50 Progressive Questions)

Difficulty Questions Pass Rate Examples
Easy 10 100% AND gate, DFF, counter, decoder
Medium 15 87% FIFO, ALU, FSM, synchronizer
Hard 13 62% Async FIFO, AXI-Lite, SPI master
Expert 12 42% FP adder, MBIST, JTAG TAP controller

The model handles all standard VLSI building blocks cleanly. Expert-level complex modules (1000+ tokens) show truncation artifacts β€” a known training data issue being addressed in V2.


Quick Start

Load and Run Inference

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "Rajasrl/VLSI-SLM-V1-CodeLlama-Full"

tokenizer = AutoTokenizer.from_pretrained(f"{model_id}/final_model")
model = AutoModelForCausalLM.from_pretrained(
    f"{model_id}/final_model",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

def ask_vlsi(question: str, code_mode: bool = False) -> str:
    if code_mode:
        system = """You are a Senior VLSI RTL Engineer.
Rules:
1. Always wrap code in ```verilog blocks
2. Use non-blocking assignments (<=) in sequential always blocks
3. Use blocking assignments (=) in combinational always blocks
4. Always include complete module with endmodule
5. Never use reserved keywords as signal names"""
    else:
        system = "You are an expert VLSI engineer. Give accurate, technical answers."

    prompt = f"### System:\n{system}\n\n### Instruction:\n{question}\n\n### Response:\n"
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=1024,       # Important: use 1024+ for complete modules
            temperature=0.0 if code_mode else 0.1,
            do_sample=not code_mode,
            repetition_penalty=1.1,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    response = tokenizer.decode(
        output[0][inputs["input_ids"].shape[1]:],
        skip_special_tokens=True
    )
    return response.strip()

# Code generation (deterministic)
print(ask_vlsi(
    "Write a parameterizable 8-bit synchronous counter with reset.",
    code_mode=True
))

# Concept explanation
print(ask_vlsi(
    "Explain clock domain crossing and how to handle it safely.",
    code_mode=False
))

Run with Ollama (Recommended for Laptop Deployment)

First quantize to GGUF:

# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j4

# Convert and quantize
python convert_hf_to_gguf.py ./final_model --outtype f16 \
    --outfile vlsi_slm_v1_f16.gguf

./llama-quantize vlsi_slm_v1_f16.gguf vlsi_slm_v1_Q4_K_M.gguf Q4_K_M
# Output: ~4GB file, runs on any laptop

Create Modelfile:

FROM ./vlsi_slm_v1_Q4_K_M.gguf

SYSTEM """You are an expert VLSI and Verilog engineer.
For code: output only syntactically correct, synthesizable Verilog.
Use non-blocking assignments (<=) in sequential always blocks.
Always wrap code in ```verilog blocks.
Always include endmodule.
For concepts: give accurate, technical explanations."""

PARAMETER temperature 0.1
PARAMETER num_ctx 2048
ollama create vlsi-slm-v1 -f Modelfile
ollama run vlsi-slm-v1

What This Model Can Do βœ…

Strong Capabilities (Easy–Medium complexity)

Verilog Code Generation:

  • Flip-flops (D, T, JK) with synchronous/asynchronous reset
  • Counters (binary, Gray code, Johnson, LFSR)
  • Multiplexers, encoders, decoders
  • Shift registers (parameterizable width/depth)
  • State machines (Moore and Mealy FSM)
  • Synchronous SRAM and FIFO
  • Clock dividers and pulse generators
  • Debounce circuits
  • Two-flop CDC synchronizers
  • Basic AXI-Lite and handshake protocols
  • Simple UART, SPI, I2C controllers
  • Testbench templates

VLSI Concept Explanations:

  • Clock Domain Crossing (CDC) and metastability
  • Setup time and hold time analysis
  • Power reduction: clock gating and power gating
  • Static Timing Analysis (STA) concepts
  • Scan chains and Design for Testability (DFT)
  • SRAM vs DRAM differences
  • Electromigration and IR drop
  • AXI, APB, AHB protocol rules
  • Blocking vs non-blocking assignments
  • Latch inference and how to avoid it

Partial Capabilities (Hard complexity)

  • Asynchronous FIFO with Gray code pointers (architecture correct, may miss endmodule)
  • Round-robin arbiters
  • Pipeline structures
  • SPI master/slave controllers
  • Branch predictors
  • Memory BIST controllers

Known Limitations ⚠️

1. Truncation Artifact (Primary Known Issue)

Complex modules exceeding ~800 tokens of output may be cut off before endmodule. This is a training data artifact β€” the dataset was generated using free APIs with 1800-token output limits, and truncated examples leaked through. The model learned this truncation pattern as a behavior.

Workaround: Always set max_new_tokens=1024 or higher. If output is still truncated, append \nendmodule manually β€” the logic inside is typically correct.

Fix in progress: V2 training uses strict endmodule validation gates in the data pipeline.

2. Concept Accuracy Gap

Concept accuracy is 65% vs the 85-90% target. Root cause: PDF textbooks were extracted page-by-page (not paragraph-by-paragraph), causing "semantic blur" where opposing concepts (e.g., Setup vs Hold timing) were mixed in the same training example.

3. Submodule Hallucination

Occasionally instantiates undefined submodules (fa fa0(...) style) when asked for gate-level designs. Best avoided by explicitly requesting "behavioral RTL" in your prompt.

4. Not Trained for SoC-Level Design

This model is optimized for block-level RTL (FIFOs, arbiters, FSMs, protocol controllers). It is not intended for full SoC or chip-level architecture. Expert-level questions (5-stage RISC pipeline, NoC routers, IEEE 754 FP units) are attempted but may be incomplete.

5. Hardware Constraints on Base Hardware

Trained on a 64GB Jetson Orin. The merged model requires 15GB RAM. Use the GGUF Q4_K_M quantized version (4GB) for laptop deployment.


Training Details

Hardware

This model was trained entirely on a NVIDIA Jetson Orin 64GB β€” an edge computing device, with no cloud GPUs used.

Device      : NVIDIA Jetson Orin (64GB unified RAM)
CUDA        : 12.6 (ARM64)
OS          : Ubuntu 22.04
PyTorch     : 2.5.0a0 nv24.8
Transformers: 4.44.0
PEFT        : 0.18.1
TRL         : 0.8.6

Important hardware note: bitsandbytes is not compatible with CUDA 12.6 on Jetson Orin ARM64. Training used pure bfloat16 with adamw_torch optimizer. If you attempt to run this model on similar ARM64 Jetson hardware, do not use bitsandbytes or NEFTune.

Training Configuration

TrainingArguments(
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,     # Effective batch = 16
    learning_rate=2e-5,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    bf16=True,
    fp16=False,
    gradient_checkpointing=True,
    optim="adamw_torch",
    max_grad_norm=1.0,
    save_steps=500,
    eval_steps=500,
    save_total_limit=4,
    group_by_length=True,
)

Thermal Management Innovation

A custom thermal batching system was implemented:

  • Every 250 training steps: save checkpoint β†’ 5-minute cooldown β†’ resume
  • Table fan added for additional airflow
  • Result: GPU temperature maintained at 44–61Β°C throughout 84-hour run
  • 6 power outages during training β€” all recovered via atomic heartbeat checkpointing

Dataset

Source          : Curated VLSI examples (code + concept + QA)
Format          : Alpaca instruction tuning
Train           : 30,354 examples
Validation      : 1,681 examples  
Test            : 1,681 examples
Categories      : 75.8% code_generation, 23.0% concept, 1.2% QA
Max seq length  : 2048 tokens
Decontamination : βœ… Zero benchmark leaks verified

Comparison: Base vs Fine-tuned

Metric Base CodeLlama-7B VLSI-SLM V1
Verilog syntax knowledge General VLSI-specialized
VLSI concept depth Surface-level Detailed and accurate
Hallucination rate ~10% 0.0%
Code syntax pass (iverilog) ~0% 60%
Runs offline βœ… βœ…
Deployable on laptop βœ… (4GB Q4) βœ… (4GB Q4)
Cost Free Free

Roadmap: What V2 Will Fix

VLSI-SLM V2 is currently in development with the following improvements:

Issue V1 Status V2 Fix
Truncated endmodule Present in complex modules Strict validation gate in data pipeline
Concept accuracy 65% Below target Layout-aware PDF chunking (paragraph-level)
Submodule hallucination Occasional Anti-submodule prompt in data generation
Dataset quality Quantity-focused (30K) Quality-focused (12K clean)
JSON data corruption Silent patching Strict drop-on-failure
EOS alignment Not enforced EOS token after endmodule
Concept/code ratio 23%/75% 50%/50% balanced

Target V2 metrics:

  • Code Syntax Pass: 65–75%
  • Concept Accuracy: 85–90%
  • Hallucination Rate: <2%

How to Contribute / Develop Further

1. Improve the Dataset

The biggest gains come from data quality, not model size.

# The most impactful contribution: add validated Verilog examples
# Requirements:
# - Must compile with iverilog
# - Must end with endmodule/endinterface/endpackage
# - Must be self-contained (no undefined submodules)
# - Alpaca format: {"instruction": ..., "input": "", "output": ...}

# Validate before contributing:
import subprocess
result = subprocess.run(["iverilog", "-tnull", "your_file.v"],
                       capture_output=True, text=True)
assert result.returncode == 0, f"Syntax error: {result.stderr}"
assert "endmodule" in open("your_file.v").read()

2. Fine-tune Further on Your Domain

Use LoRA to specialize for your specific VLSI area:

from peft import LoraConfig, get_peft_model, PeftModel

# Load V1 as base for V2 fine-tuning
model = AutoModelForCausalLM.from_pretrained(
    "Rajasrl/VLSI-SLM-V1-CodeLlama-Full/final_model",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Add new LoRA adapters for your domain
# (FPGA-specific, ASIC timing, formal verification, etc.)
lora_config = LoraConfig(r=16, lora_alpha=32, ...)
model = get_peft_model(model, lora_config)

3. Extend to SystemVerilog / UVM

The model has basic SV knowledge but was primarily trained on Verilog-2001. Adding UVM testbench examples and SystemVerilog assertions (SVA) would significantly improve verification use cases.

4. Add Image Recognition

A compelling future direction: multi-modal VLSI assistant that can:

  • Read handwritten schematic photos β†’ generate Verilog
  • Analyze timing diagrams β†’ identify violations
  • Recognize circuit board components β†’ explain connections

5. Build a Retrieval-Augmented Generation (RAG) Layer

Connect the model to a vector database of VLSI standards (IEEE 1800, AMBA AXI spec, IEEE 1149.1 JTAG) for factually grounded answers.

6. Evaluation Contributions

Add more benchmark questions to evaluation/ folder β€” especially:

  • Formal verification questions (SVA, PSL)
  • Physical design (placement, routing, DRC)
  • Analog/mixed-signal interfaces
  • RISC-V specific RTL patterns

Citation

If you use this model in your research, please cite:

@misc{vlsi-slm-v1-2026,
  title        = {VLSI-SLM V1: An Edge-Trained Small Language Model for VLSI Design},
  author       = {Rajasrl},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Rajasrl/VLSI-SLM-V1-CodeLlama-Full}},
  note         = {Fine-tuned CodeLlama-7B on NVIDIA Jetson Orin edge hardware.
                  30,354 curated VLSI examples. Zero cloud compute.}
}

The Story

This model was trained by a final-year engineering student on borrowed edge hardware, with no cloud budget, no research lab, and no team. The training ran through 6 power outages, lightning storms, and thermal shutdowns β€” all recovered automatically.

The goal was simple: build a VLSI assistant that works offline, costs nothing to run, and belongs to the community β€” not behind an API paywall.

"I built an AI to teach me VLSI."


License

MIT License β€” free to use, modify, and distribute. See LICENSE for details.


Model trained: March 29 – April 3, 2026 Uploaded to Hugging Face: May 2026 Hardware: NVIDIA Jetson Orin 64GB (edge device, no cloud)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Rajasrl/VLSI-SLM-V1-CodeLlama-Full

Adapter
(426)
this model