SigLIP2 Base 256 β Food/Not-Food Classifier v2
Binary image classifier: food_or_drink vs not_food_or_drink.
Part of the Nutrify pipeline. Role: Highest accuracy.
v2 Improvement
v2 adds 560,836 human-labeled FoodVision images to the 2,952,644 DataComp training set. FoodVision samples use hard cross-entropy loss; DataComp samples use KL distillation from SigLIP2-so400m soft labels.
| Version | FoodVision Acc | FoodVision F1 | Training Data |
|---|---|---|---|
| v2 | 98.21% | 0.9883 | DataComp 2,952,644 + FoodVision 560,836 |
| v1 | 0.00% | 0.0000 | DataComp only |
| Ξ | +98.21% | +0.9883 |
Cross-Model Comparison (v2, FoodVision Test β 153K images)
| Model | Params | FV Accuracy | FV F1 | Role |
|---|
| **SigLIP2 Base 256** | 92.9M | 98.21% | 0.9883 | Highest accuracy |
| CSATv2 11M | 10.7M | 97.99% | 0.9869 | Fastest throughput |
| NextViT Small 384 | 30.7M | 97.84% | 0.9859 | CoreML deployable |
Quick Start
import timm
from PIL import Image
import torch
# Load model
model = timm.create_model("vit_base_patch16_siglip_256.v2_webli", pretrained=False, num_classes=2)
# Load weights
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
weights_path = hf_hub_download("mrdbourke/food-not-food-classifier-siglip2-v2", "model.safetensors")
model.load_state_dict(load_file(weights_path))
model.eval()
# Get transforms
data_cfg = timm.data.resolve_data_config(model.pretrained_cfg)
transform = timm.data.create_transform(**data_cfg, is_training=False)
# Predict
img = Image.open("your_image.jpg").convert("RGB")
x = transform(img).unsqueeze(0)
with torch.no_grad():
logits = model(x)
pred = logits.argmax(dim=1).item()
labels = {0: "food_or_drink", 1: "not_food_or_drink"}
print(f"Prediction: {labels[pred]}")
Training Details
- Architecture:
vit_base_patch16_siglip_256.v2_webli(92.9M parameters) - Input size: 256px
- Training data: DataComp 2,952,644 (soft KL labels) + FoodVision 560,836 (hard CE labels)
- Epochs: 5 (best blended at epoch 3)
- Peak inference throughput: 2096.3 img/s
- Optimizer: AdamW (head LR=1e-4, backbone LR=1e-5 after 0.5 epoch warmup)
- Loss: DataComp: 0.7ΓKL(T=3) + 0.3ΓCE | FoodVision: CE
Weight Variants
Three weight files are included, each optimized for a different metric:
| File | Selects by | FV Acc | DC Acc | Blended | Epoch | Use case |
|---|---|---|---|---|---|---|
model.safetensors (default) |
Best blended (50/50) | 98.21% | 92.42% | 95.31% | 3 | Balanced β good at everything |
model_best_fv.safetensors |
Best FoodVision test | 98.34% | 92.28% | 95.31% | 5 | On-device Nutrify deployment |
model_best_dc.safetensors |
Best DataComp val | 98.21% | 92.42% | 95.31% | 3 | Scale-up filtering (menus, panels, recipes) |
To load a specific variant:
# Default (blended)
weights_path = hf_hub_download("mrdbourke/food-not-food-classifier-siglip2-v2", "model.safetensors")
# Best for Nutrify on-device
weights_path = hf_hub_download("mrdbourke/food-not-food-classifier-siglip2-v2", "model_best_fv.safetensors")
# Best for scale-up filtering
weights_path = hf_hub_download("mrdbourke/food-not-food-classifier-siglip2-v2", "model_best_dc.safetensors")
Related Models
| Version | Repo |
|---|---|
| v2 (this) | mrdbourke/food-not-food-classifier-siglip2-v2 |
| v1 | mrdbourke/food-not-food-classifier-siglip2-v1 |
Dataset
Training images from DataComp-1B-food-and-drink-3M and the Nutrify FoodVision dataset (714K human-labeled images).
License
Apache 2.0
- Downloads last month
- 399