SigLIP2 Base 256 β€” Food/Not-Food Classifier v2

Binary image classifier: food_or_drink vs not_food_or_drink.

Part of the Nutrify pipeline. Role: Highest accuracy.

v2 Improvement

v2 adds 560,836 human-labeled FoodVision images to the 2,952,644 DataComp training set. FoodVision samples use hard cross-entropy loss; DataComp samples use KL distillation from SigLIP2-so400m soft labels.

Version FoodVision Acc FoodVision F1 Training Data
v2 98.21% 0.9883 DataComp 2,952,644 + FoodVision 560,836
v1 0.00% 0.0000 DataComp only
Ξ” +98.21% +0.9883

Cross-Model Comparison (v2, FoodVision Test β€” 153K images)

Model Params FV Accuracy FV F1 Role
| **SigLIP2 Base 256** | 92.9M | 98.21% | 0.9883 | Highest accuracy |
| CSATv2 11M | 10.7M | 97.99% | 0.9869 | Fastest throughput |
| NextViT Small 384 | 30.7M | 97.84% | 0.9859 | CoreML deployable |

Quick Start

import timm
from PIL import Image
import torch

# Load model
model = timm.create_model("vit_base_patch16_siglip_256.v2_webli", pretrained=False, num_classes=2)

# Load weights
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download

weights_path = hf_hub_download("mrdbourke/food-not-food-classifier-siglip2-v2", "model.safetensors")
model.load_state_dict(load_file(weights_path))
model.eval()

# Get transforms
data_cfg = timm.data.resolve_data_config(model.pretrained_cfg)
transform = timm.data.create_transform(**data_cfg, is_training=False)

# Predict
img = Image.open("your_image.jpg").convert("RGB")
x = transform(img).unsqueeze(0)
with torch.no_grad():
    logits = model(x)
    pred = logits.argmax(dim=1).item()

labels = {0: "food_or_drink", 1: "not_food_or_drink"}
print(f"Prediction: {labels[pred]}")

Training Details

  • Architecture: vit_base_patch16_siglip_256.v2_webli (92.9M parameters)
  • Input size: 256px
  • Training data: DataComp 2,952,644 (soft KL labels) + FoodVision 560,836 (hard CE labels)
  • Epochs: 5 (best blended at epoch 3)
  • Peak inference throughput: 2096.3 img/s
  • Optimizer: AdamW (head LR=1e-4, backbone LR=1e-5 after 0.5 epoch warmup)
  • Loss: DataComp: 0.7Γ—KL(T=3) + 0.3Γ—CE | FoodVision: CE

Weight Variants

Three weight files are included, each optimized for a different metric:

File Selects by FV Acc DC Acc Blended Epoch Use case
model.safetensors (default) Best blended (50/50) 98.21% 92.42% 95.31% 3 Balanced β€” good at everything
model_best_fv.safetensors Best FoodVision test 98.34% 92.28% 95.31% 5 On-device Nutrify deployment
model_best_dc.safetensors Best DataComp val 98.21% 92.42% 95.31% 3 Scale-up filtering (menus, panels, recipes)

To load a specific variant:

# Default (blended)
weights_path = hf_hub_download("mrdbourke/food-not-food-classifier-siglip2-v2", "model.safetensors")

# Best for Nutrify on-device
weights_path = hf_hub_download("mrdbourke/food-not-food-classifier-siglip2-v2", "model_best_fv.safetensors")

# Best for scale-up filtering
weights_path = hf_hub_download("mrdbourke/food-not-food-classifier-siglip2-v2", "model_best_dc.safetensors")

Related Models

Dataset

Training images from DataComp-1B-food-and-drink-3M and the Nutrify FoodVision dataset (714K human-labeled images).

License

Apache 2.0

Downloads last month
399
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train mrdbourke/food-not-food-classifier-siglip2-v2