--- license: apache-2.0 tags: - age-estimation - gender-classification - face-analysis - vision-transformer - dinov3 - coral-ordinal-regression pipeline_tag: image-classification --- # FaceAge ClientScan > **A face-only age estimation on LAGENDA 84k — MAE 3.555. The state-of-the-art specific task model : Mivolov2 on face+body MAE 3.65. ** Age and gender estimation from face crops using **DINOv3-ViT-L** backbone with CORAL ordinal regression. ## Performance (LAGENDA 84k benchmark) | Model | Input | MAE ↓ | CS@5 ↑ | Gender Acc ↑ | |-------|-------|--------|--------|-------------| | **FaceAge ClientScan (ours)** | **face-only** | **3.555** | **75.5%** | **97.75%** | | MiVOLO v2 (paper) | face + body | 3.650 | 74.48% | 97.99% | | MiVOLO v1 (paper) | face + body | 3.990 | 71.27% | 97.36% | | MiVOLO v2 (measured, face+body) | face + body | 3.859 | 76.5% | 96.96% | | MiVOLO v2 (measured, face-only) | face only | 4.224 | 69.7% | 96.05% | **Key result**: FaceAge ClientScan achieves **MAE=3.555** using only the face crop — no body information needed — outperforming MiVOLO v2's paper claim of 3.650 which requires both face and body bounding boxes. ### Per age-group MAE (FaceAge ClientScan vs MiVOLO v2 best) | Age Group | n | MiVOLO v2 best | **FaceAge ClientScan** | Delta | |-----------|--:|---------------:|-------------------:|------:| | 0–12 | 15,369 | 1.677 | **1.548** | ✅ −0.129 | | 13–17 | 3,930 | 3.365 | **2.845** | ✅ −0.520 | | 18–25 | 9,975 | 2.989 | **2.877** | ✅ −0.112 | | 26–35 | 10,303 | **3.348** | 3.775 | ❌ +0.427 | | 36–50 | 19,234 | 4.484 | **4.195** | ✅ −0.289 | | 51–65 | 16,350 | 4.794 | **4.329** | ✅ −0.465 | | 66+ | 9,031 | 6.310 | **5.013** | ✅ −1.297 | | **Overall** | **84,192** | 3.859 | **3.555** | ✅ −0.304 | FaceAge ClientScan wins **6/7 age groups**. The only group where MiVOLO v2 leads is 26–35, where body context likely helps. ### on several age dataset The benchmark mechanism following this paper: [Out of the box age estimation through facial imagery: A Comprehensive Benchmark of Vision-Language Models vs. out-of-the-box Traditional Architectures](https://arxiv.org/pdf/2602.07815) | Dataset | Ours ONNX MAE | |---------|-----------| | UTK | 5.225 | | IMDB | 5.119 | | MORPH | 4.235 | | AFAD | 3.520 | | CACD | 5.314 | | FG-NET | 4.550 | | APPA | 5.172 | | AgeDB | 5.933 | | **Avg** | **4.884**| ## Architecture ``` Face [B, 3, 224, 224] (+ 10% proportional bbox padding) ↓ DINOv3-ViT-L/16 (307M params, pretrained on LVD-1.68B) ↓ pooler_output [B, 1024] ↓ LayerNorm → Linear(1024→512) → GELU → Dropout(0.1) [B, 512] ├── age_head: Linear(512, 100) → CORAL → age ∈ [0, 100] └── gender_head: Linear(512, 2) → softmax → {female, male} ``` **CORAL ordinal regression**: age = Σ σ(logit_k) for k=0..99. Exploits the ordinal structure of ages for better calibration than standard cross-entropy. **Important**: use 10% proportional padding when cropping the face bbox before inference — this matches the training setup and is required to reproduce MAE=3.555. ## Face crop helper (required for MAE=3.555) Apply **10% proportional padding** before passing to the model. This is critical — without it MAE degrades to ~3.758. ```python import numpy as np from PIL import Image def crop_face(image_rgb: np.ndarray, x0: float, y0: float, x1: float, y1: float, pad: float = 0.10) -> np.ndarray: """Crop face bbox with proportional padding. pad=0.10 → 10% each side.""" h, w = image_rgb.shape[:2] pw, ph = (x1 - x0) * pad, (y1 - y0) * pad x0 = max(0, int(x0 - pw)); y0 = max(0, int(y0 - ph)) x1 = min(w, int(x1 + pw)); y1 = min(h, int(y1 + ph)) return image_rgb[y0:y1, x0:x1] ``` ## Batched inference (PyTorch — recommended for benchmarks) ```python import torch import numpy as np import pandas as pd from PIL import Image from tqdm import tqdm from torch.utils.data import Dataset, DataLoader from transformers import AutoImageProcessor, AutoModel # Limit threads — on big servers PyTorch over-subscribes cores torch.set_num_threads(8) BATCH_SIZE = 32 # increase if you have enough RAM NUM_WORKERS = 8 # parallel image loading processor = AutoImageProcessor.from_pretrained("TrungTran/faceage_ClientScan") model = AutoModel.from_pretrained("TrungTran/faceage_ClientScan", trust_remote_code=True) model.eval() def crop_face(image_rgb, x0, y0, x1, y1, pad=0.10): h, w = image_rgb.shape[:2] pw, ph = (x1 - x0) * pad, (y1 - y0) * pad x0 = max(0, int(x0 - pw)); y0 = max(0, int(y0 - ph)) x1 = min(w, int(x1 + pw)); y1 = min(h, int(y1 + ph)) return image_rgb[y0:y1, x0:x1] class FaceDataset(Dataset): def __init__(self, df, root, processor): self.df = df.reset_index(drop=True) self.root = root self.processor = processor def __len__(self): return len(self.df) def __getitem__(self, idx): row = self.df.iloc[idx] img_rgb = np.array(Image.open(self.root + row.img_name).convert("RGB")) face = crop_face(img_rgb, row.face_x0, row.face_y0, row.face_x1, row.face_y1) pixel_values = self.processor(images=Image.fromarray(face), return_tensors="pt")["pixel_values"][0] return pixel_values, row.img_name df = pd.read_csv("lagenda_annotation.csv") df = df[df.age != -1].reset_index(drop=True) ROOT = "/path/to/lagenda/" dataset = FaceDataset(df, ROOT, processor) loader = DataLoader(dataset, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, pin_memory=False) results = {} with torch.no_grad(): for pixel_values, img_names in tqdm(loader, desc="Inference"): outputs = model(pixel_values=pixel_values) ages = outputs.age_output.tolist() genders = outputs.gender_class_idx.tolist() for name, age, g in zip(img_names, ages, genders): results[name] = {"age": age, "gender": "male" if g == 1 else "female"} ``` > **Tip**: for even faster CPU inference use the ONNX version (`infer_onnx.py`) which is ~3× faster than PyTorch on CPU. ```python import torch import numpy as np from PIL import Image from transformers import AutoImageProcessor, AutoModel processor = AutoImageProcessor.from_pretrained("TrungTran/faceage_ClientScan") model = AutoModel.from_pretrained("TrungTran/faceage_ClientScan", trust_remote_code=True) model.eval() # 1. Load full image and apply 10% padded crop img_rgb = np.array(Image.open("photo.jpg").convert("RGB")) face = crop_face(img_rgb, x0=120, y0=80, x1=300, y1=320) # your bbox here # 2. Run model inputs = processor(images=Image.fromarray(face), return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) age = outputs.age_output.item() gender = "male" if outputs.gender_class_idx.item() == 1 else "female" conf = outputs.gender_probs[0, outputs.gender_class_idx.item()].item() print(f"Age: {age:.1f} Gender: {gender} ({conf:.0%})") ``` ## Usage (ONNX — no PyTorch needed) > **Standalone inference script**: [github.com/TrungThanhTran/faceage-ClientScan](https://github.com/TrungThanhTran/faceage-ClientScan) > — includes `infer_onnx.py` with auto-download, single image + LAGENDA benchmark modes. ```bash git clone https://github.com/TrungThanhTran/faceage-ClientScan.git cd faceage-ClientScan pip install -r requirements.txt # Single image python infer_onnx.py --image photo.jpg --bbox 120 80 300 320 # LAGENDA MAE benchmark python infer_onnx.py \ --lagenda_dir /path/to/lagenda \ --annotation_csv lagenda_test.csv \ --batch_size 256 ``` Or use the Python API directly: ```python from infer_onnx import FaceAgeModel, crop_face import numpy as np from PIL import Image model = FaceAgeModel() # auto-downloads ONNX from HuggingFace img = np.array(Image.open("photo.jpg").convert("RGB")) face = crop_face(img, x0=120, y0=80, x1=300, y1=320) out = model.predict(face) print(out) # {'age': 34.2, 'gender': 'male', 'gender_conf': 0.981} ``` Or raw ONNX (manual): ```python import numpy as np import onnxruntime as ort from PIL import Image sess = ort.InferenceSession("faceage_dino_fp32.onnx", providers=["CPUExecutionProvider"]) in_name = sess.get_inputs()[0].name MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32) STD = np.array([0.229, 0.224, 0.225], dtype=np.float32) def preprocess(img_rgb: np.ndarray) -> np.ndarray: """HxWx3 uint8 RGB → [1,3,224,224] float32, ImageNet normalised.""" pil = Image.fromarray(img_rgb).resize((224, 224), Image.BICUBIC) arr = np.asarray(pil, dtype=np.float32) / 255.0 arr = (arr - MEAN) / STD return arr.transpose(2, 0, 1)[np.newaxis] # 1. Load image, apply 10% padded crop img_rgb = np.array(Image.open("photo.jpg").convert("RGB")) face = crop_face(img_rgb, x0=120, y0=80, x1=300, y1=320) # your bbox here # 2. Run ONNX age_logits, gender_logits = sess.run(None, {in_name: preprocess(face)}) age = float((1 / (1 + np.exp(-age_logits[0]))).sum()) # CORAL decode gender = "male" if gender_logits[0].argmax() == 1 else "female" print(f"Age: {age:.1f} Gender: {gender}") ``` ## Reproducing MAE=3.555 ```bash git clone https://github.com/TrungThanhTran/faceage-ClientScan.git cd faceage-ClientScan pip install -r requirements.txt python infer_onnx.py \ --lagenda_dir /path/to/lagenda \ --annotation_csv lagenda_test.csv \ --batch_size 256 ``` ## Training Multi-phase fine-tuning on DINOv3-ViT-L: | Phase | Backbone | LR | Key change | |-------|----------|----|-----------| | 1 | Frozen (all 24 blocks) | 1e-3 | Head training only | | 2 | Top 4 blocks unfrozen | 1e-4 | Partial fine-tuning | | 3 | All blocks unfrozen | 3e-5 | Full fine-tuning | | 4 | All blocks | 3e-6 | Age-group reweighting, best epoch MAE=3.555 | Training data: Our Collection (4M images). ## Citation ```bibtex @misc{faceage-clientscan-2026, title = {FaceAge ClientScan: Face-Only Age \& Gender Estimation}, author = {Trung Thanh Tran}, year = {2026}, url = {https://huggingface.co/TrungTran/faceage_ClientScan} } ``` Related work: - DINOv3: Meta AI, "DINOv3: Scaling Up Vision Foundation Models", 2025 - MiVOLO: Kuprashevich & Tolstykh, arXiv 2307.04616 - LAGENDA: Bhuiyan et al., 2023 - CORAL: Cao et al., Pattern Recognition Letters 2020