---
license: apache-2.0
tags:
  - age-estimation
  - gender-classification
  - face-analysis
  - vision-transformer
  - dinov3
  - coral-ordinal-regression
pipeline_tag: image-classification
---

# FaceAge ClientScan

> **A face-only age estimation on LAGENDA 84k — MAE 3.555. The state-of-the-art specific task model  : Mivolov2 on face+body MAE 3.65. **

Age and gender estimation from face crops using **DINOv3-ViT-L** backbone with CORAL ordinal regression.

## Performance (LAGENDA 84k benchmark)

| Model | Input | MAE ↓ | CS@5 ↑ | Gender Acc ↑ |
|-------|-------|--------|--------|-------------|
| **FaceAge ClientScan (ours)** | **face-only** | **3.555** | **75.5%** | **97.75%** |
| MiVOLO v2 (paper) | face + body | 3.650 | 74.48% | 97.99% |
| MiVOLO v1 (paper) | face + body | 3.990 | 71.27% | 97.36% |
| MiVOLO v2 (measured, face+body) | face + body | 3.859 | 76.5% | 96.96% |
| MiVOLO v2 (measured, face-only) | face only | 4.224 | 69.7% | 96.05% |

**Key result**: FaceAge ClientScan achieves **MAE=3.555** using only the face crop — no body information needed — outperforming MiVOLO v2's paper claim of 3.650 which requires both face and body bounding boxes.

### Per age-group MAE (FaceAge ClientScan vs MiVOLO v2 best)

| Age Group | n | MiVOLO v2 best | **FaceAge ClientScan** | Delta |
|-----------|--:|---------------:|-------------------:|------:|
| 0–12      | 15,369 | 1.677 | **1.548** | ✅ −0.129 |
| 13–17     | 3,930  | 3.365 | **2.845** | ✅ −0.520 |
| 18–25     | 9,975  | 2.989 | **2.877** | ✅ −0.112 |
| 26–35     | 10,303 | **3.348** | 3.775 | ❌ +0.427 |
| 36–50     | 19,234 | 4.484 | **4.195** | ✅ −0.289 |
| 51–65     | 16,350 | 4.794 | **4.329** | ✅ −0.465 |
| 66+       | 9,031  | 6.310 | **5.013** | ✅ −1.297 |
| **Overall** | **84,192** | 3.859 | **3.555** | ✅ −0.304 |

FaceAge ClientScan wins **6/7 age groups**. The only group where MiVOLO v2 leads is 26–35, where body context likely helps.


### on several age dataset
The benchmark mechanism following this paper: [Out of the box age estimation through facial imagery: A Comprehensive Benchmark of Vision-Language Models vs. out-of-the-box Traditional Architectures](https://arxiv.org/pdf/2602.07815)

| Dataset |	Ours ONNX MAE |
|---------|-----------|
| UTK	  |   5.225	  |
| IMDB	  |   5.119   |
| MORPH	  |   4.235	  |
| AFAD	  |   3.520	  |
| CACD	  |   5.314	  |
| FG-NET  |   4.550	  |
| APPA	  |   5.172	  |
| AgeDB	  |   5.933	  |
| **Avg**  |   **4.884**|


## Architecture

```
Face [B, 3, 224, 224]  (+ 10% proportional bbox padding)
    ↓
DINOv3-ViT-L/16  (307M params, pretrained on LVD-1.68B)
    ↓ pooler_output
[B, 1024]
    ↓ LayerNorm → Linear(1024→512) → GELU → Dropout(0.1)
[B, 512]
    ├── age_head:    Linear(512, 100) → CORAL → age ∈ [0, 100]
    └── gender_head: Linear(512, 2)  → softmax → {female, male}
```

**CORAL ordinal regression**: age = Σ σ(logit_k) for k=0..99. Exploits the ordinal structure of ages for better calibration than standard cross-entropy.

**Important**: use 10% proportional padding when cropping the face bbox before inference — this matches the training setup and is required to reproduce MAE=3.555.

## Face crop helper (required for MAE=3.555)

Apply **10% proportional padding** before passing to the model. This is critical — without it MAE degrades to ~3.758.

```python
import numpy as np
from PIL import Image

def crop_face(image_rgb: np.ndarray,
              x0: float, y0: float, x1: float, y1: float,
              pad: float = 0.10) -> np.ndarray:
    """Crop face bbox with proportional padding. pad=0.10 → 10% each side."""
    h, w = image_rgb.shape[:2]
    pw, ph = (x1 - x0) * pad, (y1 - y0) * pad
    x0 = max(0, int(x0 - pw));  y0 = max(0, int(y0 - ph))
    x1 = min(w, int(x1 + pw));  y1 = min(h, int(y1 + ph))
    return image_rgb[y0:y1, x0:x1]
```

## Batched inference (PyTorch — recommended for benchmarks)

```python
import torch
import numpy as np
import pandas as pd
from PIL import Image
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader
from transformers import AutoImageProcessor, AutoModel

# Limit threads — on big servers PyTorch over-subscribes cores
torch.set_num_threads(8)

BATCH_SIZE = 32   # increase if you have enough RAM
NUM_WORKERS = 8   # parallel image loading

processor = AutoImageProcessor.from_pretrained("TrungTran/faceage_ClientScan")
model = AutoModel.from_pretrained("TrungTran/faceage_ClientScan", trust_remote_code=True)
model.eval()


def crop_face(image_rgb, x0, y0, x1, y1, pad=0.10):
    h, w = image_rgb.shape[:2]
    pw, ph = (x1 - x0) * pad, (y1 - y0) * pad
    x0 = max(0, int(x0 - pw)); y0 = max(0, int(y0 - ph))
    x1 = min(w, int(x1 + pw)); y1 = min(h, int(y1 + ph))
    return image_rgb[y0:y1, x0:x1]


class FaceDataset(Dataset):
    def __init__(self, df, root, processor):
        self.df = df.reset_index(drop=True)
        self.root = root
        self.processor = processor

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        img_rgb = np.array(Image.open(self.root + row.img_name).convert("RGB"))
        face = crop_face(img_rgb, row.face_x0, row.face_y0, row.face_x1, row.face_y1)
        pixel_values = self.processor(images=Image.fromarray(face),
                                      return_tensors="pt")["pixel_values"][0]
        return pixel_values, row.img_name


df = pd.read_csv("lagenda_annotation.csv")
df = df[df.age != -1].reset_index(drop=True)
ROOT = "/path/to/lagenda/"

dataset = FaceDataset(df, ROOT, processor)
loader  = DataLoader(dataset, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS,
                     pin_memory=False)

results = {}
with torch.no_grad():
    for pixel_values, img_names in tqdm(loader, desc="Inference"):
        outputs = model(pixel_values=pixel_values)
        ages    = outputs.age_output.tolist()
        genders = outputs.gender_class_idx.tolist()
        for name, age, g in zip(img_names, ages, genders):
            results[name] = {"age": age, "gender": "male" if g == 1 else "female"}
```

> **Tip**: for even faster CPU inference use the ONNX version (`infer_onnx.py`) which is ~3× faster than PyTorch on CPU.

```python
import torch
import numpy as np
from PIL import Image
from transformers import AutoImageProcessor, AutoModel

processor = AutoImageProcessor.from_pretrained("TrungTran/faceage_ClientScan")
model = AutoModel.from_pretrained("TrungTran/faceage_ClientScan", trust_remote_code=True)
model.eval()

# 1. Load full image and apply 10% padded crop
img_rgb = np.array(Image.open("photo.jpg").convert("RGB"))
face    = crop_face(img_rgb, x0=120, y0=80, x1=300, y1=320)  # your bbox here

# 2. Run model
inputs = processor(images=Image.fromarray(face), return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

age    = outputs.age_output.item()
gender = "male" if outputs.gender_class_idx.item() == 1 else "female"
conf   = outputs.gender_probs[0, outputs.gender_class_idx.item()].item()
print(f"Age: {age:.1f}  Gender: {gender} ({conf:.0%})")
```

## Usage (ONNX — no PyTorch needed)

> **Standalone inference script**: [github.com/TrungThanhTran/faceage-ClientScan](https://github.com/TrungThanhTran/faceage-ClientScan)
> — includes `infer_onnx.py` with auto-download, single image + LAGENDA benchmark modes.

```bash
git clone https://github.com/TrungThanhTran/faceage-ClientScan.git
cd faceage-ClientScan
pip install -r requirements.txt

# Single image
python infer_onnx.py --image photo.jpg --bbox 120 80 300 320

# LAGENDA MAE benchmark
python infer_onnx.py \
    --lagenda_dir   /path/to/lagenda \
    --annotation_csv lagenda_test.csv \
    --batch_size    256
```

Or use the Python API directly:

```python
from infer_onnx import FaceAgeModel, crop_face
import numpy as np
from PIL import Image

model = FaceAgeModel()   # auto-downloads ONNX from HuggingFace

img  = np.array(Image.open("photo.jpg").convert("RGB"))
face = crop_face(img, x0=120, y0=80, x1=300, y1=320)
out  = model.predict(face)
print(out)  # {'age': 34.2, 'gender': 'male', 'gender_conf': 0.981}
```

Or raw ONNX (manual):

```python
import numpy as np
import onnxruntime as ort
from PIL import Image

sess    = ort.InferenceSession("faceage_dino_fp32.onnx",
                               providers=["CPUExecutionProvider"])
in_name = sess.get_inputs()[0].name

MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32)
STD  = np.array([0.229, 0.224, 0.225], dtype=np.float32)

def preprocess(img_rgb: np.ndarray) -> np.ndarray:
    """HxWx3 uint8 RGB → [1,3,224,224] float32, ImageNet normalised."""
    pil = Image.fromarray(img_rgb).resize((224, 224), Image.BICUBIC)
    arr = np.asarray(pil, dtype=np.float32) / 255.0
    arr = (arr - MEAN) / STD
    return arr.transpose(2, 0, 1)[np.newaxis]

# 1. Load image, apply 10% padded crop
img_rgb = np.array(Image.open("photo.jpg").convert("RGB"))
face    = crop_face(img_rgb, x0=120, y0=80, x1=300, y1=320)  # your bbox here

# 2. Run ONNX
age_logits, gender_logits = sess.run(None, {in_name: preprocess(face)})
age    = float((1 / (1 + np.exp(-age_logits[0]))).sum())   # CORAL decode
gender = "male" if gender_logits[0].argmax() == 1 else "female"
print(f"Age: {age:.1f}  Gender: {gender}")
```

## Reproducing MAE=3.555

```bash
git clone https://github.com/TrungThanhTran/faceage-ClientScan.git
cd faceage-ClientScan
pip install -r requirements.txt

python infer_onnx.py \
    --lagenda_dir   /path/to/lagenda \
    --annotation_csv lagenda_test.csv \
    --batch_size    256
```

## Training

Multi-phase fine-tuning on DINOv3-ViT-L:

| Phase | Backbone | LR | Key change |
|-------|----------|----|-----------|
| 1 | Frozen (all 24 blocks) | 1e-3 | Head training only |
| 2 | Top 4 blocks unfrozen | 1e-4 | Partial fine-tuning |
| 3 | All blocks unfrozen | 3e-5 | Full fine-tuning |
| 4 | All blocks | 3e-6 | Age-group reweighting, best epoch MAE=3.555 |

Training data: Our Collection (4M images).

## Citation

```bibtex
@misc{faceage-clientscan-2026,
  title  = {FaceAge ClientScan: Face-Only Age \& Gender Estimation},
  author = {Trung Thanh Tran},
  year   = {2026},
  url    = {https://huggingface.co/TrungTran/faceage_ClientScan}
}
```

Related work:
- DINOv3: Meta AI, "DINOv3: Scaling Up Vision Foundation Models", 2025
- MiVOLO: Kuprashevich & Tolstykh, arXiv 2307.04616
- LAGENDA: Bhuiyan et al., 2023
- CORAL: Cao et al., Pattern Recognition Letters 2020