EgoNormia-Cosmos-Reason2-2B-v5-shortcot
Multi-task SFT fine-tune of nvidia/Cosmos-Reason2-2B on the EgoNormia social norm benchmark. This v5 variant keeps the same 3-task setup as v4, but compresses the reasoning traces into short 1-sentence CoT supervision.
Training
| Parameter | Value |
|---|---|
| Base model | nvidia/Cosmos-Reason2-2B (Qwen3-VL-2B) |
| Tasks | Action + Justification + Sensibility (multi-task) |
| Train samples | 4959 |
| Training file | data/egonormia_llava_shortcot_train.json |
| CoT style | Short CoT, 1-sentence distilled traces |
| CoT length | median ~25 words (compressed from ~64 words) |
| Epochs | 3 |
| Global batch | 64 (8 replicas x 8 per replica) |
| Learning rate | 1e-5 (cosine decay, 3% warmup) |
| Context length | 8192 |
| Video input | video_prev.mp4, 8 frames |
| Hardware | 8x A100-SXM4-80GB |
| Seed 1 run dir | outputs/egonormia_sft/20260228141559/ |
| Seed 2 run dir | outputs/egonormia_sft/20260301002022/ |
| Uploaded checkpoint | seed2 step_150 |
Evaluation (200 verified test samples)
| Model | Action | Justification | Both | S-IoU |
|---|---|---|---|---|
| Zero-shot | 58.5% | 81.5% | 51.0% | 0.516 |
v3 best (step_175) |
78.0% | 97.0% | 77.0% | 0.664 |
v5 seed1 (step_155) |
80.5% | 95.5% | 78.5% | 0.618 |
v5 seed2 (step_150) |
82.0% | 95.5% | 78.5% | 0.634 |
Average over the two seed-wise best checkpoints:
- Action: 81.25%
- Justification: 95.5%
- Both: 78.5%
- S-IoU: 0.626
Robustness (option shuffle)
| Checkpoint | Delta Action | Delta S-IoU | Sign test p | Verdict |
|---|---|---|---|---|
seed1 step_155 |
-2.0pt | -0.035 | 0.585 | pass |
seed2 step_150 |
-5.0pt | -0.027 | 0.076 | pass |
Notes
- v5 recovers the robustness lost in v4 while keeping stronger action accuracy than v3.
- Best S-IoU still trails v3 (
0.634vs0.664), so the gain is mainly in action / joint accuracy tradeoff rather than sensibility quality. - On this run family, explicit think-mode inference hurts performance: for seed2, no-think
step_150reaches 82.0% action / 78.5% both, while think mode peaks lower at 78.0% action / 72.5% both.
Usage
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
model = Qwen3VLForConditionalGeneration.from_pretrained(
"robertzty/EgoNormia-Cosmos-Reason2-2B-v5-shortcot",
torch_dtype="bfloat16",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("robertzty/EgoNormia-Cosmos-Reason2-2B-v5-shortcot")
- Downloads last month
- 2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support