Add --hub-resume and --step-offset to allow resuming directly from HF Hub PEFT checkpoints 8bd8552 Running shank commited on 8 days ago
Revert speed optimizations to prioritize real-world model quality cf25957 shank commited on 9 days ago
fix: use GitHub raw URLs for images so README renders on HF Space 3eb8edc shank commited on about 1 month ago
chore: clean up local dev files and temporary virtual environments 59986c5 shank commited on about 1 month ago
chore: clean up local dev files and temporary virtual environments 374c6cc shank commited on about 1 month ago
fix: serialize bug_metadata as JSON to fix pyarrow mixed-type error 4668456 shank commited on about 1 month ago
fix: upgrade bitsandbytes>=0.49.0 (triton.ops), switch to Qwen2.5-Coder-3B a2fa47a shank commited on Apr 26
fix: torch at build time, remove mergekit (conflicts accelerate/peft/trl) 2bfaf77 shank commited on Apr 26
fix: remove wandb - click conflict with gradio causes resolution-too-deep 2005cd2 shank commited on Apr 26
chore: normalize dataset inputs and fix mergekit dependency for TRL 0.14.0 e67270e shank commited on Apr 25
Add HANDOVER.md: full project state, deps, training instructions, known fixes 97aad17 shank commited on Apr 25
Auto-detect GPU: bfloat16+batch2+gen8 on A100, float16+batch1+gen4 on T4 — same script works on both ea6fe4e shank commited on Apr 25
Reduce max_completion_length to 160 for T4 speed: target 1000 steps in <8hrs 9487853 shank commited on Apr 25
Optimize for Kaggle P100: float16, batch=1, grad_accum=8, num_gen=4, max_completion=256, lora_r=8 73f957d shank commited on Apr 25
Fix GRPOConfig: rename max_new_tokens to max_completion_length for trl==0.14.0 8b16369 shank commited on Apr 25
Stabilize Space runtime: pin ML deps and disable runtime package drift 663b8db shank commited on Apr 25
Pin torch to cu121 build + use model.device instead of hardcoded cuda string 8f291e0 shank commited on Apr 25
Replace unsloth with bitsandbytes+peft: fixes CUDA driver incompatibility on HF A100 c325ad7 shank commited on Apr 25