Spaces:

agentDebugger
/

AgentDebugger-training-v3

Running

Commit History

shank commited on 8 days ago

shank commited on 9 days ago

shank commited on 9 days ago

shank commited on 9 days ago

shank commited on 9 days ago

shank commited on 9 days ago

shank commited on Apr 26

shank commited on Apr 26

shank commited on Apr 26

shank commited on Apr 26

shank commited on Apr 26

shank commited on Apr 26

shank commited on Apr 26

shank commited on Apr 26

shank commited on Apr 26

shank commited on Apr 26

shank commited on Apr 26

shank commited on Apr 26

shank commited on Apr 26

shank commited on Apr 26

shank commited on Apr 26

shank commited on Apr 26

shank commited on Apr 25

shank commited on Apr 25

shank commited on Apr 25

shank commited on Apr 25

shank commited on Apr 25

shank commited on Apr 25

shank commited on Apr 25

shank commited on Apr 25

shank commited on Apr 25

shank commited on Apr 25

shank commited on Apr 25

shank commited on Apr 25

shank commited on Apr 25

shank commited on Apr 25

shank commited on Apr 25

shank commited on Apr 25

Commit History

Add --hub-resume and --step-offset to allow resuming directly from HF Hub PEFT checkpoints 8bd8552 Running

Revert speed optimizations to prioritize real-world model quality cf25957

Optimize training speed for T4 GPUs 2c50d8a

Fix HF_TOKEN scope issue for CheckpointPushCallback ca28017

Made changes to train_grpo a07dc7f

Fix checkpoint persistence, add leaderboard and update HF links e160aa1

Update: Added final imporvements for hackathon 713f336

Fix: batch%num_generations math 2b499e7

Cuda returns false fixed b8172c5

COMPUTE_DRIVE fix 77156dd

Fix: Removed BitsandBytes bdec91d

Fix: Fixed again again accb271

Fix: Fixed again 9864e61

Fix: Fixing Again 6747185

Fix: Fixing 18b4e8a

Fix: Trying to fix dependency issues 024f3c7

Fix: Fixed file cb09ef1

fix: serialize bug_metadata as JSON to fix pyarrow mixed-type error 4668456

fix: upgrade bitsandbytes>=0.49.0 (triton.ops), switch to Qwen2.5-Coder-3B a2fa47a

fix: torch at build time, remove mergekit (conflicts accelerate/peft/trl) 2bfaf77

fix: empty requirements.txt, install training deps at runtime 5d0b2d4

fix: remove wandb - click conflict with gradio causes resolution-too-deep 2005cd2

chore: normalize dataset inputs and fix mergekit dependency for TRL 0.14.0 e67270e

Auto-detect GPU: bfloat16+batch2+gen8 on A100, float16+batch1+gen4 on T4 — same script works on both ea6fe4e

Reduce max_completion_length to 160 for T4 speed: target 1000 steps in <8hrs 9487853

Optimize for Kaggle P100: float16, batch=1, grad_accum=8, num_gen=4, max_completion=256, lora_r=8 73f957d

Fix GRPOConfig: rename max_new_tokens to max_completion_length for trl==0.14.0 8b16369

Align gradio version with Hugging Face Space builder2 633a3b7

Stabilize Space runtime: pin ML deps and disable runtime package drift 663b8db

Pin torch to cu121 build + use model.device instead of hardcoded cuda string 8f291e0

Replace unsloth with bitsandbytes+peft: fixes CUDA driver incompatibility on HF A100 c325ad7

Reduce training to 500 steps with tightened curriculum for A10G budget ba8df98

Fix eval device selection with CUDA-safe fallback dc8001b

Optimize for A100 80GB: 8 generations, batch 4, lr 2e-5, dense logging 2b1fbf3

Restore full 1000-step training with original curriculum 1128de1

Reduce training to 500 steps with tightened curriculum for A10G budget 3152fa9

Add Gradio training monitor and fix subprocess python path b92ad01

Update: Started making changes for the hackathon a55c81d

Add --hub-resume and --step-offset to allow resuming directly from HF Hub PEFT checkpoints

8bd8552

Running

Revert speed optimizations to prioritize real-world model quality

cf25957

Optimize training speed for T4 GPUs

2c50d8a

Fix HF_TOKEN scope issue for CheckpointPushCallback

ca28017

Made changes to train_grpo

a07dc7f

Fix checkpoint persistence, add leaderboard and update HF links

e160aa1

Update: Added final imporvements for hackathon

713f336

Fix: batch%num_generations math

2b499e7

Cuda returns false fixed

b8172c5

COMPUTE_DRIVE fix

77156dd

Fix: Removed BitsandBytes

bdec91d

Fix: Fixed again again

accb271

Fix: Fixed again

9864e61

Fix: Fixing Again

6747185

Fix: Fixing

18b4e8a

Fix: Trying to fix dependency issues

024f3c7

Fix: Fixed file

cb09ef1

fix: serialize bug_metadata as JSON to fix pyarrow mixed-type error

4668456

fix: upgrade bitsandbytes>=0.49.0 (triton.ops), switch to Qwen2.5-Coder-3B

a2fa47a

fix: torch at build time, remove mergekit (conflicts accelerate/peft/trl)

2bfaf77

fix: empty requirements.txt, install training deps at runtime

5d0b2d4

fix: remove wandb - click conflict with gradio causes resolution-too-deep

2005cd2

chore: normalize dataset inputs and fix mergekit dependency for TRL 0.14.0

e67270e

Auto-detect GPU: bfloat16+batch2+gen8 on A100, float16+batch1+gen4 on T4 — same script works on both

ea6fe4e

Reduce max_completion_length to 160 for T4 speed: target 1000 steps in <8hrs

9487853

Optimize for Kaggle P100: float16, batch=1, grad_accum=8, num_gen=4, max_completion=256, lora_r=8

73f957d

Fix GRPOConfig: rename max_new_tokens to max_completion_length for trl==0.14.0

8b16369

Align gradio version with Hugging Face Space builder2

633a3b7

Stabilize Space runtime: pin ML deps and disable runtime package drift

663b8db

Pin torch to cu121 build + use model.device instead of hardcoded cuda string

8f291e0

Replace unsloth with bitsandbytes+peft: fixes CUDA driver incompatibility on HF A100

c325ad7

Reduce training to 500 steps with tightened curriculum for A10G budget

ba8df98

Fix eval device selection with CUDA-safe fallback

dc8001b

Optimize for A100 80GB: 8 generations, batch 4, lr 2e-5, dense logging

2b1fbf3

Restore full 1000-step training with original curriculum

1128de1

Reduce training to 500 steps with tightened curriculum for A10G budget

3152fa9

Add Gradio training monitor and fix subprocess python path

b92ad01

Update: Started making changes for the hackathon

a55c81d