Does batch_size=128 during training refer to the global or single-GPU batch size, and is it trained using DeepSpeed Zero3?

#13

by Hipanda - opened Mar 29, 2025

Mar 29, 2025

hi, thank you for your awesome work. There are some doubts about the training batch size in the paper. Does batch_size=128 during training refer to the global or single-GPU batch size?
Looking forward to your reply!

vivian345

May 16, 2025

They used gradient checkpointing to conserve GPU memory, so the could set batch_size=128.
I want to ask how to use gradient checkpointing in ms-swift, looking forward to replies.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment