YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Experiment 2 β Latin Square 1: CCT5 & COME on MCMD (Redesigned)
This repository contains the artifacts for Latin Square 1 of Experiment 2, which corresponds to the redesigned and reimplemented experiment evaluated on the MCMD dataset using the DNN-based commit message generation baselines CCT5 and COME. This experiment was conducted under a more explicit and controlled evaluation protocol than the original study by Wu et al. (2025).
Models
CCT5
CCT5 is a code-change-oriented pre-trained model built on top of the T5 architecture, initialized from CodeT5 weights and further pre-trained on CodeChangeNet (~40GB, 1.5M diff/commit pairs). Released at ESEC/FSE 2023.
- Architecture: Encoder-decoder Transformer (
T5-baseβCodeT5βCCT5) - Pre-training corpus: CodeChangeNet (code diffs paired with commit messages)
- For MCMD: fine-tuned checkpoint from original CCT5 authors, trained on MCMD training set
COME
COME (Commit Message Generation with Modification Embedding) is a hybrid DNN system built on top of CodeT5 with three core components:
- Modification embedding: converts code changes into numerical vectors capturing code evolution
- Fine-tuned CodeT5: generates candidate commit messages from the embedded representation
- SVM-based decision algorithm: selects between the generated and retrieved candidate messages
Released at ISSTA 2023. Does not include additional large-scale pre-training beyond CodeT5.
- For MCMD: language-specific checkpoints released by original COME authors (one per language)
Dataset
MCMD β Multilingual Commit Message Dataset
| Property | Details |
|---|---|
| Languages | Java, C++, C#, Python, JavaScript |
| Repositories | Top 100 most-starred GitHub repos per language (500 total) |
| Total commits | ~1,094,115 |
| Date range | Up to January 1st, 2022 |
| Split | 80% train / 10% validation / 10% test |
| Authors | Liu et al. (2020) |
Repository Structure
Each run folder corresponds to a programming language evaluated in this Latin Square. Unlike Experiment 1, this experiment follows a more controlled protocol with explicit random seed documentation and multiple evaluation runs where applicable.
experiment2_ls1/
βββ run_java/
β βββ checkpoint/ # CCT5 and COME checkpoints for MCMD (Java)
β βββ predictions/ # Generated commit messages on MCMD Java test set
β βββ metrics/ # BLEU, METEOR, ROUGE-L, CIDEr scores
βββ run_cpp/
β βββ checkpoint/
β βββ predictions/
β βββ metrics/
βββ run_csharp/
β βββ checkpoint/
β βββ predictions/
β βββ metrics/
βββ run_python/
β βββ checkpoint/
β βββ predictions/
β βββ metrics/
βββ run_javascript/
βββ checkpoint/
βββ predictions/
βββ metrics/
checkpoint/
Contains the model checkpoint files used for evaluation. For MCMD, these are the fine-tuned checkpoints released by the original authors of CCT5 and COME, trained on the MCMD training set for the corresponding language.
predictions/
Contains the generated commit messages produced by each model on the MCMD test set for the corresponding language. Files are stored as .txt with one prediction per line, aligned to the reference messages in the test set.
metrics/
Contains the evaluation metric scores computed by comparing the predictions against the MCMD test set reference messages. Each file records BLEU, METEOR, ROUGE-L, and CIDEr scores per model and language under the redesigned evaluation protocol.
Evaluation Metrics
| Metric | Description |
|---|---|
| BLEU | Bilingual Evaluation Understudy β measures n-gram precision between generated and reference messages |
| METEOR | Metric for Evaluation of Translation with Explicit Ordering β extends BLEU with recall, stemming, and synonym matching via WordNet |
| ROUGE-L | Recall-Oriented Understudy for Gisting Evaluation (LCS variant) β measures longest common subsequence overlap |
| CIDEr | Consensus-based Image Description Evaluation β TF-IDF-weighted n-gram similarity against reference messages |
Reference Results (Original Paper β Wu et al., 2025)
| Language | Model | BLEU | METEOR | ROUGE-L | CIDEr |
|---|---|---|---|---|---|
| Java | CCT5 | 17.19 | 14.95 | 26.08 | 1.06 |
| Java | COME | 27.17 | 23.36 | 34.59 | 1.90 |
| C++ | CCT5 | 15.65 | 14.11 | 24.15 | 0.90 |
| C++ | COME | 27.29 | 23.29 | 33.33 | 1.91 |
| C# | CCT5 | 12.06 | 11.05 | 18.92 | 0.61 |
| C# | COME | 20.80 | 17.72 | 27.01 | 1.25 |
| Python | CCT5 | 15.12 | 13.70 | 23.79 | 0.85 |
| Python | COME | 23.17 | 19.99 | 30.48 | 1.50 |
| JavaScript | CCT5 | 19.76 | 17.51 | 28.73 | 1.33 |
| JavaScript | COME | 26.91 | 23.02 | 34.44 | 1.92 |
| Average | CCT5 | 15.96 | 14.26 | 24.33 | 0.95 |
| Average | COME | 25.07 | 21.48 | 31.97 | 1.70 |
These values serve as the baseline reference for comparison with the results produced under the redesigned protocol.
Methodological Differences from Experiment 1
This experiment was redesigned to address the validity and reproducibility concerns identified during the Experiment 1 reproduction phase:
- Explicit random seed documentation for all runs
- Controlled evaluation procedure with clearly specified script versions
- Full documentation of execution conditions (hardware, software versions, environment)
- Explicit treatment of validity threats at each stage of the evaluation
References
- Wu et al. (2025). An Empirical Study on Commit Message Generation with Large Language Models via In-Context Learning. arXiv:2502.18904.
- Lin et al. (2023). CCT5: A Code-Change-Oriented Pre-Trained Model. ESEC/FSE 2023.
- He et al. (2023). COME: Commit Message Generation with Modification Embedding. ISSTA 2023.
- Liu et al. (2020). MCMD dataset.
- Vegas & Elbaum (2023). Pitfalls in Experiments with DNN4SE. ESEC/FSE 2023.