Mac

Push evaluation results and update readme

5de6ff4 10 months ago

9.08 kB

	---
	license: mit
	tags:
	- codellama
	- linux
	- bugfix
	- lora
	- qlora
	- git-diff
	base_model: codellama/CodeLLaMA-7b-Instruct-hf
	model_type: LlamaForCausalLM
	library_name: peft
	pipeline_tag: text-generation
	---

	# CodeLLaMA-Linux-BugFix

	A fine-tuned version of `CodeLLaMA-7B-Instruct`, designed specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches based on buggy C code and commit messages.

	---

	## 🎯 Overview

	This project targets automated Linux kernel bug fixing by:

	- Mining real commit data from the kernel Git history
	- Training a specialized QLoRA model on diff-style fixes
	- Generating Git patches in response to bug-prone code
	- Evaluating results using BLEU, ROUGE, and human inspection

	The model achieves strong performance in generating accurate Linux kernel bug fixes, making it a valuable tool for automated code review and bug detection.

	---

	## 📊 Performance Results

	### Evaluation Metrics

	✅ BLEU Score: 33.87

	✅ ROUGE Scores:
	- ROUGE-1: P=0.3775, R=0.7306, F1=0.4355
	- ROUGE-2: P=0.2898, R=0.6096, F1=0.3457
	- ROUGE-L: P=0.3023, R=0.6333, F1=0.3612

	These results demonstrate the model's ability to:
	- Generate syntactically correct Git diff patches
	- Maintain semantic similarity to reference fixes
	- Produce meaningful code changes that address the underlying bugs

	---

	## 🧠 Model Configuration

	- Base model: `CodeLLaMA-7B-Instruct`
	- Fine-tuning method: QLoRA with 4-bit quantization
	- Training setup:
	- LoRA r=64, alpha=16, dropout=0.1
	- Batch size: 64, LR: 2e-4, Epochs: 3
	- Mixed precision (bfloat16), gradient checkpointing
	- Hardware: Optimized for NVIDIA H200 GPUs

	---

	## 📊 Dataset

	Custom dataset extracted from Linux kernel Git history.

	### Filtering Criteria
	Bug-fix commits containing:
	`fix`, `bug`, `crash`, `memory`, `null`, `panic`, `overflow`, `race`, `corruption`, etc.

	### Structure
	- Language: C (`.c`, `.h`)
	- Context: 10 lines before/after the change
	- Format:

	```json
	{
	"input": {
	"original code": "C code snippet with bug",
	"instruction": "Commit message or fix description"
	},
	"output": {
	"diff codes": "Git diff showing the fix"
	}
	}
	```

	* File: `training_data_100k.jsonl` (100,000 samples)

	---

	## 🚀 Quick Start

	### Prerequisites

	- Python 3.8+
	- CUDA-compatible GPU (recommended)
	- 16GB+ RAM
	- 50GB+ disk space

	### Install dependencies

	```bash
	pip install -r requirements.txt
	```

	### 1. Build the Dataset

	```bash
	cd dataset_builder
	python extract_linux_bugfixes_parallel.py
	python format_for_training.py
	```

	### 2. Fine-tune the Model

	```bash
	cd train
	python train_codellama_qlora_linux_bugfix.py
	```

	### 3. Run Evaluation

	```bash
	cd evaluate
	python evaluate_linux_bugfix_model.py
	```

	### 4. Use the Model

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	from peft import PeftModel

	# Load the fine-tuned model
	model = AutoModelForCausalLM.from_pretrained("codellama/CodeLLaMA-7b-Instruct-hf")
	model = PeftModel.from_pretrained(model, "train/output/qlora-codellama-bugfix")
	tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLLaMA-7b-Instruct-hf")

	# Generate a bug fix
	prompt = """
	Given the following original C code:
	```c
	if (!file->filter)
	return;
	```

	Instruction: Fix the null pointer dereference

	Return the diff that fixes it:
	"""

	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_length=512, temperature=0.1)
	fix = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(fix)
	```

	---

	## 📁 Project Structure

	```
	CodeLLaMA-Linux-BugFix/
	├── dataset_builder/
	│ ├── extract_linux_bugfixes_parallel.py # Parallel extraction of bug fixes
	│ ├── format_for_training.py # Format data for training
	│ └── build_dataset.py # Main dataset builder
	├── dataset/
	│ ├── training_data_100k.jsonl # 100K training samples
	│ └── training_data_prompt_completion.jsonl # Formatted training data
	├── train/
	│ ├── train_codellama_qlora_linux_bugfix.py # Main training script
	│ ├── train_codellama_qlora_simple.py # Simplified training
	│ ├── download_codellama_model.py # Model download utility
	│ └── output/
	│ └── qlora-codellama-bugfix/ # Trained model checkpoints
	├── evaluate/
	│ ├── evaluate_linux_bugfix_model.py # Evaluation script
	│ ├── test_samples.jsonl # Test dataset
	│ └── output/ # Evaluation results
	│ ├── eval_results.csv # Detailed results
	│ └── eval_results.json # JSON format results
	├── requirements.txt # Python dependencies
	├── README.md # This file
	└── PROJECT_STRUCTURE.md # Detailed project overview
	```

	---

	## 🧩 Features

	* 🔧 Efficient Fine-tuning: QLoRA + 4-bit quant = massive memory savings
	* 🧠 Real-world commits: From actual Linux kernel development
	* 💡 Context-aware: Code context extraction around bug lines
	* 💻 Output-ready: Generates valid Git-style diffs
	* 📈 Strong Performance: BLEU score of 33.87 with good ROUGE metrics
	* 🚀 Production-ready: Optimized for real-world deployment

	---

	## 📈 Evaluation Metrics

	* BLEU: Translation-style match to reference diffs
	* ROUGE: Overlap in fix content and semantic similarity
	* Human Evaluation: Subjective patch quality assessment

	### Current Performance
	- BLEU Score: 33.87 (excellent for code generation tasks)
	- ROUGE-1 F1: 0.4355 (good semantic overlap)
	- ROUGE-2 F1: 0.3457 (reasonable bigram matching)
	- ROUGE-L F1: 0.3612 (good longest common subsequence)

	---

	## 🧪 Use Cases

	* Automated kernel bug fixing: Generate fixes for common kernel bugs
	* Code review assistance: Help reviewers identify potential issues
	* Teaching/debugging kernel code: Educational tool for kernel development
	* Research in automated program repair (APR): Academic research applications
	* CI/CD integration: Automated testing and fixing in development pipelines

	---

	## 🔬 Technical Highlights

	### Memory & Speed Optimizations

	* 4-bit quantization (NF4)
	* Gradient checkpointing
	* Mixed precision (bfloat16)
	* Gradient accumulation
	* LoRA parameter efficiency

	### Training Efficiency

	* QLoRA: Reduces memory usage by ~75%
	* 4-bit quantization: Further memory optimization
	* Gradient checkpointing: Trades compute for memory
	* Mixed precision: Faster training with maintained accuracy

	---

	## 🛠️ Advanced Usage

	### Custom Training

	```bash
	# Train with custom parameters
	python train_codellama_qlora_linux_bugfix.py \
	--learning_rate 1e-4 \
	--num_epochs 5 \
	--batch_size 32 \
	--lora_r 32 \
	--lora_alpha 16
	```

	### Evaluation on Custom Data

	```bash
	# Evaluate on your own test set
	python evaluate_linux_bugfix_model.py \
	--test_file your_test_data.jsonl \
	--output_dir custom_eval_results
	```

	---

	## 🤝 Contributing

	1. Fork this repo
	2. Create a feature branch (`git checkout -b feature/amazing-feature`)
	3. Commit your changes (`git commit -m 'Add amazing feature'`)
	4. Push to the branch (`git push origin feature/amazing-feature`)
	5. Open a Pull Request 🙌

	### Development Guidelines

	- Follow PEP 8 style guidelines
	- Add tests for new features
	- Update documentation for API changes
	- Ensure all tests pass before submitting PR

	---

	## 📄 License

	MIT License – see `LICENSE` file for details.

	---

	## 🙏 Acknowledgments

	* Meta for CodeLLaMA base model
	* Hugging Face for Transformers + PEFT libraries
	* The Linux kernel community for open access to commit data
	* Microsoft for introducing LoRA technique
	* University of Washington for QLoRA research

	---

	## 📚 References

	* [CodeLLaMA (Meta, 2023)](https://arxiv.org/abs/2308.12950)
	* [QLoRA (Dettmers et al., 2023)](https://arxiv.org/abs/2305.14314)
	* [LoRA (Hu et al., 2021)](https://arxiv.org/abs/2106.09685)
	* [Automated Program Repair: A Survey](https://ieeexplore.ieee.org/document/8449519)

	---

	## 📞 Support

	For questions, issues, or contributions:
	- Open an issue on GitHub
	- Check the project documentation
	- Review the evaluation results in `evaluate/output/`

	---

	## 🔄 Version History

	- v1.0.0: Initial release with QLoRA training
	- v1.1.0: Added parallel dataset extraction
	- v1.2.0: Improved evaluation metrics and documentation