LeeSek
/

binary-dockerfile-model

binary-classification

Model card Files Files and versions

binary-dockerfile-model / README.md

LeeSek's picture

Update README.md

8b994a8 verified 12 months ago

|

history blame contribute delete

3.34 kB

	---
	language: "code"
	license: "mit"
	tags:
	- dockerfile
	- hadolint
	- binary-classification
	- codebert
	model-index:
	- name: Binary Dockerfile Classifier
	results: []
	---


	# 🧱 Dockerfile Quality Classifier – Binary Model

	This model predicts whether a given Dockerfile is:

	- ✅ GOOD – clean and adheres to best practices (no top rule violations)
	- ❌ BAD – violates at least one important rule (from Hadolint)

	It is the first step in a full ML-based Dockerfile linter.

	---

	## 🧠 Model Overview

	- Architecture: Fine-tuned `microsoft/codebert-base`
	- Task: Binary classification (`good` vs `bad`)
	- Input: Full Dockerfile content as plain text
	- Output: `[prob_good, prob_bad]` — softmax scores
	- Max input length: 512 tokens

	---

	## 📚 Training Details

	- Data source: Real-world and synthetic Dockerfiles
	- Labels: Based on [Hadolint](https://github.com/hadolint/hadolint) top 30 rules
	- Bad examples: At least one rule violated
	- Good examples: Fully clean files
	- Dataset balance: 15000 BAD / 1500 GOOD (clean)

	---

	## 🧪 Evaluation Results

	Evaluation on a held-out test set of 1,650 Dockerfiles:

	\| Class \| Precision \| Recall \| F1-score \| Support \|
	\|-------\|-----------\|--------\|----------\|---------\|
	\| good \| 0.96 \| 0.91 \| 0.93 \| 150 \|
	\| bad \| 0.99 \| 1.00 \| 0.99 \| 1500 \|
	\| Accuracy \| \| \| 0.99 \| 1650 \|

	---

	## 🚀 Quick Start

	### 🧪 Step 1 — Create test script

	Save this as `test_binary_predict.py`:

	```python
	import sys
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch
	from pathlib import Path

	path = Path(sys.argv[1])
	text = path.read_text(encoding="utf-8")

	tokenizer = AutoTokenizer.from_pretrained("LeeSek/binary-dockerfile-model")
	model = AutoModelForSequenceClassification.from_pretrained("LeeSek/binary-dockerfile-model")
	model.eval()

	inputs = tokenizer(text, return_tensors="pt", padding="max_length", truncation=True, max_length=512)

	with torch.no_grad():
	logits = model(**inputs).logits
	probs = torch.nn.functional.softmax(logits, dim=1).squeeze()

	label = "GOOD" if torch.argmax(probs).item() == 0 else "BAD"
	print(f"Prediction: {label} — Probabilities: good={probs[0]:.3f}, bad={probs[1]:.3f}")
	```

	---

	### 📄 Step 2 — Create good and bad Dockerfile

	Good:

	```docker
	FROM node:18
	WORKDIR /app
	COPY . .
	RUN npm install
	CMD ["node", "index.js"]
	```

	Bad:

	```docker
	FROM ubuntu:latest
	RUN apt-get install python3
	ADD . /app
	WORKDIR /app
	RUN pip install flask
	CMD python3 app.py
	```

	---

	### ▶️ Step 3 — Run the prediction

	```bash
	python test_binary_predict.py Dockerfile
	```

	Expected output:

	```
	Prediction: GOOD — Probabilities: good=0.998, bad=0.002
	```

	---

	## 🗂 Extras

	The full training and evaluation pipeline — including data preparation, training, validation, prediction — is available in the `scripts/` folder.

	> 💬 Note: Scripts are written with Polish comments and variable names for clarity during local development. Logic is fully portable.

	---

	## 📘 License

	MIT

	---

	## 🙌 Credits

	- Model powered by [Hugging Face Transformers](https://huggingface.co/transformers)
	- Tokenizer: CodeBERT
	- Rule definitions: [Hadolint](https://github.com/hadolint/hadolint)