File size: 4,984 Bytes

c19331a

---
license: mit
language:
- ar
base_model:
- Qwen/Qwen2.5-1.5B-Instruct
pipeline_tag: text2text-generation
library_name: transformers
tags:
- Text-To-SQL
- Arabic
- Spider
- SQL
---

# Model Card for Arabic Text-To-SQL (OsamaMo)

## Model Details

### Model Description

This model is fine-tuned on the Spider dataset with Arabic-translated questions for the Text-To-SQL task. It is based on **Qwen/Qwen2.5-1.5B-Instruct** and trained using LoRA on Kaggle for 15 hours on a **P100 8GB GPU**.

- **Developed by:** Osama Mohamed ([OsamaMo](https://huggingface.co/OsamaMo))
- **Funded by:** Self-funded
- **Shared by:** Osama Mohamed
- **Model type:** Text-to-SQL fine-tuned model
- **Language(s):** Arabic (ar)
- **License:** MIT
- **Finetuned from:** Qwen/Qwen2.5-1.5B-Instruct

### Model Sources

- **Repository:** [Hugging Face Model Hub](https://huggingface.co/OsamaMo/Arabic_Text-To-SQL)
- **Dataset:** Spider (translated to Arabic)
- **Training Script:** [LLaMA-Factory](https://github.com/huggingface/transformers/tree/main/src/transformers/models/llama_factory)

## Uses

### Direct Use

This model is intended for converting **Arabic natural language questions** into SQL queries. It can be used for database querying in Arabic-speaking applications.

### Downstream Use

Can be fine-tuned further for specific databases or Arabic dialect adaptations.

### Out-of-Scope Use

- The model is **not** intended for direct execution of SQL queries.
- Not recommended for non-database-related NLP tasks.

## Bias, Risks, and Limitations

- The model might generate incorrect or non-optimized SQL queries.
- Bias may exist due to dataset translations and model pretraining data.

### Recommendations

- Validate generated SQL queries before execution.
- Ensure compatibility with specific database schemas.

## How to Get Started with the Model

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
base_model_id = "Qwen/Qwen2.5-1.5B-Instruct"
finetuned_model_id = "OsamaMo/Arabic_Text-To-SQL_using_Qwen2.5-1.5B"

model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

model.load_adapter(finetuned_model_id)

tokenizer = AutoTokenizer.from_pretrained(base_model_id)

def generate_resp(messages):
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    model_inputs = tokenizer([text], return_tensors="pt").to(device)

    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=1024,
        do_sample=False, top_k=None, temperature=None, top_p=None,
    )

    generated_ids = [
        output_ids[len(input_ids):]
        for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

    return response
```

## Training Details

### Training Data

- Dataset: **Spider (translated into Arabic)**
- Preprocessing: Questions converted to Arabic while keeping SQL queries unchanged.
- Training format:
  - System instruction guiding Arabic-to-SQL conversion.
  - Database schema provided for context.
  - Arabic user queries mapped to correct SQL output.
  - Output is strictly formatted SQL queries enclosed in markdown code blocks.

### Training Procedure

#### Training Hyperparameters

- **Batch size:** 1 (per device)
- **Gradient accumulation:** 4 steps
- **Learning rate:** 1.0e-4
- **Epochs:** 3
- **Scheduler:** Cosine
- **Warmup ratio:** 0.1
- **Precision:** bf16

#### Speeds, Sizes, Times

- **Training time:** 15 hours on **NVIDIA P100 8GB**
- **Checkpointing every:** 500 steps

## Evaluation

### Testing Data

- Validation dataset: Spider validation set (translated to Arabic)

### Metrics

- Exact Match (EM) for SQL correctness
- Execution Accuracy (EX) on databases

### Results

- Model achieved **competitive SQL generation accuracy** for Arabic queries.
- Further testing required for robustness.

## Environmental Impact

- **Hardware Type:** NVIDIA Tesla P100 8GB
- **Hours used:** 15
- **Cloud Provider:** Kaggle
- **Carbon Emitted:** Estimated using [ML Impact Calculator](https://mlco2.github.io/impact#compute)

## Technical Specifications

### Model Architecture and Objective

- Transformer-based **Qwen2.5-1.5B** architecture.
- Fine-tuned for Text-to-SQL task using LoRA.

### Compute Infrastructure

- **Hardware:** Kaggle P100 GPU (8GB VRAM)
- **Software:** Python, Transformers, LLaMA-Factory, Hugging Face Hub

## Citation

If you use this model, please cite:

```bibtex
@misc{OsamaMo_ArabicSQL,
  author = {Osama Mohamed},
  title = {Arabic Text-To-SQL Model},
  year = {2024},
  howpublished = {\url{https://huggingface.co/OsamaMo/Arabic_Text-To-SQL}}
}
```

## Model Card Contact

For questions, contact **Osama Mohamed** via Hugging Face ([OsamaMo](https://huggingface.co/OsamaMo)).