| --- |
| license: mit |
| language: |
| - ar |
| base_model: |
| - Qwen/Qwen2.5-1.5B-Instruct |
| pipeline_tag: text2text-generation |
| library_name: transformers |
| tags: |
| - Text-To-SQL |
| - Arabic |
| - Spider |
| - SQL |
| --- |
| |
| # Model Card for Arabic Text-To-SQL (OsamaMo) |
|
|
| ## Model Details |
|
|
| ### Model Description |
|
|
| This model is fine-tuned on the Spider dataset with Arabic-translated questions for the Text-To-SQL task. It is based on **Qwen/Qwen2.5-1.5B-Instruct** and trained using LoRA on Kaggle for 15 hours on a **P100 8GB GPU**. |
|
|
| - **Developed by:** Osama Mohamed ([OsamaMo](https://huggingface.co/OsamaMo)) |
| - **Funded by:** Self-funded |
| - **Shared by:** Osama Mohamed |
| - **Model type:** Text-to-SQL fine-tuned model |
| - **Language(s):** Arabic (ar) |
| - **License:** MIT |
| - **Finetuned from:** Qwen/Qwen2.5-1.5B-Instruct |
|
|
| ### Model Sources |
|
|
| - **Repository:** [Hugging Face Model Hub](https://huggingface.co/OsamaMo/Arabic_Text-To-SQL) |
| - **Dataset:** Spider (translated to Arabic) |
| - **Training Script:** [LLaMA-Factory](https://github.com/huggingface/transformers/tree/main/src/transformers/models/llama_factory) |
|
|
| ## Uses |
|
|
| ### Direct Use |
|
|
| This model is intended for converting **Arabic natural language questions** into SQL queries. It can be used for database querying in Arabic-speaking applications. |
|
|
| ### Downstream Use |
|
|
| Can be fine-tuned further for specific databases or Arabic dialect adaptations. |
|
|
| ### Out-of-Scope Use |
|
|
| - The model is **not** intended for direct execution of SQL queries. |
| - Not recommended for non-database-related NLP tasks. |
|
|
| ## Bias, Risks, and Limitations |
|
|
| - The model might generate incorrect or non-optimized SQL queries. |
| - Bias may exist due to dataset translations and model pretraining data. |
|
|
| ### Recommendations |
|
|
| - Validate generated SQL queries before execution. |
| - Ensure compatibility with specific database schemas. |
|
|
| ## How to Get Started with the Model |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| import torch |
| |
| device = "cuda" if torch.cuda.is_available() else "cpu" |
| base_model_id = "Qwen/Qwen2.5-1.5B-Instruct" |
| finetuned_model_id = "OsamaMo/Arabic_Text-To-SQL_using_Qwen2.5-1.5B" |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| base_model_id, |
| device_map="auto", |
| torch_dtype=torch.bfloat16 |
| ) |
| |
| model.load_adapter(finetuned_model_id) |
| |
| tokenizer = AutoTokenizer.from_pretrained(base_model_id) |
| |
| def generate_resp(messages): |
| text = tokenizer.apply_chat_template( |
| messages, |
| tokenize=False, |
| add_generation_prompt=True |
| ) |
| |
| model_inputs = tokenizer([text], return_tensors="pt").to(device) |
| |
| generated_ids = model.generate( |
| model_inputs.input_ids, |
| max_new_tokens=1024, |
| do_sample=False, top_k=None, temperature=None, top_p=None, |
| ) |
| |
| generated_ids = [ |
| output_ids[len(input_ids):] |
| for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) |
| ] |
| |
| response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] |
| |
| return response |
| ``` |
|
|
| ## Training Details |
|
|
| ### Training Data |
|
|
| - Dataset: **Spider (translated into Arabic)** |
| - Preprocessing: Questions converted to Arabic while keeping SQL queries unchanged. |
| - Training format: |
| - System instruction guiding Arabic-to-SQL conversion. |
| - Database schema provided for context. |
| - Arabic user queries mapped to correct SQL output. |
| - Output is strictly formatted SQL queries enclosed in markdown code blocks. |
|
|
| ### Training Procedure |
|
|
| #### Training Hyperparameters |
|
|
| - **Batch size:** 1 (per device) |
| - **Gradient accumulation:** 4 steps |
| - **Learning rate:** 1.0e-4 |
| - **Epochs:** 3 |
| - **Scheduler:** Cosine |
| - **Warmup ratio:** 0.1 |
| - **Precision:** bf16 |
|
|
| #### Speeds, Sizes, Times |
|
|
| - **Training time:** 15 hours on **NVIDIA P100 8GB** |
| - **Checkpointing every:** 500 steps |
|
|
| ## Evaluation |
|
|
| ### Testing Data |
|
|
| - Validation dataset: Spider validation set (translated to Arabic) |
|
|
| ### Metrics |
|
|
| - Exact Match (EM) for SQL correctness |
| - Execution Accuracy (EX) on databases |
|
|
| ### Results |
|
|
| - Model achieved **competitive SQL generation accuracy** for Arabic queries. |
| - Further testing required for robustness. |
|
|
| ## Environmental Impact |
|
|
| - **Hardware Type:** NVIDIA Tesla P100 8GB |
| - **Hours used:** 15 |
| - **Cloud Provider:** Kaggle |
| - **Carbon Emitted:** Estimated using [ML Impact Calculator](https://mlco2.github.io/impact#compute) |
|
|
| ## Technical Specifications |
|
|
| ### Model Architecture and Objective |
|
|
| - Transformer-based **Qwen2.5-1.5B** architecture. |
| - Fine-tuned for Text-to-SQL task using LoRA. |
|
|
| ### Compute Infrastructure |
|
|
| - **Hardware:** Kaggle P100 GPU (8GB VRAM) |
| - **Software:** Python, Transformers, LLaMA-Factory, Hugging Face Hub |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
|
|
| ```bibtex |
| @misc{OsamaMo_ArabicSQL, |
| author = {Osama Mohamed}, |
| title = {Arabic Text-To-SQL Model}, |
| year = {2024}, |
| howpublished = {\url{https://huggingface.co/OsamaMo/Arabic_Text-To-SQL}} |
| } |
| ``` |
|
|
| ## Model Card Contact |
|
|
| For questions, contact **Osama Mohamed** via Hugging Face ([OsamaMo](https://huggingface.co/OsamaMo)). |
|
|
|
|