| --- |
| license: mit |
| language: |
| - ar |
| base_model: |
| - Qwen/Qwen2.5-1.5B-Instruct |
| pipeline_tag: text2text-generation |
| library_name: transformers |
| tags: |
| - Text-To-SQL |
| - Arabic |
| - Spider |
| - SQL |
| --- |
| |
| # Model Card for Arabic Text-To-SQL (OsamaMo) |
|
|
| ## Model Details |
|
|
| ### Model Description |
|
|
| This model is fine-tuned on the Spider dataset with Arabic-translated questions for the Text-To-SQL task. It is based on **Qwen/Qwen2.5-1.5B-Instruct** and trained using LoRA on Kaggle for 15 hours on a **P100 8GB GPU**. |
|
|
| - **Developed by:** Osama Mohamed ([OsamaMo](https://huggingface.co/OsamaMo)) |
| - **Funded by:** Self-funded |
| - **Shared by:** Osama Mohamed |
| - **Model type:** Text-to-SQL fine-tuned model |
| - **Language(s):** Arabic (ar) |
| - **License:** MIT |
| - **Finetuned from:** Qwen/Qwen2.5-1.5B-Instruct |
|
|
| ### Model Sources |
|
|
| - **Repository:** [Hugging Face Model Hub](https://huggingface.co/OsamaMo/Arabic_Text-To-SQL) |
| - **Dataset:** Spider (translated to Arabic) |
| - **Training Script:** [LLaMA-Factory](https://github.com/huggingface/transformers/tree/main/src/transformers/models/llama_factory) |
|
|
| ## Uses |
|
|
| ### Direct Use |
|
|
| This model is intended for converting **Arabic natural language questions** into SQL queries. It can be used for database querying in Arabic-speaking applications. |
|
|
| ### Downstream Use |
|
|
| Can be fine-tuned further for specific databases or Arabic dialect adaptations. |
|
|
| ### Out-of-Scope Use |
|
|
| - The model is **not** intended for direct execution of SQL queries. |
| - Not recommended for non-database-related NLP tasks. |
|
|
| ## Bias, Risks, and Limitations |
|
|
| - The model might generate incorrect or non-optimized SQL queries. |
| - Bias may exist due to dataset translations and model pretraining data. |
|
|
| ### Recommendations |
|
|
| - Validate generated SQL queries before execution. |
| - Ensure compatibility with specific database schemas. |
|
|
| ## How to Get Started with the Model |
| ### Load Model |
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| import torch |
| import re |
| |
| device = "cuda" if torch.cuda.is_available() else "cpu" |
| base_model_id = "Qwen/Qwen2.5-1.5B-Instruct" |
| finetuned_model_id = "OsamaMo/Arabic_Text-To-SQL_using_Qwen2.5-1.5B" |
| |
| # Load the base model and adapter for fine-tuning |
| model = AutoModelForCausalLM.from_pretrained( |
| base_model_id, |
| device_map="auto", |
| torch_dtype=torch.bfloat16 |
| ) |
| model.load_adapter(finetuned_model_id) |
| |
| tokenizer = AutoTokenizer.from_pretrained(base_model_id) |
| |
| def generate_resp(messages): |
| text = tokenizer.apply_chat_template( |
| messages, |
| tokenize=False, |
| add_generation_prompt=True |
| ) |
| model_inputs = tokenizer([text], return_tensors="pt").to(device) |
| generated_ids = model.generate( |
| model_inputs.input_ids, |
| max_new_tokens=1024, |
| do_sample=False, temperature= False, |
| ) |
| generated_ids = [ |
| output_ids[len(input_ids):] |
| for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) |
| ] |
| response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] |
| return response |
| ``` |
|
|
|
|
|
|
| ### Example Usage |
| ```python |
| |
| # Production-ready system message for SQL generation |
| system_message = ( |
| "You are a highly advanced Arabic text-to-SQL converter. Your mission is to Understand first the db schema and reltions between it and then accurately transform Arabic " |
| "natural language queries into SQL queries with precision and clarity.\n" |
| ) |
| |
| def get_sql_query(db_schema, arabic_query): |
| # Construct the instruction message including the DB schema and the Arabic query |
| instruction_message = "\n".join([ |
| "## DB-Schema:", |
| db_schema, |
| "", |
| "## User-Prompt:", |
| arabic_query, |
| "# Output SQL:", |
| "```SQL" |
| ]) |
| |
| messages = [ |
| {"role": "system", "content": system_message}, |
| {"role": "user", "content": instruction_message} |
| ] |
| |
| response = generate_resp(messages) |
| |
| # Extract the SQL query from the response using a regex to capture text within the ```sql markdown block |
| match = re.search(r"```sql\s*(.*?)\s*```", response, re.DOTALL | re.IGNORECASE) |
| if match: |
| sql_query = match.group(1).strip() |
| return sql_query |
| else: |
| return response.strip() |
| |
| # Example usage: |
| example_db_schema = r'''{ |
| 'Pharmcy': |
| CREATE TABLE `purchase` ( |
| `BARCODE` varchar(20) NOT NULL, |
| `NAME` varchar(50) NOT NULL, |
| `TYPE` varchar(20) NOT NULL, |
| `COMPANY_NAME` varchar(20) NOT NULL, |
| `QUANTITY` int NOT NULL, |
| `PRICE` double NOT NULL, |
| `AMOUNT` double NOT NULL, |
| PRIMARY KEY (`BARCODE`), |
| KEY `fkr3` (`COMPANY_NAME`), |
| CONSTRAINT `fkr3` FOREIGN KEY (`COMPANY_NAME`) REFERENCES `company` (`NAME`) ON DELETE CASCADE ON UPDATE CASCADE |
| ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
| |
| CREATE TABLE `sales` ( |
| `BARCODE` varchar(20) NOT NULL, |
| `NAME` varchar(50) NOT NULL, |
| `TYPE` varchar(10) NOT NULL, |
| `DOSE` varchar(10) NOT NULL, |
| `QUANTITY` int NOT NULL, |
| `PRICE` double NOT NULL, |
| `AMOUNT` double NOT NULL, |
| `DATE` varchar(15) NOT NULL |
| ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
| |
| CREATE TABLE `users` ( |
| `ID` int NOT NULL, |
| `NAME` varchar(50) NOT NULL, |
| `DOB` varchar(20) NOT NULL, |
| `ADDRESS` varchar(100) NOT NULL, |
| `PHONE` varchar(20) NOT NULL, |
| `SALARY` double NOT NULL, |
| `PASSWORD` varchar(20) NOT NULL, |
| PRIMARY KEY (`ID`) |
| ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
| |
| CREATE TABLE `history_sales` ( |
| `USER_NAME` varchar(20) NOT NULL, |
| `BARCODE` varchar(20) NOT NULL, |
| `NAME` varchar(50) NOT NULL, |
| `TYPE` varchar(10) NOT NULL, |
| `DOSE` varchar(10) NOT NULL, |
| `QUANTITY` int NOT NULL, |
| `PRICE` double NOT NULL, |
| `AMOUNT` double NOT NULL, |
| `DATE` varchar(15) NOT NULL, |
| `TIME` varchar(20) NOT NULL |
| ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
| |
| CREATE TABLE `expiry` ( |
| `PRODUCT_NAME` varchar(50) NOT NULL, |
| `PRODUCT_CODE` varchar(20) NOT NULL, |
| `DATE_OF_EXPIRY` varchar(10) NOT NULL, |
| `QUANTITY_REMAIN` int NOT NULL |
| ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
| |
| CREATE TABLE `drugs` ( |
| `NAME` varchar(50) NOT NULL, |
| `TYPE` varchar(20) NOT NULL, |
| `BARCODE` varchar(20) NOT NULL, |
| `DOSE` varchar(10) NOT NULL, |
| `CODE` varchar(10) NOT NULL, |
| `COST_PRICE` double NOT NULL, |
| `SELLING_PRICE` double NOT NULL, |
| `EXPIRY` varchar(20) NOT NULL, |
| `COMPANY_NAME` varchar(50) NOT NULL, |
| `PRODUCTION_DATE` date NOT NULL, |
| `EXPIRATION_DATE` date NOT NULL, |
| `PLACE` varchar(20) NOT NULL, |
| `QUANTITY` int NOT NULL, |
| PRIMARY KEY (`BARCODE`) |
| ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
| |
| CREATE TABLE `company` ( |
| `NAME` varchar(50) NOT NULL, |
| `ADDRESS` varchar(50) NOT NULL, |
| `PHONE` varchar(20) NOT NULL, |
| PRIMARY KEY (`NAME`) |
| ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
| |
| Answer the following questions about this schema: |
| }''' |
| |
| example_arabic_query = "اريد الباركود الخاص بدواء يبداء اسمه بحرف 's'" |
|
|
| sql_result = get_sql_query(example_db_schema, example_arabic_query) |
| print("استعلام SQL الناتج:") |
| print(sql_result) |
| ``` |
| |
| ## Training Details |
| |
| ### Training Data |
| |
| - Dataset: **Spider (translated into Arabic)** |
| - Preprocessing: Questions converted to Arabic while keeping SQL queries unchanged. |
| - Training format: |
| - System instruction guiding Arabic-to-SQL conversion. |
| - Database schema provided for context. |
| - Arabic user queries mapped to correct SQL output. |
| - Output is strictly formatted SQL queries enclosed in markdown code blocks. |
| |
| ### Training Procedure |
| |
| #### Training Hyperparameters |
| |
| - **Batch size:** 1 (per device) |
| - **Gradient accumulation:** 4 steps |
| - **Learning rate:** 1.0e-4 |
| - **Epochs:** 3 |
| - **Scheduler:** Cosine |
| - **Warmup ratio:** 0.1 |
| - **Precision:** bf16 |
| |
| #### Speeds, Sizes, Times |
| |
| - **Training time:** 15 hours on **NVIDIA P100 8GB** |
| - **Checkpointing every:** 500 steps |
| |
| ## Evaluation |
| |
| ### Testing Data |
| |
| - Validation dataset: Spider validation set (translated to Arabic) |
| |
| ### Metrics |
| |
| - Exact Match (EM) for SQL correctness |
| - Execution Accuracy (EX) on databases |
| |
| ### Results |
| |
| - Model achieved **competitive SQL generation accuracy** for Arabic queries. |
| - Further testing required for robustness. |
| |
| ## Environmental Impact |
| |
| - **Hardware Type:** NVIDIA Tesla P100 8GB |
| - **Hours used:** 15 |
| - **Cloud Provider:** Kaggle |
| - **Carbon Emitted:** Estimated using [ML Impact Calculator](https://mlco2.github.io/impact#compute) |
| |
| ## Technical Specifications |
| |
| ### Model Architecture and Objective |
| |
| - Transformer-based **Qwen2.5-1.5B** architecture. |
| - Fine-tuned for Text-to-SQL task using LoRA. |
| |
| ### Compute Infrastructure |
| |
| - **Hardware:** Kaggle P100 GPU (8GB VRAM) |
| - **Software:** Python, Transformers, LLaMA-Factory, Hugging Face Hub |
| |
| ## Citation |
| |
| If you use this model, please cite: |
| |
| ```bibtex |
| @misc{OsamaMo_ArabicSQL, |
| author = {Osama Mohamed}, |
| title = {Arabic Text-To-SQL Model}, |
| year = {2024}, |
| howpublished = {\url{https://huggingface.co/OsamaMo/Arabic_Text-To-SQL}} |
| } |
| ``` |
| |
| ## Model Card Contact |
| |
| For questions, contact **Osama Mohamed** via Hugging Face ([OsamaMo](https://huggingface.co/OsamaMo)). |
| |
| |