| ---
|
| license: mit
|
| language:
|
| - zho
|
| - eng
|
| - fra
|
| - spa
|
| - por
|
| - deu
|
| - ita
|
| - rus
|
| - jpn
|
| - kor
|
| - vie
|
| - tha
|
| - ara
|
| base_model:
|
| - Qwen/Qwen2.5-1.5B-Instruct
|
| pipeline_tag: text2text-generation
|
| library_name: transformers
|
| tags:
|
| - Text-To-SQL
|
| - Arabic
|
| - Spider
|
| - SQL
|
| ---
|
|
|
| # Model Card for Arabic Text-To-SQL (OsamaMo)
|
|
|
| ## Model Details
|
|
|
| ### Model Description
|
|
|
| This model is fine-tuned on the Spider dataset with Arabic-translated questions for the Text-To-SQL task. It is based on **Qwen/Qwen2.5-1.5B-Instruct** and trained using LoRA on Kaggle for 15 hours on a **P100 8GB GPU**.
|
|
|
| - **Developed by:** Osama Mohamed ([OsamaMo](https://huggingface.co/OsamaMo))
|
| - **Funded by:** Self-funded
|
| - **Shared by:** Osama Mohamed
|
| - **Model type:** Text-to-SQL fine-tuned model
|
| - **Language(s):** Arabic (ar)
|
| - **License:** MIT
|
| - **Finetuned from:** Qwen/Qwen2.5-1.5B-Instruct
|
|
|
| ### Model Sources
|
|
|
| - **Repository:** [Hugging Face Model Hub](https://huggingface.co/OsamaMo/Arabic_Text-To-SQL)
|
| - **Dataset:** Spider (translated to Arabic)
|
| - **Training Script:** [LLaMA-Factory](https://github.com/huggingface/transformers/tree/main/src/transformers/models/llama_factory)
|
|
|
| ## Uses
|
|
|
| ### Direct Use
|
|
|
| This model is intended for converting **Arabic natural language questions** into SQL queries. It can be used for database querying in Arabic-speaking applications.
|
|
|
| ### Downstream Use
|
|
|
| Can be fine-tuned further for specific databases or Arabic dialect adaptations.
|
|
|
| ### Out-of-Scope Use
|
|
|
| - The model is **not** intended for direct execution of SQL queries.
|
| - Not recommended for non-database-related NLP tasks.
|
|
|
| ## Bias, Risks, and Limitations
|
|
|
| - The model might generate incorrect or non-optimized SQL queries.
|
| - Bias may exist due to dataset translations and model pretraining data.
|
|
|
| ### Recommendations
|
|
|
| - Validate generated SQL queries before execution.
|
| - Ensure compatibility with specific database schemas.
|
|
|
| ## How to Get Started with the Model
|
| ### Load Model
|
| ```python
|
| from transformers import AutoModelForCausalLM, AutoTokenizer
|
| import torch
|
| import re
|
|
|
| device = "cuda" if torch.cuda.is_available() else "cpu"
|
| base_model_id = "Qwen/Qwen2.5-1.5B-Instruct"
|
| finetuned_model_id = "OsamaMo/Arabic_Text-To-SQL_using_Qwen2.5-1.5B"
|
|
|
| # Load the base model and adapter for fine-tuning
|
| model = AutoModelForCausalLM.from_pretrained(
|
| base_model_id,
|
| device_map="auto",
|
| torch_dtype=torch.bfloat16
|
| )
|
| model.load_adapter(finetuned_model_id)
|
|
|
| tokenizer = AutoTokenizer.from_pretrained(base_model_id)
|
|
|
| def generate_resp(messages):
|
| text = tokenizer.apply_chat_template(
|
| messages,
|
| tokenize=False,
|
| add_generation_prompt=True
|
| )
|
| model_inputs = tokenizer([text], return_tensors="pt").to(device)
|
| generated_ids = model.generate(
|
| model_inputs.input_ids,
|
| max_new_tokens=1024,
|
| do_sample=False, temperature= False,
|
| )
|
| generated_ids = [
|
| output_ids[len(input_ids):]
|
| for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
|
| ]
|
| response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
| return response
|
| ```
|
|
|
|
|
|
|
| ### Example Usage
|
| ```python
|
|
|
| # Production-ready system message for SQL generation
|
| system_message = (
|
| "You are a highly advanced Arabic text-to-SQL converter. Your mission is to Understand first the db schema and reltions between it and then accurately transform Arabic "
|
| "natural language queries into SQL queries with precision and clarity.\n"
|
| )
|
|
|
| def get_sql_query(db_schema, arabic_query):
|
| # Construct the instruction message including the DB schema and the Arabic query
|
| instruction_message = "\n".join([
|
| "## DB-Schema:",
|
| db_schema,
|
| "",
|
| "## User-Prompt:",
|
| arabic_query,
|
| "# Output SQL:",
|
| "```SQL"
|
| ])
|
|
|
| messages = [
|
| {"role": "system", "content": system_message},
|
| {"role": "user", "content": instruction_message}
|
| ]
|
|
|
| response = generate_resp(messages)
|
|
|
| # Extract the SQL query from the response using a regex to capture text within the ```sql markdown block
|
| match = re.search(r"```sql\s*(.*?)\s*```", response, re.DOTALL | re.IGNORECASE)
|
| if match:
|
| sql_query = match.group(1).strip()
|
| return sql_query
|
| else:
|
| return response.strip()
|
|
|
| # Example usage:
|
| example_db_schema = r'''{
|
| 'Pharmcy':
|
| CREATE TABLE `purchase` (
|
| `BARCODE` varchar(20) NOT NULL,
|
| `NAME` varchar(50) NOT NULL,
|
| `TYPE` varchar(20) NOT NULL,
|
| `COMPANY_NAME` varchar(20) NOT NULL,
|
| `QUANTITY` int NOT NULL,
|
| `PRICE` double NOT NULL,
|
| `AMOUNT` double NOT NULL,
|
| PRIMARY KEY (`BARCODE`),
|
| KEY `fkr3` (`COMPANY_NAME`),
|
| CONSTRAINT `fkr3` FOREIGN KEY (`COMPANY_NAME`) REFERENCES `company` (`NAME`) ON DELETE CASCADE ON UPDATE CASCADE
|
| ) ENGINE=InnoDB DEFAULT CHARSET=latin1
|
|
|
| CREATE TABLE `sales` (
|
| `BARCODE` varchar(20) NOT NULL,
|
| `NAME` varchar(50) NOT NULL,
|
| `TYPE` varchar(10) NOT NULL,
|
| `DOSE` varchar(10) NOT NULL,
|
| `QUANTITY` int NOT NULL,
|
| `PRICE` double NOT NULL,
|
| `AMOUNT` double NOT NULL,
|
| `DATE` varchar(15) NOT NULL
|
| ) ENGINE=InnoDB DEFAULT CHARSET=latin1
|
|
|
| CREATE TABLE `users` (
|
| `ID` int NOT NULL,
|
| `NAME` varchar(50) NOT NULL,
|
| `DOB` varchar(20) NOT NULL,
|
| `ADDRESS` varchar(100) NOT NULL,
|
| `PHONE` varchar(20) NOT NULL,
|
| `SALARY` double NOT NULL,
|
| `PASSWORD` varchar(20) NOT NULL,
|
| PRIMARY KEY (`ID`)
|
| ) ENGINE=InnoDB DEFAULT CHARSET=latin1
|
|
|
| CREATE TABLE `history_sales` (
|
| `USER_NAME` varchar(20) NOT NULL,
|
| `BARCODE` varchar(20) NOT NULL,
|
| `NAME` varchar(50) NOT NULL,
|
| `TYPE` varchar(10) NOT NULL,
|
| `DOSE` varchar(10) NOT NULL,
|
| `QUANTITY` int NOT NULL,
|
| `PRICE` double NOT NULL,
|
| `AMOUNT` double NOT NULL,
|
| `DATE` varchar(15) NOT NULL,
|
| `TIME` varchar(20) NOT NULL
|
| ) ENGINE=InnoDB DEFAULT CHARSET=latin1
|
|
|
| CREATE TABLE `expiry` (
|
| `PRODUCT_NAME` varchar(50) NOT NULL,
|
| `PRODUCT_CODE` varchar(20) NOT NULL,
|
| `DATE_OF_EXPIRY` varchar(10) NOT NULL,
|
| `QUANTITY_REMAIN` int NOT NULL
|
| ) ENGINE=InnoDB DEFAULT CHARSET=latin1
|
|
|
| CREATE TABLE `drugs` (
|
| `NAME` varchar(50) NOT NULL,
|
| `TYPE` varchar(20) NOT NULL,
|
| `BARCODE` varchar(20) NOT NULL,
|
| `DOSE` varchar(10) NOT NULL,
|
| `CODE` varchar(10) NOT NULL,
|
| `COST_PRICE` double NOT NULL,
|
| `SELLING_PRICE` double NOT NULL,
|
| `EXPIRY` varchar(20) NOT NULL,
|
| `COMPANY_NAME` varchar(50) NOT NULL,
|
| `PRODUCTION_DATE` date NOT NULL,
|
| `EXPIRATION_DATE` date NOT NULL,
|
| `PLACE` varchar(20) NOT NULL,
|
| `QUANTITY` int NOT NULL,
|
| PRIMARY KEY (`BARCODE`)
|
| ) ENGINE=InnoDB DEFAULT CHARSET=latin1
|
|
|
| CREATE TABLE `company` (
|
| `NAME` varchar(50) NOT NULL,
|
| `ADDRESS` varchar(50) NOT NULL,
|
| `PHONE` varchar(20) NOT NULL,
|
| PRIMARY KEY (`NAME`)
|
| ) ENGINE=InnoDB DEFAULT CHARSET=latin1
|
|
|
| Answer the following questions about this schema:
|
| }'''
|
|
|
| example_arabic_query = "اريد الباركود الخاص بدواء يبداء اسمه بحرف 's'"
|
|
|
| sql_result = get_sql_query(example_db_schema, example_arabic_query)
|
| print("استعلام SQL الناتج:")
|
| print(sql_result)
|
| ```
|
|
|
| ## Training Details
|
|
|
| ### Training Data
|
|
|
| - Dataset: **Spider (translated into Arabic)**
|
| - Preprocessing: Questions converted to Arabic while keeping SQL queries unchanged.
|
| - Training format:
|
| - System instruction guiding Arabic-to-SQL conversion.
|
| - Database schema provided for context.
|
| - Arabic user queries mapped to correct SQL output.
|
| - Output is strictly formatted SQL queries enclosed in markdown code blocks.
|
|
|
| ### Training Procedure
|
|
|
| #### Training Hyperparameters
|
|
|
| - **Batch size:** 1 (per device)
|
| - **Gradient accumulation:** 4 steps
|
| - **Learning rate:** 1.0e-4
|
| - **Epochs:** 3
|
| - **Scheduler:** Cosine
|
| - **Warmup ratio:** 0.1
|
| - **Precision:** bf16
|
|
|
| #### Speeds, Sizes, Times
|
|
|
| - **Training time:** 15 hours on **NVIDIA P100 8GB**
|
| - **Checkpointing every:** 500 steps
|
|
|
| ## Evaluation
|
|
|
| ### Testing Data
|
|
|
| - Validation dataset: Spider validation set (translated to Arabic)
|
|
|
| ### Metrics
|
|
|
| - Exact Match (EM) for SQL correctness
|
| - Execution Accuracy (EX) on databases
|
|
|
| ### Results
|
|
|
| - Model achieved **competitive SQL generation accuracy** for Arabic queries.
|
| - Further testing required for robustness.
|
|
|
| ## Environmental Impact
|
|
|
| - **Hardware Type:** NVIDIA Tesla P100 8GB
|
| - **Hours used:** 15
|
| - **Cloud Provider:** Kaggle
|
| - **Carbon Emitted:** Estimated using [ML Impact Calculator](https://mlco2.github.io/impact#compute)
|
|
|
| ## Technical Specifications
|
|
|
| ### Model Architecture and Objective
|
|
|
| - Transformer-based **Qwen2.5-1.5B** architecture.
|
| - Fine-tuned for Text-to-SQL task using LoRA.
|
|
|
| ### Compute Infrastructure
|
|
|
| - **Hardware:** Kaggle P100 GPU (8GB VRAM)
|
| - **Software:** Python, Transformers, LLaMA-Factory, Hugging Face Hub
|
|
|
| ## Citation
|
|
|
| If you use this model, please cite:
|
|
|
| ```bibtex
|
| @misc{OsamaMo_ArabicSQL,
|
| author = {Osama Mohamed},
|
| title = {Arabic Text-To-SQL Model},
|
| year = {2024},
|
| howpublished = {\url{https://huggingface.co/OsamaMo/Arabic_Text-To-SQL}}
|
| }
|
| ```
|
|
|
| ## Model Card Contact
|
|
|
| For questions, contact **Osama Mohamed** via Hugging Face ([OsamaMo](https://huggingface.co/OsamaMo)).
|
|
|
|
|