Update README.md

193a126 verified about 1 year ago

9.65 kB

	---
	license: mit
	language:
	- ar
	base_model:
	- Qwen/Qwen2.5-1.5B-Instruct
	pipeline_tag: text2text-generation
	library_name: transformers
	tags:
	- Text-To-SQL
	- Arabic
	- Spider
	- SQL
	---

	# Model Card for Arabic Text-To-SQL (OsamaMo)

	## Model Details

	### Model Description

	This model is fine-tuned on the Spider dataset with Arabic-translated questions for the Text-To-SQL task. It is based on Qwen/Qwen2.5-1.5B-Instruct and trained using LoRA on Kaggle for 15 hours on a P100 8GB GPU.

	- Developed by: Osama Mohamed ([OsamaMo](https://huggingface.co/OsamaMo))
	- Funded by: Self-funded
	- Shared by: Osama Mohamed
	- Model type: Text-to-SQL fine-tuned model
	- Language(s): Arabic (ar)
	- License: MIT
	- Finetuned from: Qwen/Qwen2.5-1.5B-Instruct

	### Model Sources

	- Repository: [Hugging Face Model Hub](https://huggingface.co/OsamaMo/Arabic_Text-To-SQL)
	- Dataset: Spider (translated to Arabic)
	- Training Script: [LLaMA-Factory](https://github.com/huggingface/transformers/tree/main/src/transformers/models/llama_factory)

	## Uses

	### Direct Use

	This model is intended for converting Arabic natural language questions into SQL queries. It can be used for database querying in Arabic-speaking applications.

	### Downstream Use

	Can be fine-tuned further for specific databases or Arabic dialect adaptations.

	### Out-of-Scope Use

	- The model is not intended for direct execution of SQL queries.
	- Not recommended for non-database-related NLP tasks.

	## Bias, Risks, and Limitations

	- The model might generate incorrect or non-optimized SQL queries.
	- Bias may exist due to dataset translations and model pretraining data.

	### Recommendations

	- Validate generated SQL queries before execution.
	- Ensure compatibility with specific database schemas.

	## How to Get Started with the Model
	### Load Model
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch
	import re

	device = "cuda" if torch.cuda.is_available() else "cpu"
	base_model_id = "Qwen/Qwen2.5-1.5B-Instruct"
	finetuned_model_id = "OsamaMo/Arabic_Text-To-SQL_using_Qwen2.5-1.5B"

	# Load the base model and adapter for fine-tuning
	model = AutoModelForCausalLM.from_pretrained(
	base_model_id,
	device_map="auto",
	torch_dtype=torch.bfloat16
	)
	model.load_adapter(finetuned_model_id)

	tokenizer = AutoTokenizer.from_pretrained(base_model_id)

	def generate_resp(messages):
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(device)
	generated_ids = model.generate(
	model_inputs.input_ids,
	max_new_tokens=1024,
	do_sample=False, temperature= False,
	)
	generated_ids = [
	output_ids[len(input_ids):]
	for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
	]
	response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
	return response
	```



	### Example Usage
	```python

	# Production-ready system message for SQL generation
	system_message = (
	"You are a highly advanced Arabic text-to-SQL converter. Your mission is to Understand first the db schema and reltions between it and then accurately transform Arabic "
	"natural language queries into SQL queries with precision and clarity.\n"
	)

	def get_sql_query(db_schema, arabic_query):
	# Construct the instruction message including the DB schema and the Arabic query
	instruction_message = "\n".join([
	"## DB-Schema:",
	db_schema,
	"",
	"## User-Prompt:",
	arabic_query,
	"# Output SQL:",
	"```SQL"
	])

	messages = [
	{"role": "system", "content": system_message},
	{"role": "user", "content": instruction_message}
	]

	response = generate_resp(messages)

	# Extract the SQL query from the response using a regex to capture text within the ```sql markdown block
	match = re.search(r"```sql\s(.?)\s*```", response, re.DOTALL \| re.IGNORECASE)
	if match:
	sql_query = match.group(1).strip()
	return sql_query
	else:
	return response.strip()

	# Example usage:
	example_db_schema = r'''{
	'Pharmcy':
	CREATE TABLE `purchase` (
	`BARCODE` varchar(20) NOT NULL,
	`NAME` varchar(50) NOT NULL,
	`TYPE` varchar(20) NOT NULL,
	`COMPANY_NAME` varchar(20) NOT NULL,
	`QUANTITY` int NOT NULL,
	`PRICE` double NOT NULL,
	`AMOUNT` double NOT NULL,
	PRIMARY KEY (`BARCODE`),
	KEY `fkr3` (`COMPANY_NAME`),
	CONSTRAINT `fkr3` FOREIGN KEY (`COMPANY_NAME`) REFERENCES `company` (`NAME`) ON DELETE CASCADE ON UPDATE CASCADE
	) ENGINE=InnoDB DEFAULT CHARSET=latin1

	CREATE TABLE `sales` (
	`BARCODE` varchar(20) NOT NULL,
	`NAME` varchar(50) NOT NULL,
	`TYPE` varchar(10) NOT NULL,
	`DOSE` varchar(10) NOT NULL,
	`QUANTITY` int NOT NULL,
	`PRICE` double NOT NULL,
	`AMOUNT` double NOT NULL,
	`DATE` varchar(15) NOT NULL
	) ENGINE=InnoDB DEFAULT CHARSET=latin1

	CREATE TABLE `users` (
	`ID` int NOT NULL,
	`NAME` varchar(50) NOT NULL,
	`DOB` varchar(20) NOT NULL,
	`ADDRESS` varchar(100) NOT NULL,
	`PHONE` varchar(20) NOT NULL,
	`SALARY` double NOT NULL,
	`PASSWORD` varchar(20) NOT NULL,
	PRIMARY KEY (`ID`)
	) ENGINE=InnoDB DEFAULT CHARSET=latin1

	CREATE TABLE `history_sales` (
	`USER_NAME` varchar(20) NOT NULL,
	`BARCODE` varchar(20) NOT NULL,
	`NAME` varchar(50) NOT NULL,
	`TYPE` varchar(10) NOT NULL,
	`DOSE` varchar(10) NOT NULL,
	`QUANTITY` int NOT NULL,
	`PRICE` double NOT NULL,
	`AMOUNT` double NOT NULL,
	`DATE` varchar(15) NOT NULL,
	`TIME` varchar(20) NOT NULL
	) ENGINE=InnoDB DEFAULT CHARSET=latin1

	CREATE TABLE `expiry` (
	`PRODUCT_NAME` varchar(50) NOT NULL,
	`PRODUCT_CODE` varchar(20) NOT NULL,
	`DATE_OF_EXPIRY` varchar(10) NOT NULL,
	`QUANTITY_REMAIN` int NOT NULL
	) ENGINE=InnoDB DEFAULT CHARSET=latin1

	CREATE TABLE `drugs` (
	`NAME` varchar(50) NOT NULL,
	`TYPE` varchar(20) NOT NULL,
	`BARCODE` varchar(20) NOT NULL,
	`DOSE` varchar(10) NOT NULL,
	`CODE` varchar(10) NOT NULL,
	`COST_PRICE` double NOT NULL,
	`SELLING_PRICE` double NOT NULL,
	`EXPIRY` varchar(20) NOT NULL,
	`COMPANY_NAME` varchar(50) NOT NULL,
	`PRODUCTION_DATE` date NOT NULL,
	`EXPIRATION_DATE` date NOT NULL,
	`PLACE` varchar(20) NOT NULL,
	`QUANTITY` int NOT NULL,
	PRIMARY KEY (`BARCODE`)
	) ENGINE=InnoDB DEFAULT CHARSET=latin1

	CREATE TABLE `company` (
	`NAME` varchar(50) NOT NULL,
	`ADDRESS` varchar(50) NOT NULL,
	`PHONE` varchar(20) NOT NULL,
	PRIMARY KEY (`NAME`)
	) ENGINE=InnoDB DEFAULT CHARSET=latin1

	Answer the following questions about this schema:
	}'''

	example_arabic_query = "اريد الباركود الخاص بدواء يبداء اسمه بحرف 's'"

	sql_result = get_sql_query(example_db_schema, example_arabic_query)
	print("استعلام SQL الناتج:")
	print(sql_result)
	```

	## Training Details

	### Training Data

	- Dataset: Spider (translated into Arabic)
	- Preprocessing: Questions converted to Arabic while keeping SQL queries unchanged.
	- Training format:
	- System instruction guiding Arabic-to-SQL conversion.
	- Database schema provided for context.
	- Arabic user queries mapped to correct SQL output.
	- Output is strictly formatted SQL queries enclosed in markdown code blocks.

	### Training Procedure

	#### Training Hyperparameters

	- Batch size: 1 (per device)
	- Gradient accumulation: 4 steps
	- Learning rate: 1.0e-4
	- Epochs: 3
	- Scheduler: Cosine
	- Warmup ratio: 0.1
	- Precision: bf16

	#### Speeds, Sizes, Times

	- Training time: 15 hours on NVIDIA P100 8GB
	- Checkpointing every: 500 steps

	## Evaluation

	### Testing Data

	- Validation dataset: Spider validation set (translated to Arabic)

	### Metrics

	- Exact Match (EM) for SQL correctness
	- Execution Accuracy (EX) on databases

	### Results

	- Model achieved competitive SQL generation accuracy for Arabic queries.
	- Further testing required for robustness.

	## Environmental Impact

	- Hardware Type: NVIDIA Tesla P100 8GB
	- Hours used: 15
	- Cloud Provider: Kaggle
	- Carbon Emitted: Estimated using [ML Impact Calculator](https://mlco2.github.io/impact#compute)

	## Technical Specifications

	### Model Architecture and Objective

	- Transformer-based Qwen2.5-1.5B architecture.
	- Fine-tuned for Text-to-SQL task using LoRA.

	### Compute Infrastructure

	- Hardware: Kaggle P100 GPU (8GB VRAM)
	- Software: Python, Transformers, LLaMA-Factory, Hugging Face Hub

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{OsamaMo_ArabicSQL,
	author = {Osama Mohamed},
	title = {Arabic Text-To-SQL Model},
	year = {2024},
	howpublished = {\url{https://huggingface.co/OsamaMo/Arabic_Text-To-SQL}}
	}
	```

	## Model Card Contact

	For questions, contact Osama Mohamed via Hugging Face ([OsamaMo](https://huggingface.co/OsamaMo)).