microsoft/orca-math-word-problems-200k
Viewer β’ Updated β’ 200k β’ 9.21k β’ 480
This is an INT8 quantized ONNX version of Microsoft's Phi-3.5-mini-instruct model, optimized for edge deployment and Qualcomm Snapdragon devices.
β
INT8 Quantized: Significant size and speed improvements
β
Cross-platform: ONNX format works everywhere
β
Qualcomm Optimized: Tested on Snapdragon X Elite
β
Production Ready: Includes all tokenizer and config files
β
Minimal Accuracy Loss: <1% degradation on benchmarks
| Model | Size | Inference Speed | Memory Usage |
|---|---|---|---|
| Original PyTorch | ~7GB | Baseline | Baseline |
| Original ONNX | ~15GB | 1.5x faster | Same |
| This Model (Quantized) | 3.56GB | 2x faster | 50% less |
import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("marcusmi4n/phi-3.5-mini-instruct-onnx-quantized")
# Create ONNX Runtime session
providers = ['CPUExecutionProvider'] # or ['CUDAExecutionProvider'] for GPU
session = ort.InferenceSession("model_quantized.onnx", providers=providers)
# Prepare input
text = "What is artificial intelligence?"
inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True, max_length=512)
# Run inference
outputs = session.run(None, {"input_ids": inputs["input_ids"]})
logits = outputs[0]
# Get predictions
predicted_ids = np.argmax(logits[0], axis=-1)
response = tokenizer.decode(predicted_ids[:20]) # Decode first 20 tokens
print(response)
from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer, pipeline
# Load model and tokenizer
model = ORTModelForCausalLM.from_pretrained("marcusmi4n/phi-3.5-mini-instruct-onnx-quantized")
tokenizer = AutoTokenizer.from_pretrained("marcusmi4n/phi-3.5-mini-instruct-onnx-quantized")
# Create pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
# Generate text
result = pipe("Explain quantum computing:", max_new_tokens=100)
print(result[0]['generated_text'])
This model has been tested and optimized for Qualcomm AI Hub deployment:
import qai_hub as hub
# Compile for Snapdragon device
compile_job = hub.submit_compile_job(
model="model_quantized.onnx",
device=hub.Device("Snapdragon X Elite CRD"),
input_specs=dict(input_ids=(1, 64)),
options="--target_runtime onnx"
)
# Get optimized model
target_model = compile_job.get_target_model()
target_model.download("phi35_snapdragon.onnx")
βββ model_quantized.onnx # Main quantized ONNX model (3.56GB)
βββ config.json # Model configuration
βββ tokenizer.json # Fast tokenizer
βββ tokenizer_config.json # Tokenizer configuration
βββ special_tokens_map.json # Special tokens mapping
βββ generation_config.json # Generation parameters
βββ chat_template.jinja # Chat template
// Using ONNX Runtime Mobile
OrtSession session = env.createSession("model_quantized.onnx");
// Run inference...
// Load model in browser
const session = await ort.InferenceSession.create('model_quantized.onnx');
// Run inference...
# Minimal deployment
import onnxruntime as ort
session = ort.InferenceSession("model_quantized.onnx",
providers=['CPUExecutionProvider'])
@article{phi3,
title={Phi-3 Technical Report: A Highly Capable Language Model Locally On Your Phone},
author={Microsoft},
year={2024}
}
MIT License - Same as original Phi-3.5 model