Text Generation
Transformers
Safetensors
English
smallthinker
feature-extraction
conversational
custom_code
Instructions to use Tiiny/SmallThinker-4BA0.6B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Tiiny/SmallThinker-4BA0.6B-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Tiiny/SmallThinker-4BA0.6B-Instruct", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Tiiny/SmallThinker-4BA0.6B-Instruct", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Tiiny/SmallThinker-4BA0.6B-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Tiiny/SmallThinker-4BA0.6B-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Tiiny/SmallThinker-4BA0.6B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Tiiny/SmallThinker-4BA0.6B-Instruct
- SGLang
How to use Tiiny/SmallThinker-4BA0.6B-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Tiiny/SmallThinker-4BA0.6B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Tiiny/SmallThinker-4BA0.6B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Tiiny/SmallThinker-4BA0.6B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Tiiny/SmallThinker-4BA0.6B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Tiiny/SmallThinker-4BA0.6B-Instruct with Docker Model Runner:
docker model run hf.co/Tiiny/SmallThinker-4BA0.6B-Instruct
| # coding=utf-8 | |
| from transformers.configuration_utils import PretrainedConfig | |
| class SmallThinkerConfig(PretrainedConfig): | |
| """ | |
| This is the configuration class to store the configuration of a [`SmallThinkerModel`]. | |
| It is used to instantiate a SmallThinker model according to the specified arguments, defining the model architecture. | |
| The default values for each of the parameters are the same as the ones used in the original SmallThinker 4B model. | |
| General configs: | |
| - model_type: "smallthinker" | |
| - model_name | |
| - num_hidden_layers | |
| - hidden_size | |
| Tokenizer configs: | |
| - pad_token_id | |
| - bos_token_id | |
| - eos_token_id | |
| Embedding configs: | |
| - vocab_size | |
| RMSNorm configs: | |
| - rms_norm_eps | |
| Attention configs: | |
| - num_attention_heads | |
| - num_key_value_heads | |
| - head_dim | |
| - use_cache | |
| - rope_layout: array of 0 or 1s, 0 for nope, 1 for rope | |
| - rope_theta | |
| - max_position_embeddings | |
| - sliding_window_layout: array of 0 or 1s, 0 for normal attention, 1 for SWA | |
| - sliding_window_size | |
| MoE FFN configs: | |
| - moe_num_primary_experts | |
| - moe_ffn_hidden_size | |
| - moe_primary_router_apply_softmax: Use topk-softmax in routing instead of topk-sigmoid-normalize | |
| - moe_num_active_primary_experts | |
| LM Head configs: | |
| - tie_word_embeddings | |
| Other configs: | |
| - initializer_range | |
| """ | |
| def __init__(self, | |
| model_type = "smallthinker", | |
| model_name="smallthinker_4b_base", | |
| num_hidden_layers=32, | |
| hidden_size=1536, | |
| pad_token_id=None, | |
| bos_token_id=151643, | |
| eos_token_id=[151643,151645], | |
| vocab_size=151936, | |
| rms_norm_eps=1e-6, | |
| num_attention_heads=12, | |
| num_key_value_heads=2, | |
| head_dim=128, | |
| use_cache=True, | |
| rope_layout=[1]*32, | |
| rope_theta=1e6, | |
| max_position_embeddings=4096 * 32, | |
| sliding_window_layout=[0]*32, | |
| sliding_window_size=4096, | |
| moe_num_primary_experts=32, | |
| moe_ffn_hidden_size=768, | |
| moe_primary_router_apply_softmax=False, | |
| moe_num_active_primary_experts=4, | |
| tie_word_embeddings=True, | |
| initializer_range=0.02, | |
| **kwargs, | |
| ): | |
| # Configuration sanitizers | |
| assert num_attention_heads % num_key_value_heads == 0, "[SmallThinker config sanitizer] num_attention_heads must be divisible by num_key_value_heads" | |
| assert len(rope_layout) == num_hidden_layers, "[SmallThinker config sanitizer] rope_layout must have the same length as num_hidden_layers" | |
| assert len(sliding_window_layout) == num_hidden_layers, "[SmallThinker config sanitizer] sliding_window_layout must have the same length as num_hidden_layers" | |
| # General configs | |
| self.model_type = model_type | |
| self.model_name = model_name | |
| self.num_hidden_layers = num_hidden_layers | |
| self.hidden_size = hidden_size | |
| # Tokenizer configs | |
| self.pad_token_id = pad_token_id | |
| self.bos_token_id = bos_token_id | |
| self.eos_token_id = eos_token_id | |
| # Embedding configs | |
| self.vocab_size = vocab_size | |
| # RMSNorm configs | |
| self.rms_norm_eps = rms_norm_eps | |
| # Attention configs | |
| self.num_attention_heads = num_attention_heads | |
| self.num_key_value_heads = num_key_value_heads | |
| self.head_dim = head_dim | |
| self.use_cache = use_cache | |
| self.rope_layout = rope_layout | |
| self.rope_theta = rope_theta | |
| self.max_position_embeddings = max_position_embeddings | |
| self.sliding_window_layout = sliding_window_layout | |
| self.sliding_window_size = sliding_window_size | |
| # MoE FFN configs | |
| self.moe_num_primary_experts = moe_num_primary_experts | |
| self.moe_ffn_hidden_size = moe_ffn_hidden_size | |
| self.moe_primary_router_apply_softmax = moe_primary_router_apply_softmax | |
| self.moe_num_active_primary_experts = moe_num_active_primary_experts | |
| # Other configs | |
| self.initializer_range = initializer_range | |
| super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, tie_word_embeddings=tie_word_embeddings, **kwargs) | |
| # VLLM config, not used in transformers, but VLLM requires these args to run correctly. DO NOT DELETE! | |
| self.sliding_window = sliding_window_size | |
| self.sliding_window_pattern = sliding_window_layout | |
| self._attn_implementation = "sdpa" | |
| __all__ = ["SmallThinkerConfig"] | |