Instructions to use concedo/GLM-ASR-Nano-2512-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use concedo/GLM-ASR-Nano-2512-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="concedo/GLM-ASR-Nano-2512-GGUF", filename="GLM-ASR-Nano-1.6B-2512-BF16.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use concedo/GLM-ASR-Nano-2512-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf concedo/GLM-ASR-Nano-2512-GGUF:BF16 # Run inference directly in the terminal: llama-cli -hf concedo/GLM-ASR-Nano-2512-GGUF:BF16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf concedo/GLM-ASR-Nano-2512-GGUF:BF16 # Run inference directly in the terminal: llama-cli -hf concedo/GLM-ASR-Nano-2512-GGUF:BF16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf concedo/GLM-ASR-Nano-2512-GGUF:BF16 # Run inference directly in the terminal: ./llama-cli -hf concedo/GLM-ASR-Nano-2512-GGUF:BF16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf concedo/GLM-ASR-Nano-2512-GGUF:BF16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf concedo/GLM-ASR-Nano-2512-GGUF:BF16
Use Docker
docker model run hf.co/concedo/GLM-ASR-Nano-2512-GGUF:BF16
- LM Studio
- Jan
- Ollama
How to use concedo/GLM-ASR-Nano-2512-GGUF with Ollama:
ollama run hf.co/concedo/GLM-ASR-Nano-2512-GGUF:BF16
- Unsloth Studio new
How to use concedo/GLM-ASR-Nano-2512-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for concedo/GLM-ASR-Nano-2512-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for concedo/GLM-ASR-Nano-2512-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for concedo/GLM-ASR-Nano-2512-GGUF to start chatting
- Docker Model Runner
How to use concedo/GLM-ASR-Nano-2512-GGUF with Docker Model Runner:
docker model run hf.co/concedo/GLM-ASR-Nano-2512-GGUF:BF16
- Lemonade
How to use concedo/GLM-ASR-Nano-2512-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull concedo/GLM-ASR-Nano-2512-GGUF:BF16
Run and chat with the model
lemonade run user.GLM-ASR-Nano-2512-GGUF-BF16
List all available models
lemonade list
KoboldCpp Transcribe API not working
Hi @concedo , thank you for providing this GGUF file as you are the only one providing the GLM ASR Nao GGUF file.
In this model card description, you said we could use this GGUF file with KoboldCpp 1.104 and above. I tried but when into some difficulties in accessing the transcribe API
Here are more details:
I run the KoboldCpp with docker compose:
services:
koboldcpp:
container_name: koboldcpp
image: koboldai/koboldcpp:latest
volumes:
- /path_to_model_files/:/workspace/:rw
deploy: # You can remove this section if you do not wish to use an Nvidia GPU
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['1']
capabilities: [gpu]
environment:
- KCPP_DONT_UPDATE=true
- KCPP_DONT_TUNNEL=true
- KCPP_ARGS=--model GLM-ASR-Nano-1.6B-2512-Q8_0.gguf --mmproj mmproj-GLM-ASR-Nano-2512-Q8_0.gguf --gpulayers 99
ports:
- "5001:5001"
restart: unless-stopped
The KoboldCpp starts with no obvious issue, here are the logs:
Update check skipped
***
Welcome to KoboldCpp - Version 1.104
Loading Chat Completions Adapter: /tmp/_MEIz6Ojhs/kcpp_adapters/AutoGuess.json
Chat Completions Adapter Loaded
No GPU or CPU backend was selected. Trying to assign one for you automatically...
Auto Selected CUDA Backend (flag=0)
System: Linux #1 SMP PREEMPT_DYNAMIC Fri, 19 Dec 2025 01:23:45 +0000 x86_64 x86_64
Detected Available GPU Memory: 24576 MB
Detected Available RAM: 104038 MB
Initializing dynamic library: koboldcpp_cublas.so
==========
Namespace(admin=False, admindir='', adminpassword=None, analyze='', autofit=False, batchsize=512, benchmark=None, blasthreads=0, chatcompletionsadapter='AutoGuess', cli=False, config=None, contextsize=8192, debugmode=0, defaultgenamt=896, draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel='', embeddingsgpu=False, embeddingsmaxctx=0, embeddingsmodel='', enableguidance=False, exportconfig='', exporttemplate='', failsafe=False, flashattention=False, forceversion=False, foreground=False, genlimit=0, gpulayers=99, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, jinja=False, jinja_tools=False, launch=False, lora=None, loramult=1.0, lowvram=False, maingpu=-1, maxrequestsize=32, mmproj='mmproj-GLM-ASR-Nano-2512-Q8_0.gguf', mmprojcpu=False, model=['GLM-ASR-Nano-1.6B-2512-Q8_0.gguf'], model_param='GLM-ASR-Nano-1.6B-2512-Q8_0.gguf', moecpu=0, moeexperts=-1, multiplayer=False, multiuser=1, noavx2=False, noblas=False, nobostoken=False, nocertify=False, nofastforward=False, nommap=False, nomodel=False, noshift=False, onready='', overridekv='', overridenativecontext=0, overridetensors='', password=None, pipelineparallel=False, port=5001, port_param=5001, preloadstory='', prompt='', quantkv=0, quiet=True, ratelimit=0, remotetunnel=False, ropeconfig=[0.0, 10000.0], savedatafile='', sdclamped=0, sdclampedsoft=0, sdclip1='', sdclip2='', sdclipgpu=False, sdconfig=None, sdconvdirect='off', sdflashattention=False, sdgendefaults='', sdlora='', sdloramult=1.0, sdmodel='', sdnotile=False, sdoffloadcpu=False, sdphotomaker='', sdquant=0, sdt5xxl='', sdthreads=0, sdtiledvae=768, sdvae='', sdvaeauto=False, sdvaecpu=False, showgui=False, singleinstance=False, skiplauncher=False, smartcache=False, smartcontext=False, ssl=None, tensor_split=None, testmemory=False, threads=15, ttsgpu=False, ttsmaxlen=4096, ttsmodel='', ttsthreads=0, ttswavtokenizer='', unpack='', useclblast=None, usecpu=False, usecuda=['normal', 'mmq'], usemlock=False, usemmap=False, useswa=False, usevulkan=None, version=False, visionmaxres=1024, websearch=False, whispermodel='')
==========
Loading Text Model: /workspace/GLM-ASR-Nano-1.6B-2512-Q8_0.gguf
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:2b:00.0) - 23862 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 255 tensors from /workspace/GLM-ASR-Nano-1.6B-2512-Q8_0.gguf (version GGUF V3 (latest))
print_info: file format = GGUF V3 (latest)
print_info: file size = 1.58 GiB (8.50 BPW)
init_tokenizer: initializing tokenizer for type 2
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 59246 ('<|endoftext|>')
load: - 59253 ('<|user|>')
load: special tokens cache size = 17
load: token to piece cache size = 0.3378 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 8192
print_info: n_embd = 2048
print_info: n_embd_inp = 2048
print_info: n_layer = 28
print_info: n_head = 16
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 6144
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 8192
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned = unknown
print_info: model type = 3B
print_info: model params = 1.59 B
print_info: general.name = GLM ASR Nano 2512
print_info: vocab type = BPE
print_info: n_vocab = 59264
print_info: n_merges = 106026
print_info: BOS token = 59246 '<|endoftext|>'
print_info: EOS token = 59246 '<|endoftext|>'
print_info: EOT token = 59253 '<|user|>'
print_info: UNK token = 59246 '<|endoftext|>'
print_info: PAD token = 59246 '<|endoftext|>'
print_info: LF token = 10 'Ċ'
print_info: EOG token = 59246 '<|endoftext|>'
print_info: EOG token = 59253 '<|user|>'
print_info: max token length = 192
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: relocated tensors: 0 of 255
load_tensors: offloading output layer to GPU
load_tensors: offloading 27 repeating layers to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors: CUDA0 model buffer size = 1491.93 MiB
load_tensors: CUDA_Host model buffer size = 122.98 MiB
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
.......................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 8448
llama_context: n_ctx_seq = 8448
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = disabled
llama_context: kv_unified = true
llama_context: freq_base = 10000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (8448) > n_ctx_train (8192) -- possible training context overflow
set_abort_callback: call
llama_context: CUDA_Host output buffer size = 0.23 MiB
llama_kv_cache: CUDA0 KV buffer size = 462.00 MiB
llama_kv_cache: size = 462.00 MiB ( 8448 cells, 28 layers, 1/1 seqs), K (f16): 231.00 MiB, V (f16): 231.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 2040
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
llama_context: CUDA0 compute buffer size = 298.51 MiB
llama_context: CUDA_Host compute buffer size = 22.51 MiB
llama_context: graph nodes = 1014
llama_context: graph splits = 2
attach_threadpool: call
clip_model_loader: model name: GLM ASR Nano 2512
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 493
clip_model_loader: n_kv: 22
clip_model_loader: has audio encoder
clip_ctx: CLIP using CUDA0 backend
load_hparams: projector: glma
load_hparams: n_embd: 1280
load_hparams: n_head: 20
load_hparams: n_ff: 5120
load_hparams: n_layer: 32
load_hparams: ffn_op: gelu_erf
load_hparams: projection_dim: 2048
--- audio hparams ---
load_hparams: n_mel_bins: 128
load_hparams: proj_stack_factor: 4
load_hparams: audio_chunk_len: 30
load_hparams: audio_sample_rate: 16000
load_hparams: audio_n_fft: 400
load_hparams: audio_window_len: 400
load_hparams: audio_hop_len: 160
load_hparams: model size: 686.82 MiB
load_hparams: metadata size: 0.20 MiB
load_tensors: loaded 493 tensors from /workspace/mmproj-GLM-ASR-Nano-2512-Q8_0.gguf
Load Text Model OK: True
Chat completion heuristic: Phi 3.5
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
Llama.cpp UI loaded.
======
Active Modules: TextGeneration MultimodalAudio
Inactive Modules: ImageGeneration VoiceRecognition MultimodalVision NetworkMultiplayer ApiKeyPassword WebSearchProxy TextToSpeech VectorEmbeddings AdminControl
Enabled APIs: KoboldCppApi OpenAiApi OllamaApi
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
Starting llama.cpp secondary WebUI at http://localhost:5001/lcpp/
======
Please connect to custom endpoint at http://localhost:5001
However, when I access the web UI, the voice input button is disabled. And when I tried the /api/extra/transcribe API, it always returns HTTP 200 with this response body:
{
"text": ""
}
May I ask if there is any plan to let this GLM-ASR-Nano-2512 model works as a Whisper model? Or there are some fundamental difference between these two things and the GLM-ASR-Nano-2512 model is not working straight forward as a STT model?
Hi, can you try again with v1.105.2