IMPORTANT! THE INFORMATION BELOW IS ABOUT 'OPTIQ_MIXED' VERSION, USE IT, THE OTHER TWO VERSIONS ARE SUBSTANTIALLY WORSE AND ADDED FOR COMPARATIVE RESEARCH

Model Card

Model Description

A 4.5-bits quantized version of mlx-community/JOSIE-1.1-4B-Instruct-bfloat16, substantially reduced size (from 8GB - bf16 to 2GB with 3GB memory peak) with not-nonsense results!

Results

Here are some of the results (i also add the custom patches for optiq eval for the mentioned tests given the limited default options for evaluations, generally copying the initial eval/gsm8k.py logic for more benchmarks):
ARC-Easy eval: 100%|████████████████████████████████████████████| 400/400
ARC-Easy Results
=========================
Accuracy: 384/400 (96.0%)

I also add a patched cli.py that allows you run several benchmarks sequentally with a kwarg 'task_samples_tuple' --tst <task_1{n}>,...,<task_k{m}> where

1-k are names of task that you can look up with optiq eval --task --help,
n and m – integers that stand for n_samples
NOTE that in this case, the syntax ↑↑↑↑ is precise: no spaces and parentheses, e.g., mmlu{50},sciq{100}).

(Ideally, i intend smuggling the whole lm_eval tasks there, but afraid i am as lazy for that as i am stupid with my Ph.D. in Philosophy and absolute lack of any relevant professional skills with only Coursera course in Python by Charles Severance, pure enthusiasm, and very sharp general intelligence that in sum allow me to make at least something like what you see here...)

batch_run_1:

=================================================================
OPTiQ COMPREHENSIVE REPORT
=================================================================
Task......................| Accuracy | Raw Score
--------------------------+--------------+------------------------
winogrande................|.....58.00% | 29/50
arc_easy..................|.....98.67% | 74/75
arc_challenge.............|.....86.67% | 65/75
mmlu......................|.....65.00% | 65/100
openbookqa................|.....87.00% | 87/100
truthfulqa................|.....62.67% | 47/75

I read about capability score here but did not found it in my package (although pip index said i am on the latest version), as well as the package has no git repo, so for batch_run_2 i.... IMAGINED how it could have been, and implemented it (see the added .py files, again):

Task......................| Accuracy | Raw Score
--------------------------+--------------+-----
gsm8k | 92.50% | 37/40
mmlu | 65.00% | 26/40
hellaswag | 67.50% | 27/40
arc_challenge | 87.50% | 35/40
truthfulqa | 65.00% | 26/40 \

============================================== CAPABILITY SCORE | 75.50% | (Mean of 5 core tasks)

Uses

I used it for agentic tasks of NOT the easiest level, where reasoning models bog down, and it worked... well, NOT BAD for a 2GB, i think here the true genius who must be credited is Gökdeniz Gülmez, the dev of JOSIE. Used it for:

'webchat' (see mlx_webchat on pypi, basic addition i made for chat with internet search and fetch tools);
augmented batch translation of long documents (pipeline: chunk to specified length, 12k chars (not tokens!) in my case => translate sequentially => write in specified format); since i work in academic institution, sometimes i need to draft bilingual official boring shit or rewrite syllabus billion times per semester, and this waste of virtual space requires precise terms consistently used in translations (you know, the rule if term_t == translated_t is x, then no not-x should appear for term_t until the end of translated document), so i've added "exact_terms" that are passed to model additionally, two layers of instructions (system prompt => translation session instructions => prompt +, IF it reads from file, [DO NOT TRANSLATE, READ AND FOLLOW] components can be inserted as well); i enumerated this to give you some undersdtanding that the model did not ignore any of the aforementioned and used those/considered everything when something of this was passed; translation sample below gives an understanding that the text it translated from Ukrainian sometimes were not just official papers as well, but, like, 12-font-sized 90 pages of text like this (source language is Ukrainian):

"...The aim of these two thought experiments was to demonstrate one simple thesis: functional homogeneity of species representation of the Real can easily close access to it through an epistemologically-ontological distinction that is not justified when something is thought of as two distinct entities that are actually one (of course, this is not the only negative consequence, but rather the one that interests us in this particular study). The characteristic difference between species representation of reality and object-oriented representation (and the dominant human representation is not only species-based but species-gender-based) is that ontological differences lead to an ontological privilege of one being over another when there is actually no such privilege epistemologically. The ultimate consequence of this false positing of privileges, which is a result of false assumption, is that our primary understanding in the conditions of cognitive genesis within species-oriented culture is, as ..."
honestly, not bad for an on-device-sized model, humans would have done worse, that is why i usually prefer translating everything by myself: once i tried to give it to translator freelancer but while crafting exact terms and instruction realized that this is basically 60% of the document or so, fuck human translators i thought then, i just need a 'warmup draft' of translated text to eliminate blank page paralysis...;
modernization of archaic Wardour Street English (if you do not know what ws-en is, check this shit out: https://en.wikipedia.org/wiki/Wardour_Street_English), not a page or two, but the whole BOOKS, and not JUST books, but... kgm,
'The Night Land' (1912) by William Hope Hodgson;
'The Worm Ouroboros' (1922) by E. R. Eddison;
'Odyssey' t...ranslation? interpretation by T.E. Lawrence 'Of Arabia' (1932)... (let's say just that it is nice that Nolan has not used this version of text as a canonic source);
'The White Company' (1891) by sir Arthur Conan Doyle;
'The Wood Beyond the World' (1894) and 'The Well at the World's End' (1896) by William Morris, arguably, the most nauseous of all these... Absolutely unreadable!
I enumerated them for an illustration of the length and complexity of the source. The model batch adapted some ≈6300 chunks-aka-samples for my upcoming dataset Wardour Forge which you (and i) will joyfully use to teach models destroy any English text converting it into Wardour Street English (with awe and irritation, i found myself on the point of observation that there is NO such dataset so far, and i do not like ideas of this kind pile in the head distracting from something more important; so far, i am only fucking with more granular than i have chunking because i do not want a model to learn 'whath doth m'lady Mirdath the Beautiful hath uttert in eareth of mine own afterwards kisst and... AND LO!' -- i want the model to generalize grammar and syntax, morphology and transition rules without learning the semantic content... but because of stupid 'AND LO!' sentences in Hodgson, for instance – he was a son of reverend, so this AND LO-eing is from King James BYble... – do not let me sentence-to-sentence map properly because the model righteously skipped most of the LO's, generally fairly decently adapting this english for english natives as well, better than two manmade ah-dah-ptah-tions of Hodgson published so far – instead of lexicon and syntax they introduced DIALOGUES (the original book is entirely 550 pages of internal monologue/stream of consciousness), gave nameless characters names, so it in fact poisoned the original; not the case here...).

TEST YOURSELF, SHARE RESULTS!

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mlx-community/josie-1.1-4b-mlx-OptiQ-4bit

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

Goekdeniz-Guelmez/JOSIE-1.1-4B-Instruct

Finetuned

mlx-community/JOSIE-1.1-4B-Instruct-bfloat16

Finetuned

(1)

this model