question about TeichAI json gen

by CryptoAIM - opened 11 days ago

soo TeichAI has made a json generater, right? wasn‘t it tool gen or smth? anyway can it also just create the prompts and topics? im still arguing what kind of llm monster im buying or building and do not have the money yet, but ill prob spend a lot of time creating datasets from open source models, for distillation. also you used a unique aproach to distilling. is there anything else youd recommend? because style imitation is good and all, but i still would like some actual new knowledge in these models or should I just continue pre-training of instruct models and post-train using distilling or what?

CompactAI

TeichAI org 11 days ago

If you want your model to learn more continue pretraining
If you want your model to predict better tokens to use more of what it already knows, then do proper Distillation.
What I mean by proper Distillation:
Use a local model, run prompts through it, get the top 50 tokens it thought about for every token in its output, train on all of that data combined. This will result in a much smarter model. Currently TeichAI does not provide these types of datasets.

CompactAI

TeichAI org 11 days ago

What proper distillation data looks like ->
Existing output: "The capital of France is "
Token: "Paris"
Probability: 0.85
Top 50 Alternatives:

London (0.05)
Berlin (0.03)
Rome (0.02)
...

This forces the model to learn what could be said anywhere, and how to correct its mistakes if it says a wrong token :)

CryptoAIM

11 days ago

armand0e

TeichAI org 11 days ago

Yea if you're just doing opensource models the best and most through-and-through method of distillation is called a logits distillation. I suggest you read more about that, this can also transfer knowledge, though I think you will be limited to distilling large models into smaller models from the same family (i.e Qwen3.5 27B -> Qwen3.5 4B)

CryptoAIM

10 days ago

Seriously? What if I did miced RL? 2/3 Distilling and 1/3 RLVR so it still learns self-correction? I will probs. do a lot with around same size model, but different architecture and things like that. This would kill logits distillation for me.

Bob-the-Koala

10 days ago

Yea if you're just doing opensource models the best and most through-and-through method of distillation is called a logits distillation. I suggest you read more about that, this can also transfer knowledge, though I think you will be limited to distilling large models into smaller models from the same family (i.e Qwen3.5 27B -> Qwen3.5 4B)

Unless two models share the same vocabulary but not many models do that so yeah

CryptoAIM

10 days ago

Unless two models share the same vocabulary but not many models do that so yeah

can someone slowly change a models vocab? Pre-phase (no training): benching the model and what vocabs are used how often. first phase: RLVR, while slowly changing vocab (less used first) problem is the major ones may be important, so you also the same for the teacher (less agressivly), until you have the same vocab ?

CryptoAIM

10 days ago

or are there other workarounds?

CryptoAIM

10 days ago

maybe instead of other logits, sample one token per sentence and let the teacher generate multiple candidate solutions and you use the draft teacher or a reward model to rank the candidates

Bob-the-Koala

10 days ago

Depends on your definition of slowly, it would basically be slower than starting from scratch as it would have definitions for things that do not exist anymore.

As for your second question, it could be kind of interesting, you would have to generate the actual probabilities though and that could be kind of unreliable