question about TeichAI json gen

#2
by CryptoAIM - opened

soo TeichAI has made a json generater, right? wasn‘t it tool gen or smth? anyway can it also just create the prompts and topics? im still arguing what kind of llm monster im buying or building and do not have the money yet, but ill prob spend a lot of time creating datasets from open source models, for distillation. also you used a unique aproach to distilling. is there anything else youd recommend? because style imitation is good and all, but i still would like some actual new knowledge in these models or should I just continue pre-training of instruct models and post-train using distilling or what?

TeichAI org

If you want your model to learn more continue pretraining
If you want your model to predict better tokens to use more of what it already knows, then do proper Distillation.
What I mean by proper Distillation:
Use a local model, run prompts through it, get the top 50 tokens it thought about for every token in its output, train on all of that data combined. This will result in a much smarter model. Currently TeichAI does not provide these types of datasets.

TeichAI org

What proper distillation data looks like ->
Existing output: "The capital of France is "
Token: "Paris"
Probability: 0.85
Top 50 Alternatives:

  1. London (0.05)
  2. Berlin (0.03)
  3. Rome (0.02)
    ...

This forces the model to learn what could be said anywhere, and how to correct its mistakes if it says a wrong token :)

TeichAI org

Yea if you're just doing opensource models the best and most through-and-through method of distillation is called a logits distillation. I suggest you read more about that, this can also transfer knowledge, though I think you will be limited to distilling large models into smaller models from the same family (i.e Qwen3.5 27B -> Qwen3.5 4B)

Seriously? What if I did miced RL? 2/3 Distilling and 1/3 RLVR so it still learns self-correction? I will probs. do a lot with around same size model, but different architecture and things like that. This would kill logits distillation for me.

Yea if you're just doing opensource models the best and most through-and-through method of distillation is called a logits distillation. I suggest you read more about that, this can also transfer knowledge, though I think you will be limited to distilling large models into smaller models from the same family (i.e Qwen3.5 27B -> Qwen3.5 4B)

Unless two models share the same vocabulary but not many models do that so yeah

Unless two models share the same vocabulary but not many models do that so yeah

can someone slowly change a models vocab? Pre-phase (no training): benching the model and what vocabs are used how often. first phase: RLVR, while slowly changing vocab (less used first) problem is the major ones may be important, so you also the same for the teacher (less agressivly), until you have the same vocab ?

or are there other workarounds?

maybe instead of other logits, sample one token per sentence and let the teacher generate multiple candidate solutions and you use the draft teacher or a reward model to rank the candidates

Depends on your definition of slowly, it would basically be slower than starting from scratch as it would have definitions for things that do not exist anymore.

As for your second question, it could be kind of interesting, you would have to generate the actual probabilities though and that could be kind of unreliable

Fast Vocabulary Transfer (FVT) is the recomendation of gemini

Seems promising

Hugging face built General On-Policy Logit Distillation (GOLD) where you do not even have to change the vocab

You have to use TRL though

TeichAI org

That's pretty cool, will need to check it out.

Yeah, you still need logits though…

Sign up or log in to comment