Update tokenizer_config.json

#101
by Akshay47 - opened

Your current tokenizer config:

"unk_token": null

This means there's no defined "unknown token," which is risky — the tokenizer can't handle out-of-vocabulary (OOV) tokens properly.

This update defines the unk_token, enabling the tokenizer to:

  1. Prevent crashes or undefined behavior when unknown tokens are encountered.
  2. Ensure compatibility with libraries that expect a defined unk_token.

For those asking about API access — I've been using Crazyrouter as a unified gateway. One API key, OpenAI SDK compatible. Works well for testing different models without managing multiple accounts.

Thanks for the clarification.

You're right that for modern subword tokenizers, unk_token = null can be intentional and not necessarily an issue if the tokenizer is designed to avoid true OOV cases.

My intention with this PR was mainly to improve downstream compatibility for tooling that expects unk_token to be defined, not to suggest the current tokenizer is inherently broken.

That said, if null is intentional for DeepSeek-V3 and aligns with the tokenizer/vocab design, then I completely understand. In that case, I’m happy to close or adjust the PR accordingly — I mainly wanted to flag the potential interoperability concern for some integrations.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment