Instructions to use nvidia/NVIDIA-Nemotron-Parse-v1.2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/NVIDIA-Nemotron-Parse-v1.2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="nvidia/NVIDIA-Nemotron-Parse-v1.2", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("nvidia/NVIDIA-Nemotron-Parse-v1.2", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use nvidia/NVIDIA-Nemotron-Parse-v1.2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/NVIDIA-Nemotron-Parse-v1.2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVIDIA-Nemotron-Parse-v1.2",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/NVIDIA-Nemotron-Parse-v1.2

SGLang

How to use nvidia/NVIDIA-Nemotron-Parse-v1.2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/NVIDIA-Nemotron-Parse-v1.2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVIDIA-Nemotron-Parse-v1.2",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/NVIDIA-Nemotron-Parse-v1.2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVIDIA-Nemotron-Parse-v1.2",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use nvidia/NVIDIA-Nemotron-Parse-v1.2 with Docker Model Runner:
```
docker model run hf.co/nvidia/NVIDIA-Nemotron-Parse-v1.2
```

nvidia-oliver-holworthy commited on 27 days ago

Commit

382fc3a

unverified ·

1 Parent(s): 5cebd3b

Fix MBartDecoderLayer forward pass for transformers 5.x compatibility

Browse files

Detect the renamed `past_key_values` parameter (introduced in ~4.57) and
route through a separate call path that passes the Cache object and handles
both true 5.x (single-Tensor return) and intermediate versions (tuple return)
via an isinstance guard. Backward compatibility with 4.51.x is preserved
through the original singular-param branch.

Signed-off-by: Oliver Holworthy <nvidia-oliver-holworthy@users.noreply.huggingface.co>

Files changed (1) hide show

hf_nemotron_parse_modeling.py +132 -34

hf_nemotron_parse_modeling.py CHANGED Viewed

@@ -23,6 +23,38 @@ from transformers.modeling_attn_mask_utils import (
     _prepare_4d_causal_attention_mask_for_sdpa,
 )
 class NemotronParseDecoder(MBartPreTrainedModel):
     """
@@ -47,7 +79,11 @@ class NemotronParseDecoder(MBartPreTrainedModel):
         if embed_tokens is not None:
             self.embed_tokens.weight = embed_tokens.weight
-        self.layers = nn.ModuleList([MBartDecoderLayer(config) for _ in range(config.decoder_layers)])
         self.config = config
         self.layernorm_embedding = nn.LayerNorm(config.d_model)
@@ -163,8 +199,8 @@ class NemotronParseDecoder(MBartPreTrainedModel):
         else:
             raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
-        # past_key_values_length
-        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0
         if inputs_embeds is None:
             inputs_embeds = self.embed_tokens(input_ids)
@@ -221,7 +257,22 @@ class NemotronParseDecoder(MBartPreTrainedModel):
         all_hidden_states = () if output_hidden_states else None
         all_self_attns = () if output_attentions else None
         all_cross_attentions = () if (output_attentions and encoder_hidden_states is not None) else None
-        next_decoder_cache = () if use_cache else None
         # check if head_mask/cross_attn_head_mask has a correct number of layers specified if desired
         for attn_mask, mask_name in zip([head_mask, cross_attn_head_mask], ["head_mask", "cross_attn_head_mask"]):
@@ -240,45 +291,68 @@ class NemotronParseDecoder(MBartPreTrainedModel):
                 if dropout_probability < self.layerdrop:
                     continue
-            past_key_value = past_key_values[idx] if past_key_values is not None else None
-            if self.gradient_checkpointing and self.training:
-                layer_outputs = self._gradient_checkpointing_func(
-                    decoder_layer.__call__,
-                    hidden_states,
-                    attention_mask,
-                    encoder_hidden_states,
-                    encoder_attention_mask,
-                    head_mask[idx] if head_mask is not None else None,
-                    cross_attn_head_mask[idx] if cross_attn_head_mask is not None else None,
-                    None,
-                    output_attentions,
-                    use_cache,
-                )
-            else:
                 layer_outputs = decoder_layer(
                     hidden_states,
                     attention_mask=attention_mask,
                     encoder_hidden_states=encoder_hidden_states,
                     encoder_attention_mask=encoder_attention_mask,
-                    layer_head_mask=(head_mask[idx] if head_mask is not None else None),
-                    cross_attn_layer_head_mask=(
-                        cross_attn_head_mask[idx] if cross_attn_head_mask is not None else None
-                    ),
-                    past_key_value=past_key_value,
-                    output_attentions=output_attentions,
                     use_cache=use_cache,
                 )
-            hidden_states = layer_outputs[0]
-            if use_cache:
-                next_decoder_cache += (layer_outputs[3 if output_attentions else 1],)
-            if output_attentions:
-                all_self_attns += (layer_outputs[1],)
-                if encoder_hidden_states is not None:
-                    all_cross_attentions += (layer_outputs[2],)
         hidden_states = self.layer_norm(hidden_states)
@@ -533,6 +607,30 @@ class NemotronParseForConditionalGeneration(NemotronParsePreTrainedModel, Genera
             encoder_attentions=encoder_outputs.attentions,
         )
     def prepare_decoder_input_ids_from_labels(self, labels: torch.Tensor):
         return shift_tokens_right(labels, self.config.pad_token_id, self.config.decoder_start_token_id)

     _prepare_4d_causal_attention_mask_for_sdpa,
 )
+# ---------------------------------------------------------------------------
+# Cache compatibility (transformers 5.x introduced Cache objects;
+# 4.x used plain tuple-of-tuples for past_key_values)
+# ---------------------------------------------------------------------------
+import inspect
+try:
+    from transformers.cache_utils import Cache as _CacheBase
+    def _is_cache_object(obj) -> bool:
+        return isinstance(obj, _CacheBase)
+except ImportError:
+    def _is_cache_object(obj) -> bool:
+        return False
+def _past_key_values_length(past_key_values) -> int:
+    """Return the number of already-decoded tokens regardless of cache format."""
+    if past_key_values is None:
+        return 0
+    if _is_cache_object(past_key_values):
+        return past_key_values.get_seq_length()
+    return past_key_values[0][0].shape[2]
+# ---------------------------------------------------------------------------
+# MBartDecoderLayer API detection
+#
+# transformers <~4.57: forward() takes `past_key_value` (singular), returns a
+#   tuple (hidden_states, [attentions], [present_key_value])
+# transformers >=~4.57: forward() takes `past_key_values` (plural, Cache).
+#   True 5.x returns a single torch.Tensor (cache updated in-place);
+#   intermediate versions (e.g. 4.57.x) still return a tuple.
+# ---------------------------------------------------------------------------
+_layer_takes_plural_past_kv = 'past_key_values' in inspect.signature(MBartDecoderLayer.forward).parameters
 class NemotronParseDecoder(MBartPreTrainedModel):
     """
         if embed_tokens is not None:
             self.embed_tokens.weight = embed_tokens.weight
+        _layer_supports_idx = 'layer_idx' in inspect.signature(MBartDecoderLayer.__init__).parameters
+        self.layers = nn.ModuleList([
+            MBartDecoderLayer(config, layer_idx=i) if _layer_supports_idx else MBartDecoderLayer(config)
+            for i in range(config.decoder_layers)
+        ])
         self.config = config
         self.layernorm_embedding = nn.LayerNorm(config.d_model)
         else:
             raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
+        # past_key_values_length — works with both tuple-of-tuples (4.x) and Cache objects (5.x)
+        past_key_values_length = _past_key_values_length(past_key_values)
         if inputs_embeds is None:
             inputs_embeds = self.embed_tokens(input_ids)
         all_hidden_states = () if output_hidden_states else None
         all_self_attns = () if output_attentions else None
         all_cross_attentions = () if (output_attentions and encoder_hidden_states is not None) else None
+        # In 5.x the Cache object is updated in-place by each layer, so we just
+        # carry the same object through.  In 4.x we collect per-layer tuples.
+        _using_cache_obj = _is_cache_object(past_key_values)
+        next_decoder_cache = past_key_values if (_using_cache_obj and use_cache) else (() if use_cache else None)
+        # 5.x: on the first call (past_key_values=None), create an EncoderDecoderCache
+        # so each MBartAttention layer can populate cross-/self-attention KV states
+        # in-place.  This enables proper KV caching during multi-step generation.
+        if _layer_takes_plural_past_kv and use_cache and past_key_values is None:
+            try:
+                from transformers.cache_utils import EncoderDecoderCache, DynamicCache
+                past_key_values = EncoderDecoderCache(DynamicCache(), DynamicCache())
+                _using_cache_obj = True
+                next_decoder_cache = past_key_values
+            except (ImportError, AttributeError, TypeError):
+                pass  # fallback: layers recompute KV each step (correct but slower)
         # check if head_mask/cross_attn_head_mask has a correct number of layers specified if desired
         for attn_mask, mask_name in zip([head_mask, cross_attn_head_mask], ["head_mask", "cross_attn_head_mask"]):
                 if dropout_probability < self.layerdrop:
                     continue
+            if _layer_takes_plural_past_kv:
+                # Plural-param API: cache updated in-place, nothing to collect.
                 layer_outputs = decoder_layer(
                     hidden_states,
                     attention_mask=attention_mask,
                     encoder_hidden_states=encoder_hidden_states,
                     encoder_attention_mask=encoder_attention_mask,
+                    past_key_values=past_key_values if use_cache else None,
                     use_cache=use_cache,
                 )
+                # True 5.x returns a single Tensor; intermediate versions
+                # (e.g. 4.57.x) have the renamed parameter but still return
+                # a tuple — handle both.
+                hidden_states = layer_outputs if isinstance(layer_outputs, torch.Tensor) else layer_outputs[0]
+            else:
+                # Singular-param API: returns a tuple, collect cache per-layer.
+                if past_key_values is None:
+                    past_key_value = None
+                elif _using_cache_obj:
+                    past_key_value = past_key_values  # full Cache object
+                else:
+                    past_key_value = past_key_values[idx]  # per-layer tuple
+                if self.gradient_checkpointing and self.training:
+                    layer_outputs = self._gradient_checkpointing_func(
+                        decoder_layer.__call__,
+                        hidden_states,
+                        attention_mask,
+                        encoder_hidden_states,
+                        encoder_attention_mask,
+                        head_mask[idx] if head_mask is not None else None,
+                        cross_attn_head_mask[idx] if cross_attn_head_mask is not None else None,
+                        None,
+                        output_attentions,
+                        use_cache,
+                    )
+                else:
+                    layer_outputs = decoder_layer(
+                        hidden_states,
+                        attention_mask=attention_mask,
+                        encoder_hidden_states=encoder_hidden_states,
+                        encoder_attention_mask=encoder_attention_mask,
+                        layer_head_mask=(head_mask[idx] if head_mask is not None else None),
+                        cross_attn_layer_head_mask=(
+                            cross_attn_head_mask[idx] if cross_attn_head_mask is not None else None
+                        ),
+                        past_key_value=past_key_value,
+                        output_attentions=output_attentions,
+                        use_cache=use_cache,
+                    )
+                hidden_states = layer_outputs[0]
+                if use_cache and not _using_cache_obj:
+                    # 4.x: cache is the last element of layer_outputs.
+                    cache_idx = 3 if output_attentions else 1
+                    if len(layer_outputs) > cache_idx:
+                        next_decoder_cache += (layer_outputs[cache_idx],)
+                if output_attentions:
+                    all_self_attns += (layer_outputs[1],)
+                    if encoder_hidden_states is not None:
+                        all_cross_attentions += (layer_outputs[2],)
         hidden_states = self.layer_norm(hidden_states)
             encoder_attentions=encoder_outputs.attentions,
         )
+    def prepare_inputs_for_generation(
+        self,
+        input_ids,
+        past_key_values=None,
+        attention_mask=None,
+        use_cache=None,
+        encoder_outputs=None,
+        **kwargs,
+    ):
+        if past_key_values is not None:
+            past_length = _past_key_values_length(past_key_values)
+            if input_ids.shape[1] > past_length:
+                input_ids = input_ids[:, past_length:]
+            else:
+                input_ids = input_ids[:, -1:]
+        return {
+            "pixel_values": None,          # encoder_outputs carries the image features
+            "encoder_outputs": encoder_outputs,
+            "past_key_values": past_key_values,
+            "decoder_input_ids": input_ids,
+            "decoder_attention_mask": attention_mask,
+            "use_cache": use_cache,
+        }
     def prepare_decoder_input_ids_from_labels(self, labels: torch.Tensor):
         return shift_tokens_right(labels, self.config.pad_token_id, self.config.decoder_start_token_id)