# ExecuTorch

[ExecuTorch](https://docs.pytorch.org/executorch/stable/index.html) is a lightweight runtime for model inference on edge devices. It exports a PyTorch model into a portable, ahead-of-time format. A small C++ runtime plans memory and dispatches operations to hardware-specific backends. Execution and memory behavior is known before the model runs on device, so inference overhead is low.

Export a Transformers model with the [optimum-executorch](https://huggingface.co/docs/optimum-executorch/en/index) library.

```bash
optimum-cli export executorch \
    --model "HuggingFaceTB/SmolLM2-135M-Instruct" \
    --task "text-generation" \
    --recipe "xnnpack" \
    --output_dir="./smollm2_exported"
```

```py
from transformers import AutoTokenizer
from optimum.executorch import ExecuTorchModelForCausalLM

model = ExecuTorchModelForCausalLM.from_pretrained(
    "HuggingFaceTB/SmolLM2-135M-Instruct",
    recipe="xnnpack",
)
model.save_pretrained("./smollm2_exported")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct")
```

## Transformers integration

The export process uses several Transformers components.

1. [from_pretrained()](/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained) loads the model weights in safetensors format.
2. Optimum applies graph optimizations and runs [torch.export](https://docs.pytorch.org/docs/stable/export.html) to create a `model.pte` file targeting your hardware backend.
3. [AutoTokenizer](/docs/transformers/main/en/model_doc/auto#transformers.AutoTokenizer) or [AutoProcessor](/docs/transformers/main/en/model_doc/auto#transformers.AutoProcessor) loads the tokenizer or processor files and runs during inference.
4. At runtime, a C++ runner class executes the `.pte` file on the ExecuTorch runtime.

```c++
#include <executorch/extension/llm/runner/text_llm_runner.h>

using namespace executorch::extension::llm;

int main() {
  // Load tokenizer and create runner
  auto tokenizer = load_tokenizer("path/to/tokenizer.json", nullptr, std::nullopt, 0, 0);
  auto runner = create_text_llm_runner("path/to/model.pte", std::move(tokenizer));

  // Load the model
  runner->load();

  // Configure generation
  GenerationConfig config;
  config.max_new_tokens = 100;
  config.temperature = 0.8f;

  // Generate text with streaming output
  runner->generate("The capital of France is", config,
    [](const std::string& token) { std::cout << token << std::flush; },
    nullptr);

  return 0;
}
```

## Resources

- [ExecuTorch](https://docs.pytorch.org/executorch/stable/index.html) docs
- [torch.export](https://docs.pytorch.org/docs/stable/export.html) docs
- [Exporting to production](../serialization#executorch) guide

