| --- |
| license: apache-2.0 |
| tags: |
| - diffusion |
| - autoencoder |
| - image-reconstruction |
| - pytorch |
| - masked-autoencoder |
| library_name: mdiffae |
| --- |
| |
| # mdiffae_v1 |
| |
| > **DEPRECATED** — This model is superseded by [**SemDisDiffAE**](https://huggingface.co/data-archetype/semdisdiffae), which offers better reconstruction quality, better downstream diffusion convergence, and slightly faster inference. |
| |
| > **[mDiffAE v2](https://huggingface.co/data-archetype/mdiffae-v2) is also available but likewise superseded by SemDisDiffAE.** It offers substantially better reconstruction (+1.7 dB mean PSNR) with the same or better downstream convergence. |
| > |
| > | Version | Mean PSNR (2k images) | Bottleneck | Decoder | |
| > |---|---|---|---| |
| > | [**mDiffAE v2**](https://huggingface.co/data-archetype/mdiffae-v2) (recommended) | **35.81 dB** | 96ch (8x) | 8 blocks (skip-concat) | |
| > | mDiffAE v1 (this repo) | 34.15 dB | 64ch (12x) | 4 blocks (flat) | |
| |
| **mDiffAE** — **M**asked **Diff**usion **A**uto**E**ncoder. |
| A fast, single-GPU-trainable diffusion autoencoder with a **64-channel** |
| spatial bottleneck. Uses decoder token masking as an implicit regularizer |
| instead of REPA alignment. |
| |
| This variant (mdiffae_v1): 81.4M parameters, 310.6 MB. |
| Bottleneck: **64 channels** at patch size 16 |
| (compression ratio 12x). |
|
|
| ## Documentation |
|
|
| - [Technical Report](technical_report_mdiffae.md) — architecture, masking strategy, and results |
| - [iRDiffAE Technical Report](https://huggingface.co/data-archetype/irdiffae-v1/blob/main/technical_report.md) — full background on VP diffusion, DiCo blocks, patchify encoder, AdaLN |
| - [Results — interactive viewer](https://huggingface.co/spaces/data-archetype/mdiffae-results) — full-resolution side-by-side comparison |
|
|
| ## Quick Start |
|
|
| ```python |
| import torch |
| from m_diffae import MDiffAE |
| |
| # Load from HuggingFace Hub (or a local path) |
| model = MDiffAE.from_pretrained("data-archetype/mdiffae_v1", device="cuda") |
| |
| # Encode |
| images = ... # [B, 3, H, W] in [-1, 1], H and W divisible by 16 |
| latents = model.encode(images) |
| |
| # Decode (1 step by default — PSNR-optimal) |
| recon = model.decode(latents, height=H, width=W) |
| |
| # Reconstruct (encode + 1-step decode) |
| recon = model.reconstruct(images) |
| ``` |
|
|
| > **Note:** Requires `pip install huggingface_hub safetensors` for Hub downloads. |
| > You can also pass a local directory path to `from_pretrained()`. |
|
|
| ## Architecture |
|
|
| | Property | Value | |
| |---|---| |
| | Parameters | 81,410,624 | |
| | File size | 310.6 MB | |
| | Patch size | 16 | |
| | Model dim | 896 | |
| | Encoder depth | 4 | |
| | Decoder depth | 4 | |
| | Decoder topology | Flat sequential (no skip connections) | |
| | Bottleneck dim | 64 | |
| | MLP ratio | 4.0 | |
| | Depthwise kernel | 7 | |
| | AdaLN rank | 128 | |
| | PDG mechanism | Token-level masking (ratio 0.75) | |
| | Training regularizer | Decoder token masking (75% ratio, 50% apply prob) | |
|
|
| **Encoder**: Deterministic. Patchify (PixelUnshuffle + 1x1 conv) followed by |
| DiCo blocks (depthwise conv + compact channel attention + GELU MLP) with |
| learned residual gates. |
|
|
| **Decoder**: VP diffusion conditioned on encoder latents and timestep via |
| shared-base + per-layer low-rank AdaLN-Zero. 4 flat |
| sequential blocks (no skip connections). |
|
|
| **Compared to iRDiffAE**: iRDiffAE uses an 8-block decoder (2 start + 4 middle |
| + 2 end) with skip connections and 128 bottleneck channels (needed partly because |
| REPA occupies half the channels). mDiffAE uses 4 flat blocks |
| with no skip connections and 64 bottleneck channels |
| (12x compression vs |
| iRDiffAE's 6x), which gives better channel utilisation. |
|
|
| ### Key Differences from iRDiffAE |
|
|
| | Aspect | iRDiffAE v1 | mDiffAE v1 | |
| |---|---|---| |
| | Bottleneck dim | 128 | **64** | |
| | Decoder depth | 8 (2+4+2 skip-concat) | **4 (flat sequential)** | |
| | PDG mechanism | Block dropping | **Token masking** | |
| | Training regularizer | REPA + covariance reg | **Decoder token masking** | |
|
|
| ## Recommended Settings |
|
|
| Best quality is achieved with **1 DDIM step** and PDG disabled. |
| PDG can sharpen images but should be kept very low (1.01–1.05). |
|
|
| | Setting | Default | |
| |---|---| |
| | Sampler | DDIM | |
| | Steps | 1 | |
| | PDG | Disabled | |
| | PDG strength (if enabled) | 1.05 | |
|
|
| ```python |
| from m_diffae import MDiffAEInferenceConfig |
| |
| # PSNR-optimal (fast, 1 step) |
| cfg = MDiffAEInferenceConfig(num_steps=1, sampler="ddim") |
| recon = model.decode(latents, height=H, width=W, inference_config=cfg) |
| ``` |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{m_diffae, |
| title = {mDiffAE: A Fast Masked Diffusion Autoencoder}, |
| author = {data-archetype}, |
| year = {2026}, |
| month = mar, |
| url = {https://huggingface.co/data-archetype/mdiffae_v1}, |
| } |
| ``` |
|
|
| ## Dependencies |
|
|
| - PyTorch >= 2.0 |
| - safetensors (for loading weights) |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|