Upload README.md with huggingface_hub

1c5c086 verified about 1 month ago

4.67 kB

	---
	license: apache-2.0
	tags:
	- diffusion
	- autoencoder
	- image-reconstruction
	- pytorch
	- masked-autoencoder
	library_name: mdiffae
	---

	# mdiffae_v1

	> DEPRECATED — This model is superseded by [SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae), which offers better reconstruction quality, better downstream diffusion convergence, and slightly faster inference.

	> [mDiffAE v2](https://huggingface.co/data-archetype/mdiffae-v2) is also available but likewise superseded by SemDisDiffAE. It offers substantially better reconstruction (+1.7 dB mean PSNR) with the same or better downstream convergence.
	>
	> \| Version \| Mean PSNR (2k images) \| Bottleneck \| Decoder \|
	> \|---\|---\|---\|---\|
	> \| [mDiffAE v2](https://huggingface.co/data-archetype/mdiffae-v2) (recommended) \| 35.81 dB \| 96ch (8x) \| 8 blocks (skip-concat) \|
	> \| mDiffAE v1 (this repo) \| 34.15 dB \| 64ch (12x) \| 4 blocks (flat) \|

	mDiffAE — Masked Diffusion AutoEncoder.
	A fast, single-GPU-trainable diffusion autoencoder with a 64-channel
	spatial bottleneck. Uses decoder token masking as an implicit regularizer
	instead of REPA alignment.

	This variant (mdiffae_v1): 81.4M parameters, 310.6 MB.
	Bottleneck: 64 channels at patch size 16
	(compression ratio 12x).

	## Documentation

	- [Technical Report](technical_report_mdiffae.md) — architecture, masking strategy, and results
	- [iRDiffAE Technical Report](https://huggingface.co/data-archetype/irdiffae-v1/blob/main/technical_report.md) — full background on VP diffusion, DiCo blocks, patchify encoder, AdaLN
	- [Results — interactive viewer](https://huggingface.co/spaces/data-archetype/mdiffae-results) — full-resolution side-by-side comparison

	## Quick Start

	```python
	import torch
	from m_diffae import MDiffAE

	# Load from HuggingFace Hub (or a local path)
	model = MDiffAE.from_pretrained("data-archetype/mdiffae_v1", device="cuda")

	# Encode
	images = ... # [B, 3, H, W] in [-1, 1], H and W divisible by 16
	latents = model.encode(images)

	# Decode (1 step by default — PSNR-optimal)
	recon = model.decode(latents, height=H, width=W)

	# Reconstruct (encode + 1-step decode)
	recon = model.reconstruct(images)
	```

	> Note: Requires `pip install huggingface_hub safetensors` for Hub downloads.
	> You can also pass a local directory path to `from_pretrained()`.

	## Architecture

	\| Property \| Value \|
	\|---\|---\|
	\| Parameters \| 81,410,624 \|
	\| File size \| 310.6 MB \|
	\| Patch size \| 16 \|
	\| Model dim \| 896 \|
	\| Encoder depth \| 4 \|
	\| Decoder depth \| 4 \|
	\| Decoder topology \| Flat sequential (no skip connections) \|
	\| Bottleneck dim \| 64 \|
	\| MLP ratio \| 4.0 \|
	\| Depthwise kernel \| 7 \|
	\| AdaLN rank \| 128 \|
	\| PDG mechanism \| Token-level masking (ratio 0.75) \|
	\| Training regularizer \| Decoder token masking (75% ratio, 50% apply prob) \|

	Encoder: Deterministic. Patchify (PixelUnshuffle + 1x1 conv) followed by
	DiCo blocks (depthwise conv + compact channel attention + GELU MLP) with
	learned residual gates.

	Decoder: VP diffusion conditioned on encoder latents and timestep via
	shared-base + per-layer low-rank AdaLN-Zero. 4 flat
	sequential blocks (no skip connections).

	Compared to iRDiffAE: iRDiffAE uses an 8-block decoder (2 start + 4 middle
	+ 2 end) with skip connections and 128 bottleneck channels (needed partly because
	REPA occupies half the channels). mDiffAE uses 4 flat blocks
	with no skip connections and 64 bottleneck channels
	(12x compression vs
	iRDiffAE's 6x), which gives better channel utilisation.

	### Key Differences from iRDiffAE

	\| Aspect \| iRDiffAE v1 \| mDiffAE v1 \|
	\|---\|---\|---\|
	\| Bottleneck dim \| 128 \| 64 \|
	\| Decoder depth \| 8 (2+4+2 skip-concat) \| 4 (flat sequential) \|
	\| PDG mechanism \| Block dropping \| Token masking \|
	\| Training regularizer \| REPA + covariance reg \| Decoder token masking \|

	## Recommended Settings

	Best quality is achieved with 1 DDIM step and PDG disabled.
	PDG can sharpen images but should be kept very low (1.01–1.05).

	\| Setting \| Default \|
	\|---\|---\|
	\| Sampler \| DDIM \|
	\| Steps \| 1 \|
	\| PDG \| Disabled \|
	\| PDG strength (if enabled) \| 1.05 \|

	```python
	from m_diffae import MDiffAEInferenceConfig

	# PSNR-optimal (fast, 1 step)
	cfg = MDiffAEInferenceConfig(num_steps=1, sampler="ddim")
	recon = model.decode(latents, height=H, width=W, inference_config=cfg)
	```

	## Citation

	```bibtex
	@misc{m_diffae,
	title = {mDiffAE: A Fast Masked Diffusion Autoencoder},
	author = {data-archetype},
	year = {2026},
	month = mar,
	url = {https://huggingface.co/data-archetype/mdiffae_v1},
	}
	```

	## Dependencies

	- PyTorch >= 2.0
	- safetensors (for loading weights)

	## License

	Apache 2.0