---
license: apache-2.0
language:
- en
tags:
- diffusion
- text-to-image
- latent-diffusion
- pytorch
pipeline_tag: text-to-image
---

# SykoDiffusion V1.0

İlk versiyon latent diffusion modelim. CLIP text encoder ve VAE kullanarak metinden görüntü üretir.

## Model Detayları

| Özellik | Değer |
|---|---|
| Parametre | ~100M |
| Mimari | Latent Diffusion (U-Net) |
| Eğitim Verisi | CC3M (~100k görsel) |
| Eğitim Adımı | 20.000 step |
| Çözünürlük | 256×256 |
| Donanım | 2× NVIDIA T4 |

## Kullanım

```python
import torch
from diffusers import UNet2DConditionModel, AutoencoderKL, DDIMScheduler
from transformers import CLIPTextModel, CLIPTokenizer
from PIL import Image
import numpy as np

device = "cuda" if torch.cuda.is_available() else "cpu"

unet      = UNet2DConditionModel.from_pretrained("SykoSLM/SykoDiffusion-V1.0").to(device).half()
vae       = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse").to(device).half()
clip      = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to(device).half()
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
scheduler = DDIMScheduler(num_train_timesteps=1000, beta_start=0.00085,
                          beta_end=0.012, beta_schedule="scaled_linear", clip_sample=False)

@torch.no_grad()
def generate(prompt, steps=30, cfg=7.5, seed=42):
    torch.manual_seed(seed)
    tokens     = tokenizer(prompt, padding="max_length", truncation=True, max_length=77, return_tensors="pt").to(device)
    text_emb   = clip(**tokens).last_hidden_state
    neg_tokens = tokenizer("", padding="max_length", truncation=True, max_length=77, return_tensors="pt").to(device)
    neg_emb    = clip(**neg_tokens).last_hidden_state
    emb        = torch.cat([neg_emb, text_emb])
    latents    = torch.randn(1, 4, 32, 32, device=device, dtype=torch.float16)
    scheduler.set_timesteps(steps)
    for t in scheduler.timesteps:
        pred          = unet(torch.cat([latents]*2), t, encoder_hidden_states=emb).sample
        neg_p, text_p = pred.chunk(2)
        pred          = neg_p + cfg * (text_p - neg_p)
        latents       = scheduler.step(pred, t, latents).prev_sample
    image = vae.decode(latents / vae.config.scaling_factor).sample
    image = (image.clamp(-1,1)+1)/2
    image = (image[0].permute(1,2,0).cpu().float().numpy()*255).astype("uint8")
    return Image.fromarray(image)

img = generate("a cat sitting on a chair")
img.save("output.png")
```

## Notlar

- Bu model deneysel bir ilk versiyondur, üretim kalitesi sınırlı olabilir.
- En iyi sonuç için `cfg` değerini 5–10 arasında deneyin.
- İngilizce prompt önerilir.

## Geliştirici

[@SykoSLM](https://huggingface.co/SykoSLM)