Instructions to use beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-Merged with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-Merged with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-Merged")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-Merged")
model = AutoModelForCausalLM.from_pretrained("beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-Merged")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-Merged with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-Merged"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-Merged",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-Merged

SGLang

How to use beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-Merged with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-Merged" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-Merged",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-Merged" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-Merged",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio new

How to use beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-Merged with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-Merged to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-Merged to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-Merged to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-Merged",
    max_seq_length=2048,
)

Docker Model Runner
How to use beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-Merged with Docker Model Runner:
```
docker model run hf.co/beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-Merged
```

merged 전 모델과의 차이는 뭔가요?

by Austin9 - opened Nov 11, 2024

Discussion

Austin9

Nov 11, 2024

beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B와
beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-Merged의 차이는 뭔가요?
성능이 lora merged한게 더 좋나요?

beomi

Owner Nov 11, 2024

beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B는 LoRA만 올라간 레포이고, beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-Merged는 해당 로라를 원본 모델인 Solar-Ko-Recovery-11B와 합친 파일입니다!
vLLM등에서 사용하려면 LoRA형식이 아니라 전체 weight 상태로 레포가 필요해서 만들어둔 부가적 레포입니다 :)
성능은 이론상 100% 동일해야합니다!

beomi changed discussion status to closed Nov 11, 2024

Austin9

Nov 11, 2024

•

edited Nov 11, 2024

beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B, 이 레포 모델 카드에 나와있는 추론 코랩 코드 언슬로스로 돌려볼때는 로라뿐만 아니라 모델도 로딩되서 잘되는것 같던데 그거는 다른건가요?

beomi

Owner Nov 11, 2024

아 그건 다를 수 있습니다!
왜냐하면.. 4bit 메인모델 + 16비트LoRA vs 4비트 머지모델 비교인거라 두 모델이 조금 다르게 되긴 합니다.
(그런데 큰 차이는 없어야 하긴 해요...)

혹시 각각 어떤 경우가 더 좋았나요?

Austin9

Nov 11, 2024

beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B에 나와있는 추론코드가 그럼 4비트 모델에, 16비트 로라인것이고, 머지모델은 전체 4비트인것인가요?

그래서 그런지 전체 4비트로 했을때는 할루시네이션이 심했고, 그전에 4비트 모델에 16비트 로라로 했을때는 성능이 매우 괜찮았습니다. 개인적으로 다른 동 크기급 모델들 중 SOTA 인것 같아요.

Austin9

Nov 11, 2024

두 모델 모두

import torch
from unsloth import FastLanguageModel
from transformers import TextStreamer

model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B",
max_seq_length = 4096,
dtype = torch.bfloat16,
load_in_4bit = True,
)
FastLanguageModel.for_inference(model) # 추론 속도 2배 증가

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

Instruction:

{}

Response:

{}"""

def gen(x): # 인풋 함수
inputs = tokenizer(
[
alpaca_prompt.format(
x.strip(), # instruction
"", # output - leave this blank for generation!
)
], return_tensors = "pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 512)

위 코드로 추론해봤습니다.

beomi

Owner Nov 11, 2024

beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B LoRA BF16입니다 :D
beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-merged는 전체 BF16입니다. --> 즉 이 모델은 BF16으로 로드하셔야 가장 나은 성능을 보여줍니다!

즉 beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-merged를 전체 BF16 로드 ~= 메인4비트+beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B LoRA로드 > beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-merged를 전체 4비트 로드 성능일 거에요.

Austin9

Nov 11, 2024

아하 한번 해보겠습니다. 친절한 답변 감사드립니다.

beomi

Owner Nov 11, 2024

@Austin9 만약 로컬에서 돌리시는 중이라면 Jan(https://jan.ai/) + https://huggingface.co/mradermacher/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-Merged-GGUF 으로 8bit quantize된 모델 써보시는것도 추천해요!
아래와 같이 쓸수 있습니다.

Austin9

Nov 11, 2024

This comment has been hidden

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment