KorPatElectra / README.md
kipi-ai-official's picture
Update README.md
fbb8f14 verified
metadata
language:
  - ko

Introduction

KorPatELECTRA is a pre-trained language model based on the Google ELECTRA architecture, trained on 4.6 million Korean patent documents and 0.5 billion sentences. It demonstrates high performance in patent domain-specific tasks, including Named Entity Recognition (NER), Machine Reading Comprehension (MRC), and Patent Classification.

Model Vocab Size NER Classification MRC
Google BERT 110,000 87.98 72.33 87.79
KorPatBERT 21,400 87.89 76.32 85.61
KoELECTRA 35,000 87.47 72.98 88.09
KorPatELECTRA 35,000 91.01 73.90 89.85

For more details, please refer to our GitHub.

How to use

from transformers import BertTokenizer, BertModel
import torch

# Load model and tokenizer 
model_name = "KIPI-ai/KorPatElectra"

# Access token (replace with your actual token from Hugging Face)
access_token = "hf_여기에_λ³΅μ‚¬ν•œ_토큰_λΆ™μ—¬λ„£κΈ°"

model = BertModel.from_pretrained(model_name, use_auth_token=access_token)
tokenizer = BertTokenizer.from_pretrained(model_name, use_auth_token=access_token)

# Sample sentence
sentence_org = "λ³Έ κ³ μ•ˆμ€ 주둜 일회용 ν•©μ„±μ„Έμ œμ•‘μ„ μ§‘μ–΄λ„£μ–΄ λ°€λ΄‰ν•˜λŠ” μ„Έμ œμ•‘ν¬μ˜ λ‚΄λΆ€λ₯Ό μ›ν˜ΈμƒμœΌλ‘œ μ—΄μ€‘μ°©ν•˜λ˜ μ„Έμ œμ•‘μ΄ λ°°μΆœλ˜λŠ” μ ˆλ‹¨λΆ€ μͺ½μœΌλ‘œ 내벽을 ν˜‘μ†Œν•˜κ²Œ ν˜•μ„±ν•˜μ—¬μ„œ 내뢀에 λ“€μ–΄μžˆλŠ” μ„Έμ œμ•‘μ„ 잘짜질 수 μžˆλ„λ‘ ν•˜λŠ” ν•©μ„±μ„Έμ œ 앑포에 κ΄€ν•œ 것이닀."

# Tokenization
inputs = tokenizer(sentence_org, return_tensors="pt")

# Model input
outputs = model(**inputs)

# Extract the last hidden states
last_hidden_states = outputs.last_hidden_state
cls_vector = last_hidden_states[:, 0, :]  # (batch_size, hidden_size)

print(f"1. Length of vocab : {tokenizer.vocab_size}")
print(f"2. Input example : {sentence_org}")
print(f"3. Tokenized example : {inputs}")
print(f"4. vector shape : {cls_vector.shape}")

# Output
1. Length of vocab : 35000
2. Input example : λ³Έ κ³ μ•ˆμ€ 주둜 일회용 ν•©μ„±μ„Έμ œμ•‘μ„ μ§‘μ–΄λ„£μ–΄ λ°€λ΄‰ν•˜λŠ” μ„Έμ œμ•‘ν¬μ˜ λ‚΄λΆ€λ₯Ό μ›ν˜ΈμƒμœΌλ‘œ μ—΄μ€‘μ°©ν•˜λ˜ μ„Έμ œμ•‘μ΄ λ°°μΆœλ˜λŠ” μ ˆλ‹¨λΆ€ μͺ½μœΌλ‘œ 내벽을 ν˜‘μ†Œν•˜κ²Œ ν˜•μ„±ν•˜μ—¬μ„œ 내뢀에 λ“€μ–΄μžˆλŠ” μ„Έμ œμ•‘μ„ 잘짜질 수 μžˆλ„λ‘ ν•˜λŠ” ν•©μ„±μ„Έμ œ 앑포에 κ΄€ν•œ 것이닀.
3. Tokenized example : {'input_ids': tensor([[    2,  2326,  7419,  5091,  9295, 13078,  7590, 26872,  4885,  5216,
         31417,  4749,  8328,  4706,  4805, 11492,  4885,  5042,  5076,  7300,
          5383, 12265,  4802, 27067,  3045,  4737,  4818,  4706,  4999, 11492,
          4885,  4732,  7487,  4999,  4805, 16407,  3500, 27067, 10196,  5216,
         18219,  4706,  5036,  7220, 15064,  4739,  7300,  4963,  9062,  5708,
          4805, 11492,  4885,  5216,  3257,  5534,  4967,  2748,  3244,  4839,
          4848,  4343,  4805,  7590, 26872,  2965,  5042,  4963,  7372,   867,
          4732,  5101,   216,     3]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1]])}
4. vector shape : torch.Size([1, 768]

πŸ” Note: This model requires access approval. Please log in to Hugging Face, request access to the model, and use your personal access token (get yours here).

Reference

License

Any commercial exploitation of this model requires a separate commercial license agreement with the Licensor.

Contact

KIPI AI Support : ai_support@kipi.or.kr