Update README.md

fbb8f14 verified about 1 year ago

4.13 kB

language:
  - ko

Introduction

KorPatELECTRA is a pre-trained language model based on the Google ELECTRA architecture, trained on 4.6 million Korean patent documents and 0.5 billion sentences. It demonstrates high performance in patent domain-specific tasks, including Named Entity Recognition (NER), Machine Reading Comprehension (MRC), and Patent Classification.

Model	Vocab Size	NER	Classification	MRC
Google BERT	110,000	87.98	72.33	87.79
KorPatBERT	21,400	87.89	76.32	85.61
KoELECTRA	35,000	87.47	72.98	88.09
KorPatELECTRA	35,000	91.01	73.90	89.85

For more details, please refer to our GitHub.

How to use

from transformers import BertTokenizer, BertModel
import torch

# Load model and tokenizer 
model_name = "KIPI-ai/KorPatElectra"

# Access token (replace with your actual token from Hugging Face)
access_token = "hf_여기에_복사한_토큰_붙여넣기"

model = BertModel.from_pretrained(model_name, use_auth_token=access_token)
tokenizer = BertTokenizer.from_pretrained(model_name, use_auth_token=access_token)

# Sample sentence
sentence_org = "본 고안은 주로 일회용 합성세제액을 집어넣어 밀봉하는 세제액포의 내부를 원호상으로 열중착하되 세제액이 배출되는 절단부 쪽으로 내벽을 협소하게 형성하여서 내부에 들어있는 세제액을 잘짜질 수 있도록 하는 합성세제 액포에 관한 것이다."

# Tokenization
inputs = tokenizer(sentence_org, return_tensors="pt")

# Model input
outputs = model(**inputs)

# Extract the last hidden states
last_hidden_states = outputs.last_hidden_state
cls_vector = last_hidden_states[:, 0, :]  # (batch_size, hidden_size)

print(f"1. Length of vocab : {tokenizer.vocab_size}")
print(f"2. Input example : {sentence_org}")
print(f"3. Tokenized example : {inputs}")
print(f"4. vector shape : {cls_vector.shape}")

# Output
1. Length of vocab : 35000
2. Input example : 본 고안은 주로 일회용 합성세제액을 집어넣어 밀봉하는 세제액포의 내부를 원호상으로 열중착하되 세제액이 배출되는 절단부 쪽으로 내벽을 협소하게 형성하여서 내부에 들어있는 세제액을 잘짜질 수 있도록 하는 합성세제 액포에 관한 것이다.
3. Tokenized example : {'input_ids': tensor([[    2,  2326,  7419,  5091,  9295, 13078,  7590, 26872,  4885,  5216,
         31417,  4749,  8328,  4706,  4805, 11492,  4885,  5042,  5076,  7300,
          5383, 12265,  4802, 27067,  3045,  4737,  4818,  4706,  4999, 11492,
          4885,  4732,  7487,  4999,  4805, 16407,  3500, 27067, 10196,  5216,
         18219,  4706,  5036,  7220, 15064,  4739,  7300,  4963,  9062,  5708,
          4805, 11492,  4885,  5216,  3257,  5534,  4967,  2748,  3244,  4839,
          4848,  4343,  4805,  7590, 26872,  2965,  5042,  4963,  7372,   867,
          4732,  5101,   216,     3]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1]])}
4. vector shape : torch.Size([1, 768]

🔐 Note: This model requires access approval. Please log in to Hugging Face, request access to the model, and use your personal access token (get yours here).

Reference

License

Any commercial exploitation of this model requires a separate commercial license agreement with the Licensor.

Contact

KIPI AI Support : ai_support@kipi.or.kr