PII & De-Identification
Collection
Models for extracting PII entities and de-identifying clinical text, with support for HIPAA and GDPR compliance. • 347 items • Updated • 36
How to use OpenMed/OpenMed-PII-Hindi-BioClinicalModern-Large-395M-v1 with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("token-classification", model="OpenMed/OpenMed-PII-Hindi-BioClinicalModern-Large-395M-v1") # Load model directly
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("OpenMed/OpenMed-PII-Hindi-BioClinicalModern-Large-395M-v1")
model = AutoModelForTokenClassification.from_pretrained("OpenMed/OpenMed-PII-Hindi-BioClinicalModern-Large-395M-v1")Hindi PII Detection Model | 395M Parameters | Open Source
OpenMed-PII-Hindi-BioClinicalModern-Large-395M-v1 is a transformer-based token classification model fine-tuned for Personally Identifiable Information (PII) detection in Hindi text. This model identifies and classifies 54 types of sensitive information including names, addresses, social security numbers, medical record numbers, and more.
Evaluated on the Hindi subset of AI4Privacy dataset:
| Metric | Score |
|---|---|
| Micro F1 | 0.9539 |
| Precision | 0.9534 |
| Recall | 0.9544 |
| Macro F1 | 0.9609 |
| Weighted F1 | 0.9534 |
| Accuracy | 0.9915 |
| Rank | Model | F1 | Precision | Recall |
|---|---|---|---|---|
| 1 | OpenMed-PII-Hindi-SuperClinical-Large-434M-v1 | 0.9660 | 0.9650 | 0.9669 |
| 2 | OpenMed-PII-Hindi-SnowflakeMed-Large-568M-v1 | 0.9648 | 0.9635 | 0.9661 |
| 3 | OpenMed-PII-Hindi-BigMed-Large-560M-v1 | 0.9641 | 0.9629 | 0.9653 |
| 4 | OpenMed-PII-Hindi-ClinicalBGE-568M-v1 | 0.9635 | 0.9625 | 0.9644 |
| 5 | OpenMed-PII-Hindi-mClinicalE5-Large-560M-v1 | 0.9615 | 0.9601 | 0.9630 |
| 6 | OpenMed-PII-Hindi-mSuperClinical-Base-279M-v1 | 0.9593 | 0.9577 | 0.9610 |
| 7 | OpenMed-PII-Hindi-SuperMedical-Large-355M-v1 | 0.9588 | 0.9580 | 0.9596 |
| 8 | OpenMed-PII-Hindi-NomicMed-Large-395M-v1 | 0.9554 | 0.9542 | 0.9565 |
| 9 | OpenMed-PII-Hindi-BioClinicalModern-Large-395M-v1 | 0.9539 | 0.9534 | 0.9544 |
| 10 | OpenMed-PII-Hindi-EuroMed-210M-v1 | 0.9538 | 0.9530 | 0.9547 |
This model detects 54 PII entity types organized into categories:
| Entity | Description |
|---|---|
ACCOUNTNAME |
Accountname |
BANKACCOUNT |
Bankaccount |
BIC |
Bic |
BITCOINADDRESS |
Bitcoinaddress |
CREDITCARD |
Creditcard |
CREDITCARDISSUER |
Creditcardissuer |
CVV |
Cvv |
ETHEREUMADDRESS |
Ethereumaddress |
IBAN |
Iban |
IMEI |
Imei |
| ... | and 12 more |
| Entity | Description |
|---|---|
AGE |
Age |
DATEOFBIRTH |
Dateofbirth |
EYECOLOR |
Eyecolor |
FIRSTNAME |
Firstname |
GENDER |
Gender |
HEIGHT |
Height |
LASTNAME |
Lastname |
MIDDLENAME |
Middlename |
OCCUPATION |
Occupation |
PREFIX |
Prefix |
| ... | and 1 more |
| Entity | Description |
|---|---|
EMAIL |
|
PHONE |
Phone |
| Entity | Description |
|---|---|
BUILDINGNUMBER |
Buildingnumber |
CITY |
City |
COUNTY |
County |
GPSCOORDINATES |
Gpscoordinates |
ORDINALDIRECTION |
Ordinaldirection |
SECONDARYADDRESS |
Secondaryaddress |
STATE |
State |
STREET |
Street |
ZIPCODE |
Zipcode |
| Entity | Description |
|---|---|
JOBDEPARTMENT |
Jobdepartment |
JOBTITLE |
Jobtitle |
ORGANIZATION |
Organization |
| Entity | Description |
|---|---|
AMOUNT |
Amount |
CURRENCY |
Currency |
CURRENCYCODE |
Currencycode |
CURRENCYNAME |
Currencyname |
CURRENCYSYMBOL |
Currencysymbol |
| Entity | Description |
|---|---|
DATE |
Date |
TIME |
Time |
from transformers import pipeline
# Load the PII detection pipeline
ner = pipeline("ner", model="OpenMed/OpenMed-PII-Hindi-BioClinicalModern-Large-395M-v1", aggregation_strategy="simple")
text = """
रोगी राजेश कुमार (जन्म तिथि: 15/03/1985, आधार: 9876 5432 1098) की आज जांच हुई।
संपर्क: rajesh.kumar@email.in, फ़ोन: +91 98765 43210।
पता: 123 विकास मार्ग, 110092 नई दिल्ली।
"""
entities = ner(text)
for entity in entities:
print(f"{entity['entity_group']}: {entity['word']} (score: {entity['score']:.3f})")
def redact_pii(text, entities, placeholder='[REDACTED]'):
"""Replace detected PII with placeholders."""
# Sort entities by start position (descending) to preserve offsets
sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True)
redacted = text
for ent in sorted_entities:
redacted = redacted[:ent['start']] + f"[{ent['entity_group']}]" + redacted[ent['end']:]
return redacted
# Apply de-identification
redacted_text = redact_pii(text, entities)
print(redacted_text)
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model_name = "OpenMed/OpenMed-PII-Hindi-BioClinicalModern-Large-395M-v1"
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
texts = [
"रोगी राजेश कुमार (जन्म तिथि: 15/03/1985, आधार: 9876 5432 1098) की आज जांच हुई।",
"संपर्क: rajesh.kumar@email.in, फ़ोन: +91 98765 43210।",
]
inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
Important: This model is intended as an assistive tool, not a replacement for human review.
@misc{openmed-pii-2026,
title = {OpenMed-PII-Hindi-BioClinicalModern-Large-395M-v1: Hindi PII Detection Model},
author = {OpenMed Science},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/OpenMed/OpenMed-PII-Hindi-BioClinicalModern-Large-395M-v1}
}
Base model
answerdotai/ModernBERT-large