Instructions to use bigcode/starpii with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use bigcode/starpii with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="bigcode/starpii")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("bigcode/starpii") model = AutoModelForTokenClassification.from_pretrained("bigcode/starpii") - Notebooks
- Google Colab
- Kaggle
| datasets: | |
| - bigcode/pii-annotated-toloka-donwsample-emails | |
| - bigcode/pseudo-labeled-python-data-pii-detection-filtered | |
| metrics: | |
| - f1 | |
| pipeline_tag: token-classification | |
| language: | |
| - code | |
| extra_gated_prompt: >- | |
| ## Terms of Use for the model | |
| This is an NER model trained to detect Personal Identifiable Information (PII) | |
| in code datasets. We ask that you read and agree to the following Terms of Use | |
| before using the model: | |
| 1. You agree that you will not use the model for any purpose other than PII | |
| detection for the purpose of removing PII from datasets. | |
| 2. You agree that you will not share the model or any modified versions for | |
| whatever purpose. | |
| 3. Unless required by applicable law or agreed to in writing, the model is | |
| provided on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, | |
| either express or implied, including, without limitation, any warranties or | |
| conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A | |
| PARTICULAR PURPOSE. You are solely responsible for determining the | |
| appropriateness of using the model, and assume any risks associated with your | |
| exercise of permissions under these Terms of Use. | |
| 4. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, | |
| DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR | |
| OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE MODEL OR THE USE OR | |
| OTHER DEALINGS IN THE MODEL. | |
| extra_gated_fields: | |
| Email: text | |
| I have read the License and agree with its terms: checkbox | |
| # StarPII | |
| ## Model description | |
| This is an NER model trained to detect Personal Identifiable Information (PII) in code datasets. We fine-tuned [bigcode-encoder](https://huggingface.co/bigcode/bigcode-encoder) | |
| on a PII dataset we annotated, available with gated access at [bigcode-pii-dataset](https://huggingface.co/datasets/bigcode/pii-annotated-toloka-donwsample-emails) (see [bigcode-pii-dataset-training](https://huggingface.co/datasets/bigcode/bigcode-pii-dataset-training) for the exact data splits). | |
| We added a linear layer as a token classification head on top of the encoder model, with 6 target classes: Names, Emails, Keys, Passwords, IP addresses and Usernames. | |
| ## Dataset | |
| ### Fine-tuning on the annotated dataset | |
| The finetuning dataset contains 20961 secrets and 31 programming languages, but the base encoder model was pre-trained on 88 | |
| programming languages from [The Stack](https://huggingface.co/datasets/bigcode/the-stack) dataset. | |
| ### Initial training on a pseudo-labelled dataset | |
| To enhance model performance on some rare PII entities like keys, we initially trained on a pseudo-labeled dataset before fine-tuning on the annotated dataset. | |
| The method involves training a model on a small set of labeled data and subsequently generating predictions for a larger set of unlabeled data. | |
| Specifically, we annotated 18,000 files available at [bigcode-pii-ppseudo-labeled](https://huggingface.co/datasets/bigcode/pseudo-labeled-python-data-pii-detection-filtered) | |
| using an ensemble of two encoder models [Deberta-v3-large](https://huggingface.co/microsoft/deberta-v3-large) and [stanford-deidentifier-base](StanfordAIMI/stanford-deidentifier-base) | |
| which were fine-tuned on an intern previously labeled PII [dataset](https://huggingface.co/datasets/bigcode/pii-for-code) for code with 400 files from this [work](https://arxiv.org/abs/2301.03988). | |
| To select good-quality pseudo-labels, we computed the average probability logits between the models and filtered based on a minimum score. | |
| After inspection, we observed a high rate of false positives for Keys and Passwords, hence we retained only the entities that had a trigger word like `key`, `auth` and `pwd` in the surrounding context. | |
| Training on this synthetic dataset prior to fine-tuning on the annotated one yielded superior results for all PII categories, | |
| as demonstrated in the table in the following section. | |
| ### Performance | |
| This model is respresented in the last row (NER + pseudo labels ) | |
| - Emails, IP addresses and Keys | |
| | Method | Email address | | | IP address | | | Key | | | | |
| | ------------------ | -------------- | ---- | ---- | ---------- | ---- | ---- | ----- | ---- | ---- | | |
| | | Prec. | Recall | F1 | Prec. | Recall | F1 | Prec. | Recall | F1 | | |
| | Regex | 69.8% | 98.8% | 81.8% | 65.9% | 78% | 71.7% | 2.8% | 46.9% | 5.3% | | |
| | NER | 94.01% | 98.10% | 96.01% | 88.95% | *94.43%* | 91.61% | 60.37% | 53.38% | 56.66% | | |
| | + pseudo labels | **97.73%** | **98.94%** | **98.15%** | **90.10%** | 93.86% | **91.94%** | **62.38%** | **80.81%** | **70.41%** | | |
| - Names, Usernames and Passwords | |
| | Method | Name | | | Username | | | Password | | | | |
| | ------------------ | -------- | ---- | ---- | -------- | ---- | ---- | -------- | ---- | ---- | | |
| | | Prec. | Recall | F1 | Prec. | Recall | F1 | Prec. | Recall | F1 | | |
| | NER | 83.66% | 95.52% | 89.19% | 48.93% | *75.55%* | 59.39% | 59.16% | *96.62%* | 73.39%| | |
| | + pseudo labels | **86.45%** | **97.38%** | **91.59%** | **52.20%** | 74.81% | **61.49%** | **70.94%** | 95.96% | **81.57%** | | |
| We used this model to mask PII in the bigcode large model training. We dropped usernames since they resulted in many false positives and negatives. | |
| For the other PII types, we added the following post-processing that we recommend for future uses of the model (the code is also available on GitHub): | |
| - Ignore secrets with less than 4 characters. | |
| - Detect full names only. | |
| - Ignore detected keys with less than 9 characters or that are not gibberish using a [gibberish-detector](https://github.com/domanchi/gibberish-detector). | |
| - Ignore IP addresses that aren't valid or are private (non-internet facing) using the `ipaddress` python package. We also ignore IP addresses from popular DNS servers. | |
| We use the same list as in this [paper](https://huggingface.co/bigcode/santacoder). | |
| # Considerations for Using the Model | |
| While using this model, please be aware that there may be potential risks associated with its application. | |
| There is a possibility of false positives and negatives, which could lead to unintended consequences when processing sensitive data. | |
| Moreover, the model's performance may vary across different data types and programming languages, necessitating validation and fine-tuning for specific use cases. | |
| Researchers and developers are expected to uphold ethical standards and data protection measures when using the model. By making it openly accessible, | |
| our aim is to encourage the development of privacy-preserving AI technologies while remaining vigilant of potential risks associated with PII. |