Instructions to use jrc-ai/PreDA-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jrc-ai/PreDA-base with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("jrc-ai/PreDA-base") model = AutoModelForSeq2SeqLM.from_pretrained("jrc-ai/PreDA-base") - Notebooks
- Google Colab
- Kaggle
PreDA-base (Prefix-Based Dream Reports Annotation)
This model is a fine-tuned version of google-t5/t5-base on the annotated Dreambank.net dataset.It achieves the following results on the evaluation set:
Intended uses & limitations
This model is designed for research purposes. See the disclaimer for more details.
Training procedure
The overall idea of our approach is to disentangle each dream report from its annotation as a whole and to create an augmented set of (dream report; single feature annotation). To make sure that, given the same report, the model would produce a specific HVDC feature, we simply append at the beginning of each report a string of the form ``HVDC-Feature:'', in a manner that closely mimics T5 task-specific prefix fine-tuning.
After this procedure to the original dataset (~1.8K) we obtain approximately 6.6K items. In the present study, we focused on a subset of six HVDC features: Characters, Activities, Emotion, Friendliness, Misfortune, and Good Fortune. This selection was made to exclude features that represented less than 10% of the total instances. Notably, Good Fortune would have been excluded under this criterion, but we intentionally retained this feature to control against potential memorisation effects and to provide a counterbalance to the Misfortune feature. After filtering out instances whose annotation feature is not one of the six selected features, we are left with ~5.3K dream reports. We then generate a random split of 80%-20% for the training (i.e., 4,311 reports) and testing (i.e. 1,078 reports) sets.
Training
Hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.001
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 10
- num_epochs: 20
- mixed_precision_training: Native AMP
- label_smoothing_factor: 0.1
Metrics
| Training Loss | Epoch | Step | Validation Loss | Rouge1 | Rouge2 | Rougel | Rougelsum |
|---|---|---|---|---|---|---|---|
| 1.9319 | 1.0 | 135 | 1.8550 | 0.5264 | 0.3732 | 0.5090 | 0.5092 |
| 1.8158 | 2.0 | 270 | 1.7623 | 0.6482 | 0.5020 | 0.6238 | 0.6242 |
| 1.7508 | 3.0 | 405 | 1.7072 | 0.7429 | 0.6136 | 0.7142 | 0.7141 |
| 1.6998 | 4.0 | 540 | 1.6532 | 0.7753 | 0.6726 | 0.7517 | 0.7518 |
| 2.6261 | 5.0 | 675 | 2.7877 | 0.6512 | 0.5178 | 0.6319 | 0.6327 |
| 2.6021 | 6.0 | 810 | 2.4693 | 0.7583 | 0.6397 | 0.7384 | 0.7383 |
| 2.4839 | 7.0 | 945 | 2.3952 | 0.7914 | 0.6872 | 0.7684 | 0.7682 |
| 2.4471 | 8.0 | 1080 | 2.3693 | 0.7851 | 0.6768 | 0.7622 | 0.7621 |
| 2.4298 | 9.0 | 1215 | 2.3654 | 0.7881 | 0.6811 | 0.7649 | 0.7647 |
| 2.3975 | 10.0 | 1350 | 2.4163 | 0.7884 | 0.6785 | 0.7662 | 0.7660 |
| 2.4715 | 11.0 | 1485 | 2.4415 | 0.7861 | 0.6745 | 0.7629 | 0.7628 |
| 2.4541 | 12.0 | 1620 | 2.4401 | 0.7851 | 0.6730 | 0.7621 | 0.7620 |
| 2.4057 | 13.0 | 1755 | 2.4395 | 0.7856 | 0.6735 | 0.7621 | 0.7619 |
| 2.4491 | 14.0 | 1890 | 2.4387 | 0.7863 | 0.6739 | 0.7627 | 0.7624 |
| 2.4713 | 15.0 | 2025 | 2.4379 | 0.7870 | 0.6755 | 0.7635 | 0.7634 |
| 2.4635 | 16.0 | 2160 | 2.4375 | 0.7865 | 0.6742 | 0.7630 | 0.7628 |
| 2.4693 | 17.0 | 2295 | 2.4364 | 0.7869 | 0.6749 | 0.7630 | 0.7627 |
| 2.2091 | 18.0 | 2430 | 2.4366 | 0.7859 | 0.6743 | 0.7629 | 0.7626 |
| 2.4518 | 19.0 | 2565 | 2.4362 | 0.7866 | 0.6753 | 0.7635 | 0.7635 |
| 2.4663 | 20.0 | 2700 | 2.4362 | 0.7866 | 0.6744 | 0.7633 | 0.7631 |
Evaluation
We selected the best model via validation loss. The table below reposts overall and feature-specific scores.
| Feature | rouge1 | rouge2 | rougeL | rougeLsum |
|---|---|---|---|---|
| Activities | 62.6 | 51 | 58.4 | 58.3 |
| Characters | 84.7 | 79.5 | 82.9 | 82.9 |
| Emotion | 79.9 | 72.1 | 78.7 | 78.9 |
| Friendliness | 70.3 | 57 | 66.3 | 66.3 |
| Good Fortune | 71.4 | 22.8 | 71.7 | 71.5 |
| Misfortune | 65.6 | 48.2 | 65.2 | 65.1 |
| Overall | 78.7 | 67.4 | 76.3 | 76.3 |
Disclaimer
Dream reports and their annotation have been used in clinical settings and applied for diagnostic purposes. This does not apply in any way to our experimental results and output. Our work aims to provide experimental evidence of the feasibility of using PLMs to support humans in annotating dream reports for research purposes, as well as detailing their strengths and limitations when approaching such a task.
Framework versions
- Transformers 4.44.2
- Pytorch 2.1.0+cu118
- Datasets 3.0.1
- Tokenizers 0.19.1
Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_id = "jrc-ai/PreDA-base"
device = "cpu"
encoder_max_length = 100
decoder_max_length = 50
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
dream = "I was talking with my brother about my birthday dinner. I was feeling sad."
prefixes = ["Emotion", "Activities", "Characters"]
text_inputs = ["{} : {}".format(p, dream) for p in prefixes]
inputs = tokenizer(
text_inputs,
max_length=encoder_max_length,
truncation=True,
padding=True,
return_tensors="pt"
)
output = model.generate(
**inputs.to(device),
do_sample=False,
max_length=decoder_max_length,
)
for decode_dream in output:
print(tokenizer.decode(decode_dream, skip_special_tokens=True))
Dual-Use Implication
Upon evaluation we identified no dual-use implication for the present model. The model parameters, including the weights are available under CC0 1.0 Public Domain Dedication.
Cite
If you use our models in your research, please cite us as:
@InProceedings{10.1007/978-3-032-21477-5_13,
author="Bertolini, Lorenzo
and Comte, Valentin
and Ceresa, Mario
and Consoli, Sergio",
editor="Nicosia, Giuseppe
and Ojha, Varun
and Giesselbach, Sven
and Pardalos, M. Panos
and Umeton, Renato
and Emanuele, La Malfa
and Gabriele, La Malfa",
title="PreDA: Prefix-Based Dream Reports Annotation with Generative Language Models",
booktitle="Machine Learning, Optimization, and Data Science",
year="2026",
publisher="Springer Nature Switzerland",
address="Cham",
pages="191--206",
abstract="Dream reports are recollections of our experiences while asleep, and have strong research and clinical value. Since their analysis can be extremely time-consuming, researchers have adopted multiple types of automatised approaches, including, in more recent years, pre-trained language models (PLMs). However, most work has focused on limited aspects of the report content, such as characters or emotions. In this work, we introduce PreDA (prefix-based dream reports annotation), a framework to build language models able to annotate a dream report for multiple relevant aspects, using generative PLMs. We provide experimental evidence showing how a single PLM of small dimension can efficiently annotate a report on multiple features of the Hall and Van De Castle (HVDC) framework, give a detailed analysis of the model's performance, and explain how the training data impact learning and generalisation ability of the model.",
isbn="978-3-032-21477-5"
}
- Downloads last month
- 25
Model tree for jrc-ai/PreDA-base
Base model
google-t5/t5-base