Financial RoBERTa-Large Macro Sentiment

A fine-tuned soleimanian/financial-roberta-large-sentiment (355M params, RoBERTa-Large) for 3-class sentiment analysis on financial, macroeconomic, and climate/ESG text. Trained on 20K samples from 5 financial NLP datasets.

This is the best-performing head in the ensemble — it leads on every benchmark (in-domain and out-of-domain), reaching 91.3% accuracy / 90.2% F1 macro in-domain and 94.1% accuracy on unseen financial phrasebank text.

This model serves as the policy/macro head in the macro-sentiment-finbert ensemble pipeline. The topic router selects it for central bank communications, monetary policy text, and formal macroeconomic analysis. It also works well standalone as a general-purpose financial sentiment classifier.

Why RoBERTa-Large?

The base model (soleimanian/financial-roberta-large-sentiment) was pre-trained on a broad financial corpus — financial statements, earnings call transcripts, CSR/ESG reports, and financial news — making it the most versatile foundation among the ensemble's heads. At 355M parameters with 24 transformer layers (vs. 12 for BERT-base), it has substantially more capacity to capture nuanced sentiment in complex, formal financial prose.

The 3.2× parameter increase over FinBERT (109M) yields consistent gains:

	FinBERT (109M)	RoBERTa-Large (355M)	Δ
In-domain F1 (macro)	0.8813	0.9023	+2.1pp
OOD PhraseBank F1	0.9134	0.9357	+2.2pp
OOD Stock News F1	0.6765	0.7265	+5.0pp

The largest improvement (+5pp) is on the hardest benchmark (noisy stock news headlines with 5→3 class mapping), suggesting that the extra capacity pays off most on challenging, out-of-distribution text.

Quick Start

Standalone Usage

from transformers import pipeline

pipe = pipeline("text-classification", model="peyterho/financial-roberta-large-macro-sentiment", top_k=None)

# Policy/macro text — the model's specialty
pipe("The Federal Reserve raised rates by 75 basis points, citing persistent inflation pressures.")
# [[{'label': 'neutral', 'score': 0.82}, {'label': 'negative', 'score': 0.15}, {'label': 'positive', 'score': 0.03}]]

pipe("The ECB signaled an extended pause in rate hikes, offering relief to debt-laden eurozone economies.")
# [[{'label': 'positive', 'score': 0.89}, {'label': 'neutral', 'score': 0.09}, {'label': 'negative', 'score': 0.02}]]

# General financial text — also excels here
pipe("Markets crashed amid recession fears and massive layoffs across the tech sector.")
# [[{'label': 'negative', 'score': 0.98}, {'label': 'neutral', 'score': 0.01}, {'label': 'positive', 'score': 0.01}]]

pipe("Q3 revenue surged 24% year-over-year, driven by strong cloud adoption.")
# [[{'label': 'positive', 'score': 0.97}, ...]]

As Part of the Full Pipeline

Within the macro-sentiment-finbert pipeline, this model is automatically selected when policy/macro keywords are detected (e.g., Federal Reserve, ECB, rate hike, monetary policy, inflation, FOMC, basis points, quantitative easing):

from macro_sentiment import MacroSentimentPipeline

pipe = MacroSentimentPipeline(device="cpu")

result = pipe("The Fed held rates steady but signaled two more cuts before year-end, a dovish surprise.")
print(result.summary())
# Sentiment: Positive (+0.421) | Policy: Dovish (-0.600) | Crisis: Normal (0.000) | Domain: policy
print(result.head_used)
# "policy"

result = pipe("Inflation expectations became unanchored as the central bank lost credibility.")
print(result.summary())
# Sentiment: Negative (-0.584) | Policy: Neutral (+0.000) | Crisis: Elevated (0.350) | Domain: policy

Label Mapping

This model uses the standard 3-class label ordering — no remapping needed:

Label ID	Label	Score Mapping
0	`negative`	-1.0
1	`neutral`	0.0
2	`positive`	+1.0

Unlike FinBERT (positive=0, negative=1) and ClimateBERT (opportunity=0, risk=2), this model's label IDs follow the natural negative→neutral→positive ordering, making standalone usage straightforward.

Evaluation Results

In-Domain (Combined Test Set — 4,333 samples)

All three fine-tuned heads compared on the same held-out test split:

Model	Params	Accuracy	F1 (macro)	F1 (weighted)
RoBERTa-Large (this model)	355M	0.9130	0.9023	0.9137
FinBERT	109M	0.8973	0.8813	0.8984
ClimateBERT	82M	0.8885	0.8716	0.8898

Out-of-Domain: Financial Phrasebank (785 samples)

Evaluated on Jean-Baptiste/financial_news_sentiment_mixte_with_phrasebank_75 — PhraseBank (75% agreement) + financial news articles. Not in the training mix.

Model	Params	Accuracy	F1 (macro)
RoBERTa-Large (this model)	355M	0.9414	0.9357
ClimateBERT	82M	0.9248	0.9213
FinBERT	109M	0.9236	0.9134

OOD accuracy actually improves over in-domain (94.1% vs. 91.3%) because this PhraseBank-style data is closer to the model's pre-training distribution — formal financial sentences with clearer sentiment signals than the noisier tweet and QA data in the training mix.

Out-of-Domain: Stock News Headlines (30,150 samples)

Evaluated on ic-fspml/stock_news_sentiment — 5-class labels mapped to 3-class. Not in the training mix.

Model	Params	Accuracy	F1 (macro)
RoBERTa-Large (this model)	355M	0.7211	0.7265
FinBERT	109M	0.6781	0.6765
ClimateBERT	82M	0.6472	0.6441

RoBERTa-Large's advantage is largest here (+5pp F1 over FinBERT), confirming that extra model capacity helps most on challenging, noisy, out-of-distribution text.

Comparison Against Baselines

Method	Accuracy	F1 (macro)	Notes
RoBERTa-Large (this model)	0.9130	0.9023	355M params, fine-tuned
FinBERT (fine-tuned)	0.8973	0.8813	109M params
ClimateBERT (fine-tuned)	0.8885	0.8716	82M params
Dict-only meta-classifier (GBT)	0.6693	0.5781	GradientBoosting on 24 dictionary features
Dict-only rules (LM + Henry)	0.5684	0.5277	Threshold-based, no learned parameters

Training Data

Fine-tuned on 20,034 training samples combined from 5 public financial/climate sentiment datasets:

Dataset	Domain	Train	Test	Label Mapping
nickmuchi/financial-classification	Financial PhraseBank	~4,800	~1,200	negative / neutral / positive
zeroshot/twitter-financial-news-sentiment	Financial tweets	~9,900	~2,500	bearish→neg, bullish→pos, neutral
FinanceInc/auditor_sentiment	Auditor reports	~3,600	~900	negative / neutral / positive
pauri32/fiqa-2018	Financial QA + microblog	~938	~235	Continuous score thresholded at ±0.15
climatebert/climate_sentiment	Climate disclosures	~1,000	~500	risk→neg, neutral, opportunity→pos

All datasets unified to 3-class: 0=negative, 1=neutral, 2=positive. This model's native label ordering already matches the unified schema — no remapping needed during training.

Training Details

Hyperparameter	Value
Base model	soleimanian/financial-roberta-large-sentiment (355M, RoBERTa-Large, 24 layers, 1024 hidden, 16 heads)
Learning rate	1e-5 (lower than FinBERT/ClimateBERT's 2e-5 to stabilize large model training)
Batch size	16 × 4 gradient accumulation = 64 effective
Epochs	6 (best checkpoint at epoch 3 by val loss)
Scheduler	Linear decay with 20% warmup
Optimizer	AdamW (weight decay 0.01)
Max length	128 tokens
Precision	FP16
Class weighting	√(inverse frequency) — handles ~58% neutral class imbalance
Seed	42
Best model selection	Lowest validation loss (epoch 3)

Why Lower Learning Rate?

Large models (>300M params) are more sensitive to learning rate — too high and the pre-trained representations are destroyed ("catastrophic forgetting"). The 1e-5 rate (vs. 2e-5 for the 82–109M heads) follows the standard practice of scaling LR inversely with model size for fine-tuning.

Training Curve

Epoch	Train Loss	Val Loss	Accuracy	F1 (macro)	F1 (weighted)
1	1.511	0.352	0.852	0.841	0.854
2	1.196	0.319	0.866	0.857	0.868
3	0.766	0.273	0.908	0.897	0.909
4	0.462	0.295	0.907	0.897	0.908
5	0.345	0.313	0.911	0.901	0.912
6	0.283	0.326	0.913	0.902	0.914

Validation loss reaches its minimum at epoch 3 (0.273), then gradually increases while accuracy and F1 continue improving through epoch 6. The best checkpoint (epoch 3, lowest val loss) is loaded for the final model — this checkpoint prioritizes calibration over raw accuracy. The higher training losses compared to FinBERT/ClimateBERT are expected given the lower learning rate and larger model capacity.

Role in the Ensemble Pipeline

This model serves as the policy/macro specialist in the macro-sentiment-finbert pipeline. The topic router activates it when policy-related keywords are detected in the input:

Input Text → Topic Router → policy/macro keywords? → RoBERTa-Large (this model) ★
                          → climate keywords?      → ClimateBERT
                          → non-English text?      → XLM-RoBERTa
                          → default                → FinBERT

Policy routing keywords include: Federal Reserve, Fed, FOMC, ECB, Bank of England, central bank, interest rate, rate hike, rate cut, monetary policy, quantitative easing/tightening, hawkish, dovish, inflation, CPI, GDP, recession, fiscal policy, yield curve, basis points, forward guidance, and many more (~60 keywords total).

The pipeline then fuses this model's output with four dictionary-based signals (Loughran-McDonald, Henry earnings tone, Sautner-style climate exposure, macro policy/crisis dictionaries) using crisis-adaptive weighting to produce a structured MacroSentimentResult with financial sentiment, policy stance, crisis signal, and confidence scores.

Why Route Policy Text to RoBERTa-Large?

Central bank communications and macroeconomic analysis tend to be long, formal, and nuanced — a rate hike can be positive (inflation under control) or negative (economic tightening) depending on context. The 24-layer RoBERTa-Large architecture captures these contextual subtleties better than the shallower heads, and the base model's pre-training on earnings call transcripts and financial statements aligns well with the register of policy text.

Ensemble Siblings

Head	Model	Params	Role
RoBERTa-Large ★	peyterho/financial-roberta-large-macro-sentiment	355M	Policy/macro communications
FinBERT	peyterho/finbert-macro-sentiment	109M	Default — financial news, tweets
ClimateBERT	peyterho/climatebert-macro-sentiment	82M	Climate/ESG text
XLM-RoBERTa	cardiffnlp/twitter-xlm-roberta-base-sentiment-multilingual	278M	Non-English text (pre-trained, not fine-tuned)

Custom Fine-Tuning

Fine-tune this model further on your own labelled data:

pip install transformers datasets accelerate evaluate torch

python -m macro_sentiment.finetune \
    --data my_labels.csv \
    --text-column text \
    --label-column sentiment \
    --base-model peyterho/financial-roberta-large-macro-sentiment \
    --output my-org/my-policy-model \
    --push-to-hub \
    --epochs 4 \
    --lr 5e-6 \
    --batch-size 16

The fine-tuning script (from the parent repo) accepts CSV, TSV, JSON, or JSONL. Labels can be strings ("positive", "negative", "neutral", "bullish", "bearish") or integers (0/1/2). Class-weighted loss is applied by default. Note: use a lower learning rate (5e-6 to 1e-5) for this 355M model to avoid destabilizing the pre-trained weights.

Latency & Resource Considerations

At 355M parameters (1.3GB safetensors), this model requires more compute than the smaller heads:

Model	Params	Size	Throughput*
ClimateBERT	82M	314MB	~1,300 samples/s
FinBERT	109M	418MB	~670 samples/s
RoBERTa-Large (this model)	355M	1.3GB	~275 samples/s

*Measured on eval set (batch size 64, CPU). GPU throughput will be proportionally higher.

For latency-sensitive applications (real-time trading signals, high-volume social media monitoring), the FinBERT head may be preferable. For batch analysis of policy documents, earnings calls, and macroeconomic reports where accuracy matters most, this model is the best choice.

Limitations

English only — the base model was trained on English financial text. Non-English text should use the multilingual head in the full pipeline.
128-token max length — fine-tuned with 128-token truncation. Longer documents (FOMC minutes, earnings calls, 10-K filings) should be chunked at paragraph level. Consider overlapping windows for documents where context spans paragraph boundaries.
Higher latency — 3.2× more parameters than FinBERT means ~2.4× slower inference. The pipeline's topic routing mitigates this by only invoking RoBERTa-Large for policy-classified text.
Memory footprint — requires ~1.5GB GPU memory for inference (vs. ~500MB for FinBERT). May not fit on memory-constrained edge devices.
Label noise — the FiQA dataset uses continuous sentiment scores thresholded at ±0.15, and the Twitter dataset contains informal language that differs from this model's pre-training distribution (formal financial text).
No temporal or entity awareness — treats each text independently. Cannot reason about whether "rates rising" is positive (for banks) or negative (for growth stocks) without additional context.
Policy routing heuristic — the topic router uses keyword matching, not a learned classifier. Ambiguous text (e.g., a climate policy announcement that is both policy and climate) may be routed suboptimally. Use mode="all" in the pipeline to ensemble across all heads for important decisions.

Citation

@article{araci2019finbert,
    title={FinBERT: Financial Sentiment Analysis with Pre-Trained Language Models},
    author={Araci, Dogu},
    journal={arXiv preprint arXiv:1908.10063},
    year={2019}
}

@article{loughran2011liability,
    title={When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks},
    author={Loughran, Tim and McDonald, Bill},
    journal={The Journal of Finance},
    volume={66},
    number={1},
    pages={35--65},
    year={2011}
}

@article{henry2008earnings,
    title={Are investors influenced by how earnings press releases are written?},
    author={Henry, Elaine},
    journal={Journal of Business Communication},
    volume={45},
    number={4},
    pages={363--407},
    year={2008}
}

Framework Versions

Transformers 5.6.2
PyTorch 2.11.0+cu130
Datasets 4.8.4
Tokenizers 0.22.2

License

Apache 2.0

Downloads last month: 120

Safetensors

Model size

0.4B params

Tensor type

F32

Model tree for peyterho/financial-roberta-large-macro-sentiment

Base model

soleimanian/financial-roberta-large-sentiment

Finetuned

(2)

this model

Datasets used to train peyterho/financial-roberta-large-macro-sentiment

Paper for peyterho/financial-roberta-large-macro-sentiment

FinBERT: Financial Sentiment Analysis with Pre-trained Language Models

Paper • 1908.10063 • Published Aug 27, 2019 • 3

Evaluation results

Accuracy on Combined Financial Sentiment (5 datasets)
self-reported

0.913
F1 (macro) on Combined Financial Sentiment (5 datasets)
self-reported

0.902
F1 (weighted) on Combined Financial Sentiment (5 datasets)
self-reported

0.914
Accuracy on Jean-Baptiste/financial_news_sentiment_mixte_with_phrasebank_75
self-reported

0.941
F1 (macro) on Jean-Baptiste/financial_news_sentiment_mixte_with_phrasebank_75
self-reported

0.936
Accuracy on ic-fspml/stock_news_sentiment
self-reported

0.721
F1 (macro) on ic-fspml/stock_news_sentiment
self-reported

0.727