prometheus-eval

university

AI & ML interests

None defined yet.

Recent Activity

amphora authored a paper about 4 hours ago

Multi-Task Inference: Can Large Language Models Follow Multiple Instructions at Once?

amphora authored a paper about 4 hours ago

The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

amphora authored a paper about 4 hours ago

BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation

View all activity

authored 14 papers about 4 hours ago

Multi-Task Inference: Can Large Language Models Follow Multiple Instructions at Once?

Paper • 2402.11597 • Published Feb 18, 2024

The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

Paper • 2406.05761 • Published Jun 9, 2024 • 3

BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation

Paper • 2506.00482 • Published May 31, 2025 • 8

From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation

Paper • 2507.08924 • Published Jul 11, 2025 • 18

Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought

Paper • 2510.04230 • Published Oct 5, 2025 • 27

Revisiting the Uniform Information Density Hypothesis in LLM Reasoning Traces

Paper • 2510.06953 • Published Oct 8, 2025 • 9

Ko-PIQA: A Korean Physical Commonsense Reasoning Dataset with Cultural Context

Paper • 2509.11303 • Published Sep 14, 2025

What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models

Paper • 2601.06165 • Published Jan 7 • 16

Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures

Paper • 2510.24081 • Published Oct 28, 2025 • 24

Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math

Paper • 2602.06291 • Published Feb 6 • 24

KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context

Paper • 2604.13058 • Published Mar 18 • 2

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

Paper • 2605.09063 • Published 20 days ago • 79

Self-Improving CAD Generation Agents with Finite Element Analysis as Feedback

Paper • 2605.17448 • Published 12 days ago • 18

ResearchMath-14K: Scaling Research-Level Mathematics via Agents

Paper • 2605.28003 • Published 2 days ago • 38

submitted 2 papers to Daily Papers about 17 hours ago

ResearchMath-14K: Scaling Research-Level Mathematics via Agents

Paper • 2605.28003 • Published 2 days ago • 38

Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

Paper • 2605.27311 • Published 3 days ago • 3

updated a dataset about 22 hours ago

prometheus-eval/peerreview-bench

Viewer • Updated about 22 hours ago • 27.4k • 167 • 1

authored a paper 7 days ago

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

Paper • 2605.20668 • Published 9 days ago • 11

submitted a paper to Daily Papers 8 days ago

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

Paper • 2605.20668 • Published 9 days ago • 11

authored a paper 15 days ago

Reasoning over mathematical objects: on-policy reward modeling and test time aggregation

Paper • 2603.18886 • Published Mar 19 • 6