Multi-Task Inference: Can Large Language Models Follow Multiple Instructions at Once? Paper • 2402.11597 • Published Feb 18, 2024
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models Paper • 2406.05761 • Published Jun 9, 2024 • 3
BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation Paper • 2506.00482 • Published May 31, 2025 • 8
From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation Paper • 2507.08924 • Published Jul 11, 2025 • 18
Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought Paper • 2510.04230 • Published Oct 5, 2025 • 27
Revisiting the Uniform Information Density Hypothesis in LLM Reasoning Traces Paper • 2510.06953 • Published Oct 8, 2025 • 9
Ko-PIQA: A Korean Physical Commonsense Reasoning Dataset with Cultural Context Paper • 2509.11303 • Published Sep 14, 2025
What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models Paper • 2601.06165 • Published Jan 7 • 16
Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures Paper • 2510.24081 • Published Oct 28, 2025 • 24
Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math Paper • 2602.06291 • Published Feb 6 • 24
KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context Paper • 2604.13058 • Published Mar 18 • 2
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs Paper • 2605.09063 • Published 20 days ago • 79
Self-Improving CAD Generation Agents with Finite Element Analysis as Feedback Paper • 2605.17448 • Published 12 days ago • 18
ResearchMath-14K: Scaling Research-Level Mathematics via Agents Paper • 2605.28003 • Published 2 days ago • 38
ResearchMath-14K: Scaling Research-Level Mathematics via Agents Paper • 2605.28003 • Published 2 days ago • 38
Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models Paper • 2605.27311 • Published 3 days ago • 3
On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists Paper • 2605.20668 • Published 9 days ago • 11
On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists Paper • 2605.20668 • Published 9 days ago • 11
Reasoning over mathematical objects: on-policy reward modeling and test time aggregation Paper • 2603.18886 • Published Mar 19 • 6