-
FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use
Paper • 2603.08262 • Published • 42 -
On-Policy Context Distillation for Language Models
Paper • 2602.12275 • Published • 3 -
Online Experiential Learning for Language Models
Paper • 2603.16856 • Published • 57 -
Mixture-of-Depths Attention
Paper • 2603.15619 • Published • 79
Collections
Discover the best community collections!
Collections including paper arxiv:2603.15619
-
Attention Is All You Need
Paper • 1706.03762 • Published • 120 -
Attention Residuals
Paper • 2603.15031 • Published • 176 -
Mixture-of-Depths Attention
Paper • 2603.15619 • Published • 79 -
Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models
Paper • 2603.15557 • Published • 28
-
OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation
Paper • 2601.15369 • Published • 21 -
Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model
Paper • 2601.15892 • Published • 53 -
Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders
Paper • 2601.16208 • Published • 55 -
NAACL: Noise-AwAre Verbal Confidence Calibration for LLMs in RAG Systems
Paper • 2601.11004 • Published • 30
-
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
Paper • 2511.18373 • Published • 7 -
Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO
Paper • 2511.13288 • Published • 19 -
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
Paper • 2511.19418 • Published • 29 -
SAM 3: Segment Anything with Concepts
Paper • 2511.16719 • Published • 134
-
Selective Attention Improves Transformer
Paper • 2410.02703 • Published • 25 -
Differential Transformer
Paper • 2410.05258 • Published • 182 -
TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
Paper • 2410.05076 • Published • 8 -
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
Paper • 2410.13276 • Published • 29
-
Post-LayerNorm Is Back: Stable, ExpressivE, and Deep
Paper • 2601.19895 • Published • 26 -
Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers
Paper • 2601.17367 • Published • 34 -
Small-scale proxies for large-scale Transformer training instabilities
Paper • 2309.14322 • Published • 22 -
Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training
Paper • 2602.00747 • Published • 9
-
YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection
Paper • 2512.23273 • Published • 15 -
A 58-Addition, Rank-23 Scheme for General 3x3 Matrix Multiplication
Paper • 2512.21980 • Published • 3 -
Step-DeepResearch Technical Report
Paper • 2512.20491 • Published • 87 -
SAM Audio: Segment Anything in Audio
Paper • 2512.18099 • Published • 24
-
Test-Time Scaling with Reflective Generative Model
Paper • 2507.01951 • Published • 108 -
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Paper • 2502.05171 • Published • 154 -
Autoregressive Diffusion Models
Paper • 2110.02037 • Published -
EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling
Paper • 2502.09509 • Published • 9
-
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Paper • 2404.15653 • Published • 29 -
MoDE: CLIP Data Experts via Clustering
Paper • 2404.16030 • Published • 15 -
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Paper • 2405.12130 • Published • 50 -
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Paper • 2405.12981 • Published • 33
-
FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use
Paper • 2603.08262 • Published • 42 -
On-Policy Context Distillation for Language Models
Paper • 2602.12275 • Published • 3 -
Online Experiential Learning for Language Models
Paper • 2603.16856 • Published • 57 -
Mixture-of-Depths Attention
Paper • 2603.15619 • Published • 79
-
Attention Is All You Need
Paper • 1706.03762 • Published • 120 -
Attention Residuals
Paper • 2603.15031 • Published • 176 -
Mixture-of-Depths Attention
Paper • 2603.15619 • Published • 79 -
Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models
Paper • 2603.15557 • Published • 28
-
Post-LayerNorm Is Back: Stable, ExpressivE, and Deep
Paper • 2601.19895 • Published • 26 -
Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers
Paper • 2601.17367 • Published • 34 -
Small-scale proxies for large-scale Transformer training instabilities
Paper • 2309.14322 • Published • 22 -
Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training
Paper • 2602.00747 • Published • 9
-
OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation
Paper • 2601.15369 • Published • 21 -
Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model
Paper • 2601.15892 • Published • 53 -
Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders
Paper • 2601.16208 • Published • 55 -
NAACL: Noise-AwAre Verbal Confidence Calibration for LLMs in RAG Systems
Paper • 2601.11004 • Published • 30
-
YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection
Paper • 2512.23273 • Published • 15 -
A 58-Addition, Rank-23 Scheme for General 3x3 Matrix Multiplication
Paper • 2512.21980 • Published • 3 -
Step-DeepResearch Technical Report
Paper • 2512.20491 • Published • 87 -
SAM Audio: Segment Anything in Audio
Paper • 2512.18099 • Published • 24
-
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
Paper • 2511.18373 • Published • 7 -
Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO
Paper • 2511.13288 • Published • 19 -
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
Paper • 2511.19418 • Published • 29 -
SAM 3: Segment Anything with Concepts
Paper • 2511.16719 • Published • 134
-
Test-Time Scaling with Reflective Generative Model
Paper • 2507.01951 • Published • 108 -
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Paper • 2502.05171 • Published • 154 -
Autoregressive Diffusion Models
Paper • 2110.02037 • Published -
EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling
Paper • 2502.09509 • Published • 9
-
Selective Attention Improves Transformer
Paper • 2410.02703 • Published • 25 -
Differential Transformer
Paper • 2410.05258 • Published • 182 -
TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
Paper • 2410.05076 • Published • 8 -
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
Paper • 2410.13276 • Published • 29
-
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Paper • 2404.15653 • Published • 29 -
MoDE: CLIP Data Experts via Clustering
Paper • 2404.16030 • Published • 15 -
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Paper • 2405.12130 • Published • 50 -
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Paper • 2405.12981 • Published • 33