Scalably Enhancing the Clinical Validity of a Task Benchmark with Physician Oversight
This paper discusses the need for continuous improvement and validation of clinical risk scoring through a physician-in-the-loop framework, highlighting the impact of label noise on model training accuracy.
From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs
The authors present a large-scale dataset to assess the spatial reasoning capabilities of Multimodal Large Language Models (MLLMs) and demonstrate that performance drops in open-world settings.
GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators
Introduces GenEnv, a framework that dynamically adjusts the difficulty of tasks for LLM agents, demonstrating up to 40.3% performance improvement on various benchmarks.
LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?
This benchmark evaluates the coding capabilities of large language models against expert-curated Olympiad-level problems, revealing that current LLMs still lag behind top human contestants.
Multimodal LLMs for Historical Dataset Construction from Archival Image Scans: German Patents (1877-1918)
Demonstrates the efficacy of multimodal LLMs in constructing a large dataset of historical patents from archival images, outperforming human researchers in speed and cost.
Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies
Presents a novel RL paradigm that decomposes language model policies to optimize internal layer policies, achieving better performance in reasoning benchmarks.
Bridging the Gap Between Scientific Laws Derived by AI Systems and Canonical Knowledge via Abductive Inference with AI-Noether
Proposes an algebraic geometry-based system for deriving candidate axioms that extend existing scientific theories, showing practical applications in various scientific laws.
Exploring Zero-Shot ACSA with Unified Meaning Representation in Chain-of-Thought Prompting
Investigates the use of chain-of-thought prompting for zero-shot aspect-category sentiment analysis, yielding results dependent on model scale.
Empowering LLMs with Structural Role Inference for Zero-Shot Graph Learning
Introduces DuoGLM, a framework for structure-aware graph reasoning, emphasizing the importance of role-based understanding for enhancing LLM capabilities.
