Hidden Prompts in Manuscripts Exploit AI-Assisted Peer Review
In July 2025, 18 academic manuscripts on the preprint website arXiv were found to contain hidden instructions known as prompts designed to manipulate AI-assisted peer review. The incident exposes systematic vulnerabilities extending beyond peer review to any automated system processing scholarly texts, underscoring the need for coordinated technical screening.
Bias, Accuracy, and Trust: Gender-Diverse Perspectives on Large Language Models
This study examines how gender-diverse populations perceive bias, accuracy, and trustworthiness in LLMs, revealing significant insights on gendered responses and the necessity for more inclusive AI systems.
OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety
OpenAgentSafety introduces a modular framework for evaluating agent behavior across essential risk categories, highlighting safety vulnerabilities in AI agents deployed in real-world tasks.
Dynamic Context-Aware Prompt Recommendation for Domain-Specific AI Applications
This paper presents a novel prompt recommendation system for domain-specific AI applications, significantly enhancing the effectiveness of user interactions by automatically generating high-quality prompts.
UQLM: A Python Package for Uncertainty Quantification in Large Language Models
UQLM provides an off-the-shelf solution for detecting hallucinations in LLMs using uncertainty quantification techniques, aiding reliability in AI outputs.
SQLBarber: A System Leveraging Large Language Models to Generate Customized and Realistic SQL Workloads
SQLBarber enables the generation of customized SQL queries, significantly enhancing benchmarking in database research, and addressing challenges related to data privacy.
Few-shot text-based emotion detection
The Unibuc – NLP team details their approach to the SemEval 2025 Workshop, leveraging large language models for emotion detection with effective results across multiple languages.
A Survey on Latent Reasoning
This comprehensive survey provides insights into latent reasoning methods for LLMs, aiming to improve their computational reasoning capabilities without token-level supervision.
MedGemma Technical Report
MedGemma introduces a collection of medical foundation models demonstrating advanced capabilities, significantly contributing to AI applications in healthcare.
Large Language Models Predict Human Well-being — But Not Equally Everywhere
This study evaluates the predictive accuracy of LLMs regarding human well-being across different nations, revealing systemic biases and the limitations inherent to training data.