LLM Inference Beyond a Single Node: From Bottlenecks to Mitigations with Fast All-Reduce Communication
This paper presents a performance study of multi-node distributed inference using large language models (LLMs) on GPU-based supercomputers. It details the development of NVRAR, a hierarchical all-reduce algorithm that significantly reduces latency in inference across multiple nodes, showcasing its effectiveness in improving the end-to-end batch latency.
ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference
This paper proposes Pairwise Rotation Quantization (ParoQuant), a weight-only post-training quantization method aimed at reducing the memory footprint for Large Language Models (LLMs) while enhancing accuracy in reasoning tasks, addressing the outlier problem prevalent in previous methods.
Black-Box On-Policy Distillation of Large Language Models
The authors introduce Generative Adversarial Distillation (GAD) for on-policy black-box distillation of LLMs. Experimental results show that GAD outperforms existing knowledge distillation methods and achieves comparable results to teacher models on numerous benchmarks.
SSR: Socratic Self-Refine for Large Language Model Reasoning
This paper presents a framework for improving the reasoning capabilities of LLMs via self-refinement and verification, demonstrating enhanced performance across various reasoning benchmarks through iteratively refining model outputs.
Towards an Agentic Workflow for Internet Measurement Research
ArachNet is presented, an AI-driven tool that allows rapid development of measurement workflows for internet resilience scenarios, enhancing operators’ capabilities during network disruptions and democratizing access to sophisticated measurement tools.
Teaching LLMs to See and Guide: Context-Aware Real-Time Assistance in Augmented Reality
This work introduces a context-aware LLM assistant integrated with multimodal data in AR and VR environments, improving task execution and user assistance, particularly in industrial settings.
Towards Emotionally Intelligent and Responsible Reinforcement Learning
An innovative Responsible Reinforcement Learning framework is proposed, focusing on integrating emotional understanding and ethical considerations into sequential decision-making in sensitive contexts such as mental health.
Andrew Yang on Modern Slavery Statement Monitoring
This research discusses the development of AI tools to monitor compliance with modern slavery regulations, emphasizing the importance of enforceable verification mechanisms derived from large language models.
Probability-Only Approach to Uncertainty Estimation in Large Language Models
The authors present a method to estimate uncertainty in LLMs using a probability-only approach, which simplifies the process and outperforms traditional methods, addressing hallucination issues in model outputs.
Can LLM-Generated Textual Explanations Enhance Model Classification Performance? An Empirical Study
An automated framework for generating textual explanations using multiple LLMs is evaluated. The study finds that LLM-generated explanations can significantly improve the classification performance of pre-trained models.
