AMUSE: 음향-시각적 벤치마크 및 대화형 다중 화자 이해를 위한 정렬 프레임워크

발행일: 2026년 2월 24일 오전 12시 00분

Recent advances in multimodal large language models (MLLMs) have significantly improved the performance of various natural language understanding tasks. However, these models still face challenges in scenarios that involve multiple speakers interacting in a dialogue-centric environment. In such settings, it is crucial for the models to demonstrate agentic reasoning, which includes tracking who speaks, maintaining speaker roles, and grounding events across time. To address these challenges, researchers at Apple have introduced AMUSE, an Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding. The goal of AMUSE is to provide a benchmark that focuses on tasks requiring agentic reasoning in multimodal audio-video understanding. By jointly reasoning over audio and visual streams, models trained on the AMUSE benchmark can improve their performance in applications such as conversational video assistants and meeting analytics. AMUSE is designed to test the ability of models to decompose complex audio-visual scenes, identify speakers, track speaker roles, and understand the context of conversations over time. The benchmark consists of various tasks that assess a model’s capability to reason about agentic aspects of multi-speaker interactions. By evaluating models on these tasks, researchers can gain insights into the strengths and weaknesses of current multimodal models and develop more robust solutions for audio-visual understanding. In summary, AMUSE serves as a valuable tool for advancing research in agentic multi-speaker understanding by providing a standardized benchmark and alignment framework for evaluating the performance of multimodal models in dialogue-centric environments. By focusing on tasks that require agentic reasoning, AMUSE contributes to the development of more sophisticated models that can better handle the complexities of real-world interactions involving multiple speakers. Researchers and developers can leverage the AMUSE benchmark to drive innovation in the field of audio-visual understanding and enhance the capabilities of applications that rely on multi-speaker dialogue analysis.

출처: Apple

요약번역: 미주투데이 서현진 기자