Papers
arxiv:2512.09299

VABench: A Comprehensive Benchmark for Audio-Video Generation

Published on Dec 10
ยท Submitted by
bohan zeng
on Dec 18
Authors:
,
,
,
,
,
,
,
,

Abstract

VABench is a benchmark framework for evaluating audio-video generation models, covering text-to-audio-video, image-to-audio-video, and stereo audio-video tasks with 15 evaluation dimensions.

AI-generated summary

Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency, and carefully curated audio and video question-answering (QA) pairs, among others. Furthermore, VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds. We provide a systematic analysis and visualization of the evaluation results, aiming to establish a new standard for assessing video generation models with synchronous audio capabilities and to promote the comprehensive advancement of the field.

Community

Paper submitter

arXiv lens breakdown of this paper ๐Ÿ‘‰ https://arxivlens.com/PaperView/Details/vabench-a-comprehensive-benchmark-for-audio-video-generation-5469-093c5d48

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.09299 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.09299 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.09299 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.