TVQA is a large-scale video QA dataset based on 6 popular TV shows (Friends, The Big Bang Theory, How I Met Your Mother, House M.D., Grey's Anatomy, Castle). It consists of 152.5K QA pairs from 21.8K video clips, spanning over 460 hours of video. The questions are designed to be compositional, requiring systems to jointly localize relevant moments within a clip, comprehend subtitles-based dialogue, and recognize relevant visual concepts.
Consists of 152.5K QA pairs from 21.8K clips, TVQA is one of the largest of its kind.
Questions are designed to be compositional, requiring both visual and textual cues.
QA pairs are temporarily localized with additional timestamp annotation.
TVQA videos are made from popular TV shows, the ones you'd love!
TVQA: Localized, Compositional Video Question Answering
Empirical Methods in Natural Language Processing (EMNLP) 2018