TVQA Dataset


A Localized, Compositional Video Question Answering Dataset

What is TVQA?


TVQA is a large-scale video QA dataset based on 6 popular TV shows (Friends, The Big Bang Theory, How I Met Your Mother, House M.D., Grey's Anatomy, Castle). It consists of 152.5K QA pairs from 21.8K video clips, spanning over 460 hours of video. The questions are designed to be compositional, requiring systems to jointly localize relevant moments within a clip, comprehend subtitles-based dialogue, and recognize relevant visual concepts.

Large-Scale

Consists of 152.5K QA pairs from 21.8K clips, TVQA is one of the largest of its kind.

Compositional

Questions are designed to be compositional, requiring both visual and textual cues.

Localized

QA pairs are temporarily localized with additional timestamp annotation.

It's Fun!

TVQA videos are made from popular TV shows, the ones you'd love!

Paper


TVQA: Localized, Compositional Video Question Answering

Jie Lei, Licheng Yu, Mohit Bansal, Tamara L. Berg.

Empirical Methods in Natural Language Processing (EMNLP) 2018

Team


Jie Lei

University of North Carolina at Chapel Hill

Licheng Yu

University of North Carolina at Chapel Hill

Mohit Bansal

University of North Carolina at Chapel Hill

Tamara L. Berg

University of North Carolina at Chapel Hill

Download


TVQA Dataset

The dataset is for non-commercial academic use only. Get download links as well as data descriptions here.

Code

The code accompanies the TVQA paper. It includes benchmark models and data loading tools for the dataset. Coming soon, follow this repository for updates.