InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Wang, Yi; He, Yinan; Li, Yizhuo; Li, Kunchang; Yu, Jiashuo; Ma, Xin; Li, Xinhao; Chen, Guo; Chen, Xinyuan; Wang, Yaohui; He, Conghui; Luo, Ping; Liu, Ziwei; Wang, Yali; Wang, Limin; Qiao, Yu

Computer Science > Computer Vision and Pattern Recognition

arXiv:2307.06942 (cs)

[Submitted on 13 Jul 2023 (v1), last revised 4 Jan 2024 (this version, v2)]

Title:InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Authors:Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Conghui He, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, Yu Qiao

View PDF HTML (experimental)

Abstract:This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words. Our core contribution is to develop a scalable approach to autonomously build a high-quality video-text dataset with large language models (LLM), thereby showcasing its efficacy in learning video-language representation at scale. Specifically, we utilize a multi-scale approach to generate video-related descriptions. Furthermore, we introduce ViCLIP, a video-text representation learning model based on ViT-L. Learned on InternVid via contrastive learning, this model demonstrates leading zero-shot action recognition and competitive video retrieval performance. Beyond basic video understanding tasks like recognition and retrieval, our dataset and model have broad applications. They are particularly beneficial for generating interleaved video-text data for learning a video-centric dialogue system, advancing video-to-text and text-to-video generation research. These proposed resources provide a tool for researchers and practitioners interested in multimodal video understanding and generation.

Comments:	Data and Code: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2307.06942 [cs.CV]
	(or arXiv:2307.06942v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2307.06942

Submission history

From: Yi Wang [view email]
[v1] Thu, 13 Jul 2023 17:58:32 UTC (17,177 KB)
[v2] Thu, 4 Jan 2024 05:00:34 UTC (5,866 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators