Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity

Xu, Zhenlin; Zhu, Yi; Deng, Tiffany; Mittal, Abhay; Chen, Yanbei; Wang, Manchen; Favaro, Paolo; Tighe, Joseph; Modolo, Davide

Computer Science > Computer Vision and Pattern Recognition

arXiv:2306.16048 (cs)

[Submitted on 28 Jun 2023 (v1), last revised 29 Jan 2024 (this version, v2)]

Title:Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity

Authors:Zhenlin Xu, Yi Zhu, Tiffany Deng, Abhay Mittal, Yanbei Chen, Manchen Wang, Paolo Favaro, Joseph Tighe, Davide Modolo

View PDF

Abstract:This paper introduces innovative benchmarks to evaluate Vision-Language Models (VLMs) in real-world zero-shot recognition tasks, focusing on the granularity and specificity of prompting text. We propose a unique evaluation protocol using adapted ImageNet and MS-COCO datasets to assess models' consistency in recognizing concepts at varying granularity levels and their sensitivity to the specificity of language inputs. Our extensive evaluation reveals that state-of-the-art VLMs, including contrastive models like CLIP, struggle with granularity and are sensitive to text specificity, impacting their effectiveness in open-world settings. This comprehensive study, a first in evaluating VLMs from these perspectives, provides valuable insights and tools for the community, highlighting the limitations and paving the way for enhanced models with better generalization in zero-shot recognition.

Comments:	Additional experiments
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2306.16048 [cs.CV]
	(or arXiv:2306.16048v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2306.16048

Submission history

From: Zhenlin Xu [view email]
[v1] Wed, 28 Jun 2023 09:29:06 UTC (4,446 KB)
[v2] Mon, 29 Jan 2024 10:45:58 UTC (4,564 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators