Zelda: Video Analytics using Vision-Language Models

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Advances in ML have motivated the design of video analytics systems that allow for structured queries over video datasets. However, existing systems limit query expressivity, require users to specify an ML model per predicate, rely on complex optimizations that trade off accuracy for performance, and return large amounts of redundant and low-quality results. This paper focuses on the recently developed Vision-Language Models (VLMs) that allow users to query images using natural language like "cars during daytime at traffic intersections." Through an in-depth analysis, we show VLMs address three limitations of current video analytics systems: general expressivity, a single general purpose model to query many predicates, and are both simple and fast. However, VLMs still return large numbers of redundant and low-quality results, which can overwhelm and burden users. We present Zelda: a video analytics system that uses VLMs to return both relevant and semantically diverse results for top-K queries on large video datasets. Zelda prompts the VLM with the user's query in natural language and additional terms to improve accuracy and identify low-quality frames. Zelda improves result diversity by leveraging the rich semantic information encoded in VLM embeddings. We evaluate Zelda across five datasets and 19 queries and quantitatively show it achieves higher mean average precision (up to 1.15\(\times\)) and improves average pairwise similarity (up to 1.16\(\times\)) compared to using VLMs out-of-the-box. We also compare Zelda to a state-of-the-art video analytics engine and show that Zelda retrieves results 7.5\(\times\) (up to 10.4\(\times\)) faster for the same accuracy and frame diversity.

Related collections

Author and article information

Journal

Publication date Created: 05 May 2023

Article

ArXiV ID: 2305.03785

SO-VID: 874e1277-53ab-4f12-b224-6faed9957c45

License:

http://creativecommons.org/licenses/by/4.0/

History

Custom metadata

Categories cs.DB

ScienceOpen disciplines: Databases

Data availability:

ScienceOpen disciplines: Databases

Zelda: Video Analytics using Vision-Language Models

Read this article at

Abstract

Related collections

African e-Infrastructure Commons

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 288