Towards Zero-Shot Annotation of the Built Environment with Vision-Language Models (Vision Paper)

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Equitable urban transportation applications require high-fidelity digital representations of the built environment: not just streets and sidewalks, but bike lanes, marked and unmarked crossings, curb ramps and cuts, obstructions, traffic signals, signage, street markings, potholes, and more. Direct inspections and manual annotations are prohibitively expensive at scale. Conventional machine learning methods require substantial annotated training data for adequate performance. In this paper, we consider vision language models as a mechanism for annotating diverse urban features from satellite images, reducing the dependence on human annotation to produce large training sets. While these models have achieved impressive results in describing common objects in images captured from a human perspective, their training sets are less likely to include strong signals for esoteric features in the built environment, and their performance in these settings is therefore unclear. We demonstrate proof-of-concept combining a state-of-the-art vision language model and variants of a prompting strategy that asks the model to consider segmented elements independently of the original image. Experiments on two urban features -- stop lines and raised tables -- show that while direct zero-shot prompting correctly annotates nearly zero images, the pre-segmentation strategies can annotate images with near 40% intersection-over-union accuracy. We describe how these results inform a new research agenda in automatic annotation of the built environment to improve equity, accessibility, and safety at broad scale and in diverse environments.

Related collections

Author and article information

Journal

Publication date Created: 01 August 2024

Article

ArXiV ID: 2408.00932

SO-VID: f6ae3ab3-537f-42bf-9fde-9768aff197e9

License:

http://creativecommons.org/licenses/by/4.0/

History

Custom metadata

Categories cs.CV cs.CL

ScienceOpen disciplines: Computer vision & Pattern recognition,Theoretical computer science

Data availability:

ScienceOpen disciplines: Computer vision & Pattern recognition, Theoretical computer science

Towards Zero-Shot Annotation of the Built Environment with Vision-Language Models (Vision Paper)

Read this article at

Abstract

Related collections

Blockchain in Healthcare Today

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 59