0
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation

      Preprint

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Large vision and language models show strong performance in tasks like image captioning, visual question answering, and retrieval. However, challenges remain in integrating speech, text, and vision into a unified model, especially for spoken tasks. Speech generation methods vary (some produce speech directly), others through text (but their impact on quality is unclear). Evaluation often relies on automatic speech recognition, which may introduce bias. We propose SVLA, a unified speech vision language model based on a transformer architecture that handles multimodal inputs and outputs. We train it on 38.2 million speech text image examples, including 64.1 hours of synthetic speech. We also introduce Speech VQA Accuracy, a new metric for evaluating spoken responses. SVLA improves multimodal understanding and generation by better combining speech, vision, and language.

          Related collections

          Author and article information

          Journal
          31 March 2025
          Article
          2503.24164
          14dda2d5-cbc8-41f9-82a6-fec25e8ef722

          http://creativecommons.org/licenses/by/4.0/

          History
          Custom metadata
          21 pages
          cs.MM

          Graphics & Multimedia design
          Graphics & Multimedia design

          Comments

          Comment on this article