Details MIVPG experiments across single- and multi-image scenarios. Model uses frozen LLM and Visual Encoder, updating only the MIVPG for efficiency.Details MIVPG experiments across single- and multi-image scenarios. Model uses frozen LLM and Visual Encoder, updating only the MIVPG for efficiency.

Evaluating Visual Adapters: MIVPG Performance on Single and Multi-Image Inputs

2025/11/15 11:12

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

\ Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

4. Experiments

To assess the effectiveness of our proposed approach, we conduct evaluations across various scenarios:

\

  1. where each sample comprises a single image, and patches are naturally considered as instances;

    \

  2. where each sample includes multiple instances, but we use a general embedding for each image;

    \

  3. where each sample contains multiple images, with each image containing multiple patches.

4.1. General Setup

We initialize our model using BLIP2 [22] with FLAN-T5- XL. MIVPG is initialized with weights from QFormer. The model consists of a frozen language model and a frozen visual model. During training, we only update the MIVPG. The visual encoder, ViT-G, is employed to encode images into patches of embeddings, and the images are resized to dimensions of 224 × 224. In our experiments, we observed that unfreezing the visual encoder does not lead to additional improvements in datasets with small sizes. Further details can be found in the supplementary C.1.

\

:::info Authors:

(1) Wenliang Zhong, The University of Texas at Arlington (wxz9204@mavs.uta.edu);

(2) Wenyi Wu, Amazon (wenyiwu@amazon.com);

(3) Qi Li, Amazon (qlimz@amazon.com);

(4) Rob Barton, Amazon (rab@amazon.com);

(5) Boxin Du, Amazon (boxin@amazon.com);

(6) Shioulin Sam, Amazon (shioulin@amazon.com);

(7) Karim Bouyarmane, Amazon (bouykari@amazon.com);

(8) Ismail Tutar, Amazon (ismailt@amazon.com);

(9) Junzhou Huang, The University of Texas at Arlington (jzhuang@uta.edu).

:::


:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Sorumluluk Reddi: Bu sitede yeniden yayınlanan makaleler, halka açık platformlardan alınmıştır ve yalnızca bilgilendirme amaçlıdır. MEXC'nin görüşlerini yansıtmayabilir. Tüm hakları telif sahiplerine aittir. Herhangi bir içeriğin üçüncü taraf haklarını ihlal ettiğini düşünüyorsanız, kaldırılması için lütfen service@support.mexc.com ile iletişime geçin. MEXC, içeriğin doğruluğu, eksiksizliği veya güncelliği konusunda hiçbir garanti vermez ve sağlanan bilgilere dayalı olarak alınan herhangi bir eylemden sorumlu değildir. İçerik, finansal, yasal veya diğer profesyonel tavsiye niteliğinde değildir ve MEXC tarafından bir tavsiye veya onay olarak değerlendirilmemelidir.