benty-fields - Search paper

Visual appearance is considered to be the most important cue to understand images for cross-modal retrieval, while sometimes the scene text appearing in images can provide valuable information to understand the visual semantics. Most of existing cross-modal retrieval approaches ignore the usage of scene text information and directly adding this information may lead to performance degradation in scene text free scenarios. To address this issue, we propose a full transformer architecture to unify these cross-modal retrieval scenarios in a single $\textbf{Vi}$sion and $\textbf{S}$cene $\textbf{T}$ext $\textbf{A}$ggregation framework (ViSTA). Specifically, ViSTA utilizes transformer blocks to directly encode image patches and fuse scene text embedding to learn an aggregated visual representation for cross-modal retrieval. To tackle the modality missing problem of scene text, we propose a novel fusion token based transformer aggregation approach to exchange the necessary scene text information only through the fusion token and concentrate on the most important features in each modality. To further strengthen the visual modality, we develop dual contrastive learning losses to embed both image-text pairs and fusion-text pairs into a common cross-modal space. Compared to existing methods, ViSTA enables to aggregate relevant scene text semantics with visual appearance, and hence improve results under both scene text free and scene text aware scenarios. Experimental results show that ViSTA outperforms other methods by at least $\bf{8.4}\%$ at Recall@1 for scene text aware retrieval task. Compared with state-of-the-art scene text free retrieval methods, ViSTA can achieve better accuracy on Flicker30K and MSCOCO while running at least three times faster during the inference stage, which validates the effectiveness of the proposed framework.
Authors' comments: Accepted by CVPR 2022

Vote

Add to Library

Recommend

6499. AxIoU: An Axiomatically Justified Measure for Video Moment Retrieval

Riku Togashi, Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkila, Tetsuya Sakai

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2203.16062v1

Vote

Add to Library

Recommend

6500. On Metric Learning for Audio-Text Cross-Modal Retrieval

Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2203.15537v3

Vote

Add to Library

Recommend

Benty-search

6481. Composite Code Sparse Autoencoders for first stage retrieval

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2204.07023v1

6482. Reuse your features: unifying retrieval and feature-metric alignment

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2204.06292v2

6483. Retrieval of Scientific and Technological Resources for Experts and Scholars

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2204.06142v1

6484. Research on Cross-media Science and Technology Information Data Retrieval

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2204.04887v2

6485. Deep Conditional Representation Learning for Drum Sample Retrieval by Vocalisation

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2204.04651v1

6486. Nuclear phase retrieval spectroscopy using resonant x-ray scattering

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2204.06096v1

6487. Phase Retrieval: From Computational Imaging to Machine Learning

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2204.03554v2

6488. OSCARS: An Outlier-Sensitive Content-Based Radiography Retrieval System

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2204.03074v1

6489. Retrieval-based Spatially Adaptive Normalization for Semantic Image Synthesis

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2204.02854v1

6490. ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2204.02874v3

6491. KNN-Diffusion: Image Generation via Large-Scale Retrieval

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2204.02849v2

6492. Towards Best Practices for Training Multilingual Dense Retrieval Models

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2204.02363v1

6493. Parameter-Efficient Neural Reranking for Cross-Lingual and Multilingual Retrieval

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2204.02292v2

6494. Retrieval Study of Brown Dwarfs Across the L-T Sequence

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2204.01330v1

6495. Learning the Proximity Operator in Unfolded ADMM for Phase Retrieval

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2204.01360v1

6496. Implicit Feedback for Dense Passage Retrieval: A Counterfactual Approach

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2204.00718v2

6497. End-to-End Table Question Answering via Retrieval-Augmented Generation

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2203.16714v1

6498. ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2203.16778v1

6499. AxIoU: An Axiomatically Justified Measure for Video Moment Retrieval

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2203.16062v1

6500. On Metric Learning for Audio-Text Cross-Modal Retrieval

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2203.15537v3

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2204.07023v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2204.06292v2

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2204.06142v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2204.04887v2

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2204.04651v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2204.06096v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2204.03554v2

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2204.03074v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2204.02854v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2204.02874v3

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2204.02849v2

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2204.02363v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2204.02292v2

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2204.01330v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2204.01360v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2204.00718v2

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2203.16714v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2203.16778v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2203.16062v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2203.15537v3