benty-fields - Search paper

Research on text-to-image generation has witnessed significant progress in generating diverse and photo-realistic images, driven by diffusion and auto-regressive models trained on large-scale image-text data. Though state-of-the-art models can generate high-quality images of common entities, they often have difficulty generating images of uncommon entities, such as `Chortai (dog)' or `Picarones (food)'. To tackle this issue, we present the Retrieval-Augmented Text-to-Image Generator (Re-Imagen), a generative model that uses retrieved information to produce high-fidelity and faithful images, even for rare or unseen entities. Given a text prompt, Re-Imagen accesses an external multi-modal knowledge base to retrieve relevant (image, text) pairs and uses them as references to generate the image. With this retrieval step, Re-Imagen is augmented with the knowledge of high-level semantics and low-level visual details of the mentioned entities, and thus improves its accuracy in generating the entities' visual appearances. We train Re-Imagen on a constructed dataset containing (image, text, retrieval) triples to teach the model to ground on both text prompt and retrieval. Furthermore, we develop a new sampling strategy to interleave the classifier-free guidance for text and retrieval conditions to balance the text and retrieval alignment. Re-Imagen achieves significant gain on FID score over COCO and WikiImage. To further evaluate the capabilities of the model, we introduce EntityDrawBench, a new benchmark that evaluates image generation for diverse entities, from frequent to rare, across multiple object categories including dogs, foods, landmarks, birds, and characters. Human evaluation on EntityDrawBench shows that Re-Imagen can significantly improve the fidelity of generated images, especially on less frequent entities.
Authors' comments: 9 pages

Vote

Add to Library

Recommend

6353. Social Search: retrieving information in Online Social Platforms -- A Survey

Maddalena Amendola, Andrea Passarella, Raffaele Perego

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2209.14369v3

Vote

Add to Library

Recommend

6354. Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval

Chengzhi Lin, Ancong Wu, Junwei Liang, Jun Zhang, Wenhang Ge, Wei-Shi Zheng, Chunhua Shen

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2209.13307v1

Vote

Add to Library

Recommend

6355. Mr. Right: Multimodal Retrieval on Representation of ImaGe witH Text

Cheng-An Hsieh, Cheng-Ping Hsieh, Pu-Jen Cheng

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2209.13764v1

Vote

Add to Library

Recommend

6356. Information-Theoretic Hashing for Zero-Shot Cross-Modal Retrieval

Yufeng Shi, Shujian Yu, Duanquan Xu, Xinge You

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2209.12491v1

Vote

Add to Library

Recommend

6357. Promptagator: Few-shot Dense Retrieval From 8 Examples

Zhuyun Dai, Vincent Y. Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu et al.

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2209.11755v1

Vote

Add to Library

Recommend

6358. Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval

Xiang Fang, Daizong Liu, Pan Zhou, Yuchong Hu

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2209.11572v2

As an increasingly popular task in multimedia information retrieval, video moment retrieval (VMR) aims to localize the target moment from an untrimmed video according to a given language query. Most previous methods depend heavily on numerous manual annotations (i.e., moment boundaries), which are extremely expensive to acquire in practice. In addition, due to the domain gap between different datasets, directly applying these pre-trained models to an unseen domain leads to a significant performance drop. In this paper, we focus on a novel task: cross-domain VMR, where fully-annotated datasets are available in one domain (``source domain''), but the domain of interest (``target domain'') only contains unannotated datasets. As far as we know, we present the first study on cross-domain VMR. To address this new task, we propose a novel Multi-Modal Cross-Domain Alignment (MMCDA) network to transfer the annotation knowledge from the source domain to the target domain. However, due to the domain discrepancy between the source and target domains and the semantic gap between videos and queries, directly applying trained models to the target domain generally leads to a performance drop. To solve this problem, we develop three novel modules: (i) a domain alignment module is designed to align the feature distributions between different domains of each modality; (ii) a cross-modal alignment module aims to map both video and query features into a joint embedding space and to align the feature distributions between different modalities in the target domain; (iii) a specific alignment module tries to obtain the fine-grained similarity between a specific frame and the given query for optimal localization. By jointly training these three modules, our MMCDA can learn domain-invariant and semantic-aligned cross-modal representations.
Authors' comments: Accepted by IEEE Transactions on Multimedia

Vote

Add to Library

Recommend

6359. Language-based Audio Retrieval Task in DCASE 2022 Challenge

Huang Xie, Samuel Lipping, Tuomas Virtanen

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2209.09967v3

Vote

Add to Library

Recommend

6360. Towards 3D VR-Sketch to 3D Shape Retrieval

Ling Luo, Yulia Gryaditskaya, Yongxin Yang, Tao Xiang, Yi-Zhe Song

2020 International Conference on 3D Vision (3DV), pp. 81-90. IEEE, 2020

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2209.10020v2

Vote

Add to Library

Recommend

Benty-search

6341. Retrieval Augmented Visual Question Answering with Outside Knowledge

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2210.03809v2

6342. Ab Initio Spatial Phase Retrieval via Intensity Triple Correlations

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2210.03793v3

6343. Granularity-aware Adaptation for Image Retrieval over Multiple Tasks

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2210.02254v1

6344. A Framework for Web Services Retrieval Using Bio Inspired Clustering

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2210.01761v1

6345. SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2209.15323v2

6346. Zero-Shot Retrieval with Search Agents and Hybrid Environments

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2209.15469v2

6347. REST: REtrieve & Self-Train for generative action recognition

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2209.15000v1

6348. Learning Deep Representations via Contrastive Learning for Instance Retrieval

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2209.13832v1

6349. FiD-Light: Efficient and Effective Retrieval-Augmented Text Generation

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2209.14290v1

6350. Multi-stage Information Retrieval for Vietnamese Legal Texts

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2209.14494v2

6351. Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2209.13869v2

6352. Re-Imagen: Retrieval-Augmented Text-to-Image Generator

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2209.14491v3

6353. Social Search: retrieving information in Online Social Platforms -- A Survey

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2209.14369v3

6354. Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2209.13307v1

6355. Mr. Right: Multimodal Retrieval on Representation of ImaGe witH Text

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2209.13764v1

6356. Information-Theoretic Hashing for Zero-Shot Cross-Modal Retrieval

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2209.12491v1

6357. Promptagator: Few-shot Dense Retrieval From 8 Examples

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2209.11755v1

6358. Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2209.11572v2

6359. Language-based Audio Retrieval Task in DCASE 2022 Challenge

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2209.09967v3

6360. Towards 3D VR-Sketch to 3D Shape Retrieval

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2209.10020v2

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2210.03809v2

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2210.03793v3

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2210.02254v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2210.01761v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2209.15323v2

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2209.15469v2

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2209.15000v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2209.13832v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2209.14290v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2209.14494v2

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2209.13869v2

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2209.14491v3

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2209.14369v3

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2209.13307v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2209.13764v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2209.12491v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2209.11755v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2209.11572v2

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2209.09967v3

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2209.10020v2