Multimodal Retrieval With Contrastive Pretraining

Alsan, H.F.; Yildiz, E.; Safdil, E.B.; Arslan, F.; Arsan, T.

Multimodal Retrieval With Contrastive Pretraining

Files

4941.pdf (2.13 MB)

Date

2021

Authors

Alsan, H.F.

Yildiz, E.

Safdil, E.B.

Arslan, F.

Arsan, T.

Publisher

Institute of Electrical and Electronics Engineers Inc.

Green Open Access

No

Publicly Funded

No

Impulse

Average

Influence

Average

Popularity

Average

Abstract

In this paper, we present multimodal data retrieval aided with contrastive pretraining. Our approach is to pretrain a contrastive network to assist in multimodal retrieval tasks. We work with multimodal data, which has image and caption (text) pairs. We present a dual encoder deep neural network with the image and text encoder to encode multimodal data (images and text) to represent vectors. These representation vectors are used for similarity-based retrieval. Image encoder is a 2D convolutional network, and text encoder is a recurrent neural network (Long-Short Term Memory). MS-COCO 2014 dataset has both images and captions, and it is used for multimodal training with triplet loss. We used a convolutional Siamese network to compute the similarities between images before the dual encoder training (contrastive pretraining). The advantage is that Siamese networks can aid the retrieval, and we seek to show if Siamese networks can be used in practice. Finally, we investigated the performance of Siamese assisted retrieval with BLEU score metric. We conclude that Siamese can help with image-to-text retrieval tasks. © 2021 IEEE.

Description

Kocaeli University;Kocaeli University Technopark
2021 International Conference on INnovations in Intelligent SysTems and Applications, INISTA 2021 --25 August 2021 through 27 August 2021 -- --172175

Keywords

Convolutional Networks, Deep Learning, Long-Short Term Memory (LSTM), Multimodal Data, Pretraining, Siamese networks, Triplet loss, Brain, Computer vision, Convolution, Convolutional neural networks, Deep neural networks, Network coding, Convolutional networks, Data retrieval, Deep learning, Image texts, Long-short term memory, Multi-modal, Multi-modal data, Pre-training, Siamese network, Triplet loss, Long short-term memory, Image texts, Multi-modal data, Convolutional Networks, Long-Short Term Memory (LSTM), Pretraining, Brain, Long-short term memory, Deep learning, Siamese network, Convolution, Siamese networks, Deep Learning, Network coding, Pre-training, Multi-modal, Triplet loss, Deep neural networks, Long short-term memory, Computer vision, Convolutional neural networks, Multimodal Data, Convolutional networks, Data retrieval

OpenCitations Citation Count

3

Source

2021 International Conference on INnovations in Intelligent SysTems and Applications, INISTA 2021 - Proceedings

Start Page

1

End Page

5

URI

https://doi.org/10.1109/INISTA52262.2021.9548414
https://hdl.handle.net/20.500.12469/4941

Collections

Scopus İndeksli Yayınlar Koleksiyonu

PlumX Metrics

Citations

Scopus : 4

Captures

Mendeley Readers : 4

Full item page

SCOPUS™ Citations

4

checked on Apr 06, 2026

Page Views

7

checked on Apr 06, 2026

Google Scholar™

Check

Multimodal Retrieval With Contrastive Pretraining

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Open Access Color

Green Open Access

OpenAIRE Downloads

OpenAIRE Views

Publicly Funded

BIP! Indicators

Research Projects

Journal Issue

Abstract

Description

Keywords

Fields of Science

Citation

WoS Q

Scopus Q

OpenCitations Citation Count

Source

Volume

Issue

Start Page

End Page

URI

Collections

PlumX Metrics

Citations

Captures

SCOPUS™ Citations

4

Page Views

7

Google Scholar™

OpenAlex FWCI

0.196

Sustainable Development Goals

SDG data is not available