Learning Transferable Visual Models From Natural Language Supervision
← Back to topic
Authors: Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al.
Year: 2020
Journal: ICML
DOI: 10.48550/arXiv.2103.00020
Publisher: https://arxiv.org/abs/2103.00020
Keywords: clip, vision-language pretraining
Abstract
We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations.
Cite this paper
bibtex
@misc{clip2020,
title = {Learning Transferable Visual Models From Natural Language Supervision},
author = {Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al.},
year = {2020},
journal = {ICML},
doi = {10.48550/arXiv.2103.00020},
url = {https://doi.org/10.48550/arXiv.2103.00020},
}