Skip to content

Learning Transferable Visual Models From Natural Language Supervision

← Back to topic

Authors: Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al.
Year: 2020
Journal: ICML
DOI: 10.48550/arXiv.2103.00020
Publisher: https://arxiv.org/abs/2103.00020

Keywords: clip, vision-language pretraining

Abstract

We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations.

Cite this paper

bibtex
@misc{clip2020,
  title  = {Learning Transferable Visual Models From Natural Language Supervision},
  author = {Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al.},
  year   = {2020},
  journal = {ICML},
  doi    = {10.48550/arXiv.2103.00020},
  url    = {https://doi.org/10.48550/arXiv.2103.00020},
}

Source files

Released under the MIT License.