Red Teaming Language Models with Language Models

← Back to topic

Authors: Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, et al.
Year: 2022
Journal: arXiv
DOI: 10.48550/arXiv.2202.03286
Publisher: https://arxiv.org/abs/2202.03286

Keywords: red-teaming, alignment

Abstract

We automatically discover cases where a language model is not safe to deploy.

Cite this paper

bibtex

@misc{redllm2022,
  title  = {Red Teaming Language Models with Language Models},
  author = {Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, et al.},
  year   = {2022},
  journal = {arXiv},
  doi    = {10.48550/arXiv.2202.03286},
  url    = {https://doi.org/10.48550/arXiv.2202.03286},
}

Source files