“Plagiarism comes in different flavours,” said Dongwon Lee, professor of information sciences at Penn State University. “We wanted to see if language models not only copy and paste but resort to more sophisticated forms of plagiarism without realising it.”

The researchers identified three forms of plagiarism: verbatim, or directly copying and pasting content; paraphrase, or rewording and restructuring content without citing the original source; and idea, or using the main idea from a text without proper attribution.

They constructed a pipeline for automated plagiarism detection and tested it against OpenAI’s GPT-2 because the language model’s training data is available online, allowing the researchers to compare generated texts to the eight million documents used to pre-train GPT-2.

The scientists used 210,000 generated texts to test for plagiarism in pre-trained language models and fine-tuned three language models to focus on scientific documents, scholarly...