In a recent study conducted by Cornell University, the capabilities of AI chatbots, such as ChatGPT, to memorize and reproduce poems, including those under copyright, have come to the forefront. This research delves into the ethical implications and copyright concerns surrounding the data sources used to train AI, a timely discussion heightened by recent legal actions, including the New York Times lawsuit and controversies at Midjourney.David Mimno, the study's author and associate professor of information science, explained the choice of poems, citing their suitability for language models' context size. The study, which included models like Google AI's PaLM, EleutherAI's Pythia, and OpenAI's GPT-2, examined the recall and reproduction abilities of these models when presented with specific prompts requesting poems.
ChatGPT outperformed its counterparts, successfully retrieving 72 out of 240 poems. Notably, the study found that a poem's inclusion in the Norton Anthology of Poetry, particularly the 1983 edition, was a reliable indicator of its memorization and verbatim reproduction.The researchers, led by Lyra D'Souza, expressed concerns about the privacy and copyright implications of large language models memorizing extensive texts. While the study primarily focused on American poetry, it aims to expand its scope to include various languages and assess how specific poetic features influence memorization.The study, titled "The Chatbot and the Canon: Poetry Memorization in LLMs," presented at the Computational Humanities Research Conference, meticulously followed a structured methodology
.A dataset of 240 poems by 60 American poets, spanning diverse time periods, ethnicity, gender, and fame, was compiled.Specific prompts were crafted to request poems from AI models, varying from titles to authors and starting lines.The responses from AI models, including ChatGPT, were analyzed for accuracy in reproducing the requested poems.The study scrutinized factors impacting a model's ability to memorize poems, considering the presence of poems in well-known anthologies, the poet's race, gender, and Wikipedia page length.The study's findings not only showcased the capabilities of AI in processing poetry but also raised concerns about the potential reinforcement of existing literary biases. As AI models become more integral in representing information, the study questions whether they can fairly represent diverse works, highlighting the challenges in ensuring fair and unbiased representation within AI training data.