Your Model _Probably_ Memorized the Training Data

Monday 16:10 in B07-B08

Type/Track Talk pydata-machine-learning-deep-learning-stats

I know you probably don't want to hear about it, but your deep learning model probably memorized some of its training data. In this talk, we'll review active research on deep learning and memorization, particularly for large models such as large language and multi-modal models.

We'll also explore potential ways to think through when this memorization is actually desired (and why) as well as threat vectors and legal risk of using models who have memorized training data. We'll also look at potential privacy protections which could address some of the issues and how to embrace memorization by thinking through different types of models and their use.

Level Domain Expertise Intermediate Python Skill Level None

Katharine Jarmul

Katharine Jarmul is a privacy activist and data scientist whose work and research focuses on privacy and security in data science workflows. She works as a Principal Data Scientist at Thoughtworks and author of Practical Data Privacy. She is a passionate and internationally recognized data scientist, programmer, and lecturer.