Beyond Deployment: Exploring Machine Learning Inference Architectures and Patterns
Tim Elfrink
This talk is about setting up robust and scalable machine learning systems for high-throughput real-time predictions and large numbers of users. It is meant for ML engineers and people who work with data and want to learn more about MLOps focusing on cloud-based platforms. The focus of this talk will be about different ways to make predictions -– real-time, asynchronously and batch processing. It discusses the advantages and disadvantages of the different patterns and highlights the importance of choosing the right pattern for specific use cases, including generative large language models
We will use examples from StepStone's production systems to illustrate how to build systems that scale to thousands of simultaneous requests while delivering low-latency, robust predictions.
I will cover some of the technical details, how to efficiently manage operations, and real-life examples in a way that is easy to understand and informative. You will learn about different setups for ML and how to make them work. This will help you make your ML inference faster, more cost-efficient, and reliable.
Tim Elfrink
Affiliation: The Stepstone Group
Tim is a Staff Machine Learning Engineer at Stepstone. He is working on the deployment of various machine learning projects.