Loading...
 
Skip to main content

Introduction to Automated Machine Learning for Data Science, Modeling, and Benchmarking

Description

Automated Machine Learning (AutoML) has emerged over the last decade as a subfield of artificial intelligence (AI) and machine learning (ML), focused on the automation of machine learning modeling and other key elements of a data science analysis pipeline in order to “relax the need for a user in the loop”. The primary goal of AutoML methods and software packages has been to make the application of ML easier, more accessible (to those with and without programming or ML experience), and more capable of optimizing ML model performance across a wide variety of ML algorithms, hyperparameters, and data processing options.

Notably, a number of available AutoML tools utilize evolutionary optimization strategies to drive search, and/or include evolutionary machine learning approaches in their repertoire of available ML modeling algorithms. Thus, apart from making ML modeling and data analytics applications easier, these frameworks can (and arguably should) be leveraged to conduct fairer, better standardized/reproducible, and more rigorous algorithm performance comparisons and benchmarking.

This tutorial will begin by broadly introducing participants to AutoML tools, discussing their scope, capabilities, and tradeoffs. Next, it will dive more deeply into the function and use of a recently expanded open source AutoML package (STREAMLINE) illustrating how it can be used for data science analytics, machine learning modeling, and algorithmic benchmarking. It will cover how these AutoML frameworks work, what they automate, their unique capabilities, installation and use, and how it can be applied to utilize and benchmark existing or novel evolutionary computation algorithms for feature learning, feature selection, and/or modeling. Lastly, this tutorial will offer a practical demonstration of STREAMLINE AutoML: (1) to model and evaluate real-world data as part of a comprehensive automated data science pipeline, and (2) to easily, fairly, rigorously, and reproducibly compare and benchmark the performance of new ML modeling approaches (evolutionary or other) to other established algorithms. An outline of this tutorial is detailed further below.

1. Provide an overview of the typical elements of a machine learning data science analysis pipeline.
2. Define and introduce AutoML in contrast with traditional approaches to data science and ML.
3. Briefly review 20+ currently available AutoML tools and packages, contrasting their scope and capabilities.
4. Take a closer look at the newly expanded STREAMLINE AutoML package including its capabilities how to use it.
5. Walk through an example of applying STREAMLINE to a real-world analysis of biomedical data with the goal of optimizing ML model predictive performance, conducting model interpretation/explanation, and evaluating the reproducibility of model performance on new replication data.
6. Walk through an example of adding a new scikit-learn compatible algorithms (for feature learning, features selection, or modeling) to the STREAMLINE algorithm repertoire, and using the AutoML framework to benchmark and compare it’s performance to other established ML algorithms across a diversity of benchmark datasets, in a rigorous and reproducible manner.
7. Provide a hands-on demo for participants to try out STREAMLINE for themselves on their laptops or smartphones.


Organizers

Image
Ryan Urbanowicz

Cedars Sinai Medical Center, Los Angeles, California, USA

Dr. Ryan Urbanowicz is an Assistant Professor of Computational Biomedicine at the Cedars Sinai Medical Center, an Adjunct Assistant Professor at the University of Pennsylvania, and Director of Cedars AI-Campus Training Program. His lab research focuses on the development and application of machine learning, artificial intelligence automation, evolutionary algorithm, data mining, and informatics methodologies. His research group has developed a number of software packages including original machine learning algorithms, automated machine learning tools, and data simulators such as ExSTraCS, STREAMLINE, ReBATE, FIBERS, and GAMETES. Ryan has been an active contributor to GECCO since 2008, presenting several tutorials, serving as a workshop co-chair for 4 years, co-chairing the Evolutionary Machine Learning track, and receiving 3 GECCO best-paper awards. He is also an invested educator, with dozens of educational videos and lectures on available on his YouTube channel, and is co-author of the textbook, ‘Introduction to Learning Classifier Systems’.