DAT 320 - Building Scalable Machine Learning Pipelines
Course Description
Handle big data without consuming your resources like memory. Equip yourself with practical knowledge on how to scale machine learning solutions with Apache Spark using the Databricks platform.
Solve data-parallel problems with Spark and model parallel problems with Spark ML. Learn how to deploy Spark models for batch prediction and real-time prediction with endpoint implemented in SageMaker.
This course is delivered in collaboration with WeCloudData.
Course Details
Learning Outcomes
By the completion of this course, successful students will be able to:
- Implement machine learning models using Spark Machine Learning and GraphFrame to solve large-scale ML challenges in the retail and airline industries
- Create and deploy machine learning models in cloud computing using tools like Amazon SageMaker
- Construct Machine Learning pipelines using orchestration tools like SageMaker Pipelines
- Evaluate three popular use cases of scalable machine learning pipelines in real-time advertising, retail, and recommender systems
Topics
- Mining Massive Data with Spark ML
- Docker Containers 101
- Building Data Pipelines with Airflow
- ML Model Deployment with Amazon SageMaker
- Use Case: Building Recommender Systems
- Use Case: Sentiment Analysis at Scale
Who is this course for?
This course is designed for:
- Data and business analytics professionals who wish to learn how to train large scale models and deploy them on the cloud
- IT and Engineering Professionals looking to receive hands-on training in ML and AI
- Recent graduates and academics in Computer Science
Notes
Software Requirements
This course is built around AWS solutions and services. To complete the lab activities and mini-project, you will require:
- An AWS account
- Access to Databricks community edition (Free)
- Python ver 3.5+
Prerequisites
There are no mandatory prerequisites for this course. However, you are required to perform a self-assessment to ensure you meet the requirements to enrol.
Self-assessment for enrolment
- A minimum of 1.5 years of working experience with the following skillsets: python programming, big data tools, distributed systems and MapReduce, relational database, Linux commands, cloud platform (e.g. AWS Cloud), classical machine learning algorithms, Scikit-learn library or equivalent.
Recommended Pre-requisites:
- DAT 210 Cloud Computing for Data Scientists
- DAT 220 Big Data for Data Scientists
Applies Towards the Following Program(s)
- Deep Learning and Scalable Machine Learning : Required