DAT 220 - Big Data for Data Scientists
Course Description
Tackle Big Data head-on! Make sense of the massive data volumes that connect the world. Navigate through structured and unstructured data to find relevant information. Explore distributed systems and big data through both theory and practical skills.
Learn how to overcome the limitation of Python and single-node architecture. Apply data analysis to massive datasets utilizing the most popular Big Data tools and frameworks such as Hadoop, Hive, TrinoDB, and Apache Spark.
Gain valuable insights into how large Canadian companies in Media, Retail, and Banking apply Big Data in practice.
This course is delivered in collaboration with WeCloudData
Course Details
Learning Outcomes
By the completion of this course, successful students will be able to:
- Explain the principles of big data and modern distributed computing frameworks such as distributed file system, MapReduce, and column database
- Apply big data frameworks such as Apache Spark and TrinoDB to process massive datasets for data analytics and data science
- Explain enterprise data lake concepts and their connection to the data warehouse
- Construct big data solutions for at least one large dataset (billions of records) to discover insights with Spark on the Databricks platform
Topics
- Introduction to a distributed system
- Introduction to Databricks and EMR on AWS
- Hadoop and MapReduce (Data Lake)
- SQL for massive data mining (Hive, Presto, AWS Athena)
- Big Data Processing with PySpark
- Building your first big data application with Hadoop and Spark on AWS
Who is this course for?
This course is designed for:
- Data and business analytics professionals who want to know how to handle big data workloads
- Individuals who wish to pursue a career in Big Data
- Recent graduates and academics in Computer Science
Notes
Software Requirements
This course is built around AWS solutions and services. To complete the lab activities and mini-project, you will require:
- An AWS Account
- Access to Databricks Community Edition (Free)
- Python ver 3.5+
Prerequisites
There are no mandatory prerequisites for this course. However, you are required to perform a self-assessment to ensure you meet the requirements to enroll.
Self-assessment for enrolment:
A minimum of 1-year experience with the following skillsets: python programming, algorithm and data structures, relational database, Linux commands, and cloud platform (e.g. AWS Cloud).
Recommended Pre-requisites:
- ICT 128 Relational Databases Fundamentals
- DAT 210 Cloud Computing for Data Scientists
Applies Towards the Following Program(s)
- Big Data in Cloud : Required