Data & AI fundamentals
Having an overview of the different responsibilities and tasks of a data team is crucial. When everyone in the team understands what their colleagues are working on, the collaboration and performance of the team is much improved.
The goal of this module is to have a common understanding, introduce key concepts and give an overview of the AI and data landscape. This module sets foundations in the following topics: machine learning, data governance, cloud, DevOps, DataOps, data platforms, and data strategy.
Data & AI Strategy
Data Strategy is an inherent part of all data projects. It is important to not lose track of the bigger picture, to keep in mind the vision and the target objective to be reached by the team. As such, being aware of the data strategy of the team and all its different aspects is very important.
The goal of this module is for all participants to get acquainted with the Data Strategy side of data projects. This training focuses on how data and a strategic data approach can help businesses thrive, and explains the role of data within an organization. It is intended to all types of profiles related to data.
Data exploration and story telling
Which questions to ask from your data? How to fully understand the data is not always straightforward. Yet this is a very important skill to fully leverage the potential of data.
In this module, we explain step by step how to tackle data exploration and how to get insights from data. You will learn how to query the data and tell its story using python and pandas and visualization libraries.
Software Engineering
Nowadays, when working with data, a big part of the job requires coding. Luckily, software engineers already came up with best practices to deliver good quality, robust and tested code.
The goal of this module is to teach participants Software Engineering principles that they will need in their day to day job, such as testing, alerting and monitoring, package management, etc. This module focuses on python as it is the most commonly used programming language within the field of data.
Cloud
Cloud computing is being more and more used in the industry and especially in the data & AI domain. Thus it is important to have a good understanding of the main concepts of the cloud.
The goal of this module is to cover the main services from the 3 most popular cloud providers (AWS, GCP and Azure). Candidates will be putting the learnings in practice with small exercises on each of the cloud. Finally infrastructure as code will be introduced and candidates will learn how to provision their own infrastructure with Terraform. Check already our experience with Terraform and our Terraform templates.
DevOps
DevOps is a set of practices that combines software development and IT operations. It aims to shorten the systems development life cycle and provides continuous delivery with high software quality. A machine learning or data & cloud engineer, should know and apply the DevOps principle in multiple parts of its job.
The goal of this module is for the participants to get a good grasp of the concepts of DevOps through presentation and a concrete end2end example. In this module, the participants will learn the full DevOps lifecycle using github.
Data Engineering Pipelines
Building data pipelines is at the core of the data&AI jobs. Extract Transform and Load data is a necessary step before any intelligence can be added. Building strong, robust and scalable pipelines is an important responsibility of the data team.
Hence in this module, candidates learn how to build large-scale distributed data processing pipelines, with an exercise centered around Apache Spark pipelines. They will learn to develop a small application, build, deploy it on the cloud (AWS or Azure) and schedule it with Airflow. Finally, they also learn how to optimise and monitor data pipelines.
Data platforms
Data platforms support all the data&AI capabilities and thus understanding what data platforms are and what they offer, is core to any data job.
In this module, the participants will become familiar with the different types of data platforms and their key components. This module covers general information about data platforms and deep dives into the cloud specific ones. The candidates will also acquire knowledge about data lakes, NoSql and API’s. Finally, they will deep dive into DBT, as a good example for the SQL Databases.
Machine Learning Methodology
Machine Learning and AI are trending topics in the field of research, industry, and business processes and there are many different types and methods of machine learning that you may apply as a data science practitioner.
In this module, our experts provide industry-wide tips and best practices, knowledge on the different methodologies and techniques, and the different steps to best execute a machine learning project. Participants will learn about all phases in the CRISP-DM methodology (Cross-Industry Standard Process for Data Mining) and will have the opportunity to apply them on a use case.
The focus of this module is on how to deliver machine learning use cases from end to end. We assume participants already have a basic understanding in machine learning algorithms and python programming from the get go.
MLOps
MLOps are a set of best practices of the implementation and operations (test, deploy, integrate, and monitor ML models) on Machine Learning models with the goal to have the models run reliably in production environments. It has become a core skill for the modern machine learning engineer and can be seen as an intersection of machine learning, devops and data engineering.
In this module, we alternate between best practices and concrete examples with Great expectations, MLFlow, DVC and pycaret. You will learn how to build and manage an AI production lifecycle with continuous training, delivery and automation of ML models.
Data Governance
Data is everywhere these days yet with this data proliferation come many challenges: Naming and definitions are not aligned across siloes; Data is scattered across the organisation in hundreds locations, with no-one knowing what data is stored where, what it is used for and whether it is reliable; Reports are built in many different tools, each with their own logic; Even AI models start to get entangled; one reusing the results of another one.
To Govern means “to be in control of”, in the case of data and analytics that means to define: Who is responsible for a piece of data, a report and an algorithm; What is in place to prevent uncontrolled data duplication; How and where to document (meta)data (i.e. definitions, business rules and calculations); How and where data quality is defined, measured, reported and managed; What tool is to be used for what purpose