Knowledge sharing
is key.

Subscribe to receive weekly interesting data, AI & tech articles ❀️

Get in touch

Fabric end-to-end use case: Data Engineering part 2 - Pipelines

2023-09-14 | Sam Debruyn

8 minutes read

Welcome to the third part of a 5-part series on an end-to-end use case for Microsoft Fabric. This post will focus on the data engineering part of the use case. In this series, we will explore how to use Microsoft Fabric to ingest, transform, and analyze data using a real-world use case. The series focuses on data engineering and analytics engineering. We will be using OneLake, Notebooks, Lakehouse, SQL Endpoints, Data Pipelines, dbt, and Power BI. All posts in this series This post is part oβ–‹

Fabric end-to-end use case: Data Engineering part 2 - Pipelines

Building Data Mesh Architectures with Dataflow on Google Cloud

2023-09-10 | Dishani Sen

4 minutes read

Hi again, data enthusiasts! Welcome to an exciting blog where we will explore the world of data mesh architectures and how they help solve data-related challenges for businesses. In this blog, we will delve into the concept of data mesh, the problems it addresses, and the role of Dataflow on Google Cloud Platform (GCP) in implementing data mesh architectures. We will also demonstrate how Dataflow seamlessly integrates with data mesh to empower organisations with efficient data management and anaβ–‹

Building Data Mesh Architectures with Dataflow on Google Cloud

Fabric end-to-end use case: Data Engineering part 1 - Spark and Pandas in Notebooks

2023-09-07 | Sam Debruyn

11 minutes read

Welcome to the second part of a 5-part series on an end-to-end use case for Microsoft Fabric. This post will focus on the data engineering part of the use case. In this series, we will explore how to use Microsoft Fabric to ingest, transform, and analyze data using a real-world use case. The series focuses on data engineering and analytics engineering. We will be using OneLake, Notebooks, Lakehouse, SQL Endpoints, Data Pipelines, dbt, and Power BI. All posts in this series This post is part β–‹

Fabric end-to-end use case: Data Engineering part 1 - Spark and Pandas in Notebooks

Data Warehousing Made Easy: Leveraging Google BigQuery for Seamless Analytics

2023-09-03 | Dishani Sen

8 minutes read

In our previous blogs about Data Engineering on Google Cloud Platforms (GCP), we touched upon how to choose the right database for your data projects. In this blog, we will look at Google BigQuery in detail. We will cover the features of BigQuery, best practices and performance optimisation techniques. Google BigQuery is a fully managed, petabyte-scale analytics data warehouse that enables businesses to analyse all their data very quickly. It is a cloud-based service that offers a number of benβ–‹

Data Warehousing Made Easy: Leveraging Google BigQuery for Seamless Analytics

Fabric end-to-end use case: overview & architecture

2023-08-31 | Sam Debruyn

6 minutes read

Welcome to the first part of a 5-part series on an end-to-end use case for Microsoft Fabric. This post will focus on the architecture overview of the use case. In this series, we will explore how to use Microsoft Fabric to ingest, transform, and analyze data using a real-world use case. The series focuses on data engineering and analytics engineering. We will be using OneLake, Notebooks, Lakehouse, SQL Endpoints, Data Pipelines, dbt, and Power BI. All posts in this series This post is part oβ–‹

Fabric end-to-end use case: overview & architecture

Best Practices for Selecting a Database on Google Cloud Platform

2023-08-27 | Dishani Sen

5 minutes read

In our previous blog, we gently introduced GCP for data engineering and shared a lite edition of best practices, when it comes to data storage, security and machine learning. In this one, we will dive deep into choosing a database on GCP. The importance of a database in data engineering cannot be overstated. Data engineering is the backbone of any data-driven organisation, responsible for designing, building, and maintaining the infrastructure and systems that enable efficient and effective datβ–‹

Best Practices for Selecting a Database on Google Cloud Platform

Let Fabric teach you how to code with Data Wrangler

2023-08-24 | Sam Debruyn

6 minutes read

I'm going to be honest with you. I'm bad at writing Pandas data transformation code. Throughout the years I mostly focussed on the Spark APIs in Scala and PySpark, SQL, dbt, and some others, but I find the Pandas APIs usually just confusing and hard to read. I don't like the black box magic and lock-in of low-code solutions either. Did you know that Microsoft Fabric has the perfect middle ground for this? It's called Data Wrangler. Let's dive in! Actually, Data Wrangler is not a new thing. It wβ–‹

Let Fabric teach you how to code with Data Wrangler

Best Practices for Data Engineering on Google Cloud Platforms

2023-08-20 | Dishani Sen

4 minutes read

Data engineering is a crucial process involving data collection, storage, processing, and analysis. It plays a vital role in the data science lifecycle, enabling businesses to make informed decisions. In this blog post, we will discuss practical best practices for data engineering on Google Cloud Platforms. These practices will help you build scalable, reliable, and secure data engineering solutions. Additionally, we will cover individual services like BigQuery, DataFlow, Dataplex in detail in oβ–‹

Best Practices for Data Engineering on Google Cloud Platforms

How to use service principal authentication to access Microsoft Fabric's OneLake

2023-08-17 | Sam Debruyn

4 minutes read

Microsoft recently added support to authenticate to OneLake using service principals and managed identities. This allows users to access OneLake from applications without having to use a user account. Let's see how this works. Enabling the feature 2 new settings in the Fabric / Power BI Admin Portal need to be enabled for this feature to work. Both can be found in the Tenant settings section. The first one is named Allow service principals to use Power BI APIs and can be found under Developeβ–‹

How to use service principal authentication to access Microsoft Fabric's OneLake

Enhancing Data Consistency in AI-powered Q&A Systems

2023-08-16 | Senne Batsleer

6 minutes read

At Dataroots, we are constantly exploring the latest advancements in the world of AI. In that regard, were currently developing a Q&A system powered by a Large Language Model (LLM). It has access to a vast array of internal documents, which it can search through at query time to produce accurate and context-aware responses. However, as the document volume grows, so does the challenge of maintaining data quality and consistency. In this blog post, we will delve deeper into this issue and show howβ–‹

Enhancing Data Consistency in AI-powered Q&A Systems

How to use Whisper, Open AI and ChatGPT to improve inclusivity with an intelligent voice assistant in the On Wheels app

2023-08-13 | Sophie De Coppel

9 minutes read

With the evolutionary wave of large language models (LLMs) come a lot new interesting applications. And what better way to use them then to make to world a better place, one line of code at a time. In this blogpost, we'll explain how we utilised LLMs to develop a voice assistant to increase inclusivity for an accessibility application. Let's backtrack a bit and explain the context of this project, because no great invention is ever made alone. For this project Dataroots collaborated with On Wheβ–‹

How to use Whisper, Open AI and ChatGPT to improve inclusivity with an intelligent voice assistant in the On Wheels app

A closer look at Microsoft Fabric pricing, billing, and autoscaling

2023-08-10 | Sam Debruyn

10 minutes read

If you're considering using Microsoft Fabric, you're probably thinking "How much is this going to cost me?" Continue reading to learn how Microsoft might have just created the most compelling data platform offering available today. It's still a bit early to get a complete and thorough overview of how Microsoft is planning on billing for Fabric. But if we look at some public statements and documentation, we can piece it all together to get a pretty good idea of what to expect. Keep in mind that β–‹

A closer look at Microsoft Fabric pricing, billing, and autoscaling

Recovering the ink of Herculaneum using neural networks

2023-08-06 | Adrien Debray, Johannes Lootens

14 minutes read

Context In 79 AD, Mount Vesuvius erupted, thousands of scrolls in a library located in a Roman villa in Herculaneum were carbonized by the heat of the volcanic debris. For centuries, as virtually every ancient text exposed to the air decayed and disappeared, these Herculaneum Papyri waited underground, intact. The scrolls were discovered a few hundred years ago and have been waiting to be read using modern techniques. In 2015, Dr. Brent Seales was the first to read some scrolls without openingβ–‹

Recovering the ink of Herculaneum using neural networks

Migrating Azure Synapse Dedicated SQL to Microsoft Fabric

2023-08-03 | Sam Debruyn

9 minutes read

If all those posts about Microsoft Fabric have made you excited, you might want to consider it as your next data platform. Since it is very new, not all features are available yet and most are still in preview. You could already adopt it, but if you want to deploy this to a production scenario, you'll want to wait a bit longer. In the meantime, you can already start preparing for the migration. Let's dive into the steps to migrate to Microsoft Fabric. Today: starting from Synapse Dedicated SQL Pβ–‹

Migrating Azure Synapse Dedicated SQL to Microsoft Fabric

Evaluating large language model applications with LLM-augmented feedback

2023-08-02 | Maciej Piotrowski

9 minutes read

Many large language models (LLMs) today can be evaluated and selected based on benchmarks that span a wide variety of tasks. These benchmarks, for example, on the Open LLM leaderboard maintained by Hugging Face, provide performance metrics across numerous domains to select the optimal model for your use case. Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4Discover amazing ML apps made by the communitya Hugging Face Space by HuggingFaceH4 Unfortunately, these benchmarks often fall β–‹

Evaluating large language model applications with LLM-augmented feedback

Transforming Tabular Data in Python

2023-07-30 | Pieter Fiers

7 minutes read

Comparing Pandas v. Polars v. PyArrow v. DuckDB Let me set the scene: you're brimming with excitement to explore that shiny new data. Luckily it’s manageable enough to use your own computer; so you grab the nearest Python notebook or REPL, ready to start hacking away. But which library to use? Naturally you could reach for trusty old Pandas. Or how about a fancy new dataframe library like Polars or datatable? Or you could try embedded SQL with DuckDB for a change. Let’s delve into the dazzlingβ–‹

Transforming Tabular Data in Python

Connect to Fabric Lakehouses & Warehouses from Python code

2023-07-27 | Sam Debruyn

4 minutes read

In this post, I will show you how to connect to your Microsoft Fabric Lakehouses and Warehouses from Python. Packages & dependencies To connect to Fabric, we'll use the Microsoft ODBC Driver. This driver is available for Windows, Linux, and macOS. Click on your operating system to download and install the driver: * Windows * Linux * macOS (uses Homebrew) Next, we'll need a Python package to connect using ODBC and a Python package to authenticate with Azure Active Directory. We can instalβ–‹

Connect to Fabric Lakehouses & Warehouses from Python code

PySpark, Dask or Ray, how to scale your python workloads?

2023-07-23 | Ilion Beyst, Arnaud Fombellida

7 minutes read

When it comes to scaling out Python workloads, the landscape is filled with options. Among the prominent choices available today are PySpark, Dask, and Ray. As these systems differ significantly in their design and approach, capabilities and benefits, determining the optimal fit for your specific use case can be difficult. In this blog post, we aim to provide clarity by exploring the major options for scaling out Python workloads: PySpark, Dask, and Ray. By understanding the differences and nuaβ–‹

PySpark, Dask or Ray, how to scale your python workloads?

Preparing a migration to Microsoft Fabric: from Azure Synapse Serverless SQL

2023-07-20 | Sam Debruyn

4 minutes read

If all those posts about Microsoft Fabric have made you curious, you might want to consider it as your next data platform. Since it is very new, not all features are available yet and most are still in preview. You could already adopt it, but if you want to deploy this to a production scenario, you'll want to wait a bit longer. In the meantime, you can already start preparing for the migration. Let's dive into the paths to migrate to Microsoft Fabric. Today: Starting from Synapse Serverless SQL β–‹

Preparing a migration to Microsoft Fabric: from Azure Synapse Serverless SQL

Comprehensive analytics engineering offering to enhance speed-to-insight

2023-07-18 | Sam Debruyn, Kevin Missoorten

1 minutes read

We are thrilled to announce the expansion of our service offerings to include a broad analytics engineering function, marking a significant step in its commitment to drive customer success in the digital age. In the rapidly evolving field of data, the role of analytics engineering has come to the forefront. This function serves as a crucial bridge between engineering and analyst roles, utilizing engineering best practices to deliver clean, transformed datasets ready for analysis. The arrival ofβ–‹

Comprehensive analytics engineering offering to enhance speed-to-insight

Data Spaces: Unlocking the Power of Connected Information

2023-07-16 | Samuel Fanoukoe

7 minutes read

In today's digital world, data has gradually become an invaluable currency that drives innovation, decision-making, and business success. The exponential growth of data has led to new challenges in managing, organizing, and extracting meaningful insights from vast and diverse datasets. To address these challenges, the concept of a data space has emerged, offering a promising solution to unlock the power of connected information. In this blog post, we will explore the notion of data spaces, theirβ–‹

Data Spaces: Unlocking the Power of Connected Information

How to win your first Kaggle competition?

2023-07-09 | Adrien Debray, Johannes Lootens

8 minutes read

You want to get started with Kaggle competitions? You saw an interesting challenge or the big prize money but feel a bit lost about how to tackle the competition ? This blog provides a broad overview of Kaggle competitions, guides you through the winning methodologies, and offers tips and tricks to help you tackle a Kaggle competition more effectively. All you need to know about Kaggle competitions πŸ’‘ Kaggle is a platform where data enthusiasts come together to explore and analyse datasets aβ–‹

How to win your first Kaggle competition?

Exploring OneLake with Microsoft Azure Storage Explorer

2023-07-05 | Sam Debruyn

5 minutes read

Recap: OneLake & Delta Lake One of the coolest things about Microsoft Fabric is that it nicely decouples storage and compute and it is very transparent about the storage: everything ends up in the OneLake. This is a huge advantage over other data platforms since you don't have to worry about moving data around, it is always available, wherever you need it. To store the data in the OneLake, Microsoft Fabric makes use of Delta Lake. Delta is an open-source data format built upon Apache Parquet (β–‹

Exploring OneLake with Microsoft Azure Storage Explorer

Data preparation for a Q&A application powered by LLMs

2023-07-01 | Andrea Benevenuta

10 minutes read

Large Language Models (LLMs) have recently emerged as a groundbreaking advancement in the field of natural language processing. These models are designed to comprehend and generate human-like text, exhibiting a remarkable ability to understand context, grammar, and even nuances in language. One of the applications of such models is to extract relevant information from vast document collections. By extracting pertinent information, LLMs can be used to efficiently provide fast and precise answersβ–‹

Data preparation for a Q&A application powered by LLMs

6 x 6 ways to engage with your data community

2023-06-28 | Sam Debruyn

7 minutes read

Dataroots is very active in the data community thanks to its strong culture of internal knowledge sharing which we also like to share with the outside world. We do this by organizing meetups, giving talks, writing blog posts, and more. In this short post, we list 6 times 6 ways you can engage with the community of people working with data. Meetups We noticed that a lot of meetups started to appear after the pandemic. People apparently like to go out and share knowledge. We are very happy to sβ–‹

6 x 6 ways to engage with your data community

Fine-tuning BERT for an unbalanced multi-class classification problem

2023-06-25 | Antoine Caytan

19 minutes read

Fine-tuning BERT for an unbalanced multi-class classification problem Predicting the team responsible for an incident from its description with Natural Language Processing and Machine Learning 1. Introduction 1.1 Context As a Data Engineer at Dataroots, I was sent to a team in charge of promoting a Data Driven approach in the IT department of one of our clients. The major step was to set up a data lake to centralise the data from the whole IT department. One of the first use cases of thiβ–‹

Fine-tuning BERT for an unbalanced multi-class classification problem

Herding the flock with MotherDuck: your next data warehouse?

2023-06-22 | Sam Debruyn

8 minutes read

At Dataroots we've been intrigued by the new path the data community is taking with DuckDB. You can listen to our podcast or read our previous blog post about this topic. To refresh your memory a bit: DuckDB is an in-process OLAP. It's similar to what SQLite is doing for OLTP workloads. DuckDB is an open-source project steered by the DuckDB Foundation, a Dutch non-profit organization. You can transform data with the tools you already love, in the languages you already know. DuckDB's SQL syntax β–‹

Herding the flock with MotherDuck: your next data warehouse?

Introducing a prototype LLM API Starter Kit

2023-06-21 | Tim Leers

2 minutes read

In today's tech environment, the influence of large language models (LLMs) is profound and only growing. Despite this, many developers find themselves on the outskirts, unsure of where to start or how to integrate these AI powerhouses into their projects. To help bridge this gap, there's an exciting new resource: The llm-api-starterkit. This accessible API Starter Kit enables developers to prototype an LLM-powered microservice with ease, providing a streamlined introduction to harnessing the fuβ–‹

Introducing a prototype LLM API Starter Kit

Streamlining Data Exploration: A Comparison of Profiling Tools for Effective EDA

2023-06-18 | Niels Baptist

8 minutes read

When getting introduced to new data, there is always a learning curve in which we need time to explore the data before feeling comfortable with it. In this blogpost we’ll go over the different ways we can make this faster and easier.β–‹

Streamlining Data Exploration: A Comparison of Profiling Tools for Effective EDA

Welcome to the 3rd generation: SQL in Microsoft Fabric

2023-06-15 | Sam Debruyn

9 minutes read

While typing this blog post, I'm flying back from the Data Platform Next Step conference where I gave a talk about using dbt with Microsoft Fabric. DP Next Step was the first conference focussed on Microsoft data services right after the announcement of Microsoft Fabric so a lot of speakers were Microsoft employees and most of the talks had some Fabric content. Fabric Fabric Fabric, what is it all about? In this post I'll go deeper into what it is, why you should care and focus specifically on β–‹

Welcome to the 3rd generation: SQL in Microsoft Fabric

Kickstart your next Python project πŸš€

2023-06-11 | Murilo Cunha

7 minutes read

Starting a new project and not sure where to begin? What should the directory structure look like? What are these β€œbest practices”? It may sound a bit silly I've found myself spending waaaay too much time on these "small" decisions. If that's you, it's probably because you don't have a strong opinion about how to structure your Python project. So we've added a list of tools and practices in a project template, so you don't have to. TLDR; We’ve put down a small list of things to include in youβ–‹

Kickstart your next Python project πŸš€

The Power of AI in Songwriting: Creating a Sensational Hit for Niels Destadsbader

2023-06-08 | Senne Batsleer, Dorian Van den Heede

7 minutes read

Recently, Dataroots received an intriguing request from a journalist at De Tijd, a prominent Flemish newspaper. She was working on an AI-focused podcast series called β€œDe Aionauten”, where the audio team of De Tijd explores the current state of AI in various domains. As a challenge for their podcast, she approached Dataroots with a specific task in mind: creating a brand new hit song for the acclaimed Belgian artist, Niels Destadsbader. To tackle this ambitious task, De Tijd had already reachedβ–‹

The Power of AI in Songwriting: Creating a Sensational Hit for Niels Destadsbader

Managing Federated Learning Infrastructure with Terraform and Azure

2023-06-04 | Eya Akrimi, Dishani Sen

11 minutes read

Federated learning is a revolutionary approach to machine learning that allows data scientists to train models on decentralised data sources, without ever having to access the data directly. This approach has several advantages, including improved privacy, reduced communication costs, and increased scalability. At Dataroots, we have implemented federated learning infrastructure for three hospitals and a research center using a combination of Terraform and Docker, as well as a robust CI/CD pipeliβ–‹

Managing Federated Learning Infrastructure with Terraform and Azure

Artificial General Intelligence: The Quest for an ultimate invention - Experts opinion

2023-06-01 | Martial Van den Broeck

13 minutes read

Artificial intelligence has made remarkable progress over the past few years, from self-driving cars to virtual assistants that can understand and respond to natural language. However, while AI systems are getting better at performing specific tasks, they still lack the versatility and flexibility of human intelligence. Enter Artificial General Intelligence (AGI), the quest for a thinking machine. An attempt of definition It's hard to find a consensus definition in the community, here's a poteβ–‹

Artificial General Intelligence: The Quest for an ultimate invention - Experts opinion

How Open is Open Data?

2023-05-28 | Dishani Sen

11 minutes read

This is a follow up blog to the article "What is data circularity and why should you care?" where we discussed about data circularity and why we need it. It's common to Β misunderstand that "Data Circularity" Β means "Open Data", or vice versa. In this blog we will navigate deeply what is open data, the challenges associated with open data and and try to understand its relationship with the concept of data circularity. Introduction The notion of open data has drawn greater interest over the pasβ–‹

How Open is Open Data?

LLMOps, GenerativeOps or AgentOps? Distinguishing the challenges in contemporary LLMOps

2023-05-28 | Tim Leers

2 minutes read

The term "LLMOps" (Large Language Model Operations) is debated - while it does encapsulate the operational challenges of deploying and managing large language models, it's the powerful, generative, and interactive nature of contemporary LLMs that present distinct challenges and opportunities. Moreover, the "large" aspect of LLMs that poses significant infrastructure challenges may be temporary as the open-source world continues to shrink model sizes. As such, the focus of "LLMOps" might shrink tβ–‹

LLMOps, GenerativeOps or AgentOps? Distinguishing the challenges in contemporary LLMOps

From MLOps to LLMOps - what's the difference?

2023-05-22 | Tim Leers

3 minutes read

What is MLOps? Machine Learning Operations (MLOps) can be treated as a subset of challenges in software Development Operations (DevOps), with the latter encompassing software engineering best practices and principles used to streamline the process of delivering software in companies. MLOps concentrates on the unique challenges brought about by the development of ML-powered projects and products, particularly due to the nascent state of ML, artifact management and reproducibility issues, uniqueβ–‹

From MLOps to LLMOps - what's the difference?

Applying a MLOps approach to Federated learning using ML Flow with NV Flare: A Healthcare use case

2023-05-21 | David Valdez, Ardalan Mehraram, Adrien Debray, Christoforos Trakas, Maciej Piotrowski, Johannes Lootens

14 minutes read

Introduction In healthcare, developing accurate models can be difficult due to the low amount of data and privacy constraints. Federated learning (FL) addresses this by allowing data analytics and modeling to take place without sharing data outside of the premises. As data science projects become more complex, Machine Learning Operations (MLOps) have become essential for improving the efficiency, quality, and speed of machine learning models. However, applying MLOps to FL comes with new chalβ–‹

Applying a MLOps approach to Federated learning using ML Flow with NV Flare: A Healthcare use case

From Pandora's Box to the Genie's Lamp: Overcoming Challenges in Enterprise Adoption of Large-Language Models

2023-05-17 | Tim Leers

3 minutes read

The race is on to harness the potential of Large Language Models (LLMs) in enterprises. Right now, there is significant risk in adopting LLMs in many usecases, without a clear path to deploying them to deliver business value. In part, that is because, the broad principles that drive value creation in traditional machine learning (ML) model deployment and operations (MLOps) aren't directly transferable to LLM operations (LLMOps). In collaboration with Dataroots research & Talan research, we idenβ–‹

From Pandora's Box to the Genie's Lamp: Overcoming Challenges in Enterprise Adoption of Large-Language Models

Was the weather beginning of May 2023 normal? Let's visualize the data

2023-05-16 | Lode Nachtergaele

3 minutes read

From casual small talk to deep philosophical discussions, conversations about the weather have permeated our daily interactions for centuries. Its omnipresence in our social exchanges raises intriguing questions about the significance we attribute to something as seemingly mundane as the weather. Renowned philosophers have delved into the subject, contemplating its connection to the more significant meaning of life. In this blog post, we demonstrate that our opinions on weather are often biased β–‹

Was the weather beginning of May 2023 normal? Let's visualize the data

Cloud-native framework for federated learning, designed with privacy and security at its core

2023-05-08 | Dishani Sen, Eya Akrimi

11 minutes read

In the roots academy session of March 2023, a group of Data & Cloud engineers and ML engineers collaborated together to deliver a cloud-native framework for healthcare data analysis, designed with privacy and security at its core; federated learning framework for healthcare. In this blog, we introduce the problem, the goals of the project as well as the architecture proposed as the solution. The solution consists of a fully operational, federated learning framework for healthcare data analysis.β–‹

Cloud-native framework for federated learning, designed with privacy and security at its core

What is data circularity, and why should you care?

2023-05-01 | Dishani Sen

6 minutes read

Recently, I attended the Conference on New Techniques and Technologies for Statistics organised by the European Commission from March 6-10th in Brussels. I was pleased to discover several relevant data circularity sessions and conversations. If you are wondering what that is, then this blog is a gentle introduction for you to the what, why, and how of data circularity. Organisations create and consume massive volumes of data in today's data-driven environment. Data is a key asset that drives dβ–‹

What is data circularity, and why should you care?

From doodle to bedazzled - with generative AI

2023-04-18 | Bart Smeets

6 minutes read

Over the last months, I’ve read several posts showing how to use generative AI to generate comics. Sometimes the visuals, sometimes the story, sometimes a combo of both. It is often used as a starting point, and the comic evolves from there. Or it serves as a moodboard for the artist to quickly sketch up some ideas to work on manually as a next step. I’ve not yet seen (and this is probably simply my limited view of the world) generative AI being used to improve the aesthetics of an already existβ–‹

From doodle to bedazzled - with generative AI

GPT4 - Unlocking the power of multimodal learning: overcoming challenges and harnessing opportunities

2023-04-14 | Silke Plessers

7 minutes read

In March, OpenAI released GPT-4, the successor of the famous GPT-3 model. While it has many impressive features such as the ability to handle longer context windows, more advanced reasoning and natural language generation and understanding skills, one of the most surprising added features is the ability to process image input. Indeed, GPT-4 is multimodal, meaning it can generate content based on both images and textual inputs. What is multimodal learning? Why would we do this? How can we leveraβ–‹

GPT4 - Unlocking the power of multimodal learning: overcoming challenges and harnessing opportunities

Using Twitter to predict Covid infections in Belgium

2023-04-02 | Dishani Sen

14 minutes read

Digital trace data refers to the electronic data that is generated as people interact with digital platforms, such as social media, online search engines, and mobile applications. These data have become valuable sources of information for businesses and researchers, particularly in the field of public health. The everyday opinions expressed on social media provide promising opportunities for measuring population-level statistics for health metrics; once such being disease prevalence. In this bloβ–‹

Using Twitter to predict Covid infections in Belgium

SOS - RuntimeError: CUDA Out of memory

2023-03-27 | Silke Plessers

9 minutes read

Training large (transformer) models is becoming increasingly challenging for machine learning engineers. With new and massive transformer models being released on a regular basis, such as DALLΒ·E 2, Stable Diffusion, ChatGPT, and BLOOM, these models are pushing the limits of what AI can do and even going beyond imagination. One thing these transformer models have in common is that they are big. Very very very big. For instance, the famous GPT-3 model has around 175 billion parameters, BLOOM has eβ–‹

SOS - RuntimeError: CUDA Out of memory

A gentle introduction to blockchain

2023-03-19 | Paolo LΓ©onard

5 minutes read

A gentle introduction to blockchain focuses on the proof-of-work algorithm.β–‹

A gentle introduction to blockchain

Quantitatively measuring speech quality and training a text-to-speech model for Flemish Dutch

2023-03-05 | Silke Plessers

14 minutes read

Recently, Microsoft released VALL-E, a revolutionary new language model for text-to-speech (TTS) designed to significantly outperform other state-of-the-art zero-shot TTS models in terms of both speech naturalness and speaker similarity. VALL-E requires nothing more than a 3-second speech recording from a previously unseen speaker to synthesize high-quality speech. However, unfortunately, VALL-E is not yet available to the public and since the model is trained solely on English data, it will noβ–‹

Quantitatively measuring speech quality and training a text-to-speech model for Flemish Dutch

Is edge computing just a buzzword?

2023-02-26 | Stijn Dolphen

6 minutes read

Edge computing is one of the recent buzzwords in Artificial Intelligence and - according to Gartner - it even has the potential to reach mainstream adoption in two to five years, with transformational business benefits as a result. How are these so-called marginal calculations creating additional value at the edges of a network instead of the centralized server location - or even an infinite pool of cloud resources? Let’s find out. Setting the scene. The integration of cutting-edge technologieβ–‹

Is edge computing just a buzzword?

Song of the Machines (4): Digital Music Production

2023-02-05 | Arthur Chionh

5 minutes read

4 Dataroots colleagues, no professional music production experience, a heap of Artificial Intelligence (AI)-generated samples of music and lyrics. How did all these end up in the Song of The Machines? In this final instalment of our blogpost series on Beatroots and the 2022 AI Song Contest, we dive into the world of digital music production with AI. Digital Audio Workstations (DAW) Digital Audio Workstations (DAW) are software used for music production. Maybe you’ve heard of DAWs like β€˜Aβ–‹

Song of the Machines (4): Digital Music Production

Anomaly detection in images using PatchCore

2023-01-22 | Toon Van Craenendonck

8 minutes read

Anomaly detection typically refers to the task of finding unusual or rare items that deviate significantly from what is considered to be the "normal" majority. In this blogpost, we look at image anomalies using PatchCore . Next to indicating which images are anomalous, PatchCore also identifies the most anomalous pixel regions within each image. One big advantage of PatchCore is that it only requires normal images for training, making it attractive for many use β–‹

Anomaly detection in images using PatchCore

Rootsacademy project: Fixing a slow AWS Lambda function

2023-01-15 | Nicolas Jankelevitch

7 minutes read

The rootsacademy, some believe it's as unrealistic and imaginary as Hogwarts, but nothing is further from the truth. This mythical academy really exists and turns wild partying scholars into professional consultants that are experts in the magical world of data. After the academy, most employees start working for their first client. For those who haven't found a match with a client yet (or for those whose project will only start in a couple weeks time, like myself) there is the rootsacademy proβ–‹

Rootsacademy project: Fixing a slow AWS Lambda function

Running Power

2023-01-10 | Thibauld Braet

9 minutes read

Bij dataroots lopen er heel wat sportievelingen rond en zijn we niet vies van af en toe een impulsieve uitdaging. Eigenschappen die we absoluut gemeenschappelijk hebben met Bobby en Seppe, de hosts van de Jogclub podcast. Impulsiviteit heeft zeker en vast zijn charme, een stijgend aantal wearables heeft sport echter alsmaar meer datagedreven gemaakt. Het leek ons dus leuk om samen met de mannen van de Jogclub ons gevoel af te toetsen tegenover deze sensors in aanloop naar de Jogclub Ultratrail. β–‹

Running Power

Create your own Christmas miracle with AI generated art

2022-12-22 | Sophie De Coppel

10 minutes read

Christmas is around the corner and you are still missing some cool Christmas cards? Well I got the thing for you! Don’t let artist block control you and start creating with the help of AI. This last year has been mind blowing with the rise of recent AI art generators like DALL-E , Midjourney and their open-source nephew, Stable Diffusion β–‹

Create your own Christmas miracle with AI generated art

Setting up AWS Infrastructure Using Terraform for Beginners

2022-12-18 | Jinfu Chen, Baudouin MartelΓ©e

5 minutes read

After one month of training at dataroots, some starters work on the internal project. The project of the Rootsacademy 2022 Q3 class consists of making an end-to-end solution for inferring information from traffic images. It goes without saying that this end-to-end solution requires infrastructure. In this post, we go through the infrastructure along with some tips and tricks to deploy AWS infrastructure using Terraform. Our aim is to explain things at a high level, such that you, the reader, canβ–‹

Setting up AWS Infrastructure Using Terraform for Beginners

Song of the machines (3) : Generating lyrics with musical context.

2022-12-11 | Sander Van Grunderbeeck

5 minutes read

In this episode of β€œCan 4 Dataroots colleagues without music production experience write hit songs with AI?” you discover how the Beatroots team finetuned a transformer model to generate musical lyrics for their hit song. -------------------------------------------------------------------------------- Lyric generation powered by Beatroots AI No hit song without meaningful lyrics to go with it, right? This statement motivated the Beatroots team to explore the use of AI for lyric generation. Β β–‹

Song of the machines (3) : Generating lyrics with musical context.

Tokyo Drift : detecting drift in images with NannyML and Whylogs

2022-12-04 | Warre Dreesen, Martial Van den Broeck

9 minutes read

Detecting drift in your data is very important when deploying models in production. It ensures that the performance of your model does not decrease due to the nature of the input data changing. There are a lot of tools out there for monitoring your data and detecting drift such as Great expectations, NannyML,... . However most of these are made for tabular data. In this blogpost we will discuss different approaches for detecting drift in images using popular tools. Creating drift using data auβ–‹

Tokyo Drift : detecting drift in images with NannyML and Whylogs

Is the role of Chief Data Officer still hot or not?

2022-11-23 | Kevin Missoorten

8 minutes read

The role of Chief Data Officer evolved a lot since being first introduced twenty years ago. Given the paradigm shift to the Data Mesh, some experts argue that the CDO’s responsibilities might once again drastically change. Leading to the discussion whether the role is still necessary or not. In 2002, Catherine Doss stepped up as Chief Data Officer of Capital One. Back then, the position of a CDO and the challenges it came with were very different from how it is viewed today. Since I have been aβ–‹

Is the role of Chief Data Officer still hot or not?

Real-Time Voice Cloning - tutorial

2022-11-10 | Virginie Marelli

2 minutes read

This tutorial demonstrates how a simple voice transfer app can be created using Streamlit . The code for this demo is based on the repository for Real-Time-Voice-Cloning . This app allows you to: * Record your voice * Visualize the embedding of the speaker * Synthesize speech based on the recorded voice Setup 1. Install Requirements Python 3.6 or 3.7 is needed * Create your virtual environment (e.g. pipenv β–‹

Real-Time Voice Cloning - tutorial

Hyper parameter tuning with Optuna - tutorial

2022-10-27 | Hans Tierens

2 minutes read

You all know that datarootsians are excellent data athletes. Olympics athletes train with weights, we evolved past that mere display of physical strength and started training the weights. In this way, Machine Learning Engineers train our models to achieve optimal performance on any task given to us. That’s how we shine! In this tutorial, we explain how state of the art hyper parameters techniques work and when to apply them, using Optuna library. The code GitHub - datarootsio/tutorial-hyperpaβ–‹

Hyper parameter tuning with Optuna - tutorial

MLOps - tutorial

2022-10-13 | Vitale Sparacello, Murilo Cunha, Bram Vandendriessche

2 minutes read

"MLOps is a set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently." - wiki At datartoots we've been pioneers of the MLOps methodology since the very beginning. For us MLOps means being able to identify all the business challenges, deliver the best solution quickly and efficiently, and monitor the project's evolution over time. To promote MLOps best practices we have run a workshop to KU Leuven uniβ–‹

MLOps - tutorial

Songs of the machines (2) - Harmonisation

2022-10-10 | ZoΓ« Van Noppen

4 minutes read

This blog post is part of a series of content in which we uncover how we wrote the song β€œSong of the Machines” with which we participated in the AI Song Contest 2022. AI songcontest As shared in the previous blog post, our vision for this year's song contest was to use AI tools as β€œcreative partners in crime, music-making”. Rather than letting one model create the entire song, we would keep a human in the loop (for a limited part) of the songwriting model. However, in the spirit of the β€œAI Sonβ–‹

Songs of the machines (2) - Harmonisation

Snowflake + Snowpark Python = machine learning?

2022-10-03 | Murilo Cunha

9 minutes read

Snowflake announced on June 2022 that they are offering Python support with Snowpark! πŸŽ‰ What does that mean, you ask? Well, that means that now we can do all sorts of things with Python on Snowflake ecosystem, even some machine learning 🦾. "How?!?!", you ask? Short answer is: UDFs and stored procedures. Long answer is, as you could've guessed, a bit longer. What is Snowflake? If you're new to Snowfβ–‹

Snowflake + Snowpark Python = machine learning?

Face Mask Detection - tutorial

2022-09-29 | Toon Van Craenendonck

7 minutes read

Face masks are crucial in minimizing the propagation of Covid-19, and are highly recommended or even obligatory in many situations. In this project, we develop a pipeline to detect unmasked faces in images. This can, for example, be used to alert people that do not wear a mask when entering a building. We recorded a YouTube video to explain the general pipeline of this project. Our pipeline consists of three steps: 1. We detect all human faces in an image 2. Wβ–‹

Face Mask Detection - tutorial

AI a catalyst for innovation

2022-09-26 | Virginie Marelli

4 minutes read

With the Energy crisis, it is even more blatant that we need to work on reducing our carbon footprint and better use the planet's resources. Earlier this year, I wrote an article about the impact that AI has on energy and how much smart implementation of algorithms we need. I discussed the race of big models and the negative impact they can have on energy consumption and thus carbon emissions. I also listed a couple of solutions from the software and hardware perspective that are active area oβ–‹

AI a catalyst for innovation

The imagery revolution and how to create a logo with DALL-E

2022-09-18 | Virginie Marelli

11 minutes read

Joining the bandwagon Recently, text-to-image models have been sprouting all around the internet. It started with the release of DALL-E 2, a model created by OpenAI. > DALLΒ·E 2 is a new AI system that can create realistic images and art from a description in natural language Many other similar models were released quickly after DALL-E. It went so fast that I'm not even sure which models were released when. The most famous ones are probably DALL-E, obviously, Midjourney, a model developed by Midβ–‹

The imagery revolution and how to create a logo with DALL-E

Great Expectations - tutorial

2022-09-15 | Paolo LΓ©onard

3 minutes read

A brief tutorial for using Great Expectations , a python tool providing batteries-included data validation. It includes tooling for testing, profiling and documenting your data and integrates with many backends such as pandas dataframes, Apache Spark, SQL databases, data warehousing solutions such as Snowflake, and cloud storage offerings (S3, Azure Blob Storage, GCS). This tutorial covers the main concepts you'll need to know to use Great Expectations, gently walkβ–‹

Great Expectations - tutorial

The price of healthy eating

2022-09-12 | Thibauld Braet

5 minutes read

It’s March 2020 and suddenly we’re all forced to spend time at home and because restaurants and bars are closed, all together we rediscover our kitchen and our passion for banana bread. We’re forced to learn how to cook and are eager to make healthy lifestyle changes by cooking healthy and exercising more. Fast forward to February 2022 and our passion for banana bread is being threatened by a war causing a sudden increase in the price of flour. What do we have to do now? Are there other alternaβ–‹

The price of healthy eating

Tremplin IA by Digital Wallonia

2022-09-08 | Virginie Marelli

2 minutes read

Is AI for you? You don't really know how to start? Β then Tremplin IA is spot on You probably have heard already a lot about AI and how it gives a competitive advantage to many companies. Yet, you don't really know where to start and how to test whether AI can bring value to your company. Tremplin IA is a program held by Digital Wallonia to help companies start a Proof of Concept (POC) in AI. Dataroots as a member of the AI experts ofβ–‹

Tremplin IA by Digital Wallonia

Our most popular posts - looking back

2022-09-05 | Virginie Marelli

2 minutes read

Last September, we started to regularly write blog posts. Everyone within Dataroots has been participating to this endeavour. People write about their job, about a conference, a new technology or simply they write about their passion. Thus since September, we have been publishing one post a week, without failing, sometimes even 2 posts per week! Creating the content is probably the hardest, you need to find an interesting topic, have a creative angle, a pinch of opinion and mash this into a stβ–‹

Our most popular posts - looking back

Statistics Saga 2: Dimensionality Reduction

2022-08-29 | Chiel Mues

10 minutes read

Welcome back! If you remember from last time I told you we would continue with matrix factorisations, more specifically dive into some dimensionality reduction techniques! Hope you're ready! This blogpost will give you a comparison of two specific factorisation techniques that are foundational to the idea of dimensionality reduction: principal component analysis and factor analysis. Some of the techniques aβ–‹

Statistics Saga 2: Dimensionality Reduction

Network analysis and community detection using Gephi

2022-08-22 | Silke Plessers

10 minutes read

Networks are everywhere around us. Just by reading this blog post, you traveled across the internet following links to land on different web pages. Your social life is defined by relationships with other people that consecutively are connected to some other people. Nowadays, these friendships are also defined online by social media platforms such as Facebook, Twitter, Linkedin etc. Protein interactions, gene interactions, supply chain optimisation, payments and transactions, even the spread of aβ–‹

Network analysis and community detection using Gephi

To be sentient or not to be sentient?

2022-08-08 | Romain Compagnie

7 minutes read

AI, sentience and Google LaMDA If you've followed the tech-related news recently, you probably heard about Google's latest conversational bot, called LaMDA (Language Model for Dialogue Applications). After a lengthy conversation with the bot, a Google engineer became convinced that the bot they designed was sentient, or conscious in more common language. His claim and the public release of the conversation sparked heated debate among AI experts and enthusiasts. This is the perfect opportunity β–‹

To be sentient or not to be sentient?

Keep posted on our events!

2022-08-08 | Bart Smeets

0 minutes read

Just a short FYI :) As of now we will be listing all our upcoming events over at dataroots.io/events . We already have a calendar leading up to end of next year and we will be publishing these events soon at our events page. Keep posted πŸ™Œβ–‹

Keep posted on our events!

Federated Learning - a tour of the problem, challenges and opportunities

2022-08-01 | Raul Jimenez Maldonado, Omar Safwat

11 minutes read

The majority of machine learning algorithms are data hungry, the more the data we feed our models, the better they learn about the world’s dynamics. Luckily for us, data is everywhere in today’s world, dispersed over the different locations where they were collected. Examples of this is the user data that is collected on a daily basis by our cell phones, medical equipment and practitioners in medical facilities, etc. Conventionally, if we wanted to train a learning model, we would collect the dβ–‹

Federated Learning - a tour of the problem, challenges and opportunities

Next best action recommendation - part 3: recommending actions using reinforcement learning

2022-07-25 | Silke Plessers, Sandy Moens, Virginie Marelli

9 minutes read

You have heard about reinforcement learning for next best action optimization but don't really know why you would use it over other techniques or how to use it best? In this article, we try to demystify and explain how we used offline reinforcement learning to have a good baseline model for optimizing marketing campaigns. A quick recap As mentioned in the initial post of this seriesβ–‹

Next best action recommendation - part 3: recommending actions using reinforcement learning

Leaning in for HuggingFace Spaces

2022-07-25 | Sophie De Coppel, Hans Tierens

6 minutes read

Deploying your Machine Learning model is often the cherry on the cake for a Machine Learning Engineer. After putting a lot of effort in building your model, it is immensely satisfying to be able to send it off on an adventure of its own, hoping it conquers the world. However, the size of the world is vast, so we better equip our model with a quick and able ride. The question of whether to build our own horse in the cloud or borrow someone else’s horse for the trip? I recently faced this qβ–‹

Leaning in for HuggingFace Spaces

Terraforming Snowflake ❄️

2022-07-18 | Lidia-Ana-Maria Baciu

9 minutes read

It should go without saying that data is a critical asset for any organization. As a result, it is important that the platform handling all this data is able to do so with scalability and speed in mind. Enter... πŸ₯πŸ₯πŸ₯ Snowflake! Snowflake is a cloud platform for data x, where x = . So everything that is data-related basically. Terraform , on the other hand, is an infrastruβ–‹

Terraforming Snowflake ❄️

Terraform with Azure became even more awesome: filling the gaps in your code with the azapi provider

2022-07-04 | Sam Debruyn

9 minutes read

The cloud is just someone else's computer and to manage that we prefer to use Infrastructure as Code (IaC). dataroots believes that IaC can benefit any team working with cloud resources and most often Terraform is our tool of choice there. As a data & cloud engineer focusing on Microsoft Azure, that is true for me as well. However, there have been a couple of hick-ups along the road. We have to talk about the proβ–‹

Terraform with Azure became even more awesome: filling the gaps in your code with the azapi provider

Song of the Machines (1): Sampling musical sections

2022-06-30 | Dorian Van den Heede

6 minutes read

Can 4 Dataroots colleagues without professional music production experience write hit songs with AI? In this blogpost series the Beatroots team members uncover how they wrote their latest song, Song of the Machines, which they submitted for the 2022 AI Song Contest. AI Song Contest The AI Song Contest is an international music competition exploring the use of AI in the songwriting process. We have participated with Beatroots since its inception in 2020. The preβ–‹

Song of the Machines (1): Sampling musical sections

Weather Nowcasting - Model compression

2022-06-27 | Margaux GΓ©rard, Omar Safwat

5 minutes read

-------------------------------------------------------------------------------- In our previous post, we explained our project of weather nowcasting in a general way. Now, we will deep dive into one of the most important steps in machine learning, which is model optimization. The need for optimizing model size and speed arises whenever the prediction model is required to run on an edge device, namely, smartphones, surveillance cameras, robots, etc. Β Therefore, the challenge is maintaining a sβ–‹

Weather Nowcasting - Model compression

πŸ” Burgers & Drinks - Get to know Dataroots!

2022-06-21 | Silke Gerets

1 minutes read

What: Get to know our dataroots team! Where: Tiensevest 132, Leuven Who: Everyone with a passion for data & AI About this event πŸ€–You have a passion for data and AI? You also like a good burger as a study break or after finishing your exams? 🎯 We have the perfect event for you! πŸ” On Tuesday June 28th we’ll be organising a walking dinner at dataroots! πŸ‘ The aim of this dinner is to give you an idea of what we do at dataroots and to meet some of our team members. ❗️Spots are limited, so β–‹

πŸ” Burgers & Drinks - Get to know Dataroots!

The Great Industry - Heurisko - take aways

2022-06-20 | Richard Cosemans

4 minutes read

Industry 4.0 marks the fourth industrial revolution. What does that mean for us? Are we part of it, i.e. is the industry ready for big data and A.I.? Time to find out! Dataroots was invited to Heurisko 2022, the annual seminar hosted by Flanders Make, where the most innovative and industry-ready research results and applications are presented. Industry Zero to Industry Hero There have been multiple industrial revolutions in the last 200 years. The first industrial revolution happened around theβ–‹

The Great Industry - Heurisko - take aways

Home design: how AI helps you customize your furniture

2022-06-13 | Sophie De Coppel, Hans Tierens

12 minutes read

Do you want to restyle or redesign your interior, but don't want to leave the comfort of your own home? Don't fancy reading through hundreds of interior design albums or going to your local furniture store to try imagining those couches in your own living room? Have you always dreamed of a couch with an extravagant tiger print, but you don't know if it will fit your interior? Well I got a thing for you! During my internship at dataroots, I have built an AI-driven application, SofaStyler, which β–‹

Home design: how AI helps you customize your furniture

Some interesting takeaways from this year's Techorama

2022-06-09 | Sam Debruyn

5 minutes read

Last week was a busy week for fans of the Microsoft technology stack like myself. Microsoft hosted its yearly developer conference, Microsoft Build, announcing lots of exciting updates to new and existing Azure services. In the meantime, the Belgian community of Microsoft technology users gathered in Kinepolis Antwerp for this year's edition of Techorama. First off, let me start by thanking the incredible crew and partners β–‹

Some interesting takeaways from this year's Techorama

Weather Nowcasting - deploying a model on edge

2022-06-06 | Margaux GΓ©rard, Lidia-Ana-Maria Baciu, Adrian Gonzalez Carpintero, Omar Safwat

13 minutes read

The research department at Dataroots hosts twice a year its RootsAcademy, an initiation program that prepares its students for their career as consultants in data and AI. After the academy, the consultants take on their first internal project at Dataroots, with the aim of putting the concepts learned through the academy into practice. This March, we have been doing a proof of concept to automate the deployment of a weather nowcasting model on a Nvidia Jetson nano. Weather nowcasting is all abouβ–‹

Weather Nowcasting - deploying a model on edge

Arty Farty - AI Song Contest 2021

2022-06-02 | Virginie Marelli

1 minutes read

Since 2020, dataroots participate in the AI Song Contest. Since the contest is around the corner and the team is working hard, we thought to tease with our previous participations! For the 2021 submission, Beatroots upped their game and studied the Jukebox model released by OpenAI. This model opened many opportunities to sample musical audio waves and complete musical ideas in the style of many genres and artists. Beatroots fine-tunβ–‹

Arty Farty - AI Song Contest 2021

How to develop a business-driven data strategy

2022-05-29 | Ben Mellaerts

8 minutes read

for companies with different operating models If you prefer the video version (with slides); it is available here . Organizations have a business strategy in place to define how they can achieve and maintain a sustainable competitive advantage. However, most organizations don’t yet have a strategy in place on how to extract the right value from data. According to a survey β–‹

How to develop a business-driven data strategy

A gentle introduction to Geometric Deep Learning

2022-05-23 | Vitale Sparacello

9 minutes read

Intro AI has changed our world, intelligent systems are part of our everyday life, and they are disrupting industries in all sectors. Among all the AI disciplines, Deep Learning is the hottest right now. Machine Learning practitioners successfully implemented Deep Neural Networks (DNNs) to solve challenging problems in many scientific fields. Nowadays, cars can see how busy a crossroad is, it’s possible to have pleasant conversations with imaginary β–‹

A gentle introduction to Geometric Deep Learning

Recipe for a Data Burger

2022-05-16 | Kevin Missoorten

5 minutes read

At dataroots, we like to present our service portfolio by means of a burger. Like a burger, the β€˜piΓ¨ce de rΓ©sistance β€˜ is the Artificial Intelligence value-chain, with data pipelines transporting & providing quality data from source to model, simple or complex models mashing the data into insights and finally integration of those insights into the day to day business processes to put these hard earned insights to work. Also like a burger, the way to facilitate the efficient consumption of the β–‹

Recipe for a Data Burger

Arty Farty - AI Song Contest 2020

2022-05-12 | Virginie Marelli

1 minutes read

Since 2020, dataroots participate in the AI Song Contest. Since the contest is coming soon and the team is working hard, we thought to tease with our previous participations! 6 dataroots colleagues group together with only one mission: generating fully automated songs by clicking a button! The final algorithm generates songs by traversing the shortest distance in MIDI harmonies sampled by Variational Autoencodβ–‹

Arty Farty - AI Song Contest 2020

Next best action recommendation - part 2: causal inference techniques

2022-05-09 | Silke Plessers, Sandy Moens, Virginie Marelli

11 minutes read

Causal inference is used to determine whether an action on a selected population is efficient and by how much. It is extremely useful to evaluate the average treatment effect of a campaign. For this, you need to compare the outcome difference on a treatment and on a control group. In this post, we explain techniques that can be used to evaluate an action even when a proper control group does not exist. We will explain how causality can still be inferred and tested and how much we can deduct frβ–‹

Next best action recommendation - part 2: causal inference techniques

Statistics Saga 1: Matrix Factorization

2022-05-02 | Chiel Mues

5 minutes read

This blogpost will give you a gentle (re)introduction to the idea of matrix factorization, an enormously useful technique in statistics and machine learning. Matrix Factorization Matrix factorization is a technique to decompose or factorize a matrix into a product of more fundamental matrices. If that sounds a bit confusing, it's analogous to factorizing a number: 48=4Γ—12 or 48=6Γ—8. Of course, a matrix is more complex than a number, so many kinds of factorization are possible. Perhaps the easiβ–‹

Statistics Saga 1: Matrix Factorization

Trends in statistical visualisation

2022-04-25 | Lode Nachtergaele

4 minutes read

Machine learning engineers are at the intersection of programming (computer science), math/statistics/machine learning and domain knowledge/communication. Although a lot of progress has been made in the first two, their advances are constrained by the ability to convey their results to the business owners of a problem. Graphical representation can be of enormous help to bring over complex results. In this blogpost, we discuss latest trend in visualisation of statistical results. State-of-the -arβ–‹

Trends in statistical visualisation

Next best action recommendation - part 1: measuring the effect of a campaign

2022-04-11 | Silke Plessers, Virginie Marelli, Sandy Moens

10 minutes read

Campaigns you said? Great but which one? Multiple ways exist to nudge customers: for instance calling, sending out emails, offering discounts, etc. The channels are various and the content of the marketing messages are even more diverse. In this article we explain how to optimize a marketing campaign and what to do when you did not implement the ideal strategy but have data that can help you derive important insights. From churn prediction to business value Not so long ago, in a previous postβ–‹

Next best action recommendation - part 1: measuring the effect of a campaign

Non-existent quotes by GPT-3

2022-04-11 | Bart Smeets

3 minutes read

Over the weekend I had the pleasure of talking to Gertrude Poirot Torricelli III, long for GPT-3. Her insightful advise and hopeful views on the future of society and the world at large inspired me to share her musings with the rest of you. She was very open to this idea and I will be sharing her advise in quote form daily during the week of April 11th, 2022. All quotes will be collected down here. πŸ‘‡ Monday Tuesday Wednesday Thursday Friday That wraps up this non-existent quotes series! βœ… οΏ½β–‹

Non-existent quotes by GPT-3

What is architecture?

2022-04-04 | Wim Van Leuven

5 minutes read

As a growing data consultancy boutique, we get more and more questions to review and architect data platforms. While growing, we are also maturing the architecture practice at Dataroots. What is Architecture? We can obviously not discuss architecture without some reflection on the term itself in the context of ICT solutions in general, and data platforms specifically. A topic which immediately proves to be not that easy to grasp. When brainstorming the subject, we easily talked about the responβ–‹

What is architecture?

Is AI an eco disaster?

2022-03-28 | Virginie Marelli

6 minutes read

You hear more and more that technology in general is not so eco friendly. What about AI? Is it also not so eco-friendly? What is the impact of developing AI models and how good is AI for the planet? With this article, we try to demystify and understand the impact of AI on the planet and how this could be reduced. What are the resources needed to build an AI? To build AI models, requires a lot of resources, especially if you are building models like Bert, GPT, or in general, deep neural netβ–‹

Is AI an eco disaster?

Open source alert: Rootsstyle

2022-03-21 | Virginie Marelli

1 minutes read

You love Matplotlib cause it's easy to use and you can generate plots quickly? It's now possible to do these Matplotlib exact same plots with dataroots theme with Rootsstyle ! Rootsstyle works with any visualization tools that builds upon Matplotlib (seaborn, pandas). Check it out !β–‹

Open source alert: Rootsstyle

A light introduction to transformers for NLP

2022-03-21 | Murilo Cunha

5 minutes read

If you ever took a look into Natural Language Processing (NLP) for the past years, you probably heard of transformers. But what are these things? How did they come to be? Why is it so good? How to use them? A good place to start answering these questions is to look back at what was there before transformers, when we started using neural networks for NLP tasks. Early days One of the first uses of neural networks for NLP came with Recurrent Neural Networks (RNNs). The idea there is to mimic humaβ–‹

A light introduction to transformers for NLP

Marketing strategy - How to go beyond propensity models

2022-03-16 | Virginie Marelli

6 minutes read

When you start integrating data into your marketing strategy, the first questions that needs to be answered are often: who’s going to churn in the next couple months? To whom should we best sell what product? Does that person need this product? To answer these types of questions one can build a model based on historical data. We look for customers that demonstrated the desired behavior in the past (churn, buying a product, etc) and how they looked like (characteristics and behavior). The assumpβ–‹

Marketing strategy - How to go beyond propensity models

Internships

2022-03-09 | Virginie Marelli

2 minutes read

Want to discover if a career in AI is something for you? Apply for one of our cool internships or propose your own! We are already planning the internships of next year, here is a sneak peak into what it entails πŸ€– Looking for an internship? Internships are the perfect way for you to see if you would like to pursue a career in AI and for us to see if there’s a match for a long-term collaboration! There is not enough time in a human life to develop all the cool ideas that we have in mind so itβ–‹

Internships

How to make AI fair and influence data science projects.

2022-03-08 | Tim Leers

6 minutes read

The problem. Artificial intelligence (AI) is driving the rapid transformation of industries. However, the exponential rate of that transformation is difficult to manage for legislators. Moreover, there is no industry standard to ensure AI is safe and beneficial. New applications are introduced at breakneck speed, oftentimes without sufficient consideration of their potential societal impact. AI promises to enable the scaleable automation of almost any decision-making system. In doing so, we ampβ–‹

How to make AI fair and influence data science projects.

Deep learning model compression

2022-02-28 | Toon Van Craenendonck

4 minutes read

Deep neural networks offer unparalleled performance for many applications, but running inference can be resource-intensive. Model optimization comes in to help here, reducing disk storage, memory usage or compute requirements. This can be useful for deployment on the edge (to run models where it otherwise would not be possible), as well as for the cloud and on-premise (to run models faster, or allow more models te be stored in-memory simultaneously). Moreover, reduced energy requirements of optiβ–‹

Deep learning model compression

Gender Equality at the Olympics

2022-02-25 | Thibauld Braet

6 minutes read

Last week, the winter Olympics in Beijing came to an end. For Belgium, this meant a successful edition with one female (Hanne Desmet) and one male (Bart Swings) medal. Belgian medals at the Winter Olympics are pretty rare anyway but the medal of Hanne Desmet was the first Belgian female one since the games of 1948 in Sankt Moritz! At dataroots, we highly value diversity, putting the topic regularly on the agenda to see if everybody thinks we’re on the right track. The past decades, the topic haβ–‹

Gender Equality at the Olympics

What the Duck?!

2022-02-23 | Bruno Quinart

4 minutes read

Unboxing an embeddable analytical database. DuckDB is a recent addition in the analytical database world. And it takes an interesting approach: it wants to be the SQLite for analytics. DuckDB was developed by Mark Raasveldt and Hannes MΓΌhleisen, two database researchers at the Centrum Wiskunde & Informatica (CWI) in Amsterdam, the Dutch National Research Institute for Mathematics and Computer Science. CWI is not just any research institute. For a few decades now, the team has been pushing the β–‹

What the Duck?!

The explainable AI boom: Why is XAI important? And why now?

2022-02-19 | Tim Leers

4 minutes read

As we alluded to in our trends post , the number of researchers, developers and companies that focus on eXplainable AI (XAI) is growing faster each year. πŸ’‘XAI is an umbrella term for methods, algorithms and tools that increase insight into the inner workings of AI. This is in contrast witβ–‹

The explainable AI boom: Why is XAI important? And why now?

Marriage problem - a matching theory story

2022-02-14 | Virginie Marelli

4 minutes read

Matching theory (a branch of game theory) is a mathematical framework attempting to describe the formation of mutually beneficial relationships over time. What other topic could we possibly have chosen for Valentine's day? Actually, this is a very serious and important field of research in economics. And, in 2012, Alvin Roth and Lloyd Shapley got awarded a Nobel prize β–‹

Marriage problem - a matching theory story

Data Quality for Notion Databases πŸš€

2022-02-06 | Ricardo Elizondo

5 minutes read

> Notion βž• Great Expectations = πŸš€ If you've ever heard of or used Notion (specially their databases) and Great Expectations, you can already imagine what this is about πŸ˜‰. If not, find a quick ELI5 below: See our Github for more technical details and detailed instructions. πŸ‘Ά ELI5: Great Expectations > "Great Expectations is a shared, open standard for data quality. It helps data teams eliminate pipeline debt, through data testing, docuβ–‹

Data Quality for Notion Databases πŸš€

Trends in XAI tools & research at NeurIPS 2021

2022-02-04 | Tim Leers

10 minutes read

eXplainable AI or XAI is crucial to ensure stakeholder and public trust, as well as reliability, particularly in high-stake contexts where AI decisions can impact lives. Open-source contributors, researchers & companies are stepping up their game by providing ever-more ambitious and inventive methods to ensure transparent, interpretable and ultimately, explainable AI. As a consequence, XAI methods are sprouting up like mushrooms, meaning that the decision on which method to use is becoming incβ–‹

Trends in XAI tools & research at NeurIPS 2021

Data science and notebooks = databooks: a love story

2022-02-02 | Murilo Cunha

4 minutes read

If you're not new to Python and data science, you probably heard of Jupyter notebooks . But if you haven't, here's the gist: it's an interactive environment, meaning you can run little bits of code and see the output, store variables in memory, etc. That makes notebooks a good tool for experimentation, reporting and visualizations. And because of that, it's a popular tool of choice for data science in general. And this is why you see a lot of notebooks in places like Kaggβ–‹

Data science and notebooks = databooks: a love story

What we are excited about for 2022!

2022-01-30 | Virginie Marelli

11 minutes read

Foreword In this post, we have gathered our experts’ views on new developments in AI. However, AI is a broad field and we do not pretend to have a complete understanding of the whole landscape. Our view is necessarily biased by our activities as an AI service provider and our Belgian market presence. Bearing this in mind, we examine different trends that we have spotted in AI across industries, research, tooling and much more. The goal of this article is to get an overview of the landscape and β–‹

What we are excited about for 2022!

Publication Alert: Tim Leers

2022-01-28 | Bart Smeets

1 minutes read

πŸ™Œ An article that our very own Tim Leers co-authored just got released. A snippet of the summary: > Engagement and training of community health workers (CHWs) have demonstrated their value in different conditions. Despite repeat epilepsy trainings of CHWs in Northern Rwanda, the treatment gap remained high. We hypothesized that effectiveness of CHWs on mobilization of patients living with epilepsy (PwE) could be improved using a vaβ–‹

Publication Alert: Tim Leers

DataTrends 2022

2022-01-27 | Virginie Marelli

0 minutes read

Watch our experts share our views on : * What type of data do we currently work with? * How are the AI use cases evolving? * How much time does it take to leverage value from AI/data? * What has been the biggest evolution in infrastructure to support the AI cases * Where is the market in terms of AI adoption and maturity? * What is the role of the EU citizens, how are they included in AI projects/development?β–‹

DataTrends 2022

Open Source is at the heart of the way we work

2022-01-24 | Sam Debruyn

2 minutes read

> Why would the chef give away the recipes for the dishes he is famous for? What does the engineer achieve from sharing his schematics for that new technological marvel? Working open source is like sharing your secrets. These contemplations are often pronounced by people outside or not familiar with software development. Let’s have a look at a couple of examples of why and where Open Source proves its value in our day-to-day business. Essential ingredient Free Open Source Software (FOSS) is anβ–‹

Open Source is at the heart of the way we work

EU regulations for AI

2021-11-14 | Jan Yperman

2 minutes read

While the advent of advanced artificial intelligence (AI) systems in our daily lives is an absolutely thrilling prospect (and increasingly a reality), it's important to keep in mind the impact such systems could have on society if left unchecked. And that doesn't just apply to killer robots or evil corporations either. There are quite a few ways AI can do more harm than good when implemented without the proper precautions, even when deployed with the best of intentions. For example, a company miβ–‹

EU regulations for AI

Qs?

We're here for you πŸ™Œ

Chat
Contact form
Email

sign up to our weekly AI & data digest ❀️