By Sam Debruyn, Dishani Sen, Paolo Léonard, Sander Van Grunderbeeck
Last week, the four of us attended both days of Big Data London, Europe's biggest yearly data conference. In this post, we're sharing our highlights from the event. There were a few recurring topics that we've distinguished as the most common themes throughout the conference. And since it's still 2023, you couldn't enter any session or exhibitor booth without hearing about Generative AI.
Data Fabric, the new hype term
The data platform world is filled with hype terms like modern data stack, data fabric, open, and self-serve analytics. The data fabric seems to be a recent one that sticks. A data platform can be as simple as a simple storage solution with an ETL/ELT tool to move or transform data. A data fabric is then a more complete solution that includes other essential features like DataOps & DevOps, data governance solutions, and visualization functionalities.
What is a Lakehouse again?
Another trend that definitely seems to last is the one of the lakehouse. Databricks, Microsoft, Starburst, Dremio, and Clickhouse all offer their own version of the lakehouse, all competing with each other and their bigger brother, Snowflake. We noticed that it became a thing to position a data platform offering as an open platform, to compete with the closed offering of Snowflake. There are 2 things we can conclude from this: Snowflake has become the most popular platform and openness and cross-vendor compatibility has become a major selling point to compete with it.
So let's say you are in the market for a new data platform and you want to investigate the available options to build a lakehouse. Let's first look at what all vendors have in common; what defines a lakehouse?
A lakehouse is the evolution of the data lake combined with the data warehouse. Every lakehouse builds upon a data lake where the data is stored. Then, a SQL engine exposes that data for data consumption through dashboarding, machine learning, reverse ETL, and more. The main difference with the previous decade is that data doesn't have to be stored in the SQL engine itself anymore. It can live in the data lake, in any open format.
Delta Lake: winner of the data format battle
Next, you'd have to pick a format to store your data in. Here, Apache Parquet used to be the most obvious choice. Then, the transactional layers on top of Parquet came along: Delta Lake, Hudi, and Iceberg. Delta Lake seems to the clear winner in terms of compatibility here. All data tooling (not just lakehouses) we saw at the conference supported Delta Lake, while only very few offered support for Iceberg or Hudi.
The Open Lakehouse
What all lakehouse vendors have in common is that they boast themselves on being open. But what does that actually mean? All of them can work with your data stored in Delta Lake on any of the common cloud providers (AWS, Azure, GCP). While a data vendor used to try to lock you into their ecosystem by staying incompatible with the rest of the market, the opposite now has become true. This is an evolution we really like to see. The cloud also has become a commodity in this context. You can work with whichever cloud provider you'd prefer without having to miss out on any features. Even Microsoft's new platform named Fabric can work with data stored on AWS S3 and soon with Google Cloud Storage as well.
Once your data is available in your data platform (or data fabric?), you'd usually have to start transforming that data to build a data product. We can distinguish 2 clear categories of tooling here.
The low-code tooling
Big Data London featured talks and exhibitor booths with low-code data transformation tooling like Matillion, Alteryx, Coalesce.io, Rivery, and more. Here, it seems that Alteryx is the older, more commonly known solution, while Coalesce seems to be one of the more popular and complete options for newer platforms. Other vendors like Microsoft stay neutral on this front by providing excellent support for both low-code and code-first solutions.
As dataroots, we believe that complex operations like data transformations are usually best written as code, rather than as a combination of boxes and arrows locked away in a vendor-specific format.
The code-first tooling
While Apache Spark was historically the go-to tool for any kind of data transformations, we saw once again at Big Data London that dbt has taken over this role. Almost every common data tool has some kind of integration with dbt. We didn't attend any talks focussing on dbt itself, but it was featured tens of times.
As the volume and complexity of data continue to grow, organisations are increasingly recognising the importance of effective data governance to ensure data accuracy, security, and compliance. At Big Data London, we had the opportunity to explore a multitude of stands that delve into various aspects of data governance, including data quality management, data privacy, regulatory compliance, and data security. From BigID to Collibra, the big data governance players had their stand to showcase their products. However, smaller players which offer a solution with less feature were nonetheless very interesting. The most interesting ones were related to data quality and data catalog.
Data Quality and Data Contracts
Data contracts took part of the spotlight as the newest addition to the existing data mesh ecosystem and a way to extend the current data quality implementations with additional features like SLA's and stakeholders. Agilelab's Witboost showcased a compelling solution during the presentation, demonstrating the seamless integration of a standardized data contract interface into existing workflows. This innovative approach not only streamlines data management but also enhances accountability and collaboration within an organisation. Moreover, the data quality panel brought first hand experiences from companies like Hello Fresh which implemented data contracts in their applications.
Focusing more on data quality, two tools really stood out from the crowd: Soda and Giskard. Soda focussing more on tabular datasets while Giskard has impressive testing features for AI and ML models with upcoming support for LLM testing on the horizon. Giskard offers an interesting scan functionality which is able to detect pottential issues in your ML model. The tests resulting from this scan can be added together in a testing suite which can be extended with your own tests and used to thorougly test your ML models.
At Big Data London, three compelling data catalog tools took the spotlight: Atlan, CastorDoc, and Sifflet. Sifflet stands out as it not only provides a comprehensive data catalog but goes beyond by integrating anomaly detection and team collaboration features like root cause analysis and incident management. On the other hand, Atlan and CastorDoc offer valuable integration options with other tools, enabling them to complement their capabilities and fill in any gaps within their toolset, notably Soda.
Dataroots will certainly conduct a more in-depth investigation into Sifflet and CastorDoc.
85% of data projects fail. All the above do not guarantee a successful data project at the end of the day (of years, even decade!). Technology is just one (initially smaller) part, but still important. But, the right people, with the right mindset, along with a rigorous strategy holds the key to a successful data project in equal proportions.
How to build a Data Culture?
A successful "Data Culture" in a company looks like this: everyone understands the value of data and can use it to improve business outcomes. One of the major themes at Big Data London was focus on "Data Culture is for everyone, not just your data teams". Fluency and data training to ensure holistic collaboration across various departments for data strategy is what we must aim for in our organisations.
Why most data projects fail?
Having a data strategy isn't enough. There needs to be a plan and execution. A first point to start with planning can be answering these :
Who? (how many analysts, data scientists, data engineers or analytics engineers do you need and what will be their specific responsibilities?)
What? (What is the business value- just saying you wat AI isn't enough. Define a clear and attainable path to value creation)
When? (When will you generate value? Stay away from unattainable timelines)
Where? (Where will data be processed; data teams need to have a clear architecture)
and finally How? (Data teams need a clear plan that they are executing. This plan needs a singular focus on the work that will go down in different directions)
Data is everywhere
To no one's surprise: data is everywhere. Big Data London hosted various very interesting talks about data related tools and the most recent advances in the world of data, of which multiple are highlighted in this blog post. Next to that, the event also offered insights and success stories on how organisations use all sorts and volumes of data to achieve data driven performance. Whether it is to put guest speaker Tim Peake into space, enable HelloFresh to deliver millions of mealboxes across the globe or help McLaren put Lando Norris on pole position for the Formula 1 race weekend, the trend where data really drives different operations to achieve value is here to stay and we at dataroots are very excited to be in the midst of it!
See you next year?
With more than 15.000 attendees, there is no other conference in Europe like Big Data London. It's the place where data practitioners all over Europe gather. We're excited to see next year's schedule and if you're planning on going, look out for the dataroots logo! 😉