By Paolo Léonard
In the ever-evolving digital landscape, where data is the bedrock of decision-making and innovation, ensuring its quality has become paramount. As organizations navigate the intricate web of data integration, transformation, and utilization, the need for robust data quality solutions has surged to the forefront.
In this analysis, we explore different data quality tools, carefully examining their capabilities, integrations, and performance. Our goal is to offer a comprehensive yet selective overview of the present data quality arena, guiding readers and organizations in making well-informed decisions tailored to their distinct data environments and objectives.
Data quality tools
To this end, we've carefully chosen a notable group of data quality tools to examine closely:
- Great Expectations: Great Expectations stands out as one of the most prominent, if not the foremost, open-source data quality tools available today. Many companies rely on GX to ensure their data quality, and the reasons behind this trust are evident: boasting over 100 expectations and integrations with major cloud providers.
- Soda Core and Cloud: Soda is another open-source data quality tool that helps data engineers in incorporating testing into their pipelines. It features its domain language (SodaCL) and has recently introduced SodaGPT, a chat assistant designed to assist with defining quality rules. Soda Cloud is the hosted version.
- dbt test: Given dbt's rising popularity, it seemed fitting to include its testing feature in the comparison.
- whylogs: whylogs is a profiling open-source tool that promises continuous monitoring of your datasets based on its powerful profile feature.
We assess the different tools based on the three following aspects:
- Infrastructure compatibility: A crucial aspect of a data quality tool is its compatibility with existing cloud environments and data pipelines. It's essential to consider whether the tool provides native integration with orchestration tools like Airflow or Prefect.
One notable advantage of the tools we are evaluating is that they are all Python-based. This characteristic inherently provides a high degree of flexibility, allowing for seamless integration across various platforms. However, this section focuses on evaluating the connectors that are natively supported by the data quality tools to facilitate the integration.
- Data integration: Beyond infrastructure, the range of data integrations offered by a data quality tool often plays a pivotal role in tool selection. You should assess whether the tool integrates with a wide variety of data sources or specializes in specific data types, such as tabular data.
- Reporting and dashboard integration: Data quality, by itself, isn't useful unless it can be translated into actionable insights. The third critical aspect is integration with reporting, incident management, or dashboard tools. While some data quality tools might include these features, it's equally important to ensure they can seamlessly integrate with existing platforms that handle dashboard creation and report generation.
In essence, a comprehensive evaluation of a data quality tool should include its infrastructure compatibility, data integration capabilities, and its potential to seamlessly blend into your reporting and dashboard ecosystem.
These three key aspects form the baseline when evaluating data quality tools. However, we are also considering additional facets that we consider nice to have but not absolutely necessary:
- Features and functionality: Does the tool offer advanced features like data profiling, automated assistance, and options for data cleansing and standardization? Additionally, does it extend its capabilities beyond basic data quality tasks?
- Ease of use: How user-friendly the tool is during setup and when defining data quality rules. A tool that is easy to configure and use can streamline your data quality processes.
- Data quality rule coverage: Examine the tool's library of pre-defined data quality rules and its flexibility in allowing you to create custom rules. This helps you tailor the tool to your specific data quality requirements.
- Pricing and licensing: Investigate the tool's pricing model, including whether it is open source or offers a paid hosted version. Understand the benefits of the paid version and any installation options (e.g., hybrid solutions for handling sensitive data).
- Meta indicators: Assess the tool's overall maturity and community support. Factors like the number of stars, the volume of reported issues, and the activity level on platforms like Slack and GitHub can indicate the tool's maturity and growing user base.
With our evaluation criteria in place and a selection of data quality tools ready for comparison, it's time to begin!
For the sake of clarity and brevity, we won't cover every criterion for each tool. Instead, we'll focus on their strengths and weaknesses to provide a comprehensive overview.
The first tool we'll investigate is the one and only Great Expectations. With more than 8900 ⭐ on GitHub️, Great Expectations stands out as the more popular open-source tool with a very active community on Slack as well.
Great Expectations is a Python-based data quality tool built by Superconductive, which secured $40M in funding, to improve
the confidence in your data, no matter what.
Great Expectations offers a wide range of integrations towards orchestrators and compute engines:
- For orchestration, both Prefect and Airflow are supported. Specifically, Prefect has its dedicated task definition, while Airflow has an operator definition, backed by Astronomer.
- When it comes to compute engines for file system sources, both Pandas and Spark are compatible as external engine. For handling larger files, Spark is the recommended option due to its ability to distribute tasks and manage data efficiently.
- On the SQL side, GX taps into SQLAlchemy for support, enabling compatibility with the more popular SQL databases. Additionally, Great Expectations is also in sync with Databricks, Snowflake, and BigQuery.
In terms of data source connections, GX offers natively supported connectors to Azure, GCP, and AWS storage.
One of the strongest suits is the reporting capabilities of GX. A multi-level approach is available to determine what to hide or reveal in the validation output report. Interactive HTML reports, called Data Docs, are built automatically to provide a clear visualization of data quality results, which can be shared with non-technical users. The output validation results are also available in JSON format. This allows for integration into existing reporting solutions or dashboards.
On top of that, webhooks can be defined to trigger events or incidents in Slack or Teams.
Another interesting feature is the metric and evaluation parameter stores. These stores save the results of data quality validation runs, allowing you to reuse them in subsequent runs or analyze metrics trends more effectively.
However, during our various experiences with GX, we encountered some pain points:
- Learning curve: Understanding and defining expectations can be a bit complex for newcomers on top of all the custom terminologies that you need to understand before even using GX.
- Configuration overhead: Setting up GX can be cumbersome. Creating the data context, configuring the execution engine and the data connectors, and then creating the expectation suite with batch data. Once you have defined the suite, you finally create a checkpoint to schedule a data quality run. That's a lot.
- Performance concerns: While it's useful for many smaller datasets when dealing with large datasets or high-frequency validation, performance can be a concern, especially if not using Spark since the other available option is Pandas. On top of that, building a lot of Data Docs causes severe performance issues.
- Limited customization: The Data Docs feature is powerful but isn't easily customizable. Additionally, building new custom expectations is difficult and way too granular.
- Cloud connectivity: while a lot of connectors exist we faced issues connecting securely to Azure premises as well as configuring cloud-based stores.
The primary drawback of GX is its lack of a stable version. This tool has been in beta for several years, and upgrading minor versions will potentially disrupt your setup and make a lot of the documentation irrelevant. Furthermore, while we've had numerous presentations about GX Cloud, we haven't had the opportunity to test it firsthand.
To conclude, GX is a good tool with a lot of integrations and features, like Data Docs and their stores. However, the lack of proper v1 and the drawbacks we faced during our different projects make it difficult to confidently choose this tool for our data quality needs.
Soda is a Belgium company focused on building their open source data quality library SodaCore with their paid offering SodaCloud based on SodaLibrary. They raised $13M from series A funding in 2021 and are comfortably sitting on 1500 ⭐️ on GitHub.
Just like Great Expectations, Soda offers an extensive range of integrations with existing tools:
- Prefect supports an official Soda task, however, there is no official Soda operator even though an extensive guide on how to implement Soda with Airflow exists.
- Soda offers multiple connectors with the most popular data warehouses. It focuses on databases so its scalability is inherently determined by the database's capabilities.
- Soda has Spark, Pandas, and Dask compatibilities but only through programmatic scans.
Soda's primary strength lies in its user-centric approach, emphasizing simplicity and efficiency. Starting with Soda requires minimal setup: by just defining your configuration and check files, you can swiftly move into the monitoring phase. This straightforwardness doesn't come at the cost of flexibility. Additionally, on top of the extensive amount of data quality checks that Soda provides, you can also define SQL queries to integrate more specific tests that might not be covered by Soda.
A notable feature under experimentation is the DuckDB connector. This is especially advantageous for those who wish to rapidly validate and test their checks. With the DuckDB connector, you can execute these tests either on your local machine or, for greater scalability and real-world testing conditions, directly in a cloud environment.
Further elevating its value proposition is SodaCloud, Soda's cloud-hosted variant. It's not just a replica of the core tool hosted in the cloud; it's an enriched platform. Through its intuitive user interface, users can effortlessly build data agreements, keeping all stakeholders on the same page. Monitoring becomes a breeze with the ability to closely follow metrics, ensuring that they remain within the defined parameters and instantly spotting any anomalies. Moreover, for those who require advanced functionalities, features like reconciliation and the innovative SodaGPT are readily available.
In the event of data discrepancies or issues, SodaCloud shines with its comprehensive incident management system. Instead of merely alerting you, it aids in tracking the incident, ensuring timely resolution. Furthermore, understanding that many teams already have established workflows, SodaCloud seamlessly integrates with popular issue trackers, making it a harmonious addition to any data pipeline.
In our exploration of Soda, we pinpointed several areas of concern:
- SQL-Centric Design: Soda is tailored predominantly for SQL databases, positioning it optimally for such environments. That said, it's not confined exclusively to SQL. Through the programmatic scan feature, integration with other data sources remains feasible.
- Documentation Disparity: Soda's official website shines with detailed documentation for its cloud version and the SodaCL tool. However, the open-source variant, SodaCore, is hosted on GitHub which makes it not convenient to use and navigate.
- SodaCL's Learning Aspect: SodaCL, while intuitive especially when paired with SodaGPT, introduces a new syntax and structure. This entails an inherent learning phase, which users must navigate to fully harness its capabilities.
In conclusion, Soda, based in Belgium, offers the open-source SodaCore and advanced SodaCloud for data quality. With notable Series A funding and a robust GitHub presence, it's comparable to leaders like Great Expectations in integration. While user-centric and versatile, its SQL focus and GitHub documentation for SodaCore present challenges. Yet, Soda remains a formidable contender in the data quality domain.
Next, we are going to dive into dbt. dbt is a robust command-line interface (CLI) tool that empowers users to manipulate data in their data warehouses using SQL statements. It has gained significant traction in the data engineering community due to its SQL-focused design, modular structure, and user-friendly nature. Today, we're particularly interested in exploring dbt's integrated testing feature, which ensures data quality and consistency.
I will only cover the features dbt test brings to the table as dbt's integrations are extensively discussed in numerous other documents.
dbt testing provides singular tests and generic tests:
- Singular Tests: These are SQL queries designed to identify and return records that fail specific conditions. If you find yourself frequently using a singular test, it can be transformed into a generic test for broader application.
- Generic Tests: These are parametrized SQL queries. You can incorporate generic tests directly into your model as attributes. Currently, dbt supports four predefined generic tests:
You can store the failures in a table to use further in your data quality process.
In essence, dbt testing is straightforward but integrates seamlessly with your existing dbt models. A few additional tools add more features to dbt to leverage the testing feature like dbt-expectations, elementary, or even Soda!
Due to the popularity of dbt, we included dbt test as well in the comparison. However, it's not a full data quality solution and needs other integrations to be considered as a data quality framework.
Whylogs is the last tool on the list. Whylogs has been developed by Whylabs and currently has 2400⭐️ on GitHub.
Whylogs is a logging library that builds a profile for your datasets. It can then use those profiles to:
- Monitor changes in the datasets.
- Define data constraints to ensure that the data looks as you expect.
- Visualize key metrics for your datasets.
In terms of infrastructure compatibility and data integration, we found that whylogs was a bit lacking:
- It doesn't offer any support for orchestration tools. If you want to schedule whylogs to run daily, you will have to create your own integration.
- Additionally, it doesn't have native support for any cloud providers except for BigQuery.
While you can integrate almost anything into your Python whylogs script, having native cloud integration would significantly streamline the process and expedite the integration of whylogs into current workflows.
Whylogs' profiles offer a centralized solution for maintaining dataset statistics, making them ideal for effective data logging. Beyond tabular data, whylogs also supports formats like images, text, and embeddings. The visualization capabilities of their profiles are impressive. Additionally, Whylabs offers a cloud-based version of whylogs, providing hosting for your profiles along with other enhanced features.
In brief, Whylogs by Whylabs, with 2400⭐️ on Github, is a proficient dataset profiling tool. It shines in dataset monitoring and metric visualization but lacks in infrastructure compatibility, especially with orchestration tools and cloud providers, barring BigQuery. While custom integrations are possible, native support would elevate its utility. Its versatility in data formats and cloud offerings, however, underscores its potential value.
In the ever-moving world of data, quality data is pivotal for informed decision-making. This analysis provided an overview of the current open-source data quality tools, focusing on their capabilities and integrations. We investigated Great Expectations, Soda, dbt test, and whylogs. Each tool has its strengths, from extensive integrations to user-centric designs and robust profiling. However, they also come with challenges, from steep learning curves to infrastructure limitations.
The ideal tool hinges on an organization's specific needs and data landscape. While we won't provide a definitive ranking, Soda emerges as a versatile, ready-to-use solution.
As the data landscape continues to evolve, will the tools, and organizations must remain agile in their selections. This leads us to the next and last point of this article.
The investigations of data quality tools don't end here. As the field of data, and thus data quality, is dynamic and ever-evolving, it's imperative to keep up to date with the latest developments. The next steps for this article will involve continuous investigation of emerging data quality tools and updates on the ones we've already delved into. This ensures that our analysis remains relevant and provides the readers with the most current insights.
If you want us to consider one tool not discussed in this list, please let us know!
Thanks for reading this article!