Data Quality for Notion Databases 🚀
Notion ➕ Great Expectations = 🚀
If you've ever heard of or used Notion (specially their databases) and Great Expectations, you can already imagine what this is about 😉. If not, find a quick ELI5 below:
See our Github for more technical details and detailed instructions.
👶 ELI5: Great Expectations
"Great Expectations is a shared, open standard for data quality. It helps data teams eliminate pipeline debt, through data testing, documentation, and profiling" - Great Expectations' website, 2021
In short, with great expectations you always know what to expect from your data. They do this via what the call 'Expectations' (didn't see that coming, huh? 🙄), which as the name implies, are qualities you expect from your data.
Expectations can be as simple as "I want to be sure that this column is never null" or "I want to make sure the row count is always X". If you want to dig deeper or find a list of possible expectations, you can do so at Great Expectation's official site
If you're done with that and want to dig deeper, our colleague Paolo Léonard wrote a tutorial on writing your own custom expectations here.
👶🏼 ELI5: Notion
I love Notion's own explanation of itself, so I'll point you to it, right here 🎉
In short, it is an all-in-one workspace collaboration tool that has it all: tasks, lists, kanban boards, wikis, and the star of today: databases. Here at Dataroots, we ❤️ Notion and we use it extensively. Being a data-first company, as you can imagine, we have databases for anything and everything you can think of, but we'll talk more about that in a bit.
What took you so long?
As said before, we love Notion and we love Great Expectations, so this marriage was just a matter of time. Not only time, to be exact. Before, Notion was only a website and it was not until May 2021 that they released the first public beta of their own API 🎊. This was the last piece of the puzzle that allows us to combine it with Great Expectations. Isn't that the beauty of Open Source?
So you get the goal here: use Notion's API to get our databases and run those through Great Expectations to get our results. For the remainder of this blog, we'll focus on our Employee Directory database. I know what you're thinking: "you even have a database for your own employees?! 🤯". You bet we do.
Our Employee Directory contains mundane information from our employees like our phone number, email, position, but also crucial pieces of data like our favorite dessert. It is of the upmost importance to be sure that we know everybody's favorite dessert and of course this was the first expectation we built.
We focused our tool to be extremely user-friendly and fast to quickly get something going. Adding a new database and creating the expectation suite takes around 10-15min if you know already what expectations to include. Allow me to guide you through the 4 easy steps:
1: Create a Notion integration
As always, whenever we're dealing with an API, we need a way to authenticate ourselves. Luckily for us, Notion makes it really simple to create what they call an Integration and give it access to whatever page/db you want.
2 Choose your database (just get the url)
3: Create your Expectation Suite
Using our jupyter notebook, it is extremely easy to create your expectation suite while doing some data exploration to make sure you know what to expect. Here you can see our dessert-related expectations 🍰 (along with others).
4: Run 🚀
Now you have your database and your expectation suite. On top of these 2 things, you'll just need a description to identify your run and you're good to go.
That's it! You have now successfully ran an expectation suite (a group of expectations) against your data. You can either see your results as boring .json files OR you can use Great Expectations' sweet, automatic Data Docs.
Great Expectations' Data Docs 📊
One of the great things from Great Expectations is their Data Docs. Data Docs are these HTML pages that Great Expectations compile from your expectation suites and validation runs. To learn more, here is the original website.
Here you can see a log of all your previous runs with information on a per-expectation level.
It is also a great place to see a list of your expectation suites and to dive deep into each expectation.
So, what happened to our 'Favorite Dessert' column?
You can see we currently have 18 employees of which we don't know what their favorite dessert is! 🤯 You can be sure that by the time you're reading this, this is no longer the case, as this is utterly unacceptable.
Wrapping Up 🦾
To conclude, we think this is a great tool to implement data quality in your Notion databases. Although Great Expectations may seem a bit overkill for this use-case (as they're mostly use much bigger and complex databases), we thought it was a great way to combine Notion, which we use extensively internally, ➕ Great Expectations, which we use with a number of clients.
If you've read all the way here, first of all we'd like to say "thanks 🙏🏼", and we hope you're excited and already thinking about how to use this solution yourself.
You can find our open-source repo which we used to built this ourselves here. Inside you will find more technical details and all the specific instructions as to how to get it running yourself. This tool is open-source ❤️, both Great Expectations and Notion's API being open-source, so we would love for you, the community, to contribute, as this is how great things get built.