This is an open-source rewrite of a similar tool that we developed for one of the biggest advertisers to monitor the quality of data. The solution was battle tested to analyze data across all three clouds, analyze sensitive data and work behind firewalls. Additionally, used the framework to build 800+ custom data quality checks, analyze petabyte scale tables daily and calculate data quality KPIs for 40+ vendors providing data.
DQO.ai is an open-source rewrite which takes the best practices and avoids the mistakes that we made in the first release.
What was the biggest problem with the first version? It was designed from the UI first perspective, like all those new Data Observability tools that are coming on the market. We have been there, it works only for the first 6 months until you want to move the data quality checks between environments (from development to production). It takes a lot of clicking to rename tables used in data quality checks when a table was renamed in the data lake.
DQO.ai was redesigned for DevOps and DataOps teams. Here are our design goals:
- configuration stored in Git, along the data pipelines or ML code: we store all definitions in YAML files, one file per monitored table
- configuration is easy to edit without using the docs: our YAML files support code completion in VSCode and some other editors
- multi cloud / on-premise: DQO runs as a remote agent, you can just run it inside your environment, call it from Airflow
- support sensitive data: all data quality metrics are executed from the agent, results are first stored locally in parquet files
- custom data quality checks: the built-in quality checks are written as templates of SQL queries, using Jinja2 for templating, but they may be modified or new quality check could be added
- multi platform storage: the data quality results are stored in a Hive compatible partitioning format, you can just push the whole folder to BigQuery, Databricks, Spark, Athena, ... or anything that can query parquet files
- petabyte scale data: we can just analyze date partitioned data, using the partition date to analyze the table as a time series. You can detect anomalies in the data.
- free data quality dashboards: our SaaS version sets up a data lake for each tenant, the data quality metrics are shown on Google Data Studio dashboards. We have just the first dashboard so far, but we are working on releasing more dashboards.
- use dimensions to group similar tables (i.e. one table per country) or analyze each group of rows in a single table (GROUP BY <country> column): data can be analyzed even when it is loaded from different sources into a single table
What kind of issues could be detected:
- data timeliness: the table are not refreshed
- data completeness: something is missing
- data validity: just verify basic rules like column values matching regex expressions
- data consistency: analyze behavior of data over the time as a time series, for example: detect anomalies in an inconsistent growth of the row count
- uniqueness: check if the data is unique
Repository is here: https://github.com/dqoai/dqo
We are still working on the documentation, but this article should give you a good understanding: https://docs.dqo.ai/latest/check_reference/validity/regex_match_percent/regex_match_percent/
Also a getting started blog is here: https://dqo.ai/getting-started-with-dqo-ai/