Data workflow management is a recurring issue at many companies. There are several open source projects to address it[1][2][3][4].
But as data starts moving around, inconsistencies start cropping up. Did a new enumerable crop up in a field? Did a new field crop up in the source? Did an int get switched to a bigint? Does a field contain mojibake[5] and need fixed[6]? Did you start seeing nulls instead of empty strings somewhere? Are you seeing missing values where you don't expect them? Did the phone number formatting from the source change out from under you? Are duplicates showing up in a presumably unique field? Is there a discrepancy between your source and our data warehouse?
None of these workflow management systems seem to explicitly address this area. For all the data plumbing projects, there seems to be a distinct lack of anything to ensure you've got clean water running through all that plumbing.
The traditional solutions to this seem to be Informatica[7] and Talend[8]. The only newer project I'm coming across is a very new (and Hadoop-only) project from eBay[9].
I'm curious how companies here are approaching this issue. Is your company rolling its own solution? Or using something I haven't mentioned? Or maybe using one of the old school and heavy products like Informatica? Or maybe doing nothing at all under the assumption that your databases and data warehouses don't need the same level of unit, regression, and sanity testing that your applications receive?
[1] github.com/spotify/luigi
[2] github.com/pinterest/pinball
[3] azkaban.github.io/
[4] github.com/apache/incubator-airflow
[5] en.wikipedia.org/wiki/Mojibake
[6] ftfy.readthedocs.io/en/latest/
[7] informatica.com/products/data-quality.html
[8] talend.com/products/data-quality
[9] ebay.github.io/griffin/