We know from experience that a qualitative understanding of critical data segments is necessary when developing ML models. Better tooling makes finding these data slices faster, more systematic and helps to communicate with domain experts.
We built sliceguard to go from a raw dataset to an interactive report on critical data slices with just 3 lines of code:
https://github.com/Renumics/sliceguard
Behind the scenes, we use hierarchical clustering and explainable AI techniques to detect and rank data slices based on features, metadata and embeddings.
Here is some more information on Sliceguard: - Works on structured, unstructured data (image, audio, NLP, multimodal) and hybrid datasets. - Directly works on existing Pandas DataFrames. - Automatic computation of embeddings and AutoML functionality to pinpoint problems without any setup. - Interactive GUI for slice inspection supports multimodal data and can be configured with drag-n-drop.