Show HN: Cleanlab Vizzy – automatically find label errors and bad data

playground.cleanlab.ai

1 point

4 years ago

Cleanlab (https://github.com/cleanlab/cleanlab) is a family of algorithms for automatically finding issues in datasets. It might seem surprising that it’s possible to automatically identify label errors and out-of-distribution data; Cleanlab does this using the algorithms published in https://arxiv.org/abs/1911.00068.

Cleanlab’s algorithms, while clever, are actually relatively simple. To help myself (and others!) build intuition for how they work, I built Vizzy, an interactive demo that runs in the browser. Vizzy lets you experiment with an example dataset, tweak the labels, and run Cleanlab to automatically find issues like label errors and out-of-distribution data

Vizzy includes a JavaScript port of (a part of) cleanlab, along with other neat technical nuggets including ML model training in the browser (using features from a pretrained ResNet-18, performing truncated SVD, and using an SVM model for speed). If you’re interested in the details of how Vizzy works, check out this blog post: https://cleanlab.ai/blog/cleanlab-vizzy/

I’m happy to answer any questions related to Vizzy, cleanlab, or confident learning and data-centric AI in general!