In machine learning, we are faced with tensor-based computations (that's the language that ML models think in). I've recently discovered a project that helps you make it much easier to set up and conduct machine learning projects, and enables you to create and store datasets in deep learning-native format.
Hub by Activeloop (https://github.com/activeloopai/Hub) is an open-source Python package that arranges data in Numpy-like arrays. It integrates smoothly with deep learning frameworks such as TensorFlow and PyTorch for faster GPU processing and training. In addition, one can update the data stored in the cloud, create machine learning pipelines using Hub API and interact with datasets (e.g. visualize) in Activeloop platform (https://app.activeloop.ai). The real benefit for me is that, I can stream my datasets without the need to store them on my machine (my datasets can be up to 10GB+ big, but it works just as well with 100GB+ datasets like ImageNet (https://docs.activeloop.ai/datasets/imagenet-dataset), for instance).
Hub allows us to store images, audio, video data in a way that can be accessed at lightning speed. The data can be stored on GCS/S3 buckets, local storage, or on Activeloop cloud. The data can directly be used in the training TensorFlow/ PyTorch models so that you don't need to set up data pipelines. The package also comes with data version control, dataset search queries, and distributed workloads.
For me, personally the simplicity of the API stands out, for instance:
Loading datasets in seconds
import hub ds = hub.load("hub://activeloop/cifar10-train")
Lets you access and train a model on CIFAR 10 dataset in seconds (look for all the datasets available here https://docs.activeloop.ai/datasets).Dataset filtering
Hub has its own Pythonic query language (see below), or you can define your own functions to filter through data (since it's a multi-dimensional array, it's easy to do so).
ds_view = ds.filter("labels == 'automobile' or labels == 'automobile'", scheduler = 'threaded', num_workers = 0) Dataset Version Control
Dataset Version Control
All things you may already know from git, e.g commit, log, branch, diff, checkout also work for Hub using similar commands. (https://docs.activeloop.ai/getting-started/dataset-version-control)-----------------------------------------------------------------------------------------------------------------------------------
My experience with Hub was amazing: I was able to create and push data to the cloud within a couple of minutes. I also put together a tutorial on how to get started with dataset management with Activeloop Hub so you can finish your ML projects faster
You will learn:
- How to initializing a dataset on the Activeloop Cloud - Processing the images - Pushing the data to the cloud - Data version control - Data visualization
Read more here: https://www.kdnuggets.com/2022/03/new-way-managing-deep-learning-datasets.html Activeloop Hub docs: https://docs.activeloop.ai/
Let me know if you have any questions! I just started with Hub quite recently, but might be able to help out. :)
@1abidaliawan