Here's a 2 minute video that shows it off: https://www.youtube.com/watch?v=S4R1gtmM-Lo
How/why does it work? We believe that with the rise of foundation vision models, computer vision will fundamentally change. These powerful models will let any devs “compile” a model ahead of time with a subset of the foundation model’s characteristics, using only text and a web-tool. The days of teams of MLEs building complex models and pipelines are ending.
Zeroshot works by using two powerful pre-trained models, CLIP and DINOv2 together. The web-app allows users to quickly create our training sets via text search. Using pre-cached DINOv2 features, we generate a simple linear model that can be trained and deployed without any fine-tuning. Since you can see what’s going into your training set, you can tune your prompts to get the type of performance or detail you want.
CLIP Small -- Size: 335 MB, Latency: 35ms
CLIP Large -- Size: 891 MB, Latency: 276ms
Zeroshot -- Size: 85 MB, Latency: 20ms
What’s next? We wanna see how people use or would use the tool before deciding what to do next. On the list: clients for iOS and NodeJS, speeding up GPU inference times via TensorRT, offering larger Zeroshot models for better accuracy, easier results refining, support for bringing your own data lake, model refinement using GPT-V, we’ve got plenty of ideas.