Our demo allows you to search for objects in San Francisco using natural language. You can look for things like Tesla cars, dry patches, boats, and more.
Key features:
- Search using text or by selecting an object from the image as a source ("aim" icon)
- Toggle between object search (default) and tile search ("big" toggle, useful when contextual information matters, like tennis courts)
- Adjust results with downvotes (useful when results are water images)
- Click on tiles to locate them on a map
- Control the number of retrieved tiles with a slider
We use OpenAI's CLIP model (https://openai.com/index/clip/) to put texts and images into the same embedding space. We do a similarity search within this space using text query or source image. We are using CLIP finetuned on pairs of satellite images and OpenStreetMap (https://www.openstreetmap.org/) tags (https://github.com/wangzhecheng/SkyScript) because vanilla clip performs poorly on satellite data. We pre-segment objects using Meta's Segment Anything Model (https://segment-anything.com/) and pre-compute CLIP embeddings for each object.
We'd love to hear your thoughts! What worked well for you? Where did it fail? What features do you wish it had? Any real-world problems you think this could help with?