1. Partitioning: I'm thinking of using distributed k-means and inverted multi-index quantizers for efficient data partitioning.
2. On-Disk Storage: Due to the scale, I'm storing everything on disk using a Compressed Sparse Row format.
3. Distributed Search: I plan to implement a client-server model with multiple servers to handle search operations.
Does this approach sound feasible, or am I overlooking something crucial? Any advice or suggestions?
I'm mostly working off of this article: Indexing 1T Vectors (https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectors). I think the data is too big for AutoFaiss, but I can use that for experiments.