Is My Approach to Vectorizing and Storing 1.5 Trillion Tokens Reasonable?

1 point

2 years ago

I'm planning to index and store 1.5 trillion tokens using Faiss and would love some feedback on my approach:

1. Partitioning: I'm thinking of using distributed k-means and inverted multi-index quantizers for efficient data partitioning.

2. On-Disk Storage: Due to the scale, I'm storing everything on disk using a Compressed Sparse Row format.

3. Distributed Search: I plan to implement a client-server model with multiple servers to handle search operations.

Does this approach sound feasible, or am I overlooking something crucial? Any advice or suggestions?

I'm mostly working off of this article: Indexing 1T Vectors (https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectors). I think the data is too big for AutoFaiss, but I can use that for experiments.