Design considerations for RAG application in production mode

3 points

3 years ago

Been working on a robust Q&A app for enterprise. I used llamaindex (+langchain) as a pipeline. Started using Chroma for my vector db, which worked pretty well, but I realized that my app runs faster when I store the indices in an S3 bucket rather than use Chroma to store my embeddings and generate the index from these embeddings at query time. Are there tradeoffs I'm making in using a pre-built index in S3 rather than a vector db to stash embeddings? Has anyone come across this kind of consideration? I've looked at Weaviate (offers hybrid search) but haven't decided to retool code based around it. Basically, I'm just looking for whichever implementation will result in the fastest response times (knowledge base size is 'large' ~40GB).

RE Weaviate, this looks interesting: https://github.com/openai/openai-cookbook/blob/main/examples/vector_databases/weaviate/hybrid-search-with-weaviate-and-openai.ipynb

Further and related, has anyone tried to embed a larger amount of data before? I estimated total time using CPU ~29 hours. With GPU I've seen demos reducing this to minutes. https://www.anyscale.com/blog/build-and-scale-a-powerful-query-engine-with-llamaindex-ray