Sycamore is an LLM-powered semantic data preparation system for building search applications. It introduces a distributed set-based abstraction, a DocSet, that makes processing a large document collection as easy as reading a single document. Sycamore makes it easy to use LLMs to transform and enrich your unstructured data and prepare it for search. It comes with a scalable distributed runtime, built on Ray, that makes it easy to go from prototype to production.
For example, with Sycamore, you can read a collection of PDFs, partition them in coherent chunks, pull out entities like titles and authors, compute vector embeddings, and load them into a local OpenSearch cluster. All with a few lines of code.
To learn more, visit the repo: https://github.com/aryn-ai/sycamore, docs: https://sycamore.readthedocs.io/, and demo: https://www.loom.com/share/53e68b0eb5ab49948111a3fcf6286b7f?...
We’d love for you to try it out, give us feedback, and contribute.