It is about running a series of transformations, defined in SQL, on many tables, but:
- All the data can fit in the main memory of a single server (a few TB)
- No change in the data during processing (batch processing, all source files are Parquet files, all output files are Parquet files)
- Only one user at a time running the transformation (no need for transaction or isolation)
- No need for persistence during the computation of intermediate tables
From my understanding of how a database works, that should remove a lot of the bookkeeping that a standard database has to do in order to provide transactions and ACID properties. In return, one can expect an improvement in performance. I am looking at a kind of "memory only database".
DuckDB looks like it should work, but I am not sure how mature it is.
[1] https://duckdb.org/ [2] https://arrow.apache.org/