Ask HN: Best practice for in-memory large data transformation

1 point

4 years ago

Does any of you have some recommendation on how to set up an in-memory data transformation system. I have been looking at DuckDB[1], but maybe there are interesting alternatives. Apache Arrow seems promising in some ways too.

It is about running a series of transformations, defined in SQL, on many tables, but:

- All the data can fit in the main memory of a single server (a few TB)

- No change in the data during processing (batch processing, all source files are Parquet files, all output files are Parquet files)

- Only one user at a time running the transformation (no need for transaction or isolation)

- No need for persistence during the computation of intermediate tables

From my understanding of how a database works, that should remove a lot of the bookkeeping that a standard database has to do in order to provide transactions and ACID properties. In return, one can expect an improvement in performance. I am looking at a kind of "memory only database".

DuckDB looks like it should work, but I am not sure how mature it is.

[1] https://duckdb.org/ [2] https://arrow.apache.org/