In order to perform downstream analysis, informaticians MUST use a parser in order to work with the data. This presents a barrier and also makes standardization harder.
- e.g. many samples in vcf
- e.g. many ploids in ALT
While storing the data sparsely throughout the file is an alternative, this would drastically increase the number of rows and would negativity impact the human-readability of the file (although perhaps not as much as severely as the delimiters).
---
PROPOSAL: If Parquet file format is adopted then sub-fields could instead be nested as a JSON field. The JSON could then be read using either :
A) Python = Pandas
> pandas.read_parquet()
> pandas.io.json.json_normalize() ```
| https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_parquet.html | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html
Supplementally, you could make a hybrid kaolas df for use w Spark.
B) R = Pyarrow + Tidyverse
> arrow.read_parquet("somefile.parquet", as_tibble = TRUE)
# it's not obvious how to read a nested json column from tibble df.
| https://stackoverflow.com/a/55165161/5739514
C) Spark SQL explode()
> df.select(explode())
| https://docs.databricks.com/spark/latest/dataframes-datasets/complex-nested-data.html
The dataframe could then be fed into downstream analytics like scikit-learn, tensorflow, caret, as well as generic batch/ stream/ containerized workflows.
In addition to language-agnostic interoperability, the Parquet file would also come with the added benefits of:
(1) Partitioning/ sharding, especially on HDFS. (2) Delta Lake capabilities of longitudinal, schema, and ACID.