Genomics Parquet-VCF; nested JSON (read as dataframes) vs. delimited strings

1 point

7 years ago

In genomic variant file formats, many sub-fields are typically concatenated into one column (as special-character-separated or delimited strings).

In order to perform downstream analysis, informaticians MUST use a parser in order to work with the data. This presents a barrier and also makes standardization harder.

- e.g. many samples in vcf

- e.g. many ploids in ALT

While storing the data sparsely throughout the file is an alternative, this would drastically increase the number of rows and would negativity impact the human-readability of the file (although perhaps not as much as severely as the delimiters).

---

PROPOSAL: If Parquet file format is adopted then sub-fields could instead be nested as a JSON field. The JSON could then be read using either :

A) Python = Pandas

> pandas.read_parquet()

> pandas.io.json.json_normalize() ```

| https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_parquet.html | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html

Supplementally, you could make a hybrid kaolas df for use w Spark.

B) R = Pyarrow + Tidyverse

> arrow.read_parquet("somefile.parquet", as_tibble = TRUE)

# it's not obvious how to read a nested json column from tibble df.

| https://stackoverflow.com/a/55165161/5739514

C) Spark SQL explode()

> df.select(explode())

| https://docs.databricks.com/spark/latest/dataframes-datasets/complex-nested-data.html

The dataframe could then be fed into downstream analytics like scikit-learn, tensorflow, caret, as well as generic batch/ stream/ containerized workflows.

In addition to language-agnostic interoperability, the Parquet file would also come with the added benefits of:

(1) Partitioning/ sharding, especially on HDFS. (2) Delta Lake capabilities of longitudinal, schema, and ACID.