Ask HN: Best Tools for Data Analysis with Unstructured JSON Documents?

1 point

5 years ago

Hi HN,

In 2021 I'm working to open source a behemoth project I've poured over 1,500 hours into. It relates to US Congress bill discovery and analysis (similar, but different, to govtrack).

My next major step is to write a data dictionary to bring organization to the undefined/unstructured chaos. The goal is anyone can quickly start hacking on their own applications with the data, and conduct their own analyses, but without requiring a polysci degree to do that. I'd be thrilled if a highschool student could pick the data up and start hacking.

Here is an example schema: https://i.imgur.com/Qsoa1aj.png

Currently I use a relational database and although JSON querying does work fine, it isn't exactly easy to build statistical analyses with on the fly. Here are some questions I can answer, but not quickly:

1. What's the entire list of unique bill attributes that have ever existed in the dataset? What about only for 2019?

2. How many times was X attribute used in 2019? What was every possible value for it?

3. For all bills and all actions ever recorded, what is the total number of unique types of actions have been recorded? (eg tabling a bill, holding a vote, passed to committee, etc)

4. Which bill was most "popular" (most referenced by other bills) in 2020?

I have experience with Elasticsearch, MongoDB, et al and am intrigued by Typesense. But as I don't work with statistical analysis often, I humbly ask the community if there are tools I should be considering to answer the above questions (quickly!).

Cheers!