Show HN: Automatically extract data from APIs with dlt and OpenAPI

colab.research.google.com

7 points

2 years ago

Hi Show HN, we are Dave, Marcin, Alena, and Adrian, authors of data load tool (dlt), a Python library that automatically creates datasets from any kind of messy, unstructured data.

We launched dlt on HN 7 months ago with a mission to make getting datasets fast and easy. Now dlt helps its users to code around a thousand new data sources each month and to maintain many thousands of live datasets in production.

Today we are releasing *dlt-init-openapi,* a Python CLI tool that generates a dlt data pipeline from any OpenAPI spec. It brings the time to create a dataset down to a few minutes.

Here’s a Colab demo: https://colab.research.google.com/drive/1MRZvguOTZj1MlkEGzji...

Here’s a video walkthrough: https://youtu.be/b99qv9je12Q

---

In the past you had to analyze REST API endpoints, response data types, pagination styles and write lots of custom Python code to create a dataset. OpenAPI specifications standardized API definitions, making it easier to interact with APIs. Now, dlt-init-openapi leverages these OpenAPI specs to automate the manual work around dataset creation you had to do before. OpenAPI is a growing standard with almost every API using it, and used by default in frameworks like FastAPI.

What can dlt-init-openapi do for you?

- It generates Python scripts with dlt pipelines that you can run to pull data from your API into a structured destination of your choice (Parquet files, SQL DBs, Databricks, Snowflake, etc.)

- Infers and evolves schema for the endpoints from the actual data!

- Discovers pagination style for each endpoint

- Finds and unwraps data entities for each endpoint, also for deep, nested JSON responses

- Discovers the primary key for each entity

- Discovers authentication schema and generates code and config files to pass required credentials

- You always have the last say. The generated code is declarative and ready to hack in case we pick the wrong paginator or response entity.

The tool and dlt are open source, find the code here: https://github.com/dlt-hub/dlt-init-openapi and here: https://github.com/dlt-hub/dlt

1 comment