Show HN: Streaming DataFrames–a Pandas-like syntax for real-time data

8 points

2 years ago

Hey all! We’ve built a Pandas-like interface to make it easy to work with streaming data using what we call ‘Streaming DataFrames’.

For example, suppose that you want to convert speed measurement units from meters per second to kilometers per hour

With static data in Pandas, you’d do this:

df["speed_km_h"] = df["speed_m_s"] * 3.6

With Streaming DataFrames, it’s pretty much the same thing…

sdf["speed_km_h"] = sdf["speed_m_s"] * 3.6

…except it’s being done continuously and the updated records can be sent to an output topic in Kafka with almost no delay after they’ve been processed.

You can also do a stateful tumbling window like this:

sdf = sdf.apply(lambda row: row["speed_km_h"]).tumbling_window(timedelta(seconds=3O), grace_ms=timedelta(seconds=1).mean().final()

I know some peeps might be thinking “eeeeww Pandas”, but we wanted to make it more user-friendly for data and ML folks who are interested in moving batch to real-time or near real-time processing. Of course there’s Flink and Spark which are great stream processing tools… but if you need the specialised talent to operate them. We think there’s a need for something that's easier for generalists to learn, which is why it’s strictly Python only.

It doesn’t wrap any other technology so you don’t need to do any cross language debugging or include SQL statements in your Python code.

If you want to learn more about how it works, check out our launch blog: https://quix.io/blog/introducing-streaming-dataframes Github repo is here: https://github.com/quixio/quix-streams