Ask HN: Can a large model be trained directly on a db's binlog?

1 point

martythemaniak

2 years ago

At a high level, traditional ML architectures look something like this:

1. Have some services with a database. Users do stuff, services update the database.

2. The service emits some kind of event describing what changed.

3. The stream of events gets consumed and stored.

4. You write a bunch of aggregations on those events and call them "features"

5. Models get trained on these features.

6. Features get calculated every time there's a new event. You run the trained model on the newest feature values to try and predict something useful.

There's a second path, where you might replicate the db in a warehouse, then run ETLs to produce features, but the outline is very similar. There's basically a lot of manual work done to produce features that models can be trained and run on.

This vaguely reminds me of the way computer vision used to be done - you'd manually run some algorithm to do edge detections, etc then try to operate on them. But it turns out training a large model on a giant pile of images will let the model create and learn its own features from a lot of raw data. So I'm curious - is there an analog here? Can a sufficiently large model be trained directly on the binlog of a database and learn its own features?

1 comment