1. Have some services with a database. Users do stuff, services update the database.
2. The service emits some kind of event describing what changed.
3. The stream of events gets consumed and stored.
4. You write a bunch of aggregations on those events and call them "features"
5. Models get trained on these features.
6. Features get calculated every time there's a new event. You run the trained model on the newest feature values to try and predict something useful.
There's a second path, where you might replicate the db in a warehouse, then run ETLs to produce features, but the outline is very similar. There's basically a lot of manual work done to produce features that models can be trained and run on.
This vaguely reminds me of the way computer vision used to be done - you'd manually run some algorithm to do edge detections, etc then try to operate on them. But it turns out training a large model on a giant pile of images will let the model create and learn its own features from a lot of raw data. So I'm curious - is there an analog here? Can a sufficiently large model be trained directly on the binlog of a database and learn its own features?