And then it's a ton of keyframing. I also found that the timings all have to be earlier than the timestamp you want to account for how long it takes to denoise to your new scene - so like .4 gives smoother but longer, .7 is snappy. To get stuff to keep showing up and stay dynamic you can zoom out continuously, it needs a UI, badly.
If you're on windows you'll need ffmpeg too.
I watched this when it came out last week: https://www.youtube.com/watch?v=W4Mcuh38wyM
I followed this to get it running locally: https://stablediffusionguides.carrd.co/#one