Vibing Cat Synchroniser - Generating Memes Using Waveform Analysis
If you haven't come across it already, allow me to introduce you to the vibing cat.
This particular cat nodding expressively, as if to the rhythm of a song, has become an internet sensation. I love it, even if the cynical side in me knows that the owner is probably just moving its head. Amongst its newfound life as a chat react emoticon, the vibing cat has gone on to bless many music videos with its ~cat vibes~.
However, if you have a good sense of rhythm, perhaps you too are irked by how the animation is just slightly off beat from the underlying song. Quite frankly, we can't blame the cat, and manual beat matching is quite hard. What can be done?
The Vibing Cat Synchroniser
Sometimes I think I'm beyond the point of marvelling at how hilariously useless your projects are and then you hit me with something like this1
Naturally, having concluded that the world needs access to better quality vibing cat memes, I set about writing a program to automate the process. Basically, it performs beat matching and then synchronises the cat to the beat. So let's see some examples!
Responding to pauses in the beat structure
Dealing with a changing tempo
Dexys Midnight Runners - Come On Eileen
Hilarity ensues when the rhythm is fast
DragonForce - Through the Fire and Flames
How it's done
The first thing that needs to be done is working out where all the beats are. To achieve this, I used an implementation2 of the beat tracking algorithm as described in this paper: Enhanced Beat Tracking with Context-Aware Neural Networks.
It sounds complicated, but its result is simple: a list of probabilities, one for each 10ms time period in the song. Each probability represents how likely it is that the given sample is part of a beat. As an example, let's analyse a piece of music with a well-defined beat but variable tempo: Come On Eileen.
From about 150s in, the song starts to gradually speed up. If we graph beat certainty against time, we get something looking like this:
As the rhythm speeds up, so the dots get closer together, although at this scale it's kind of imperceptible. It's interesting to note, however, that the certainty of recognising a given beat decreases as the tempo increases. This is because the beat detection is done by a contextually-aware neural network: it takes into account the rhythm that came before, and ranks deviations from it as less certain.
So we set a certainty threshold, and consider only points above this threshold to be "beats".
The increase in tempo is made more clear if we look at the duration of each beat:
As the rhythm speeds up, so the duration ("time since last beat") decreases - we can see this as the part that's sloping downward. The tempo increase is quite dramatic: at beat 360, the rhythm is approximately twice as fast as at beat 320.
Then at beat 366, something interesting happens: the beats are happening so quickly that every other beat is falling below the certainty threshold, resulting in a return to a tempo closer to what we had at the start. And then it just goes crazy during the heavy drumming section, before returning to normal.
So now we know with pretty high certainty where all the beats land in the song. So we're done on the music front, right? Not so fast...
The beat break
Quite a few songs exhibit the pattern of beat breaks: the drums will be playing a recognisable beat, and then drop out for a bit before rejoining. Unfortunately, this means our beat detection is going to stop working so reliably. Although it does a great job overall, in practice it's going to drop at least a few beats. Ideally we'd like to not have the cat just sitting there awkwardly during instrumentals, so we need a way of guessing what happens during these intervals.
Let's look at another example:
Taking a look at the beat certainties again, we notice that the beat certainty drops significantly during the break:
To avoid missed beats, we need to detect the breaks and forward-fill them with beats of an appropriate tempo. The beat duration graph gives us a clue of how best to do this:
Look at these massive outliers - they're very distinguished from the group, and nice multiples of the base beat duration too! So let's make the naïve assumption that any long outlier represents a beat break3, and see how far that gets us.
For each long outlier, we forward fill beats by taking an average of the beat lengths before and after the break, and scaling to fit. And the results look promising:
The bottom row is the detected beats, and the top row is our interpolated beats. Nice!
As an aside, it turns out that this same technique is useful when beats happen too quickly as found in bass drum rolls (for example, at 1:12). The beats get detected with a lower certainty, and so a long outlier beat duration is detected, and we can forward fill with the average. Admittedly, I think it's a bit of a shame that the contextually-aware neural net doesn't spit out rapid hits as beats too, because the resulting cat would be hilarious. But it's a bit too clever for that.
Anyway, we've so far achieved beat detection with forward filling, and it seems from the graphs that the results look promising. Now to actually make the video...
Beat matching the cat
Going back to the source video, you'll note that the timing isn't perfectly consistent. Not that I'm really complaining; it's about as good as you could reasonably expect a cat to be. Either way, we're going to need to perform some correction to accommodate for this.
I started from this video, which had already handily replaced the background with a nice, solid green. Creating a click track with audacity, edited the video to precisely 120bpm by stretching off-beat sections using kdenlive's time stretch feature. This is the resulting video file:
Each beat should be aligned with the point at which that cat's head is in its maximally extended position, and so the idea is that we're going to stretch the video between these points to match the outputted beats.
Given that the cat bops exactly 20 beats in our overlay, and with a bpm of 120 and framerate of 30fps, we can calculate that in the video we have precisely 15 frames per cat movement. So to align the cat with the beat, we take the duration of each beat as output by the program, and scale our 15 frame sections across that time, in order.
The final piece of the puzzle is compositing our cat video channel onto the music video itself; a pretty trivial operation for ffmpeg. And with that, we're done! Let's see how it all looks in practice:
For such a long instrumental break, I'd say that's not at all bad! Although it could probably be improved slightly by a smarter way of interpolating the break.
I found a cat video funny, so I spent a weekend on this project. I hope you enjoyed reading about it.
If you're interested in the code, take a look that the repository. It's built to be generic, so can handle non-vibing-cat overlays too!
A friend of mine, upon hearing about this project for the first time. ↩︎
The library: https://github.com/CPJKU/madmom and documentation for the specific function: https://madmom.readthedocs.io/en/latest/modules/features/beats.html ↩︎
Ideas on how to tell the difference between genuine pauses and beat breaks welcome. ↩︎