Tales from optimization hell

With the ever growing dataset of my KittyScan project came unforeseen performance issues. How I ultimately fixed them, and how I made the process unnecessarily hard for myself is what I want to discuss in this short blog post.

Let's first look at the use case these issues appeared in. The scanning infrastructure of KittyScan is divided into two components, a "head" and the "scanners". The head is the service responsible for distributing jobs (eg. "scan this IP", "check if this server uses online mode", ...) to the scanners which execute these jobs and respond with the result. These two components are connected to each other with a RabbitMQ bus, for load balancing and buffer reasons.

With scanners just blindly working through their jobs, the head has to process the responses and persist them in the database.

While the real database schema is a lot more complex, the only relevant detail you need to know is that it has a table called "servers" which contains all basic information about a given server at a given time, such as the MOTD or the player count. This table is already deduplicated by only saving deltas between scans to minimize the entries generated. If I would not do this, the table would grow by around 190k rows per hour. During the roughly 7 months of runtime, this would have amassed 957 Million entries. Using deltas reduced this to only about 100 Million entries, a reduction of 90%. Nice.

Because of this delta check, the head has to compare the current result against the one last saved in the database for the given IP. And because the head receives about 190k answers per hour, it also has to do this check 190k times per hour.

So how do I get the last saved state from the database? Just filter by the IP and sort descending by the (incrementing) ID. This results in a query which looks something like this:

SELECT * FROM servers WHERE ip_id='0.0.0.0' ORDER BY id DESC;

And while this is the most straight forward "head through the wall" approach to this issue, it did work surprisingly well. For the first few months this ran without any problems. Or so I thought. Over time I noticed a small but linear decrease in ingest speed (the rate at which the head consumes the answers given by the scanners). While it started of as "as fast as you can throw stuff at it" it slowly but surely declined down to ~70 per second. This wasn't an issue at the time and I chose the message bus explicitly to allow for buffers to build up, but in time this would not keep up. So I had to look for a solution.

Before the delta check I have to query some other tables to collect all necessary data (Players, Favicons, ...). So my first thought was that doing all these queries onto also growing tables would be the root cause for the reduction in ingest speed. Luckily all of these tables have entries which are added to, but never edited or removed. A favicon, once saved, will never change or disappear for example. This made these queries perfect for a simple in memory cache, which only contacted the database if the item was not yet queried before. This was a little bit tedious but simple. And it was effective. With the caching implemented, the ingest speed rose to ~300 per second again.

Problem solved. Not.

About a month later the ingest speed had again fallen down to ~70 per second. So I took another look. I saw the query above often took ~250ms to finish. That was something that confused me in the beginning, but after some trial and error I noticed that I forgot to give the "ip_id" field an index. That was a stupid mistake, which caused the Database to search in the raw data instead of an optimized lookup table, but it was easy to fix. Also, as a side note, I want to applaud Postgres here for still being that fast, even with a mistake on that level.

So I quickly added that index and, look at that, the query only took ~40ms and ingest speed was up again, not as much as before but good enough. It also did not fall in any capacity I would have noticed.

That was until a few days ago, when I moved all my services to a new server. The server had much more performance and much better I/O, so color me surprised when I saw what the ingest speeds were down to ~5 per second and the query took ~4000ms to finish.

Panik.

My first attempt to fix this was to REINDEX the database, maybe the indexes built by pg_restore were faulty. And yes, it got better after this, but only a bit. The ingest speed was up to ~60 per second again which is barely enough to prevent a backlog from building. I knew that I had to fix this or else the project may not be able to continue.

I searched for every bit I could optimize and found nothing.

Then, when looking at a completely different part of the software, it hit me. I had, a long time ago, already built a 1:N table between IPs and the last server found them. I did this for a different microservice and forgot about it afterwards. The most ironic thing about that is that this 1:N table is being updated by the same ingest function which also encompasses the delta checks.

Using this table to reference the latest entry instead of the query speeds things up by a ton. The time to retrieve the data whet down from ~40ms to below 0.1ms.

But curiously after implementing this fix, I still saw the same messages about the query taking ~400ms, which confused me even more, I thought that I had maybe forgotten to push my changes. But nope. It turns out that I somehow did not manage to run this inefficient query once per ingest but twice.

The second query was originally not part of this ingest function but after I refactored the head a bit to simplify some processes it ended up there and I never considered just reusing the result I got from the first one.

And with that fixed we are back to an ingest speed of "as fast as you can throw stuff at it" again. Hooray.

So what is the lesson I should learn from this? Firstly is to not stop optimizing every time it reaches "good enough" again, if I have the time I should invest it in looking for further improvements. Secondly is to monitor my metrics more closely to identify performance issues faster and not just randomly. And thirdly is to know that I should avoid relying on database "magic". While it does certainly help, it may be lost during the next migration to something else. To be entirely honest, I did not even know that Postgres did so much optimization in the background. But maybe this also just was a happy collection of coincidents.

Welp, that was it for today. Thank you so much for reading and have a nice day!

CTA Image

Join the KittyScan comunity today!

Discord