Red Alert: Major Incident Due to Performance in the Microservices Deployment
Heredia, Costa Rica, 2022-11-26
Series: From Volkswagen Beetle to Ferrari: Accelerating ASP.Net Microservices, Article 2
In this second article of the series From Volkswagen Beetle to Ferrari: Accelerating ASP.Net Microservices, I will be describing the remediation attempts tried based on whatever little information I had regarding the deployed microservices system.
Go-Live Day 1: 3 Hours After Deployment
The day we went live (morning time), all the smoke tests passed and we let it run, just to start receiving complaints 3 hours later from many customers about the system being so slow it could not be used.
Indeed, we tested, and the application wouldn't even log in (using Windows Authentication). So we went hands-on to resolve the problem.
IMPORTANT: I put this up front because it is so incredibly important: We did not have server log collection. Nobody gave logging the importance it deserved, and while I did ask for this, I was told there was no time to implement it due to the business deadlines. Guys, never ever go live without proper logging. It is the thing that is not needed until it is really super needed. Don't suffer like me.
Attempt 1: Spin up More Pods
Like I said before, my knowledge in microservices was not that much. I was no expert in the matter at the moment (4 months ago) by any measure, and I just knew a few things here and there. Of course, K8s 101 (basic stuff) tells me: More pods, more concurrency.
I went right ahead and started spinning up pods for the Security microservice, mainly in charge of performing the log in operation. This seemed to alleviate the problem as I spun more pods, but the symptoms would come back within 5 minutes or so.
Still, by the time I had 12 pods running, the log in part of the process seemed to have stabilized.
So problem 1 Users are unable to log in, seemed to have been solved. Worth of note, however, is that I could not understand why I needed so many pods if the charts for Memory and CPU for the pods were telling me the pod was doing almost nothing.
Other Pages Not Loading
So after being able to log in, I went ahead and looked at the data parts of the application. They were not loading. Some requests were taking 2 or more minutes when they normally take 4 seconds.
So let's apply the remedy, right? Spin up more data pods! Hell, yes. Well, hell no. While this helped me alleviate the log in part, it did nothing for the data part. I got up to 20 pods per microservice and still could not make the data load. Again, the Memory and CPU charts in K8s told me the pods had very light usage. This was infuriating: "Come on guys, work! There are a bunch of requests pending, so chop chop!" All my pep talk had no incidence over them.
Attempt 2: Try to Blame the Database
At this point I told to myself: "Myself, if the pods aren't working is because the database is not providing the data. Go check SQL Server." Hmmm, interesting idea.
Guess what? I spun up DBeaver and tested the queries. They were blazing fast. As expected. So the problem is in the new system. Furthermore, we kept the monolithic deployment alive as part of the rollback strategy, and the monolithic application was working as fast as ever.
So no, don't blame the database.
Attempt 3: Meddle with .Net
At this point in time and by quick-reading stuff in the Internet, what you see the most about .Net underperforming is synchronous programming. Pretty much every resource will tell you that, in order to have a performant .Net HTTP server, it must be programmed asynchronously. I did not have this. So what's the most popular workaround? Setting the minimum threads.
Setting the minimum number of threads in .Net is super simple and risk-free for the most part, so we attempted this as the last remediation measure.
The code is super simple. Something like this:
// In program.cs, but can also be done in startup.cs while configuring the app.
// If done in startup.cs, you can read minimum thread values from configuration.
if (ThreadPool.SetMinThreads(200, 200))
{
// Yes, no logging set up, so output to console.
Console.WriteLine("Minimum thread count successfully set.");
}
else
{
Console.WriteLine("Could not set the minimum thread count.");
}
This, just like spinning up new pods, provided merely a temporary relief. Shortly after the pod was deployed, the application would go slow again.
I did not understand why the one, apparently almighty workaround wouldn't work for us, and to actually give you a response to this one specifically you'll have to bear with me for a few more articles. This one was hindered by an external factor I won't explain here for now.
Just so you know, I tested all kinds of values and combinations of worker threads and IOCP threads, ranging from 50 to 1000 threads.
This attempt did increase a little bit the CPU usage of the pods. Before it, it was around 50mCPU, which is 5% CPU, and setting the minimum threads gave us sometimes 100mCPU usage (10% CPU). Still the system was the same mess.
Compilation of Things I Did Not Understand So Far
So here's a list of things I saw that I couldn't understand at this point in time.
- Why aren't the pods working? Why, if they have incoming requests, cannot relay said request quickly to the database?
- Why didn't the minimum thread count workaround solve the issue for me as it has solved it for so many out there?
- Why are the .Net pods so slow? There is a NodeJS pod in the mix, and it was nailing it: A single pod was able to work through its queue of requests. Is NodeJS more capable than .Net? Did we choose our technology stack erroneously?
- How come the monolithic (legacy) application can handle the load, even when programmed synchronously?
- How come the gateway microservice, although being .Net, was performing in all appearances, OK?
At this point in time, and after being defeated and told to rollback the deployment, the one point that interested me right away was the comparison of .Net vs NodeJS.
Does NodeJS Scale Better Than .Net?
The short answer is No. As a matter of fact, if you do some light searching around the topic, you'll find extensive tests and demonstrations about the capacities of .Net over its competitors, mainly NodeJS and Java.
In short, .Net can handle 10 times larger loads than NodeJS. Just like that. Clear as the water of Río Celeste here in Costa Rica. So we were the ones doing something wrong, clearly.
Next Step: Asynchronous Programming
After analyzing my options so far, I decided the next course of action was probably converting from synchronous programming to asynchronous programming. By far a lengthy and difficult task. It seemed inevitable, though.
But before we go through this one, we'll make a quick stop and try to understand how .Net works under the hood in terms of its thread pool. Don't miss this next article as it is probably the most interesting piece of technical information if you want to have a performant .Net API ecosystem.
Follow the blog series if you wish to know what happened next. I will continue to detail my troubleshooting attempts, my discoveries, K8s configurations I tried, and most importantly, how I progressed through clearing the bottlenecks as I found them.