From RhinoJS To NodeJS. Enhanced With worker_threads 🚀
Good old times
What this application is about? It is a reporting tool. Once a scheduled time it goes through all documents in the account and collects various statistic metrics. Then it forms a CSV file and sends it to the customer as an email attachment. Depending on the document count, it may take hours.
We started by tweaking the Java scheduler code to support calling API requests along with executing the code in the RhinoJS runtime. Doing this allowed us to create a NodeJS server and move all report execution into a new component.
Once we were unblocked to extract report generation into a dedicated service, I have created a new ExpressJS server and an API endpoint. It was naive to think that the old logic will start working in the new environment. I did not know how it works in general, people who wrote this left the company years ago. Commits merged with great (irony detected!) messages like “added a new service”, “refactor CSV generator”, “add new report type”, did not help at all. But there was a good thing: I finally got my “Aha!” moment.
In order to appreciate the value of unit testing, you need to have that “aha!” moment, when unit tests you wrote to save the day
There was a bunch of unit tests, that helped me to sort things out. This brought us confidence, at least we have some important cases covered.
Java world side effects
Our next step was to re-write the code. RhinoJS is a way old engine and it only supports ES5 code. Having ES5 code does not mean that we can not execute it in NodeJS (I am going to share the issues that disallow js execution in our particular case). But we wanted to make the application fresh and safe. In our company, the team decides what libraries and frameworks to use, and I always prefer to use Typescript. One advantage here (along with type checking and making the code documented) is that we have a fully documented API Client library written on Typescript as an npm package. This brings us confidence that we manipulate the correct response data during the report execution. The Typescript static analyzer does an awesome job catching common problems and typos. It took us some time to realize how the old code works, then I split it into small pieces, came up with brand-new modules. As the result, we got a new modern NodeJS application with Kubernetes deployment set up and ready to be shipped into the real world.
RhinoJS is java
Synchronous HTTP requests
The other one is also Java-related. You know, in Java all API requests are synchronous. Guess what? In RhinoJS all API requests are… synchronous as well! So you would never see someApi.get().then(dataHandler). Ugh my bad, we live in a ES5 world. No Promise statements, only callbacks, right?. Nope. You would never see something like this someApi.get(callbackFn). Imagine you have a loop that fetches documents page by page. In the RhinoJS world, everything is okay. If you run the same code in NodeJS, you will never get the proper result. Because the API request is asynchronous, it’s of Promise type, so you need to wait until it is being resolved. I had to review all parts where we execute an API request and change the logic according to the standard. Finally, we were able to launch our new application.
Battling with issues
As I said earlier, we had a bunch of unit tests. And it did not cover everything. Maybe some critical business cases and that’s it. I think it was a Jasmine version 1.0 specs.
Today we run tests using the Jest library. So I made two steps forward. First I rewrote existing tests and ran them against old logic. When everything was okay, I put the newly rewritten code into action and fixed issues that appeared during the test run. In the end, I added more tests so it would hopefully help people to maintain the service.
Unexpected API response time
When we saw all tests passing, we decided to deploy our new application into sandbox environment so we would see if report generation really works. And this is how we caught the first issue. Our API request builder is based on the Axios library. We always initialize Axios instance with 20 seconds timeout by default. Sound reasonable enough, huh? Now we try to fetch yet another page of documents and it takes… 1 minute?! What?? We saw requests hanging for 2 minutes, for 2.5 minutes… At the end of the day, we decided to turn timeout off and it now works like that. You may say that having 2.5 minutes response time for a single API request is terrible. And you are right. But this is what we have at the moment. Sad truth.
Okay, we sorted out the timeout issue. But there was something that preceded this. We would never see the request takes 1 minute to execute if we gave up on logs. The number of logs the old application was producing was ridiculously small. You hardly could understand if something goes wrong. So we decided to log as much as possible. A big part of the code was surrounded using the “full coverage” strategy. This means that for particular function we log when it starts the execution and when it finishes. All critical methods, especially promise requests log when they start, when they resolve, and when they throw an exception. Thus we were able to understand how the application works. We were able to find weak weak parts, bottlenecks, and optimize the code.
We are developing the stateless application so the Kubernetes scheduler can manage the load and move pods around the stack. Every hour we receive thousands of reports for execution. There is no way we can keep the reports queue in the application’s memory. Kubernetes can unexpectedly kill the instance and we would lose a bunch of data. The application may bump into memory limits so it would crash with the OutOfMemory exception. You can deploy a new version and the Kubernetes manager will scale down old instances and spawn new ones. We had to implement some report storage to keep track of the reports queue. And here Redis comes into action.
We created two lists in the Redis instance. One is the whole pending report queue. So we receive yet another request to generate a report. Next, instead of running the report, we put it on the waiting list. Now, when we have resources, we take the first item from the waiting list and put it into the working list (another list in Redis cache). We supply reports in progress with an actual timestamp, so we always know if one NodeJS thread working on it. Imagine Kubernetes scheduler killed one of the application instances and create a new one. New instance checks the Redis queue if there are reports that seem to be in progress, but the timestamp is old. This means they are stuck. Then we move all such reports into the waiting list in the first place (because these reports are the oldest ones we need to generate). And then if there are threads available, we can take the next report and start from the beginning.
We made our service stateless. Redis guarantees all operations we use are atomic, so we never run into data collision. And we implemented a simple and neat auto-recovery approach, so we never lose data.
In the beginning, the biggest weakness of the application was memory consumption. We did not know how it works exactly. So we rewrote it step by step, function by function. We covered 80% of execution flow by logs. We significantly increased test coverage. We use JS classes so most of the methods are now stored in the class prototype and NodeJS would not allocate memory for the same function again and again. Then we found one thing.
Imagine you want to create a report. Previously, we could fetch all documents in the account. Then we map them into the report field. But first, we fetch ALL DOCUMENTS. Can you imagine how much memory it takes if you have thousands of items? This data object is quite a big model with different kinds of information. And we store all documents in memory. This perfectly explains why we had to restart the Java service every few weeks. We started to see the same picture on Datadog graphs for our new NodeJS application. Memory usage was rocketing for big reports. In the end, we were getting OutOfMemory crashes.
You may ask me what was the purpose to store all data in memory? I guess back in the days there were not many documents in the system, so this temporal approach worked all the way. But nowadays we have clients with huge organizational structure, with thousands and thousands of orders per each branch. So the time has come. As I wrote before, we rewrote everything on Typescript, and put more tests into action. And now we just rewrote the way we generate reports. Instead of collecting everything in memory, we take a page of items and convert it into part of the report at that time. We might need to get an additional data — so we can run new requests, validate values, and so forth.
As the result we have report lines related to that documents and maybe some metadata, thus we allow NodeJS garbage collector to free a lot of memory. Right now one instance hits 2Gb memory in the hard times. It never reaches 3Gb limit. Before we tried to run the application with 5Gb memory limit, but it could not survive.
Why logs are so important? I have already told you about the API execution time issue. Here is another example. Logs showed that a bunch of reports bump into HTTP 401 all the time. Not authorized. We selected the first 10 records and checked the account status. Turns out these companies were deleted or deactivated. It is useless to generate reports for a company that does not use the platform, they were just flooding the report queue and waste our resources. We wrote a script and manually deleted report configurations for deactivated accounts. Thus we freed some room for real reports.
When we first started, we did not know anything about the service usage stats. There were no metrics at all, the application was barely sending logs. We did not know for many companies use report generation and how many report configurations are active. In the new implementation, we wanted to address these issues. We enriched logging capabilities so it would be much easier to understand how the application works. We found many big and small problems, got many insights on how people use the reporting tool. Then we added some Datadog metric, so now we know the real service load. We know how much memory it consumes, we know how many reports are in the queue. We even know the longest-running reports and why it takes so much time to generate them. But we most fascinating is that we set up a dashboard to celebrate when we hit 0 reports in the pending queue. We could not imagine it is possible, but all the optimizations had a real impact on the working cycle. In the meantime, we all agree that things may get worse in the future and it would require more effort than just refactoring and optimizations.
Previously it was a Java service that was spawning a new thread with RhinoJS code. In NodeJS there are built-in features that allow running the code in parallel. Cluster module allows creating the process forks that share the same HTTP port so it perfectly fits any kind of HTTP server. Another option is to use a built-in module called worker_threads, which is designed to help with CPU-intensive JS code. In short, you have the main process that can create a new thread with its own event cycle and JS engine. Workers have isolated memory pools and can communicate with the main thread using the event loop messages. We decided that the worker_threads module fits our case perfectly and worth a try. It is a built-in module so you don't need to do something to enable it.
As you might know, spawning a thread on the OS requires resources, so the common practice is to have a pool of workers all the time plus handle thread termination. This means to spawn a new thread so there would always be the same number of threads available. But for some reason in our case, it did not work well. Garbage collection was barely coming into action, threads did not free memory and the application was dying with OutOfMemory signal all the time. After several optimizations, the memory was still leaking. To make sure we are freeing resources, we decided to… kill the thread every time it finishes report execution. As I wrote this is always an overhead to spawn a new thread again and again. In the meantime, it was not a critical metric for us, and since the application started working as expected, memory issues have gone away and we agreed to keep it this way (on the Datadog dashboard screenshot you could see CPU usage peaks that represent the thread spawning. There also memory graphs that go down at some points when threads finish with the execution and being killed)
That being said, we created a WorkerPool that initializes the thread pool for us
I have already mentioned we introduced Redis cache for reports. The generateNextReport() function gets the first available report from the queue and asks WorkerPool to trigger the report execution in some worker thread.
When we have a new report to execute, we create a new thread and send a message with all the data required (report configuration, company id, and more).
To conclude, we take a new report request, spawn a new worker thread, send a message with report configuration for this thread, generate the report, send the message back to the main thread that everything is fine (or there is an error happened), kill the thread and then ask for a new report request. This did not take a lot of effort to convert from us. The report generation code remained the same, we mostly implemented the approach to manage threads and connected it with existing Redis implementation. Simple as that.
As the result, we created a whole new application. Moved out the old RhinoJS code prior to having a scalable k8s-ready NodeJS application.
- It took 4 months instead of what we roughly estimated in the beginning (2 months). The reason is that we did not know the way the applications works. We did not know what a heck is this RhinoJS thingy. And we did not have time to have a spike on it. The whole task was our big research.
- We re-implemented the application using Typescript. And it was amazing. It helped us write consistent stable code. We fixed typos from the old code just by the way (I guess it was something our customers were complaining about). The code itself got more readable and documented. It was super easy to refactor and extend it with new functionality.
- We implemented Datadog metrics and added more logs. Now we clearly understand how to investigate issues. We know the application load. We are getting notifications in case something goes wrong. We know the bottlenecks and weak parts and analyze more data.
- We handle 98% of the execution in the NodeJS. Remember we left a scheduler in the Java code. Except this, everything else happens in the NodeJS application. We receive a new bunch of reports. We spawn new threads and follow each thread’s events events. Everything is under control.
- We finally were able to refactor the code and significantly decreased memory consumption. If we compare with Java virtual machine, the NodeJS application requires less memory for execution. But you always can write not-optimized code. That’s why it was rewritten. New service does not bump into memory limit and feels comfortable under Kubernetes environment.