From RhinoJS To NodeJS. Enhanced With worker_threads 🚀

I want to share a story of migration from RhinoJS to the NodeJS environment. We successfully extracted Javascript code into a dedicated NodeJS service and put the application into our Kubernetes stack configuration. Moreover, we set up parallel execution using the worker_threads module, which is the built-in NodeJS module. I am going to explain key milestones and unexpected issues we faced on our way during the migration period. I will also share improvements we applied to get a fast reliable application in the end.

Good old times

I started my Javascript journey 7 years ago. At that time I did not know much about Angular, React, NodeJS — all that stuff. I was developing JQuery scripts in a small company. Nowadays I develop the code using modern patterns, features, and frameworks. I write NodeJS modules and applications as well.

In our company we develop and support micro-service stacks. We deliver Angular applications in the browser and handle data using the Java backends. And if we happen to use server-side Javascript, we would choose NodeJS without any doubts (do you know any product that uses Deno on production already?). But 8 years ago the company consisted mostly of Java engineers, thus they decided to run the code using RhinoJS. The company was small and we did not experience loads of data we have today. The code was working. But at some point, memory leaks become a real problem. It was a usual practice to restart one backend component once a few weeks. By the way, many customers asked to implement new features. So we raised a discussion and decided to extract Javascript code from the Java service. I had to rewrite the RhinoJS code into the NodeJS application.

What is the RhinoJS thing, you ask me? Well, years ago NodeJS was yet unpopular, and people were trying to run JS code on the server using the Java Virtual Machine. RhinoJS is one such project. It is an engine that compiles Javascript into Java bytecode. Think of RhinoJS as another way to execute your scripts. Unfortunately, with NodeJS as a frontrunner, RhinoJS became redundant and stopped development. It did not receive the support of modern ES6+ features and left outdated.

The application

What this application is about? It is a reporting tool. Once a scheduled time it goes through all documents in the account and collects various statistic metrics. Then it forms a CSV file and sends it to the customer as an email attachment. Depending on the document count, it may take hours.

Technically, each report stores the Quartz configuration and thus schedules itself to run. Java spawns a new thread and then the report is going to be generated. We decided to keep scheduling functionality inside the Java backend and focus more on Javascript code.

First steps

We started by tweaking the Java scheduler code to support calling API requests along with executing the code in the RhinoJS runtime. Doing this allowed us to create a NodeJS server and move all report execution into a new component.

Once we were unblocked to extract report generation into a dedicated service, I have created a new ExpressJS server and an API endpoint. It was naive to think that the old logic will start working in the new environment. I did not know how it works in general, people who wrote this left the company years ago. Commits merged with great (irony detected!) messages like “added a new service”, “refactor CSV generator”, “add new report type”, did not help at all. But there was a good thing: I finally got my “Aha!” moment.

In order to appreciate the value of unit testing, you need to have that “aha!” moment, when unit tests you wrote to save the day

Gergely Orosz

There was a bunch of unit tests, that helped me to sort things out. This brought us confidence, at least we have some important cases covered.

Java world side effects

Our next step was to re-write the code. RhinoJS is a way old engine and it only supports ES5 code. Having ES5 code does not mean that we can not execute it in NodeJS (I am going to share the issues that disallow js execution in our particular case). But we wanted to make the application fresh and safe. In our company, the team decides what libraries and frameworks to use, and I always prefer to use Typescript. One advantage here (along with type checking and making the code documented) is that we have a fully documented API Client library written on Typescript as an npm package. This brings us confidence that we manipulate the correct response data during the report execution. The Typescript static analyzer does an awesome job catching common problems and typos. It took us some time to realize how the old code works, then I split it into small pieces, came up with brand-new modules. As the result, we got a new modern NodeJS application with Kubernetes deployment set up and ready to be shipped into the real world.

RhinoJS is java

There are two things I want to mention that did not allow refactoring to be an easy journey. Both of them rely on a fact that RhinoJS is a part of the Java JVM. That being said, let’s meet the first issue you would face if you ever happen to dive into Rhino Javascript: Java objects in the Javascript code. Yes. You can execute a Javascript function with some Java object as an input parameter. For instance, you need to print the date. How to get the format this date should be rendered in? Remember we have the RhinoJS engine being executed inside some Java application set up and running. So you are passing the localization config from Java to Javascript. What. A. Fun. Luckily enough my colleagues created a single Java object with all configurations so I had to re-create the same on Javascript. Then I had to find all the places that use it and replace them with new data.

Synchronous HTTP requests

The other one is also Java-related. You know, in Java all API requests are synchronous. Guess what? In RhinoJS all API requests are… synchronous as well! So you would never see someApi.get().then(dataHandler). Ugh my bad, we live in a ES5 world. No Promise statements, only callbacks, right?. Nope. You would never see something like this someApi.get(callbackFn). Imagine you have a loop that fetches documents page by page. In the RhinoJS world, everything is okay. If you run the same code in NodeJS, you will never get the proper result. Because the API request is asynchronous, it’s of Promise type, so you need to wait until it is being resolved. I had to review all parts where we execute an API request and change the logic according to the standard. Finally, we were able to launch our new application.

Battling with issues

As I said earlier, we had a bunch of unit tests. And it did not cover everything. Maybe some critical business cases and that’s it. I think it was a Jasmine version 1.0 specs.

Today we run tests using the Jest library. So I made two steps forward. First I rewrote existing tests and ran them against old logic. When everything was okay, I put the newly rewritten code into action and fixed issues that appeared during the test run. In the end, I added more tests so it would hopefully help people to maintain the service.

Unexpected API response time

When we saw all tests passing, we decided to deploy our new application into sandbox environment so we would see if report generation really works. And this is how we caught the first issue. Our API request builder is based on the Axios library. We always initialize Axios instance with 20 seconds timeout by default. Sound reasonable enough, huh? Now we try to fetch yet another page of documents and it takes… 1 minute?! What?? We saw requests hanging for 2 minutes, for 2.5 minutes… At the end of the day, we decided to turn timeout off and it now works like that. You may say that having 2.5 minutes response time for a single API request is terrible. And you are right. But this is what we have at the moment. Sad truth.

Logs coverage

Okay, we sorted out the timeout issue. But there was something that preceded this. We would never see the request takes 1 minute to execute if we gave up on logs. The number of logs the old application was producing was ridiculously small. You hardly could understand if something goes wrong. So we decided to log as much as possible. A big part of the code was surrounded using the “full coverage” strategy. This means that for particular function we log when it starts the execution and when it finishes. All critical methods, especially promise requests log when they start, when they resolve, and when they throw an exception. Thus we were able to understand how the application works. We were able to find weak weak parts, bottlenecks, and optimize the code.

Stateless approach

We are developing the stateless application so the Kubernetes scheduler can manage the load and move pods around the stack. Every hour we receive thousands of reports for execution. There is no way we can keep the reports queue in the application’s memory. Kubernetes can unexpectedly kill the instance and we would lose a bunch of data. The application may bump into memory limits so it would crash with the OutOfMemory exception. You can deploy a new version and the Kubernetes manager will scale down old instances and spawn new ones. We had to implement some report storage to keep track of the reports queue. And here Redis comes into action.

We created two lists in the Redis instance. One is the whole pending report queue. So we receive yet another request to generate a report. Next, instead of running the report, we put it on the waiting list. Now, when we have resources, we take the first item from the waiting list and put it into the working list (another list in Redis cache). We supply reports in progress with an actual timestamp, so we always know if one NodeJS thread working on it. Imagine Kubernetes scheduler killed one of the application instances and create a new one. New instance checks the Redis queue if there are reports that seem to be in progress, but the timestamp is old. This means they are stuck. Then we move all such reports into the waiting list in the first place (because these reports are the oldest ones we need to generate). And then if there are threads available, we can take the next report and start from the beginning.

We made our service stateless. Redis guarantees all operations we use are atomic, so we never run into data collision. And we implemented a simple and neat auto-recovery approach, so we never lose data.

Memory consumption

In the beginning, the biggest weakness of the application was memory consumption. We did not know how it works exactly. So we rewrote it step by step, function by function. We covered 80% of execution flow by logs. We significantly increased test coverage. We use JS classes so most of the methods are now stored in the class prototype and NodeJS would not allocate memory for the same function again and again. Then we found one thing.

Imagine you want to create a report. Previously, we could fetch all documents in the account. Then we map them into the report field. But first, we fetch ALL DOCUMENTS. Can you imagine how much memory it takes if you have thousands of items? This data object is quite a big model with different kinds of information. And we store all documents in memory. This perfectly explains why we had to restart the Java service every few weeks. We started to see the same picture on Datadog graphs for our new NodeJS application. Memory usage was rocketing for big reports. In the end, we were getting OutOfMemory crashes.

You may ask me what was the purpose to store all data in memory? I guess back in the days there were not many documents in the system, so this temporal approach worked all the way. But nowadays we have clients with huge organizational structure, with thousands and thousands of orders per each branch. So the time has come. As I wrote before, we rewrote everything on Typescript, and put more tests into action. And now we just rewrote the way we generate reports. Instead of collecting everything in memory, we take a page of items and convert it into part of the report at that time. We might need to get an additional data — so we can run new requests, validate values, and so forth.

As the result we have report lines related to that documents and maybe some metadata, thus we allow NodeJS garbage collector to free a lot of memory. Right now one instance hits 2Gb memory in the hard times. It never reaches 3Gb limit. Before we tried to run the application with 5Gb memory limit, but it could not survive.

Useless reports

Why logs are so important? I have already told you about the API execution time issue. Here is another example. Logs showed that a bunch of reports bump into HTTP 401 all the time. Not authorized. We selected the first 10 records and checked the account status. Turns out these companies were deleted or deactivated. It is useless to generate reports for a company that does not use the platform, they were just flooding the report queue and waste our resources. We wrote a script and manually deleted report configurations for deactivated accounts. Thus we freed some room for real reports.

Improvements

Metrics

When we first started, we did not know anything about the service usage stats. There were no metrics at all, the application was barely sending logs. We did not know for many companies use report generation and how many report configurations are active. In the new implementation, we wanted to address these issues. We enriched logging capabilities so it would be much easier to understand how the application works. We found many big and small problems, got many insights on how people use the reporting tool. Then we added some Datadog metric, so now we know the real service load. We know how much memory it consumes, we know how many reports are in the queue. We even know the longest-running reports and why it takes so much time to generate them. But we most fascinating is that we set up a dashboard to celebrate when we hit 0 reports in the pending queue. We could not imagine it is possible, but all the optimizations had a real impact on the working cycle. In the meantime, we all agree that things may get worse in the future and it would require more effort than just refactoring and optimizations.

Multi-threading

Previously it was a Java service that was spawning a new thread with RhinoJS code. In NodeJS there are built-in features that allow running the code in parallel. Cluster module allows creating the process forks that share the same HTTP port so it perfectly fits any kind of HTTP server. Another option is to use a built-in module called worker_threads, which is designed to help with CPU-intensive JS code. In short, you have the main process that can create a new thread with its own event cycle and JS engine. Workers have isolated memory pools and can communicate with the main thread using the event loop messages. We decided that the worker_threads module fits our case perfectly and worth a try. It is a built-in module so you don't need to do something to enable it.

As you might know, spawning a thread on the OS requires resources, so the common practice is to have a pool of workers all the time plus handle thread termination. This means to spawn a new thread so there would always be the same number of threads available. But for some reason in our case, it did not work well. Garbage collection was barely coming into action, threads did not free memory and the application was dying with OutOfMemory signal all the time. After several optimizations, the memory was still leaking. To make sure we are freeing resources, we decided to… kill the thread every time it finishes report execution. As I wrote this is always an overhead to spawn a new thread again and again. In the meantime, it was not a critical metric for us, and since the application started working as expected, memory issues have gone away and we agreed to keep it this way (on the Datadog dashboard screenshot you could see CPU usage peaks that represent the thread spawning. There also memory graphs that go down at some points when threads finish with the execution and being killed)

That being said, we created a WorkerPool that initializes the thread pool for us

I have already mentioned we introduced Redis cache for reports. The generateNextReport() function gets the first available report from the queue and asks WorkerPool to trigger the report execution in some worker thread.

When we have a new report to execute, we create a new thread and send a message with all the data required (report configuration, company id, and more).

Inside the thread, we can run any Javascript code to get the result. Once we finish everything we can use the parentPort connector exposed by the worker_threads module to send the message back to the main thread. As you can see the whole idea of this NodeJS feature relies on the standard event loop concept, it is quite simple and neat!

To conclude, we take a new report request, spawn a new worker thread, send a message with report configuration for this thread, generate the report, send the message back to the main thread that everything is fine (or there is an error happened), kill the thread and then ask for a new report request. This did not take a lot of effort to convert from us. The report generation code remained the same, we mostly implemented the approach to manage threads and connected it with existing Redis implementation. Simple as that.

Results

As the result, we created a whole new application. Moved out the old RhinoJS code prior to having a scalable k8s-ready NodeJS application.

  • It took 4 months instead of what we roughly estimated in the beginning (2 months). The reason is that we did not know the way the applications works. We did not know what a heck is this RhinoJS thingy. And we did not have time to have a spike on it. The whole task was our big research.
  • We re-implemented the application using Typescript. And it was amazing. It helped us write consistent stable code. We fixed typos from the old code just by the way (I guess it was something our customers were complaining about). The code itself got more readable and documented. It was super easy to refactor and extend it with new functionality.
  • We implemented Datadog metrics and added more logs. Now we clearly understand how to investigate issues. We know the application load. We are getting notifications in case something goes wrong. We know the bottlenecks and weak parts and analyze more data.
  • We handle 98% of the execution in the NodeJS. Remember we left a scheduler in the Java code. Except this, everything else happens in the NodeJS application. We receive a new bunch of reports. We spawn new threads and follow each thread’s events events. Everything is under control.
  • We finally were able to refactor the code and significantly decreased memory consumption. If we compare with Java virtual machine, the NodeJS application requires less memory for execution. But you always can write not-optimized code. That’s why it was rewritten. New service does not bump into memory limit and feels comfortable under Kubernetes environment.

It was an entertaining journey, for sure. I dug into history and it was dark times when the JS ecosystem just started to rise. Javascript got its own runtime, we now have Typescript above the JS code. Along with JS things, I learned a lot of tips and tricks related to system design. Splunk logs, Datadog metrics, Redis cache, Kubernetes deploy — this actually covers the full lifecycle for real application. We took one step back to understand how it worked back in the days. And made two steps forward. This journey was actually worth it!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sergey Nikitin

Sergey Nikitin

Frontend dev @ Catawiki.com, Amsterdam, NL. TypeScript, Angular, React, Webpack, NodeJS. We also can talk about surfing or snowboarding twitter.com/n0th1ng_else