An evening of hunting memory leaks
‘Twas a lovely early afternoon in the Be’er-Sheva campus of Ben-Gurion University, when AS mentioned that a system based on BPjs is slowing down after running for about ten minutes. I’m generally happy when I get bug reports (it means that people use the software), but this reports sounded like it might be an issue that could be caused by something that may or may not – but quite probably may – be a memory leak.
I don’t like memory leaks. They can be pretty hard to nail down.
We opened JConsole to sample the suspected running JVM, and looked at the memory usage graph for a while. It did go down when the GC kicked in, but never quite to the level it was before. I was toying with the hope that a major GC would kick in an get us back to a civilized level of memory consumption, but for no avail. We were, indeed, looking at a memory leak.
I don’t like memory leaks. They can be pretty hard to nail down. And have a tendency to appear when one needs to write two papers and a PhD thesis.
There was one hope, though – it could be a leak in the using system, and not in BPjs. To find that, we created a simple b-program that waits for external tick events, and then toggles a virtual switch. The program used a static set of a few events (the “new” keyword was never used).
This time, looking at the JVM using Java Mission Control (JMC), there was no doubt:
Memory consumptions didn’t look to good either:
After the usual start-up mess, the familiar saw-tooth graph slowly creeping upwards appears. Yep, that’s a memory leak, and in BPjs.
I hate memory leaks. The good news where that I didn’t have any plans for the evening.
We did some extra tests, and noticed that when we release a BProgramRunner instance, the leaked memory is freed. I’ll have to start there.
Seek-a-Leak
Debugging is like being the detective in a crime movie where you are also the murderer.
— Filipe Fortes (@fortes) November 10, 2013
The main issue with memory leaks – you need to understand your system in all its levels. Every method and value are suspects. I once hunted a leak in a Java (1.4.2) Swing application, and it turned out that hiding the caret in a JTextPane prior to displaying the JFrame containing it leaked the entire window through references left in the timer mechanism that was responsible for blinking the caret. I chased references using an Eclipse debugger and a block of paper for a three whole days to prove that.
Even more annoying, that bug was already reported when I realized what was going on.
Luckily, BPjs is much smaller than the Java standard library – about 4600 lines. But it relies on Mozilla Rhino, which is also a suspect.
I started from BProgramRunner, and traced all references it had. The JMC report showed that the sync statements and BEvent counts were in the 100Ks, so that had to do something with the runtime of the JavaScript code itself. My hopes for some classroom example of a map somewhere caching instances were gone.
The good news where that I didn’t have any plans for the night, either.
Scopes and Proxies
I was poking around the code drawing blanks, so I started doing some clean-ups and solving old issues, trying to come up with something. Issue #32 caught my eye – I wanted to get rid of the BThreadJsProxy class, but never got around to doing it. OK, issue #32 it is.
At the beginning of BPjs, b-thread could call “bsync” to synchronize with their peers. That method was implemented in BThreadJsProxy, a class whose instances were made available to the JavaScript client code by placing them as a scope in the scope hierarchy under which the JavaScript code runs. Later, other runtime features found their way to this class.
As the BPjs library evolved, we moved everything BP to a “bp” object that BPjs made available to the b-program code in a similar way. That bp object, implemented by BProgramJsProxy, is b-program global, and is not aware of any specific b-threads. The only BPjs runtime feature that required a specific b-thread is the ability to set interrupt handlers. Boilerplate aside, that was the only method left in BThreadJsProxy. Moving it to BProgramJsProxy is non-trivial and seemed unimportant at the time, so this class stayed. With nothing better to do, I’ve moved the interrupt handler mechanism to BProgramJsProxy, thanking whoever decided to add ThreadLocal to Java’s standard library. That’s it – BThreadJsProxy can be removed. The leak stays, but at least #32 would be solved, so the evening won’t be a complete waste of time.
I started removing references to BThreadJsProxy from the runtime and analysis sub systems, when I encountered the code where I’m placing the proxy in a scope, so that the JavaScript code can call it:
void setupScope(Scriptable programScope) {
Scriptable bthreadProxyScope = (Scriptable) Context.javaToJS(proxy, programScope);
bthreadProxyScope.delete("equals");
<more deletes>
bthreadProxyScope.setParentScope(programScope);
// setup entryPoint's scope s.t. it knows our proxy
Scriptable penultimateEntryPointScope = ScriptableUtils.getPenultiamteParent(entryPoint);
penultimateEntryPointScope.setParentScope(bthreadProxyScope);
scope = entryPoint;
}
I was quite proud of this code when I wrote it – I rarely get a legitimate excuse for using the word “penultimate” in a variable name. But since everyone are suspects now, I looked closer. This code adds another scope to the scope hierarchy within which the JavaScript code runs. That’s cool. The not-so-cool part, is that it runs each time a “sync” is called. So, each time a “sync” is called, another scope is added to the hierarchy, caching things like BEvents. And, effectively, causing a memory leak.
I hate memory leaks.
Removing the BThreadJsProxy class was one of those nice refactors where you delete more than you add. It also solved the leak:
The memory itself also looks better now. After the initial startup mess, the same program occupies less than 32 MiB consistently:
Takeaways
- Love thy early adopter. And be prepared to get some serious bug reports when a framework starts to get traction.
- Mozilla Rhino is pretty performant even when one uses it badly – that leak was there for a while, and nobody noticed until the system was used to execute long-running tasks for a long time.
- Profiling and inspection tools like JMC and JConsole rock.
- Don’t leave design related fixes laying around for too long.
- Don’t make plans for the evening – you can never tell whether you’ll have to deal with a memory leak.