Programming JS in a functional style is often more elegant but comes at the
cost of performance. In the past, part of that performance problem has been
due to native, C++ implementations of the most-often used higher-order
Array.prototype.forEach and friends. In a call like
arr.forEach(f), the passed in kernel function
f ends up needing to
cross the C++-JS boundary every iteration of the
slowdown. Moreover, the JIT cannot be leveraged as
forEach is not a JS
function, and thus cannot be analyzed and optimized.
With the hard work of Till Schneidereit, SpiderMonkey now has self-hosted versions of these higher-order operations. That is, the implementation is in JS, not C++. This gets rid of the C++-JS boundary, and in theory, should leverage the JIT to get much improved performance. So, let’s write a simple benchmark that finds the maximum value in a N×N matrix represented raggedly using vanilla JS arrays, adopted from here.
Note: The following benchmarks should be run on Firefox Nightly. My comments about what’s fast and what’s slow probably (definitely) do not apply to other browsers and versions.
As a preamble, a utility benchmark function that creates a N×N matrix of
random numbers, takes a benchmark function
f, and runs it
First is the gold standard, two nested
for loops. This performs quite well.
Next up is using the builtin
Array.prototype.forEach. The functional
programmer in me rejoices, but this version performs much, much worse than the
To convince yourself that a handrolled
forEach does no better, run the
following snippet. Note however that a handrolled version is a bit faster, due
to its not needing to perform extra operations that the builtin version is
required to perform for spec compliance.
The performance degradation, even with a self-hosted
forEach, is due to the
JIT’s inability to efficiently inline both the closures passed to
well as the nested call to
forEach. In a perfect world, the
version, after inlining all the way, should be more or less equivalent to the
for loop version.
An old technique exists to compile closures more efficiently–lambda lifting. The optimization is currently not implemented in IonMonkey, the new JIT of SpiderMonkey, but it’s easy to verify its effectiveness by manually optimizing the source.
The transformation is straightforward. Convert all free variables to
arguments, thus obviating the need to create an environment, then lift the
closures to the global scope. Below I choose to collect all free variables
(though there’s only one in both kernels anyways) into a state object,
The lambda lifted version is faster, but still not up to the speed of the
for loops. What’s going on? It turns out that the call to
liftedForEach is unable to be inlined effectively due to its being
liftedOuterKernel are passed to
f. We can help the compiler by manually cloning the caller.
Finally, comparable performance to the nested
for loops. To conclude, there
are two criteria to functional style performance parity:
The transformations to enable the above criteria are tedious and are surely the purview of the compiler. All that’s needed are brave compiler hackers.