Programming JS in a functional style is often more elegant but comes at the
cost of performance. In the past, part of that performance problem has been
due to native, C++ implementations of the most-often used higher-order
operations like Array.prototype.forEach
and friends. In a call like
arr.forEach(f)
, the passed in kernel function f
ends up needing to
cross the C++-JS boundary every iteration of the forEach
, causing
slowdown. Moreover, the JIT cannot be leveraged as forEach
is not a JS
function, and thus cannot be analyzed and optimized.
With the hard work of Till Schneidereit, SpiderMonkey now has self-hosted versions of these higher-order operations. That is, the implementation is in JS, not C++. This gets rid of the C++-JS boundary, and in theory, should leverage the JIT to get much improved performance. So, let’s write a simple benchmark that finds the maximum value in a N×N matrix represented raggedly using vanilla JS arrays, adopted from here.
Note: The following benchmarks should be run on Firefox Nightly. My comments about what’s fast and what’s slow probably (definitely) do not apply to other browsers and versions.
As a preamble, a utility benchmark function that creates a N×N matrix of
random numbers, takes a benchmark function f
, and runs it iters
times:
First is the gold standard, two nested for
loops. This performs quite well.
Next up is using the builtin Array.prototype.forEach
. The functional
programmer in me rejoices, but this version performs much, much worse than the
above code.
To convince yourself that a handrolled forEach
does no better, run the
following snippet. Note however that a handrolled version is a bit faster, due
to its not needing to perform extra operations that the builtin version is
required to perform for spec compliance.
The performance degradation, even with a self-hosted forEach
, is due to the
JIT’s inability to efficiently inline both the closures passed to forEach
as
well as the nested call to forEach
. In a perfect world, the forEach
version, after inlining all the way, should be more or less equivalent to the
for
loop version.
An old technique exists to compile closures more efficiently–lambda lifting. The optimization is currently not implemented in IonMonkey, the new JIT of SpiderMonkey, but it’s easy to verify its effectiveness by manually optimizing the source.
The transformation is straightforward. Convert all free variables to
arguments, thus obviating the need to create an environment, then lift the
closures to the global scope. Below I choose to collect all free variables
(though there’s only one in both kernels anyways) into a state object, st
.
The lambda lifted version is faster, but still not up to the speed of the
nested for
loops. What’s going on? It turns out that the call to f
inside
liftedForEach
is unable to be inlined effectively due to its being
polymorphic; both liftedInnerKernel
and liftedOuterKernel
are passed to
f
. We can help the compiler by manually cloning the caller.
Finally, comparable performance to the nested for
loops. To conclude, there
are two criteria to functional style performance parity:
The transformations to enable the above criteria are tedious and are surely the purview of the compiler. All that’s needed are brave compiler hackers.