Optimizing Purescript's Runtime Performance

Posted on

While writing the WebGL bindings for Purescript and testing them by porting the first 9 examples from learningwebgl.com to Purescript (see here), we soon realized there's quite a performance penalty when using Purescript compared to the JavaScript versions.

Unfortunately we learnt that performance optimization of the generated JavaScript is currently not in focus of
the (otherwise rapid) development of Purescript.

This may be justified, since many domains where JavaScript is used are not performance critical. However, the picture changes if you do animated graphics in the GUI, may it be in 2D or 3D, may it be for games or applications. Since we need good performance close to JavaScript's, we ourselves started to look into implementing optimizations.

One obvious performance hit we singled out quite soon is the naive representation of currying and argument application. This means, a function with n arguments is represented by n functions with one argument in the generated JavaScript code, as shown in the following example:

f (a,b,c,d) {} =>
    f (a) {
        return function (b) {
          return function (c) {
            return function (d) {}}}}

Now for a function call that is saturated, meaning that all arguments can be applied in a single step, the calling code looks as follows:

f(1,"b",2,"d"); => f(1)("b")(2)("d");

We decided to add an optimization to "uncurry" curried functions, whenever partial application of arguments is not needed. For that purpose, the psc-bundle tool seemed like the best candidate: the advantage here is that it
holds a complete Abstract Syntax Tree (AST) of all relevant code in memory (the current functionality of this tool being mainly dead code elimination). Thus the logical way to go for the uncurry optimization is a source-to-source transformation in that AST.

So we added a new option –optimize uncurry to psc-bundle, which when given does the following:

  1. For every curried function, generate an uncurried variant. This is achieved by keeping the innermost function body whilst changing the function's signature and mangling its name with some arbitrary suffix (in our case, $\_$\_$).
  2. Add module exports for all uncurried function variants iff the curried variant is exported.
  3. Replace saturated calls to a curried function with a call to its uncurried variant.
  4. Remove uncalled function variants (both curried and uncurried) in a second pass of dead code elimination.

In the current implementation the (empty) "monadic" argument is never optimized away, although this should be straightforward to implement. Additionally, functions of typeclasses never get optimized this way either.

The results

We tested the optimizing psc-bundle tool against our own application, which unfortunately is not open source. As there seems to be no commonly agreed-upon benchmark suite for Purescript, our own codebase had to do.
On the upside, the codebase is of considerable size, the application itself adresses and generalizes a real-world
use case, incorporates different algorithms and, last but not least, motivates our optimization efforts due to
immediate practical concern.

After the first pass of psc-bundle, the result consists of 133 modules with 666 top-level function definitions, of which 338 (~50%) trigger the generation of an uncurried variant. 216 of these 338 functions are exported. 2038 calls with partial application are replaced by saturated calls. After the second run of dead code elimination, 215 uncurried functions survive, being called from approx. 1300 call sites.

27 functions survive with both curried and uncurried definitions (~4%). This duplication corresponds to an increase in JavaScript code size from 976.1 to 981.0 KByte (0.5%). However, the lines of code shrink from 20225 to 19398, because uncurried functions have a more compact representation.

We then ran the custom benchmark suite of our application, which consists of a chain of 3D animations and runs about 30 seconds, while measuring the median cycle time (which corresponds to rendering one single frame). We took five measurements alternating between optimized and unoptimized version. On Firefox 45.0.1, the median cycle time without optimization was 7.64ms, with the optimized version it was 6.44ms. That is an improvement of 15.65%. With Google Chrome 49.0.2623.110 (64-bit), the median cycle time in the unoptimized version was 15.12 ms, whereas with the optimized version it was 12.18ms – an overall speedup of 19.98%.

Considering the fact that a significant portion of the benchmark targets measuring GPU processing and WebGL calls, the performance improvement regarding JavaScript execution in SpiderMonkey and V8 is astonishingly high: one could
expect even more drastic percentages with code that is run exclusively by those JavaScript engines.

It would be interesting to compare memory consumption as well, but we currently lack in-depth understanding of
the underlying JavaScript infrastructure to be able to produce figures and claim them to be representative. To conclude, here is an overview over our findings so far:

SLOC File size (KByte) Firefox (ms) Chrome (ms) Boost (%)
20225 976.1 7.64 15.12
-p u 19398 981.0 6.44 12.18 15.65 – 19.98%

Loose ends, outlook and potential

The current development version of the uncurry optimization is not quite production-ready yet. There are some corner cases where under certain conditions the second pass of dead code elimination performs to eagerly. As a consequence, functions which actually get called, are eliminated from the AST. Anyone feeling the need for a quick adaption of this optimization, please do not hesitate to jump in here.

Furthermore, only top-level function definitions are considered candidates for this optimization. There's no reason local functions which are generated by the Purescript keywords let and where should not be viable targets for the uncurry optimization, too – except for our lack of time.

As you can imagine, our approach could open up a generalized way of implementing optimizations that happen as source-to-source transformation in the JavaScript AST. Some other areas could be addressed this way, e.g.:

  • Inlining: Assign a weight to a function definition based on the size of its body and the number of call sites it's referenced from; inline the code when it's sufficiently small or gets called from just a single (or very few) site(s).

  • An equivalent of Lambda Calculus' eta conversion: Replace a function f(x,..) = g(x,..) with g(x,..)

In general we hope that the code generator of Purescript will improve in the future regarding performance and memory consumption; especially when generating code for typeclasses. The inlining optimization could be a huge factor here.

So check out our branch of the optimizing Purescript toolchain, see if you can verify our findings and give us your feedback, or get involved if this is an important issue to you.

^JF ^MK