Saturday, March 26, 2011

Unladen Swallow Retrospective

I wrote this while I was at PyCon, but I kept revising it. Anyway, here goes. :)
As is apparent by now, no one has really been doing any work directly on Unladen Swallow or in porting it to py3k. Why not?

Lack of Sponsor Interest

The primary reason is that we weren't able to generate enough internal customers at Google. There are a few reasons for that:
  1. Most Python code at Google isn't performance critical. It's used mainly for tools and prototyping, and most user-facing applications are written in Java and C++.
  2. For those internal customers who were interested, deployment was too difficult: being a replacement was not enough. Any new Python implementation had to be turn-key. We thought building on CPython instead of starting fresh would sidestep this problem because C extensions and SWIG code would just work. However, simply changing from the previous version of CPython to a 2.6-based Python was too difficult.
  3. Our potential customers eventually found other ways of solving their performance problems that they felt more comfortable deploying.
After my internship was over, I tried to make Unladen the focus of my Master's thesis at MIT, but my advisor felt that the gains so far were insufficient to have big impact and the techniques I wanted to implement are all no longer considered novel. Most feedback techniques were implemented for Smalltalk by Urs Hölzle and tracing for Java by Andreas Gal. Not to say there aren't novel techniques to be discovered, but I didn't have any ideas at the time.

Lack of Personal Interest

Most of this was decided around Q1 of 2010. We still could have chosen to pursue it in our spare time, but at that point things looked a little different.
First of all, it's a lot less fun to work on a project by yourself than with other people, especially if it's unclear if you'll even have users.
Secondly, a large part of the motiviation for the project was that we felt like PyPy would never try to support CPython C extension modules or SWIG wrapped code. We were very surprised to see PyPy take steps in that direction. That somewhat obviated the need to build a bolt-on JIT for CPython. Also, when the project was launched, PyPy didn't have x86_64 support, but in the meantime they have added it.
Finally, the signals we were getting from python-dev were not good. There was an assumption that if Unladen Swallow were landed in py3k, Google would be there to maintain it, which was no longer the case. If the merge were to have gone through, it is likely that it would have been disabled by default and ripped out a year later after bitrot. Only a few developers seemed excited about the new JIT. We never finished the merge, but our hope was that if we had, we could entice CPython developers do hack on the JIT.
So, with all that said for why none of us are working on Unladen anymore, what have we learned?

Lessons About LLVM

First of all, we explored a lot of pros and cons of using LLVM for the JIT code generator. The initial choice to use LLVM was made because at the time none of us had significant experience with x86 assmebly, and we really wanted to support x86 and x86_64 and potentially ARM down the road. There were also some investigations of beefing up psyco, which I beleive were frusturated by the need for a good understanding of x86.
Unfortunately, LLVM in its current state is really designed as a static compiler optimizer and back end. LLVM code generation and optimization is good but expensive. The optimizations are all designed to work on IR generated by static C-like languages. Most of the important optimizations for optimizing Python require high-level knowledge of how the program executed on previous iterations, and LLVM didn't help us do that.
An example of needing to apply high-level knowledge to code generation is optimizing the Python stack access. LLVM will not fold loads from the Python stack across calls to external functions (ie the CPython runtime, so all the time). We eventually wrote an alias analysis to solve this problem, but it's an example of what you have to do if you don't roll your own code generator.
LLVM also comes with other constraints. For example, LLVM doesn't really support back-patching, which PyPy uses for fixing up their guard side exits. It's a fairly large dependency with high memory usage, but I would argue that based on the work Steven Noonan did for his GSOC that it could be reduced, especially considering that PyPy's memory usage had been higher.
I also spent the summer adding an interface between LLVM's JIT and gdb. This wasn't necessary, but it was a nice tool. I'm not sure what the state of debugging PyPy is, but we may be able to take some of the lessons from that experience and apply it to PyPy.

Take Aways

Personally, before working on this project, I had taken a compiler class and OS class, but this experience really brought together a lot of systems programming skills for me. I'm now quite experienced using gdb, having hacked on it and run it under itself. I also know a lot more about x86, compiler optimization techniques, and JIT tricks, which I'm using extensively in my Master's thesis work.
I'm also proud of our macro-benchmark suite of real world Python applications which lives on and PyPy uses it for speed.pypy.org. In all the performance work I've done before and after Unladen, I have to say that our macro benchmark suite was the most useful. Every performance change was easy to check with a before and after text snippet.
We also did a fair amount of good work contributing to LLVM, which other LLVM JIT projects, such as Parrot and Rubinius, can benefit from. For example, there used to be a 16 MB limitation on JITed code size, which I helped to fix. LLVM's JIT also has a gdb interface. Jeff also did a lot of work towards being able to inline C runtime functions into the JITed code, as well as fixing memory leaks and adding the TypeBuilder template for building C types in LLVM.
So, while I wish there were more resources and the project could live on, it was a great experience for me, and we did make some material contributions to LLVM and the benchmark suite that live on.

20 comments:

  1. Great post, thanks.

    Small typo: Andreas Gil -> Andreas Gal.

    ReplyDelete
  2. FWIW, Reid and I discussed this before he wrote the post. Some minor clarifications below:

    - "I also spent the summer adding an interface between LLVM's JIT and gdb. This wasn't necessary, but it was a nice tool." I disagree on "wasn't necessary", or at least it should be amended with "in hindsight". We decided to work on the gdb interface as part of a larger strategy to make upstream merger easier. The theory was that it would be hard to get python-dev to maintain something they couldn't debug, especially given how easy it was to debug the existing eval loop.

    - "Secondly, a large part of the motiviation for the project was that we felt like PyPy would never try to support CPython C extension modules or SWIG wrapped code." This is part of the story. The lay of the land at the end of 2008 when we began the project was very different than it is today. At the time, PyPy was slow on all the benchmarks we had gathered, it was x86-only and didn't support extension modules or SWIG. We believed it would be easier to make CPython convincingly fast than to add x86-64 codegen AND make PyPy faster AND add perfect support for extension modules and SWIG.

    This leads into...

    - "The initial choice to use LLVM was made because at the time none of us had significant experience with x86 assmebly, and we really wanted to support x86 and x86_64 and potentially ARM down the road." We also believed that LLVM was a more robust JIT than it turned out to be. Apple was using the JIT engine in real products, and so we took that as a sign that it could work for us as well. While using LLVM helped us get off the ground very, very quickly, it quickly became a liability, and we ended up having to fix lots of bugs in the JIT support. It also brought with it lots of features we ended up not needing and which we had to spend time carving out in the name of memory footprint.

    ReplyDelete
  3. Reid, thanks for taking the time to write up this post-mortem. I had been wondering what happened to the project. Even though the project has lost its legs, there are still a lot of lessons for the authors and observers (like me). For example, it was very helpful to look at a LLVM-based JIT when I sketched out a toy JIT for my class last year.

    ReplyDelete
  4. Thanks for the writeup!

    #1 is the real key. Code in the critical path for performance rarely remains written in Python so in the end making Python anything less than an order of magnitude or two faster doesn't add up to much.

    As for #2: It is worth noting that most of Google is using CPython 2.6 _today_ but at the time the code base was not ready for that.

    ReplyDelete
  5. That was a great read, thanks for the article.

    Andreas "Gil"... Isn't this guy responsible for many limitations of Python, specially regarding multithreading?? :]

    Running gdb under itself must be really cool.

    ReplyDelete
  6. In looking at Unladen Swallow about 6 months ago I had the same reaction as gpshead except that it would also need to add better support threading as well. But that means replacing reference counting with a GC which implies, I believe, tearing up the CPython C API so much that every C extension would need a substantial rewrite (Cython ones possibly excepted). It's too bad because today I suspect someone could put a Python parser on the V8 JavaScript engine and end up with quite a fast implementation.

    ReplyDelete
  7. Greate post, thanks.

    The absense of a "production level" JIT in Python is really very sad.
    It's doing python only a "old good script" language, but no more.

    ReplyDelete
  8. I'm a vicarious fan of JIT compilation, in particular to support expressive and dynamic languages like Ruby, Javascript, Python, etc. I'm really impressed with the LuaJIT and would like to see other implementations converge on this type of performance (how fast is MacRuby and does it leverage the LLVM)

    http://shootout.alioth.debian.org/u32/benchmark.php?test=all&lang=luajit&lang2=pypy

    ReplyDelete
  9. @Carl Thanks, fixed.

    @Collin Also good points.

    @Mark: I'm mostly curious what happened to Rubinius, since Evan Phoenix was working on that around the same time we were doing our thing. Looks like he's moved from EngineYard to Lockdown. =/

    ReplyDelete
  10. @Reid Here is an update of Rubinius:
    http://rubini.us/2011/02/17/rubinius-what-s-next/

    You can follow development on github: https://github.com/evanphx/rubinius

    ReplyDelete
  11. I think you should have looked around for a different advisor at MIT. Lack of novelty needn't have been a constraint. There are several publication paths for work on dynamic language work. But the dynamic and static analysis guys are on different floors at Stata and don't talk to each other much.

    I'm curious why accesses to the Python stack were going through memory at all? Stack-based IRs can/should be converted to register-based implementations. No one uses an in-memory stack for JVM implementations, for instance.

    ReplyDelete
  12. @Scott

    I spent a fair amount of time looking for an advisor, but I couldn't really bring up anything.

    w.r.t. the stack accesses, the main reason to keep them was to maintain compatibility with the interpreter loop, so that should a guard fail it was easy to transfer control out of the machine code. The idea was that if we could teach LLVM enough about the frame aliasing rules, it would remove unnecessary stack traffic. That was a bit of wishful thinking, though.

    ReplyDelete
  13. This is real-world feedback of the most valuable kind, and well-written too.

    It would be easy for someone to take a superficial view of the situation and simply write Unladen off as a "failed project", but its goals were attractive enough to recruit broad support from the developer community. As you and Colin have shown benefits have come out of the attempt, not only to Python but to LLVM (and hence other projects) as well. Too often people only report on their successes, but this kind of analysis is helpful by demonstrating (among other things) that rational design decisions are difficult and don't always lead to working designs.

    Thanks to everyone who worked on the fork.

    ReplyDelete
  14. Thanks for posting it.
    I'm a Korean.
    and here, I have been looking for the release of unladen swallow for years. oh.. it's sad, though,
    you did a really good job and I don't think it's a kind of failure because you inspired developers around world, including me. (although I'm not that good programmer like you....)

    ReplyDelete
  15. This comment has been removed by a blog administrator.

    ReplyDelete
  16. Great debriefing. Many good things came of the effort, even if it did not include the intended.

    Are there any microcode machines for LLVM?

    ReplyDelete
  17. Say please, which way to solve performance problems found your potential customers?

    ReplyDelete
  18. Excellent pieces. Keep posting such kind of information on your blog. I really impressed by your blog.
    Vee Eee Technologies| Vee Eee Technologies

    ReplyDelete
  19. I actually think, that the python parser on v8 is worth looking at.

    ReplyDelete