# psyco / www / content / psyco.tex

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 \documentclass{article} \usepackage{html,xy} \xyoption{all} \def\Psyco{{\bf P}syco} \def\code#1{\texttt{#1}} \begin{document} {\hfill\Large \Psyco, the Python Specializing Compiler \hfill} \bigskip {\hfill\it Armin Rigo \hfill (October-December 2001) \hfill} \bigskip \noindent\Psyco\ resources and the latest version of this page:\hfill\break \htmladdnormallink{http://homepages.ulb.ac.be/\~{}arigo/psyco/}{http://homepages.ulb.ac.be/~arigo/psyco/}. (This document corresponds to \Psyco\ 0.3.4.) \bigskip This document is an overview of \Psyco, the Python Specializing Compiler. The aim of the \Psyco\ project is to show that it is possible to execute Python code at speeds approaching that of fully compiled languages. The current prototype operates on i386-compatible processors. Preliminary results show potential speed-ups of a factor between 2 and 100, depending on the kind of code, which means that we can hope for execution speeds closer to fully optimized C than to traditional Python. More specifically, the goal of \Psyco\ is to investigate the results one could obtain by automated specialization. Indeed, besides a relatively important but mostly language-independant core, \Psyco\ is designed as a straightforward rewrite of the existing C interpreter in a higher-level language than C. The core then acts as a specializing compiler on the Python interpreter. So \Psyco\ can be thought of as a real-world experimentation on the advantage that could have been derived from having written a non-trivial application like the Python interpreter in a higher-level language from the beginning. The higher-level language'' mentionned is currently only conceptual: it is actually C again, but written to be executed at compile-time instead of run-time. See the file \code{README.txt} included in the distribution for technical hints, as well as the SourceForge project page at \htmladdnormallink{http://sourceforge.net/projects/psyco/}{http://sourceforge.net/projects/psyco/}. \medskip In more practical terms, \Psyco\ is a replacement for just the most inner loop of the current CPython implementation of Python, the \code{eval\_frame} function. This means that unlike some previous attempts this not only can be easily integrated into the current implementation, but actually requires it to run. So \Psyco\ is definitely not the tool you are looking for if all you want is to get rid of the interpreter and distribute fully compiled applications without source. The name \emph{\Psyco}, or \emph{Python Specializing Compiler}, comes from the fact that the emitted machine code is specialized to specific values of the data, not only to the pseudo-code (a.k.a. bytecode) to interpret. \bigskip \Psyco\ and this document are copyrighted by Armin Rigo. \Psyco\ and all related documentation is protected by the GNU General Public License, version 2. I donate a copy of \Psyco\ and all related documentation to the Python developers group free of any license for any purpose, including modification and redistribution under another license. \section{Introduction} An interpreter for a language runs with, as arguments, the pseudo-code to interpret and the value of the variables that this pseudo-code operates on. Following Python's convention, we will call \emph{bytecode} the pseudo-code. In this point of view, compiling means specializing the interpreter on a specific value of the bytecode; that is, after compilation, we obtain machine code specialized in simulating that specific bytecode. This makes it both faster and less general than the interpreter itself, but these are the only differences. In an interpreter written in a language like C, there is no difference (at least no one that a computer can tell) between variables holding the bytecode to interpret, and variables holding the actual values on which the bytecode operates. Both the code and the data is at the same level. Thus, to write a compiler from an interpreter, one introduces a differentiation into two levels: \emph{compile-time known} and \emph{run-time known}. We decide that, in the interpreter source code, all variables holding the bytecode are compile-time, and all other variables (the actual Python objects manipulated by the bytecode in this case) are run-time. In other words, we decide that a good way to speed up the interpreter is to \emph{specialize} it to a particular value of the compile-time variables, i.e.\ to a specific bytecode. The idea of specializing is fundamental in this document, but not the way a compiler actually chooses \emph{how} to specialize the variable. Indeed, there is in general no best way to choose which variable is compile-time and which one is run-time. Let me give a few examples of differentiations: \begin{itemize} \item In a usual interpreter, all variables are run-time. \item In a traditional compiler, the source code is compile-time and everything else is run-time (with the exception of a few common optimization, like computing at compile-time arithmetic operations between constants). \item At the other extreme of the usual interpreter, there is the (not quite interesting) case of a compiler that would do all the work itself, and merely emit code that returns the final answer. Of course, such a compiler has to interpret everything itself! \end{itemize} Clearly, none of the above choices works too well for Python.\footnote{The second choice, the compiler'', corresponds to taking Python sources (or bytecodes), looking at each pseudo-instruction at turn, and dumping the sequel of machine instructions that does the same as what the interpreter would do when encountering this pseudo-instruction. This has the effect of letting each bytecode instruction explode into a large block of machine instructions. Attempts on Python have shown to give about twice the speed of the Python interpreter, at the expense of code size explosion.} This does not mean that no differentiation can give good results on Python; it merely shows that no \emph{classical} differentiation does, and probably that no \emph{fixed} differentiation does. In \Psyco\ the level of each variable is not predetermined. Actually, I start by assuming that all variables are run-time, and progressively promoting them to compile-time when for some reason \Psyco\ guesses it could do a significant optimization if it knew that value at compile-time. Of course, significant optimization'' is a fuzzy concept. Currently, there are a few simple heuristics, based on the idea that when what comes next depends on an essential way on the value of the variable, it would be a significant optimization to know that value at compile-time. The first thing this applies to is of course the bytecode itself: if we don't know it at compile-time, we are dead. But it also applies, say, to the type pointer inside of Python objects in numerous operations which, like numeric operations, depend essentially on the type. While emitting code, \Psyco\ manages a list of all the variables and which ones are currently run-time or compile-time.\footnote{Could a system use more than two levels of variables, that is, produce code that itself produces code that does the job? This should be explored. I think that currently we lack good heuristics for choosing the levels. This kind of project will probably require us to go to the next step and let the system decide automatically when it should promote variables, by producing code that performs statistical measures at run-time. This is a domain that might require the whole force of projects like Tunes [\htmladdnormallink{http://tunes.org}{http://tunes.org}].} There is actually a third category in \Psyco: virtual-time variables, whose exact value is never physically known, neither at compile-time nor at run-time. They are typical in complex expressions, where the intermediate values should normally be stored in temporary Python objects. Actually building these objects is often not needed; all that is needed is the value that would be put in the fields of the structure if it would have been built. In this situation, the variable pointing to the non-existent structure is virtual-time. \section{Structure of \Psyco} Let me first recall that the classical Python implementation has a relatively small kernel'', or inner loop'': the \code{eval\_frame()} bytecode interpreter. This is a C function called whenever we must execute bytecode. The bytecode itself comes from elsewhere (typically, either from the Python source-level compiler or from a \code{.pyc} file). On the other hand, \code{eval\_frame()} also depends on the rest of the C implementation, because it constantly needs to call other functions to perform the actions required by the bytecode. \Psyco\ is a substitute for just \code{eval\_frame()}. When called, it will issue its magic'' to optimize speed and reduce the number of outgoing calls (for example, no call to \code{PyInt\_FromLong()} is needed to build temporary integer objects), while still producing a result similar to that of \code{eval\_frame()}. Of course, some of the outgoing calls are essentials and must be done as if we were running the original \code{eval\_frame()} (e.g.\ to execute a \code{print} statement). $$\xymatrix{ & *+++[F-,]\txt{Python C API\\and various support code} & \\ *+++[F--] \txt{Machine code\\written by \Psyco} \ar@<5pt>[r]^{jump} \ar@/^20pt/[ur]^{call} & *+++[F-,] \txt{Run-time\\dispatcher} \ar@<5pt>[l]^{jump} \ar[r]^{call} \ar[u]^{call} & *+++[F-,] \txt{Specializing\\compiler} \ar@{-->}@/^50pt/[ll]^{write} \ar@/_20pt/[ul]_{call} }$$ \bigskip \Psyco\ consists of three main parts (second row), only the last two of which (in solid frames) are hard-coded in C. The first part, the \emph{machine code}, is dynamically generated. This is the part that is intended to directly substitute for \code{eval\_frame()}. In other words, conceptually, when this machine code is executed, it must perform just like \code{eval\_frame()}. This machine code contains calls to the C API and a few new supporting functions. But the most interesting point is that the machine code is not completely written in advance --- if it were, it would just be the same as \code{eval\_frame()}. Instead, when the machine code prematurely ends, it finishes with a jump to fixed code belonging to the second part of \Psyco, the \emph{run-time dispatcher}. The role of the run-time dispatcher is to perform appropriate actions to let execution continue. If needed, it calls the compiler to emit new code. Sometimes, it is not needed, because we can discover that appropriate code was already written somewhere else. When found or built, we jump to this new code and execution goes on. Now let us consider the \emph{compiler} proper. It is responsible for emitting native machine code specialized over some variables. It never directly executes the emitted the machine code blocks; it just emits as much code as possible and returns (usually this means it returns to the dispatcher, which will indeed jump to the just-emitted code, but this is of no concern to the compiler). The code emitted by the compiler usually ends in a jump to the dispatcher, as explained above. Writing as much code as possible'' means until we are blocked --- generally by the desire to promote a run-time variable to compile-time so that compile-time optimizations are possible. In effect, run-time variables are not known before the machine code actually executes, so all we can do is write a jump to the dispatcher. Later, when that jump will actually be executed, the dispatcher will know the actual value and be able to call the compiler again to specialize on this particular value. The dispatcher usually manages a look-up table caching the result of previous such compilations: encountering the same value again later will no longer invoke the compiler. In some cases we can even directly fix the original code with the final jump location, thus short-circuiting the dispatcher. \section{Example} Here is an example, showing step-by-step what occurs when \Psyco\ runs on the following function:\footnote{Yes, I know the example could be trivially optimized to \code{lambda n:n*(n+1)/2}, but let's pretend we don't know it.} \begin{verbatim} def f(n): x = i = 0 while if_code) push %ebx ; give it to the dispatcher push $0x401a05fc ; internal data for the dispatcher call 0x4019b5c0 add$0x8,%esp ; restore the stack jmp *%eax ; jump to the sequel \end{verbatim} Now consider what occurs when we call this small code as, say, \code{f(2117)}. It jumps to the dispatcher, passing it a pointer to the bytecode of \code{f} (remember, I am not concerned about how to produce bytecode; Python already does this very well). The dispatcher calls the compiler, which knows about \code{f} now. Passing through assignments of constant values to \code{i} and \code{x}, it turns them into compile-time variables: after the assignments, \code{i} and \code{x} must point to the constant object \code{0}. No machine code is emitted for that. Then there is the test \code{if_fastlocals[0]) mov 0x4(%esi),%edi ; n->ob_type mov %edi,%eax ; *** cmp $0x80c0ae0,%eax ; is it &PyInt_Type ? je 0x8151958 ; if yes, jump there call 0x814969f ; otherwise call this ; (hack: the call never returns) \end{verbatim} %$ (The instruction marked ***'' could be optimized out, but simply comes from the fact that the compiler tries to work as fast as possible and sometimes a potential optimization would involve rewriting a large block of code, so is avoided.) In this case it is an integer, so the conditional jump is followed. It jumps to the dispatcher, calling the compiler again. (When the compiler returns, the target of the conditional jump above will be changed to point directly to the new code block.) Now that we know that \code{n} is an integer, we must still compare it with \code{0}. We just write the corresponding machine instruction. According to Python, the result of the comparison should be stored in a Python integer object, either \code{0} or \code{1}. Instead of building a new object, we mark the variable holding it as \emph{virtual-time}, i.e.\ we know that it could be actually computed if needed, just by calling \code{PyInt\_FromLong()}, depending on the outcome of the comparison --- which, remember, is still marked as being in the processor's flags register at this time. The next instruction in the bytecode is a conditional jump (for the \code{while}). \Psyco\ sees that the condition is in a virtual-time value that could be computed from the processor flags. The long path to the jump would be to build a value, 0 or 1, that represents the outcome of the previous comparison, and then test this value. Instead of doing that we can (in this quite common case) emit just a conditional jump instruction testing the correct flags. If you come back to the Python source of \code{f} you will see that \code{n} only appears here, and all other variables are compile-time. Normally, this means that everything can be computed by the compiler but the conditional jump on the value of \code{n}. Without precaution the compiler would produce this kind of code: \begin{verbatim} cmp $0,%eax jle xxxxxxxx ; produced at the beginning cmp$1,%eax jle xxxxxxxx ; produced after one loop cmp $2,%eax jle xxxxxxxx ; produced after two loops cmp$3,%eax jle xxxxxxxx ; and so on... ... \end{verbatim} The problem here is that the compiler must go on and on with emitting code because the outcome of the looping condition \code{i=n stop push %eax add $0x1,%eax ; i=i+1 jo 0x401a05ac mov$0x3,%ecx ; n=n+i for n==3 add %eax,%ecx ; now n is run-time jo 0x401a05ac @Loop: cmp %edi,%eax ; compare i and n nop jge 0x815a0b0 ; if i>=n stop push %eax add $0x1,%eax ; i=i+1 jo 0x401a05ac push %ecx add %eax,%ecx ; n=n+i jo 0x401a05ac sub$0xfffffff8,%esp ; cancel the two 'push' jmp @Loop \end{verbatim} %$This example contains a lot of unnecessary \code{push} instructions. This is because no value needs to be saved between one iteration and the next, but the compiler is not smart enough to know that. Also, in this case, the values of \code{i} and \code{n} happen to be in the same register at the beginning and at the end of the loop; in general, \code{jmp} instructions like the final one are preceeded by some shuffle between registers and memory to put the values exactly where they are expected by the code we are jumping to. Finally, when the loop exits by one of the conditional jumps, \Psyco\ compiles the following code promoting the variable \code{x} (as a pointer to a \code{PyIntObject} structure) from virtual-time to run-time: \begin{verbatim} push %eax ; saved for later, just in case push %ecx ; idem push %ecx ; the integer value call 0x8088720 sub$-16,%esp ; remove temporary values from the stack ret ; good bye \end{verbatim} %\$ This finally constructs the integer object in memory, and returns it. Until now all the manipulated integer objects were purely virtual-time pointers. A final note: if you try afterwards to call the same function \code{f} with other arguments, the same code will be reused. In fact, the compiler is very fast, so we might run into code explosion problems; currently, no emitted code is ever removed from the memory. This is clearly not a satisfying solution, and need to be addressed in some future version. \section{The compiler} Let us take a closer look at the compiler part of \Psyco, the most interesting one; it is the only part that fundamentally depends on the fact that we are interpreting Python and no other language. Looking at it (file \code{python\_eval.h}) you will find a strong similarity with the code of \code{eval\_frame()} itself. This is because the compiler is best viewed as just the original interpreter in a \emph{two-staged dynamically separated} implementation. In other words the compiler is obtained as follows: \begin{enumerate} \item Take the original source code of \code{eval\_frame()}. \item Replace every variable there by a structure in which you can say either that the variable is compile-time, in which case you give its value as usual, or run-time, in which case there is no value (but there is information on where the value will be at run-time, e.g.\ in which processor register or where in memory). \item Before every single operation it does, insert a test that checks whether all the involved variables are compile-time; if they are, proceed with the operation; if they are not, the operation outcome will only be known at run-time, so emit the machine code that will do it at run-time. \item Tag operations that you consider should always be done at compile-time. Switches seem to be a good choice: if we know the value on which we switch, the emitted code will be quite compact, but if we do not we have no better solution than considering all possible cases. So before all switches, promote the necessary run-time values to compile-time. (Remember, this means returning from the compiler, which will later be called to go on when the value is known.) \item Optimize all this stuff by finding values that will necessarily end as compile-time, like the Python bytecode, but also all the stack manipulation (because the bytecode is enough to know where objects are pushed or popped from the stack, even if we don't know what these objects actually are). \end{enumerate} Notice that this recipe could be implemented as a computer algorithm. This means it is a priori possible to write a meta-program which, given the source code (in a suitably restricted language) of \emph{any} interpreter, would produce a fully functional specializing compiler to be inserted into \Psyco. The most interesting point lie in the remark that after all there is no need for the source code to be that of an interpreter. There are a number of other things that would benefit from specialization. [XXX references. Look for the Fabius compiler. Difference: in Fabius the two stages are statically chosen, variables do not pass from compile-time to run-time or vice-versa. Advantage: speed (only one or two C instructions needed to emit a machine code instruction, no test needed). Disadvantage: static stages] \section{For more information} The file \code{README.txt} included with the distribution gives a practical overview of \Psyco\ and its current state. Other resources:\hfill\break \htmladdnormallink{http://homepages.ulb.ac.be/\~{}arigo/psyco/}{http://homepages.ulb.ac.be/~arigo/psyco/}. \hfill\emph{Armin Rigo.} \end{document}