Source

hakanardo / sqrt / text.html

Full commit
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
<H1>Loop invariant code motion</H1> 

Recently, the jit-unroll-loops branch was merged. It implements the
idea described in 
<a href="http://morepypy.blogspot.com/2010/09/using-escape-analysis-across-loop.html">Using Escape Analysis Across Loop Boundaries for Specialization</a>.
That post does only talk about virtuals, but the idea turned out
to be more far reaching. After the metainterpreter produces a trace,
several optimizations are applied to the trace before it is turned
into binary code. Removing allocations is only one of them. There are also
for instance
<ul>
<li> Heap optimizations that removes memory accesses by reusing results
  previously read from or written to the same location.
<li> Reusing of the results of pure operations if the same pure
  operation is executed twice.
<li> Removal of redundant guards.
<li> ...
</ul>
A lot of these optimizations are in one way or another removing
operations form the trace and/or reusing previous results. All of these
optimizations could benefit from being able to operate across loop
boundaries. Not only in the sense that operations operating on loop
invariants could be moved out of the loop entirely. But also that
results produced at the end of an iteration could be reused at the
beginning of the next even if there are no loop invariants involved.

<p>

This is achieved by unrolling the trace into two iterations, and
letting the optimizer work on this two-iteration-trace.
The optimizer will now be able to optimize the second iteration more than the
first since it can reuse results from the first iteration. The
optimized version of the first iteration we call the <em>preamble</em> and the
optimized version of the second iteration we call the <em>loop</em>. The
preamble will end with a jump to the loop, while the loop will end
with a jump to itself. This means that the preamble will be executed
once for the first iteration, the loop will be executed for all following
iterations.
 
<p>

<h2>Sqrt example</h2>
Here is an example of a Python implementation of sqrt using a fairly
simple algorithm

<p>
<!-- pygmentize -f html -O full -o t.html t.py -->
  <style type="text/css">
td.linenos { background-color: #f0f0f0; padding-right: 10px; }
span.lineno { background-color: #f0f0f0; padding: 0 5px 0 5px; }
pre { line-height: 125%; }
body .hll { background-color: #ffffcc }
body  { background: #f8f8f8; }
body .c { color: #408080; font-style: italic } /* Comment */
body .err { border: 1px solid #FF0000 } /* Error */
body .k { color: #008000; font-weight: bold } /* Keyword */
body .o { color: #666666 } /* Operator */
body .cm { color: #408080; font-style: italic } /* Comment.Multiline */
body .cp { color: #BC7A00 } /* Comment.Preproc */
body .c1 { color: #408080; font-style: italic } /* Comment.Single */
body .cs { color: #408080; font-style: italic } /* Comment.Special */
body .gd { color: #A00000 } /* Generic.Deleted */
body .ge { font-style: italic } /* Generic.Emph */
body .gr { color: #FF0000 } /* Generic.Error */
body .gh { color: #000080; font-weight: bold } /* Generic.Heading */
body .gi { color: #00A000 } /* Generic.Inserted */
body .go { color: #808080 } /* Generic.Output */
body .gp { color: #000080; font-weight: bold } /* Generic.Prompt */
body .gs { font-weight: bold } /* Generic.Strong */
body .gu { color: #800080; font-weight: bold } /* Generic.Subheading */
body .gt { color: #0040D0 } /* Generic.Traceback */
body .kc { color: #008000; font-weight: bold } /* Keyword.Constant */
body .kd { color: #008000; font-weight: bold } /* Keyword.Declaration */
body .kn { color: #008000; font-weight: bold } /* Keyword.Namespace */
body .kp { color: #008000 } /* Keyword.Pseudo */
body .kr { color: #008000; font-weight: bold } /* Keyword.Reserved */
body .kt { color: #B00040 } /* Keyword.Type */
body .m { color: #666666 } /* Literal.Number */
body .s { color: #BA2121 } /* Literal.String */
body .na { color: #7D9029 } /* Name.Attribute */
body .nb { color: #008000 } /* Name.Builtin */
body .nc { color: #0000FF; font-weight: bold } /* Name.Class */
body .no { color: #880000 } /* Name.Constant */
body .nd { color: #AA22FF } /* Name.Decorator */
body .ni { color: #999999; font-weight: bold } /* Name.Entity */
body .ne { color: #D2413A; font-weight: bold } /* Name.Exception */
body .nf { color: #0000FF } /* Name.Function */
body .nl { color: #A0A000 } /* Name.Label */
body .nn { color: #0000FF; font-weight: bold } /* Name.Namespace */
body .nt { color: #008000; font-weight: bold } /* Name.Tag */
body .nv { color: #19177C } /* Name.Variable */
body .ow { color: #AA22FF; font-weight: bold } /* Operator.Word */
body .w { color: #bbbbbb } /* Text.Whitespace */
body .mf { color: #666666 } /* Literal.Number.Float */
body .mh { color: #666666 } /* Literal.Number.Hex */
body .mi { color: #666666 } /* Literal.Number.Integer */
body .mo { color: #666666 } /* Literal.Number.Oct */
body .sb { color: #BA2121 } /* Literal.String.Backtick */
body .sc { color: #BA2121 } /* Literal.String.Char */
body .sd { color: #BA2121; font-style: italic } /* Literal.String.Doc */
body .s2 { color: #BA2121 } /* Literal.String.Double */
body .se { color: #BB6622; font-weight: bold } /* Literal.String.Escape */
body .sh { color: #BA2121 } /* Literal.String.Heredoc */
body .si { color: #BB6688; font-weight: bold } /* Literal.String.Interpol */
body .sx { color: #008000 } /* Literal.String.Other */
body .sr { color: #BB6688 } /* Literal.String.Regex */
body .s1 { color: #BA2121 } /* Literal.String.Single */
body .ss { color: #19177C } /* Literal.String.Symbol */
body .bp { color: #008000 } /* Name.Builtin.Pseudo */
body .vc { color: #19177C } /* Name.Variable.Class */
body .vg { color: #19177C } /* Name.Variable.Global */
body .vi { color: #19177C } /* Name.Variable.Instance */
body .il { color: #666666 } /* Literal.Number.Integer.Long */

  </style>

<div class="highlight"><pre><span class="k">def</span> <span class="nf">sqrt</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">n</span><span class="o">=</span><span class="mi">10000</span><span class="p">):</span>
    <span class="n">x</span> <span class="o">=</span> <span class="n">y</span> <span class="o">/</span> <span class="mi">2</span>
    <span class="k">while</span> <span class="n">n</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span>
        <span class="n">n</span> <span class="o">-=</span> <span class="mi">1</span>
        <span class="n">x</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">+</span> <span class="n">y</span><span class="o">/</span><span class="n">x</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span>
    <span class="k">return</span> <span class="n">x</span>
</pre></div>
<p>

If it is called with <tt>sqrt(1234.0)</tt>,  
<a href="noopt.txt">a fairly long trace</a> is produced. From this trace
the optimizer creates
the
following preamble (Loop 1) and loop (Loop 0) 

<p><img src="trace1.png"><p>

Looking at the preamble, it starts by making sure that it is not 
currently being profiled, the guard
on <tt>i5</tt>, and that the function object have not been changed
since the trace was made, the guard on <tt>p3</tt>. Somewhat
intermixed with that, the
integer variable <tt>n</tt> is unboxed, by making sure <tt>p11</tt>
points to an integer object and reading out the integer value from
that object. 
These operations are not needed in the
loop (and have been removed from it) as emitting the same guards again
would be redundant and <tt>n</tt> becomes a virtual before the
end of the preamble.
<pre>
        guard_value(i5, 0, descr=&lt;Guard6&gt;) 
        guard_nonnull_class(p11, ConstClass(W_IntObject), descr=&lt;Guard7&gt;) 
        guard_value(p3, ConstPtr(ptr15), descr=&lt;Guard8&gt;) 
        i16 = getfield_gc_pure(p11, descr=&lt;W_IntObject.inst_intval&gt;)
</pre>

Next comes a test and a guard implementing the while statement
followed by the decrementing of <tt>n</tt>. These operation appear
both in the preamble and in the loop
<pre>
        i18 = int_gt(i16, 0)
        guard_true(i18, descr=&lt;Guard9&gt;) 
        i20 = int_sub(i16, 1)
</pre>

After that the two floating point variables <tt>x</tt> and <tt>y</tt>
are unboxed. Again this is only needed in the preamble. Note how the
unboxed value of <tt>y</tt>, called <tt>f23</tt>, is passed unchanged
from the preamble to the loop in arguments of the jump 
to allow it to be reused. It will not become a virtual
since it is never changed within the loop.
<pre>
        guard_nonnull_class(p12, 17652552, descr=&lt;Guard10&gt;) 
        guard_nonnull_class(p10, 17652552, descr=&lt;Guard11&gt;) 
        f23 = getfield_gc_pure(p10, descr=&lt;W_FloatObject.inst_floatval&gt;)
        f24 = getfield_gc_pure(p12, descr=&lt;W_FloatObject.inst_floatval&gt;)
</pre>

Following that is the actual calculations performed in the loop in
form of floating point operations (since the function was called with
a float argument). These appear in both the loop
and the preamble.
<pre>
        i26 = float_eq(f24, 0.000000)
        guard_false(i26, descr=&lt;Guard12&gt;) 
        f27 = float_truediv(f23, f24)
        f28 = float_add(f24, f27)
        f30 = float_truediv(f28, 2.000000)
</pre>

Finally there are some tests checking if a signal was received
(such as when the user presses ctrl-C) and thus should execute some
signal handler or if we need to hand over to another thread. This is
implemented with a counter that is decreased once every iteration. It
will go below zero after some specific number of iterations, tunable by
<tt>sys.setcheckinterval</tt>. The counter is read from and written to
some global location where it also can be made negative by a C-level
signal handler. 
<pre>
        i32 = getfield_raw(32479328, descr=&lt;pypysig_long_struct.c_value&gt;)
        i34 = int_sub(i32, 2)
        setfield_raw(32479328, i34, descr=&lt;pypysig_long_struct.c_value&gt;)
        i36 = int_lt(i34, 0)
        guard_false(i36, descr=&lt;Guard13&gt;) 
        jump(p0, p1, p2, p4, p10, i20, f30, f23, descr=&lt;Loop0&gt;)
</pre>

<H1>Bridges</H1>

When a guard fails often enough, the meta-interpreter is started again
to produce a new trace starting at the failing guard. The tracing is
continued until a previously compiled loop is entered. This could
either be the the same loop that contains the failing guard
or some completely different loop. If it is the same loop, executing
the preamble again maybe be unnecessary.
It is preferable to end the bridge with a jump directly to 
the loop. To achieve this the optimizer tries to produce <i>short
  preambles</i> that are inlined at the end of bridges allowing
them to jump directly to the loop. Inlining is better than jumping to
a common preamble because most of the inlined short preamble can
typically be removed again by the optimizer.
Creating such a short
preamble is however not always possible. Bridges jumping to loops for which
no short preamble can be generated have to end with a jump to the
full preamble instead.

<p>

The short preamble is created by comparing the operations in the
preamble with the operations in the loop. The
operations that are in the preamble but not in the loop 
are moved to the short preamble whenever it is safe to move them to
the front of the operations remaining. In other words, the full preamble
is equivalent to the short preamble followed by one iteration of the
loop. 

<p>

This much has currently been implemented. To give the full picture
here, there are two more features that 
hopefully will be implemented in the near future.
The first is to replace the full preamble, used by the interpreter
when it reaches a compiled loop, with the short preamble.
This is currently not done and is probably not as straight forward as
it might first seem. The problem is where to resume interpreting on a
guard failure. However, implementing that should save some
memory. Not only 
because the preamble will become smaller, but mainly because the
guards will appear either in the loop or in the preamble, but not
in both (as they do now). That means there will only be a single bridge and 
not potentially two copies once the guards are traced.

<p>

The sqrt example above would with a short preamble result in a trace
like this

<p><img src="trace2.png"><p>

If it is executed long enough, the last guard will be traced to form a
bridge. The trace will inherit the virtuals from its parent. This can
be used to optimize away the part of the inlined short preamble
that deals with virtuals. The resulting bridge should look
something like

<pre>
    [p0, p1, p2, p3, p4, f5, i6]
    i7 = force_token()
    setfield_gc(p1, i7, descr=&lt;PyFrame.vable_token&gt;)
    call_may_force(ConstClass(action_dispatcher), p0, p1, descr=&lt;VoidCallDescr&gt;)
    guard_not_forced(, descr=&lt;Guard19&gt;) 
    guard_no_exception(, descr=&lt;Guard20&gt;) 

    guard_nonnull_class(p4, 17674024, descr=&lt;Guard21&gt;) 
    f52 = getfield_gc_pure(p4, descr=&lt;W_FloatObject.inst_floatval&gt;)
    jump(p1, p0, p2, p3, p4, i38, f53, f52, descr=&lt;Loop0&gt;)
</pre>

Here the first paragraph comes from the traced bridge and the second
is what remains of the short preamble after optimization. The
box <tt>p4</tt> is 
not a virtual (it contains a pointer to <tt>y</tt> which is never
changed), and it is only virtuals 
that the bridge inherit from it's parents. This is why the last two
operations currently cannot be removed.

<!--
To remove the remaining
parts of the short preamble a lot more of the optimizer state would
have to be saved in every guard. However it should be possible to
recreate this state from the short preamble. To do that, the bridge
have to know which of it's input boxes corresponds to which of the
output boxes (arguments of the last jump) of the short preamble. One
idea of how to store this information is to introduce some
VFromStartValue virtuals that would be some pseudo virtuals containing
a single
input argument box and it's index. Using this the
infrastructure in place for storing information about virtuals can be
used to also store this index.

XXX again, a lot of details from the code
-->

<p>

Each time the short preamble is inlined, a new copy of each of the
guards in it is generated. Typically the short preamble is inlined in
several places and thus there will be several copies of each of those
guards. 
If they fail often enough bridges
from them will be traced (as with all guards). But since there
typically are several copies of each guard the same bridge
will be generated in 
several places. To prevent this, mini-bridges from the inlined guards
are produced already during the inlining. These mini-bridges contain
nothing but a jump to the preamble.

<p>
The mini-bridges needs the arguments of the preamble to be able
to jump to it. These arguments contain among other things, boxed
versions of the 
variables <tt>x</tt> and <tt>y</tt>. Those variables are virtuals in
the loop, and have to be allocated. Currently those allocations
are placed in front of the inlined guard. Moving those allocations into
the mini-bridges is the  second feature that 
hopefully will be implemented in the near future. 
<!--
The current approach actually kills the entire benefit of the inlining in most
real world cases as typically all the virtuals are forced.
-->
After this feature is
implemented, the result should look something like
<p><img src="trace3.png"><p>


<h2>Multiple specialized versions</h2>

Floating point operations were generated in the trace above
because <tt>sqrt</tt> was called with a float argument. If it is
instead called with an int argument, integer operations will be generated. The
somewhat more complex situations is when both int's and float's are
used as arguments. Then the jit need to generate multiple versions of
the same loop, specialized in different ways. The details, given
below, on how this is achieved is somewhat involved. For the casual
reader it would make perfect sense to skip to the next section here.

<p>

Consider the case when <tt>sqrt</tt> is first called with a float
argument (but with <tt>n</tt> small enough not to generate the
bridge). Then the trace shown above will be
generated. If <tt>sqrt</tt> is now called with an int argument, the
guard in the preamble testing that the type of the input object is float
will fail:
<pre>
        guard_nonnull_class(p12, 17652552, descr=&lt;Guard10&gt;) 
</pre>
It will fail every iteration, so soon enough a bridge will be
generated from this guard in the preamble. This guard will end with a
jump to the same loop, and the optimizer will try to inline
the short preamble at the end of it. This will however fail
since now there are two guards on <tt>p12</tt>. One that makes sure it
is an int and and one that makes sure it is a float. The optimizer
will detect that the second guard will always fail and mark the bridge
as invalid. Invalid loops are not passed on to the backend for
compilation. 

<p>

If a loop is detected to be invalid while inlining the short preamble,
the metainterpreter will continue to trace for yet another 
iteration of the loop. This new trace can be compiled as above and
will produce a new loop with a new preamble that are now specialized
for int arguments instead of float arguments. The bridge that
previously became invalid will now be tried again. This time inlining
the short preamble of the new loop instead. This will produce a set of
traces connected like this

<p>
<a href="trace4mag.png"><img src="trace4.png"></a>
<br>(click for some hairy details)
<p>

The height of the boxes is this figure represents how many instructions
they contain (presuming the missing features from the previous section
are implemented). Loop 0 is specialized for floats and it's preamble have
been split into two boxes at the failing guard. Loop 2 is specialized
for ints and is larger than Loop 0. This is mainly because the integer
division in python does not map to the integer division of the
machine, but have to be implemented with several instructions (integer
division in python truncates its result towards minus
infinity, while the the machine integer division truncates towards
0). Also the height of the bridge is about the same as the height of
Loop 2. This is because it contains a full iteration of the loop.

<p>

<h2>A More Advanced Example</h2>

Let's conclude with an example that is a bit more advanced, where this unrolling
approach actually outperforms the previous approach. Consider
making a
<a href="http://en.wikipedia.org/wiki/Fixed-point_arithmetic">fixed-point</a>
implementation of the square root using 16 bit's of decimals. This can be
done using the same implementation
of <tt>sqrt</tt> but calling it with an object of a class representing
such fixed-point real numbers:

<p>
<div class="highlight"><pre><span class="k">class</span> <span class="nc">Fix16</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">val</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="bp">True</span><span class="p">):</span>
        <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">val</span><span class="p">,</span> <span class="n">Fix16</span><span class="p">):</span>
            <span class="bp">self</span><span class="o">.</span><span class="n">val</span> <span class="o">=</span> <span class="n">val</span><span class="o">.</span><span class="n">val</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="k">if</span> <span class="n">scale</span><span class="p">:</span>
                <span class="bp">self</span><span class="o">.</span><span class="n">val</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">val</span> <span class="o">*</span> <span class="mi">2</span><span class="o">**</span><span class="mi">16</span><span class="p">)</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="bp">self</span><span class="o">.</span><span class="n">val</span> <span class="o">=</span> <span class="n">val</span>

    <span class="k">def</span> <span class="nf">__add__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">other</span><span class="p">):</span>
        <span class="k">return</span>  <span class="n">Fix16</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">val</span> <span class="o">+</span> <span class="n">Fix16</span><span class="p">(</span><span class="n">other</span><span class="p">)</span><span class="o">.</span><span class="n">val</span><span class="p">,</span> <span class="bp">False</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">__sub__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">other</span><span class="p">):</span>
        <span class="k">return</span>  <span class="n">Fix16</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">val</span> <span class="o">-</span> <span class="n">Fix16</span><span class="p">(</span><span class="n">other</span><span class="p">)</span><span class="o">.</span><span class="n">val</span><span class="p">,</span> <span class="bp">False</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">__mul__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">other</span><span class="p">):</span>
        <span class="k">return</span>  <span class="n">Fix16</span><span class="p">((</span><span class="bp">self</span><span class="o">.</span><span class="n">val</span> <span class="o">&gt;&gt;</span> <span class="mi">8</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">Fix16</span><span class="p">(</span><span class="n">other</span><span class="p">)</span><span class="o">.</span><span class="n">val</span> <span class="o">&gt;&gt;</span> <span class="mi">8</span><span class="p">),</span> <span class="bp">False</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">__div__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">other</span><span class="p">):</span>
        <span class="k">return</span>  <span class="n">Fix16</span><span class="p">((</span><span class="bp">self</span><span class="o">.</span><span class="n">val</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span><span class="p">)</span> <span class="o">/</span> <span class="n">Fix16</span><span class="p">(</span><span class="n">other</span><span class="p">)</span><span class="o">.</span><span class="n">val</span><span class="p">,</span> <span class="bp">False</span><span class="p">)</span>
</pre></div>

<p>

Below is a table comparing the runtime of the sqrt function above with
different argument types on different python interpreters. Pypy 1.4.1
was released before the optimizations described in this post were in place
while they are in place in the 
<a href="http://buildbot.pypy.org/nightly/trunk/pypy-c-jit-40390-e1ab35394b0f-linux64.tar.bz2">nightly
  build from January 5</a>, 
denoted pypy in the table. There are also the running time for the same
algorithms implemented in C and compiled with "gcc -O3
-march=native". Tests were executed on a 2.53GHz Intel Core2
processor with <tt>n=100000000</tt> iterations.
Comparing the integer versions with C may be considered a
bit unfair because of the more advanced integer division operator in
python. The left part of this table shows runtimes of <tt>sqrt</tt> in
a program containing a single call to sqrt (i.e. only a single
specialized version of the loop is needed). The right part shows the
runtime of <tt>sqrt</tt> when it has been called with a different
type of argument before.

<p>

<table>
  <tr><th></th><th colspan=3>First call</th><th></th><th colspan=3>Second call</th></tr>
  <tr><th></th><th>float</th><th>int</th><th>Fix16</th><th>&nbsp;&nbsp;</th>
               <th>float</th><th>int</th><th>Fix16</th></tr>
  <tr><th>cpython</th>
    <td> 28.18 s</td>
    <td> 22.13 s</td>
    <td> 779.04 s</td>
    <td></td>
    <td> 28.07 s</td>
    <td> 22.21 s</td>
    <td> 767.03 s</td>    
  </tr>
  <tr><th>pypy 1.4.1</th>
    <td> 1.20 s</td>
    <td> 6.49 s</td>
    <td> 11.31 s</td>
    <td></td>
    <td> 1.20 s</td>
    <td> 6.54 s</td>
    <td> 11.23 s</td>
  </tr>
  <tr><th>pypy</th>
    <td> 1.20 s</td>
    <td> 6.44 s</td>
    <td> 6.78 s</td>
    <td></td>
    <td> 1.19 s</td>
    <td> 6.26 s</td>
    <td> 6.79 s</td>
  </tr>
  <tr><th>gcc</th>
    <td> 1.15 s</td>
    <td> 1.82 s</td>
    <td> 1.89 s</td>
    <td></td>
    <td> 1.15 s</td>
    <td> 1.82 s</td>
    <td> 1.89 s</td>
  </tr>
</table>

<p>

For this to work in the last case, when Fix16 is the argument type in
the second type, 
the trace_limit had to be increased from its default value to prevent
the metainterpreter from aborting while tracing the second version of
the loop. Also sys.setcheckinterval(1000000) were used to prevent the
bridge from being generated. With the bridge the performance of the
last case is significantly worse. Maybe because the optimizer currently
fails to generate a short preamble for it. But the slowdown
seems too big for that to be the only explanation. Below are the runtimes
numbers with checkinterval set to its default value of 100:

<table>
  <tr><th></th><th colspan=3>First call</th><th></th><th colspan=3>Second call</th></tr>
  <tr><th></th><th>float</th><th>int</th><th>Fix16</th><th>&nbsp;&nbsp;</th>
               <th>float</th><th>int</th><th>Fix16</th></tr>
  <tr><th>cpython</th>
    <td> 28.71 s</td>
    <td> 22.09 s</td>
    <td> 781.86 s</td>
    <td></td>
    <td> 28.28 s</td>
    <td> 21.92 s</td>
    <td> 761.59 s</td>
  </tr>
  <tr><th>pypy 1.4.1</th>
    <td> 1.21 s</td>
    <td> 6.48 s</td>
    <td> 11.22 s</td>
    <td></td>
    <td> 1.72 s</td>
    <td> 7.58 s</td>
    <td> 12.18 s</td>
  </tr>
  <tr><th>pypy</th>
    <td> 1.21 s</td>
    <td> 6.27 s</td>
    <td> 7.22 s</td>
    <td></td>
    <td> 1.20 s</td>
    <td> 6.29 s</td>
    <td> 90.47 s</td>
  </tr>
</table>


<h2>Conclusions</h2>
Even though we are seeing speedups in a variety of different small
benchmarks, more complicated examples are not affected much by these
optimizations. It might partly be because larger examples have longer
and more complicated loops, and thus allowing optimizations to operate
across loop boundary will have a smaller relative effect. Another problem is
that with more complicated examples there will be more bridges, and bridges
are currently not handled very well (most of the time all virtuals are
forced at the end of the bridge as explained above). But moving those
forcings into the mini bridges should fix that.