M3Forth is a 32-bit subroutine-threaded Forth. Its design reflects some interesting constraints, which I'll discuss below, before going into detail on how it works.
Relevant Bits of the Cortex-M3
The Cortex-M3 is a real pleasure to work with, but it has some idiosyncrasies that make it a challenging Forth target.
For example, a typical M3 part has single-cycle RAM latency even at 100MHz, but no branch prediction. This makes clever stack caching difficult, since dynamic stack caching essentially trades loads for branches. In most cases, on the M3, a load or store is cheaper than a branch by a factor of 2-ish. (Static stack caching may still be a win.)
The M3 is a 32-bit machine, and implementations typically have a fixed 4GiB address space, with Flash and RAM placed somewhere within it -- typically separated by at least 1GiB. But the maximum range of a branch instruction in the Thumb-2 instruction set is just ± 16 MiB. In other words, code placed at RAM cannot directly call code placed in Flash, and vice versa. C compilers work around this by calling indirect through a register, but for call-heavy Forth code this would make functions quite large.
Early threading techniques: indirect and direct
I began work on M3Forth by building three different strawman systems: an indirect-threaded kernel (ITC), a direct-threaded kernel (DTC), and a subroutine-threaded kernel (STC). (I also played around with token-threading but rejected it pretty quickly as a poor fit for the M3.)
I expected ITC to win out:
- DTC and STC both use the machine's branch instructions.
- Forth likes to generate code in RAM sometimes.
- A branch instruction in RAM can't reach Flash, so it wasn't clear how a colon-def (for example) would reach the implementation of its code field.
I was immediately surprised to discover that ITC was not substantially slower than traditional DTC, where each word started with a branch to some machine code. ITC requires an additional indirection compared to DTC, but this is relatively cheap on the M3 -- it was the branches that really hurt. On some benchmarks, ITC was actually faster than the simplest version of DTC! Since naïve DTC code couldn't reach Flash from RAM anyway, this seemed like a pretty strong argument in favor of ITC.
However, from my work on PropellerForth, I knew it was important to get the threading model right early on. (PropellerForth is still stuck in ITC because it would be a real pain to rework everything.) So I altered the DTC approach: instead of a branch to the code field routine, I inlined it. Thumb-2 code is nice and small, but this still cost 4 bytes on every word -- but this flavor of DTC was faster than ITC by a good margin. This version of DTC also makes implementing DOES> much easier. This is the threading model M3Forth currently (2010-09-07) uses on the 'default' branch, but it's not the fastest.
As I mentioned above, STC looked like it wouldn't work: code in RAM simply can't reach code in Flash with a call. But a sketch of a Flash-only STC Forth kernel suggested that it would be 20-30% faster than DTC! I have a hard time leaving that kind of performance on the table. Fortunately, I found a solution, described below.
Rewiring the Cortex-M3's memory map: the Farcall Reflector
The M3 uses a fixed memory map, without an MMU -- so the chip designer's decisions are more or less set in stone. On all the M3s I've surveyed, the layout is similar:
- Flash sits at the very bottom of RAM. (Most M3s let you remap the bottom section into RAM to move the vector table, which seems kind of odd, since the M3 lets you set the vector table address...but I digress.)
- The primary SRAM sits somewhere between 512MiB and 1GiB into the address space.
- Between Flash and SRAM is a large unmapped hole. Touching this hole generates a Bus Fault.
Now, a Bus Fault isn't fatal -- programs can set an exception handler to handle it. I set about trying to abuse this mechanism. Here's the approach:
- The Forth kernel pretends that the top few kilobytes of unmapped space alias Flash. That is, calling a Flash address A is equivalent to calling SRAM_BASE - FLASH_SIZE + A. I call this section of address space the reflector.
- Of course, this is a fiction: jumping into this space causes a fault. But the Bus Fault handler rewrites the jump destination to point into Flash and resumes.
This is slower than a true call. Microbenchmarks put it at 25-30 cycles, whereas a call into Flash without prefetch costs about 7, and an indirect call of the same sort appears to cost 10-12. But code in RAM can now call Flash using a single four-byte branch-with-link instruction, instead of an eight-byte sequence that loads the address into a register and jumps indirect. Because Forth code consists mostly of calls, this cuts code size by nearly 50%.
This makes STC practical on the M3, but it doesn't make it fast. While benchmarks that run entirely in RAM or entirely in Flash run about 25% faster than DTC, code that repeatedly calls from RAM to Flash is slower.
By tweaking the reflector implementation, I got a bonus I wasn't expecting. ARM calls and jumps are relative. Instead of using a fixed reflector window, M3Forth now uses a dynamic window that adjusts to the actual amount of code stored in Flash. RAM code can thus pretend that it's in Flash, right above the actual Flash code. It becomes position-independent. This makes it nearly trivial to "save" code from RAM to Flash, since it eliminates most relocations! The reflector window base is held in a register that was otherwise unused (currently r11), making the Bus Fault handler slightly faster.
Removing farcalls by inlining
A casual analysis of Forth binaries suggested that most high-level code makes heavy use of a few primitives (which live in Flash):
- The branch words, which implement all control structures and loops.
- The stack manipulation words, like DUP and SWAP.
- Memory access words, like @ and !.
The easiest way to make such code faster is to copy the primitives into RAM, but this is wasteful. Instead, I'm working on making the compiler smarter about inlining. This is working in the offline Flash compiler, bootstrap.py, and reduces code size substantially by replacing eight-byte traditional Forth branches (like 0BRANCH) with two-byte Thumb-2 instructions. So far it has literaly saved kilobytes.
The online compiler (which generates code in RAM) performs minimal inlining at the moment, but this should remove the last main speed difference between inlined-DTC and STC. Of course, code that runs out of Flash -- such as finished applications -- already performs on par with hand-coded assembly.