16 January 2004 A tiny compiler, part 1 (Revised 18 January 2004) I'm working on a small side project — a prototype for the colorForth compiler that Karig will have. I'm writing a compiler so tiny and so simple that it will fit entirely within a boot sector. It will come with a dictionary containing a single word, which would be used to "poke" bytes of machine code into memory. It will probably be the smallest, simplest compiler you've ever seen that (1) generates real executable machine code and (2) can be extended without changing its source code. Let me explain this a bit. Forth Forth is a programming language invented by Chuck Moore around 1970. It features two stacks — a conventional stack for return addresses, and a second stack (often referred to as just "the stack") used to pass data to subroutines and to give subroutines a way to return data. (Compare this to C and C++, which place return addresses and subroutine arguments on the same stack.) Keeping data and return addresses on separate stacks means that there is no need for code that rearranges the items on the stack or "cleans up" the stack — which means that, all else being the same, a Forth program can be smaller and faster than the equivalent C or C++ program. The consequences of this removal of the need to move items around on a single stack are that the cost of a subroutine call is smaller in Forth than it is in C, and that a subroutine in Forth is slightly faster than a subroutine in C that does the same work. This means that a Forth programmer can break his programs up into small and even tiny subroutines with less worry about the consequences for performance than a C programmer. Forth thus encourages the programmer to produce modular code, and to be aggressive about removing redundancies from his programs, even to the point of producing subroutines containing only three or four instructions.
Forth provides the simplest possible syntax: You make things happen simply by stringing words together. Each word is a subroutine. If a word needs to use data on the stack, the data needs to have been put there by a previous word. Thus the code to add 3 and 5 together is not entered as a single statement such as add (3, 5); it is entered as three discrete words: Forth lets you create (or define) new words. You choose a name to be the word you're defining, and the definition of the word is a sequence of one or more other words. New words are stored in a kind of database called the dictionary. Whenever the compiler needs to execute or to compile a call to a given word, the compiler looks up the word in the dictionary to find the address of the word's definition.
Forth has a reputation for being cryptic. Part of that reputation comes from its syntax — each individual word is a command, and any random string of words in Forth is legal. There are no keywords or special characters to guide the eye and make the structure of the "sentences" plain. We. Are. Not. Used. To. Reading. Each. Word. As. If. It. Were. Independent. Of. All. Others. We are used to reading words by the clause, and we are used to seeing words in certain order — most of us are accustomed to seeing colorForth As the years went by, and more and more people began using Forth, there eventually arose an ANSI standard for Forth — a standard that spelled out all the various words that Forth ought to have, and the way all these words ought to work. Chuck Moore didn't like the complexity of ANS Forth, so he created colorForth for his own use. One special feature of colorForth is that each word in a program has a color that stands for whatever it is that the compiler is supposed to do with that word. For example, a word in red is a word that is being defined, and the words that come after are part of the definition of this new word. A word in green is a word that is to be compiled — that is, the compiler is supposed to generate a CALL instruction to the address of that word's definition. A frequent objection at this point is that some programmers are colorblind. However, that which distinguishes words being defined from words being compiled doesn't need to be color per se; it can be any kind of formatting — different fonts for example, or perhaps just different combinations of upper- and lowercase. The point is to give the word as a whole a certain look, so that a programmer can see at a glance what the compiler is going to do with the word. Chuck Moore used color instead of keywords or special characters to make his source code more compact and easier to decipher — " newword " is more compact than ": newword" or "define newword". The other special feature of colorForth is that it converts keystrokes directly into binary pseudocode (Chuck Moore calls this pre-parsed words; I call it precode). There is no need for the compiler to make sense of an ASCII text file before it can start producing machine code; the precode can be translated into machine code much more simply. ColorForth does not store ASCII source code or ready-to-run machine code; it stores blocks of this precode. Whenever a block of precode is loaded from disk, it is translated into machine code — a process that is so simple that it takes no perceptible time to complete. My compiler My prototype compiler will be even simpler than Chuck Moore's colorForth. ColorForth uses 32-bit stacks and sets up protected mode (or at least a GDT); my compiler will use 16-bit stacks and operate in real mode. ColorForth allows words of any length; my compiler will allow only two-character words. ColorForth uses six colors; my compiler will use only four. ColorForth comes with a vocabulary of words built in; mine will come with only one word in the dictionary. However, one word and four colors will be enough for the user to start adding words to the dictionary. One word The one word is "c," (pronounced "c-comma"). This is the traditional Forth word to move a byte from the stack to the end of the code space, which is just a large (32KB) buffer that contains the machine code that the compiler produces. (In other words, the executable machine code produced by the compiler consists of bytes placed there via "c,".) Four colors Each of the four colors corresponds to one of four commands that the compiler understands. The four commands are as follows:
Precode format My compiler's precode format also differs from colorForth's. A precode word is simply two 7-bit ASCII characters, with the high bits of the two characters (bits 7 and 15 of the word) used to store the color. The four bit patterns representing the colors are: DEFINE equ 0x0000 COMPILE equ 0x0080 EXECUTE equ 0x8000 EXE_HEX equ 0x8080 ; "execute hex" So the pre-parsed word to compile a word "du" would be "du" & COMPILE in NASM syntax. Dictionary format The dictionary is simply a pair of arrays, one after the other. Each array is 256 bytes long, so each has room for 128 items. The first array is reserved for two-character words; the second contains the corresponding machine-code addresses (16-bit offsets into segment zero). To look up an address, I search the first array for the word, and when I find it, I take the offset to that word and use it to retrieve the address for that word's machine code from the other array. Memory map The return stack (pointed to by SS) can reside in its own segment, but everything else — compiler, setup code, dictionary, compiler variables, and code space — must fit within segment zero (64KB). If everything is in the same segment, there will be no need for any far calls, far returns, or segment overrides. (The data segment (DS) must be the same as the code segment (CS) because the compiler stores data into the code space and then must execute it as code.) So the compiler uses two full segments (128KB) of memory. They are used as follows (note that all addresses are in hex): Segment zero (0000:0000 to 0000:FFFF) — the following are ranges of offsets within the segment:
Stack segment (1000:0000 to 1000:FFFF)
Register assignments I'll reserve AX as the register that contains the top item on the data stack (TOS), and SI as the register that points to the second item on the data stack (NOS). SP will do what it always does — point to the top item on the return stack. Obviously these registers need to be saved whenever the compiler needs to use a BIOS service (as it will when it asks the BIOS to wait until the user presses a key), and they'll need to be restored whenever the compiler executes any words in the dictionary. Coming soon I'll go into detail about how the compiler works, and how to use it, in future entries. Check the index for other entries. |