Semi-tangential rant: I went to grad school for compilers, and yet I've been struggling for several months to get started building something real-world. I can build interpreters and parsers in my sleep at this point, but the crucial next step nobody talks about is having a deep knowledge and understanding of the target language that your compiler will be emitting. For a conventional native ahead-of-time compiler that requires knowledge of x86, the ELF format, OS loading and more. I suppose this is why toolkits like LLVM are so popular :/

One way around this is to divide the problem ... Compile to an intermediate format that is closer to what you want your final output to be. FORTH is a reasonable example and P-Code is another one (https://en.wikipedia.org/wiki/P-code_machine) then you can get the all the "easy" stuff into a stable state with an runtime library and back into a native code generator later. If you can get that working, you can probably jump to something like an assembler output with as or nasm to produce a "real" executable.

I got about half way there with https://github.com/arlaneenalra/insomniac and have been stuck looking for time to get back to it.

Thanks, particularly for the tip about P-code. Are you aware of any interpreters/assemblers for P-code? The Wikipedia page didn't seem to have any links.

Well, since we are talking about P-code and all...

I recommend you look at Oberon. I've avoided the Wirth family of languages my whole life, but have been messing with Oberon off and on since winter, because it's interesting from a security standpoint. (It was supposed to be my fun Advent hacking project, but work changes and living changes—i.e., moving—caused interference.) Project Oberon is interesting because it involves a language, a system, and a machine, from the ground up, all created from scratch.[1]

Often, when trying to dive deep on some concept, the available literature can get you rolling with a toy (e.g., compilers), but it helps you reach only a facile understanding, and punts on everything around it. You'll be aware of this; I'm pretty sure it's what you're referring to in your comment above. That's mostly avoided with Oberon, because it's a full-fledged toolchain for quasi-real-world use—at least it was in production use at ETH Zurich.

There are some gotchas with Oberon, and it mostly comes down to a lot of vague, hypey comments written by people who haven't dived deep, and don't have the level of understanding that their comments suggest. There are numerous examples. I could write those up, but here's one: "It was all done without resorting to assembly anywhere." Then you go look into it, and that's because there's no assembler, and it's inline snippets of hex-encoded machine code and other binary blobs instead.

The second big gotcha is that Wirth & Co have produced volumes of (what looks like high-quality) literature, but a bunch of it is either out of date, only superficially helpful, poorly written, or contains errors. For example, "Oberon" refers to so many things—including systems and languages that Wirth had nothing to do with and probably should have never been allowed to bear the name—that it makes jwz's old Java rant[2] seem quaint. (Try starting out at the Wikipedia page and making sense of Oberon's evolution or mapping out the family tree, then try referring to primary and secondary sources directly that might clear things up. Good luck.)

I began with a fresh notebook for taking notes and keeping track of errata in the published stuff. I quit keeping track of errata after two days and several chapters, because it was too much. If you're interested, I highly recommend just running a system image from Peter De Wachter's Norebo[3] and using his emulator for Wirth's RISC machine[4]. Familiarize yourself with the basics of how to use Oberon-the-system by playing with it for an hour or so, crack open the source and just study it directly. Cross reference Wirth's publications if you want (they're all online), but assume that they're lying about something. I can also share my notes. Stay away from the mailing list, it's populated by USENET-style cranks, and it isn't really an essential component of Oberon development. There's not really a community—Wirth pretty much does his own thing, never posts there, and just does a source dump through his personal website.

Having said all that, studying Oberon won't impart all the knowledge you're looking for. It's in this weird place where it's more than a toy, but it really doesn't directly resemble any of the real-world systems that you're actually interested in. (Which most likely means UNIX; let's just be honest.) But it's probably the kind of stepping stone you need.

So the best resources on ELF I know of are the articles written by Eric Youngdale for Linux Journal[5][6], from back in the 90s when vendors were adopting ELF for the first time and he wrote the Linux implementation. I believe this to be the highest quality treatment of the subject that exists (at least as of a few years ago when I was interested in studying this kind of thing).

Hope it helps.

1. https://issuu.com/xcelljournal/docs/xcell_journal_issue_91/3...

2. https://www.jwz.org/doc/java.html

3. https://github.com/pdewacht/project-norebo

4. https://github.com/pdewacht/oberon-risc-emu

5. http://www.linuxjournal.com/article/1059

6. http://www.linuxjournal.com/article/1060

EDIT: I forgot the P-code tie-in! P-code is tangentially related to Oberon because it was developed to port/run (a dialect of) Pascal, one of Oberon's predecessors and Wirth's main claim to fame. Here's a sort-of P-code interpreter for Oberon—it actually runs a (fairly capable) subset of the RISC ISA that Oberon proper targets:

https://www.inf.ethz.ch/personal/wirth/CompilerConstruction/...

(Yes, that's the entire implementation. You'll need a compiler for it though. That can be found in the parent directory. To see what a more "fortified" implementation would look like, and implemented in C, look at Peter De Wachter's emulator, already mentioned above.)