Recently, I was helping out a friend in analyzing some RNA samples for her work. These samples are huge - like nearly a gigabyte of data. There was this tool which was recommended for the job - mirexpress. It was a small job, perhaps 10 minutes worth of effort. To make my work easier, I provisioned a beefy (and costly) machine on Azure to do the job, took a quick look at the clock (it said 11 PM), ran the tool, and relaxed. The tool crashed while reading the file.

In an attempt to fix the bug, I opened mirexpress's code. And all my confidence in my programming ability vanished when I saw its innards. I understand that the code may have been written by scientists who had no experience in programming, but I have never been so utterly _disoriented_ by bad code. Anyways, after hacking away at the mess for about 3-4 hours, I realized that this was a fool's errand and thought I'll just phone it in the next day saying I couldn't do it. I went to sleep thinking that it was already late and I'd get late for work the next day.

- 5 minutes later -

I woke up with a start, recalling this nifty tool called awk. I had last used it maybe 3 years ago, and before that only in college. But I could see how awk could do some of the things which mirexpress was claiming to do. So I fire up my computer, write an awk script - 2 lines only! TWO FUCKING LINES! And it runs like a charm - eats away at megabytes of sample data and gives me results I can show. So then like any rational person, I spent the remaining hours re-discovering awk and forgot to sleep. Pissed away the whole next day (and some part of the day after that too!) :-D

It's really fascinating that this nifty little tools invented DECADES ago are still going strong, and there's been no _evolutionary_ leap in areas where tools like awk/grep/sed excel at.

When you start looking at the code of the tools, and how older systems were design, it's really is a testament to good engineering practices because it all scales so well. Small, simple, "one task well" utilities that process data in streams means the megabyte files of the 80's and 90's are today's terabytes. Sure they require some work to get things scripted correctly but it really is pretty amazing what coreutils and a shell script can do.

You might appreciate the original implementation of awk. Good engineering everywhere.

https://github.com/RetroBSD/retrobsd/tree/master/src/cmd/awk

Updated, but still the original:

https://github.com/onetrueawk/awk