Tag: Dive Deep

The Butterfly Effect

The Butterfly Effect

I was privileged to have access to computers from an early age, from the humble TI-99/4A of my early elementary years to a snappy Pentium in high school (can’t remember the exact model, but it was pretty expensive; perhaps the 133MHz version?) The influence this access had on my life cannot be overstated.

Young Jud on TI-99/4A
Train up a child in way they should go

Now that I’m firmly in middle age, and on a career path where I’m regularly evaluating technical talent, I’m reminded of that privilege, and how so many didn’t have it then, and some still don’t have it now. How much untapped potential there must be within these groups!

If we’re going to overcome the lack of diversity in tech, it starts with access; early access, when life-long perceptions are formed. As the saying goes: the best time to plant a tree is 25 years ago, but the second best time is today. Gotta get planting!

There’s Gold In Them Thar Hills

There’s Gold In Them Thar Hills

In my conversations with fellow engineers, git comes up quite a bit. I find myself regularly giving advice both tactical and strategic on its effective use. Learning it in detail is a force multiplier, but few people do. Part of the problem is that training materials are all over the map.

Which is why I was so pleased to discover Git from the inside out. Without question the best introduction to git I’ve come across. It perfectly balances teaching basic commands while also explaining what’s actually happening. Despite having used git at a fairly advanced level for 10 years, I still learned some new things, for example that each git add creates an immutable blob object that is retained for a while even if you git add the same file again, and even if you never commit it. Also that it’s pretty easy to decode raw git objects should you ever need to; here’s a script I wrote to do just that, if you’re curious.

I’ve said before that abstractions are valuable, but they’re not excuses to avoid learning internals, because critical information lies beneath the surface. At the risk of pretentiously quoting myself:

When things go wrong, the engineer must descend into the particulars, and an inability to minimally reason about, if not fully grasp, what lies beneath an abstraction can prove fatal to the debugging process.

I didn’t write the above with version control in mind, but I surely could have. Engineering organizations are full of developers who run stuck the moment a git command fails. You don’t have to be that developer!

Turduckens Everywhere

Turduckens Everywhere

Did you know you can implement a CPU in Minecraft or write arbitrary computer programs using Magic: the Gathering cards? Computing is powerfully weird, and hidden structures and capabilities can arise from all sorts of odd places.

I read a theory a while back that the Internet was, in some meaningful sense, intelligently conscious, though we may never be able to interact with such an intelligence. The analogy was to human consciousness arising from the cells of the brain in a way that no individual cell could ever comprehend it, but despite that it being no less real. The notion of Turing-complete computers being buried inside all manner of languages and tools has a similar vibe to it.

Like All Good Things

Like All Good Things

The Morning Paper is wrapping things up, as explained in its final post. A treasure trove of computer science deep dives, it will be missed. But I can certainly relate to the desire to move on, especially when a global pandemic has brought life’s various priorities into sharper focus.

Luckily for all of us there is a rich back catalog of posts that could keep one busy for months. Do check them out. Need a place to start? Here’s my absolute favorite: Applying the Universal Scalability Law to organizations.

We’re All In This Together

We’re All In This Together

I went far too many years into my career before truly trying to understand networking, despite it being an increasingly common source of problems. If I could give younger self some advice, I’d recommend taking some time to learn networking.

In that spirit I present to you How DNS Works, a “fun and colorful” comic that describes in detail the operation of the Domain Name Service. Enjoy!

What The Devil’s In

What The Devil’s In

AWS provides a number of fantastic managed services that make building applications quick and easy. At least for the most part. But there are plenty of interesting gotchas, and instances where the underlying details matter.

This past week I was working on an app that used the Simple Queue Service (SQS) to exchange messages between components, and I had implemented long polling to reduce the cost of repeated API calls. I’d also set a long visibility timeout because the processor took a significant amount of time to handle each message.

During the course of testing I was finding that messages were getting stuck in an “in-flight” state; given the long visibility timeout, this was causing delays in processing because the handler had to wait for the timeout to expire for these stuck messages. But I couldn’t initially figure out why the messages were getting stuck in the first place. I only had one handler thread; why were messages getting pulled in flight, but not getting processed and eventually removed?

It turns out the reason was that in the course of testing I was regularly killing off the handler with Ctrl+C and restarting it. And that terminate signal was cutting short the long poll API call into SQS. Why did that matter? Because a long poll call fires off a process on the AWS servers that is waiting for messages to show up on the queue so it can return them. That process continues to run even if the client that initiated it dies. Thus if a message shows up on the queue after the client goes away, but before the long poll time expires, it’s taken off the queue as “in flight”, but sits there until the visibility timeout hits because there’s nothing to subsequently process and delete it.

I was unable to figure out the above until I learned more about what actually happens within AWS during an SQS long poll. Finding this thread about the Node.js client helped too (I was writing my client in Python but the behavior is common across all SDK implementations). If I’d only been able to reason at the level of the queue abstraction, I’m not sure I could have solved the problem. Once again, descending into the particulars was the path to a solution.

Turtles All The Way Down

Turtles All The Way Down

One could make an argument that computer science is the study of effective abstractions. It is no small challenge to build interfaces on lower-level details in a way that enables higher-level capabilities. But once in place, the higher-level constructs become the next layer’s low-level details, and exponentially-growing design power is unlocked.

Nowhere is this more apparent than in the explosion of cloud computing, where hardware itself has been abstracted away, where “serverless architectures” and “managed services” have enabled a form of “pure thought stuff” that Fred Brooks could only dream about.

At least in theory. In reality, there is no perfect abstraction in which the lower-level details become completely irrelevant. We do a disservice to software developers when we pretend that because high-level abstractions like AWS Lambdas exist that their underlying implementations never need to be understood. When things go wrong, the engineer must descend into the particulars, and an inability to minimally reason about, if not fully grasp, what lies beneath an abstraction can prove fatal to the debugging process.

Consider my previous post. Node’s package management system has enabled an explosion of abstractions that power some of the web’s best tools, but too often developers are not trained on what it’s doing or how to fix problems. Package documentation makes it sound so simple (“just run npm i and you’re golden!”) But if you want to use npm, you need to grok the details, or you’ll never be productive.

As another example, last week I was troubleshooting a deployment to Lambda, and the issue ended up being file permissions inside the zipped code package. One might be inclined to believe that since Lambda is “serverless” that the upload simply floats into the clouds and magically does its work. But of course that’s untrue: there is a server (with its myriad hardware abstractions), there is an operating system and corresponding system user, there is a disk to which those files are written, and there are file permissions on said disk. And if the files are not readable by the system user (e.g. if they were created on a machine with a restrictive umask) the Lambda cannot function. What seems a minor detail proves critical.

Is there a way to hide that detail from the user? Maybe? I don’t claim to understand the complex domain of cloud function implementations (if one had to do so to use them, few could), but I’m glad I had sufficient knowledge to know what to consider when I experienced trouble.

To Alcohol and WiFi!

To Alcohol and WiFi!

The causes of, and solutions to, all of life’s problems.

It isn’t often that an intermittent network connect is a benefit, but in this case a connection hiccup actually tipped me off to a useful workaround.

When you’re an engineering manager, you’re “important”, which means you have to go to a lot of meetings. And because you’re so very “important”, you can’t be troubled to close your laptop when walking across the office to said meetings, because you might miss someone’s giphy on Slack. Pretty sure I looked like an idiot, but that’s the price you pay for being in charge. Or something.

Anyhow, I’d been fighting an npm issue all morning (natch), where a particular module (bcrypt) was core dumping on my Mac. Not cool, bcrypt, not cool. Couldn’t figure out what was going on, but as is typical, “have you tried erasing your node_modules folder and re-running npm install?” Actually I had, but I was getting desperate, so thought I’d give it one more go. While simultaneously picking up my laptop to head to a meeting (keeping it open as I walked, because “important things” happening on it).

I arrived at the meeting (no idea what it was about, also pretty typical), and when the npm install had finally finished, I tried the program again, and lo and behold, it worked! I think at this moment I audibly exclaimed my excitement, despite the outburst not fitting the context of the meeting, that’s how happy I was. But I was also a bit befuddled. What had changed?

So I pored over the logs, both from the install that didn’t work, and the one that had (God bless anyone that ever has to review an npm log, it’s a special kind of hell).  Check out extracts from the install that failed to run, and the one that worked. Do you see the difference?

Please go look. I’ll wait.

Figure it out?

Did you notice that the binary of bcrypt failed to download in the second log, and npm fell back to compiling from source? That was the secret! Something must’ve been wrong with the prebuilt version for Mac. Now, I never solved what caused the crash in that build, but it was easy enough to work around it with npm --build-from-source.

But the real serendipity was the likely cause of the download failure. The only explanation I can think of is that our office’s crummy WiFi happened to flake out briefly as I was carrying my open laptop across the hall, just at the moment when the bcrypt binary was being downloaded, causing it to fail. But the network was back by the time the source tarball was downloaded, and the reset of the process finished normally.

Even as I write, it sounds preposterous. What are the odds? Maybe it was something else, I don’t have any proof. But you’ll never convince me.

A Tale As Old As 2001

A Tale As Old As 2001

For the next week or two I’m going to go back through my old drafts and finish them up. That means the stories are at least a year or two old. For this one, I’m curious if Edge finally changed the behavior. Anyone want to try it out?

When you’re debugging a pernicious issue, there’s no greater feeling than Google search auto-completing your first couple search terms and matching a page that describes your problem to a T. The challenge of course is figuring out those magic couple of words.

The team was recently trying to figure out an IE11-only problem (ugh) where our authentication mechanism was failing, but only for a subset of customers, with no obvious commonality. The server would return a Set-Cookie header, but the browser completely ignored it. WTF, Microsoft!

We’d spent an entire day trying to come up with a solution, until finally stumbling into the root cause: underscores in the subdomain. Chrome and Firefox are cool with them, but IE silently refuses to store cookies when they’re present. The details are a fascinating combination of unexpected side effects from a bug fix, misinterpreted web standards, and lingering backwards compatibility. This post captures the story nicely.

My product manager had never been thrilled with the way we’d been handling domain names. While I couldn’t have anticipated our design would lead to this misadventure (and a simple s/_/-/ solved the problem), I probably should have given his critique a closer listen.

Way Way Back

Way Way Back

I’ve had the pleasure of working with a large variety of technologies over the course of my career. Yesterday I was working on an interface to an old government database, without the aid of documentation, natch. After a few hours I was able to extract data from the system, but I was unable to decode it. Google is a great tool (I couldn’t get 15 minutes into my day without it), but if you don’t know what to search for you can’t find anything. Thankfully some careful guesses led me to the Wikipedia entry for EBCDIC, a character encoding developed in the mid-60s.

“Extended Binary Coded Decimal Interchange Code (EBCDIC) is an eight-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. EBCDIC descended from the code used with punched cards and the corresponding six bit binary-coded decimal code used with most of IBM’s computer peripherals of the late 1950s and early 1960s.”

It’s a fun thing in this industry to say you’ve worked with 50-year-old technology, but only once you’ve figured it out. Surprisingly there’s some modern tooling to interface with this particular encoding (yay Python!), so once I knew what I was dealing with it was straightforward enough. Maybe even easier than what it used to be, if this anecdote is to be believed:

EBCDIC: An alleged character set used on IBM dinosaurs. It exists in at least six mutually incompatible versions, all featuring such delights as non-contiguous letter sequences and the absence of several ASCII punctuation characters fairly important for modern computer languages (exactly which characters are absent varies according to which version of EBCDIC you’re looking at). IBM adapted EBCDIC from punched card code in the early 1960s and promulgated it as a customer-control tactic, spurning the already established ASCII standard. Today, IBM claims to be an open-systems company, but IBM’s own description of the EBCDIC variants and how to convert between them is still internally classified top-secret, burn-before-reading. Hackers blanch at the very name of EBCDIC and consider it a manifestation of purest evil.