Tag: Dive Deep

Off To The Races

Off To The Races

In my previous post I mentioned an issue I had when building a CDK construct. As promised, today I’ll go through the problem I found: a dreaded race condition, which anyone who’s spent much time debugging software knows is a pernicious type of situation where behavior varies depending on the order in which various parts of the system execute, causing intermittent failures.

For background, part of the power of CDK is that it provides a framework for executing raw AWS API calls as part of a larger deployment. This is useful in numerous circumstances. For my construct, it enabled me to output several Managed Blockchain parameters that are available via API call but not from CloudFormation.

Under the hood these API calls are executed in a Lambda function that is created just for this purpose. This function has an IAM role, to which various permission policies are applied. For efficiency, it is only created once during a deployment, and then shared across all the API calls in the stack.

In order to keep my code well-organized, I’ve broken out the API calls in several places: one to gather data on the network member, and another to gather data for each peer node. And as a security best practice I want the permissions to be scoped as narrowly as possible. That means at each point in my construct where I call the function, I attach a policy that allows access only to the specific member or node being queried, via an explicit identifier.

Here’s the problem: IAM is an eventually consistent service, and thus policy updates are not immediately effective. Typical propagation time is only a few seconds, but it can take longer in certain circumstances. For the first custom API call in a CDK stack this is not an issue. The policy and role is created, and then the Lambda is created, the latter taking over a minute to be fully instantiated because it upgrades its dependencies at launch. However, on subsequent calls, because the Lambda is already warmed up, it runs immediately after the preceding policy update, and about half the time said policy is not yet effective, and the function fails due to a permission error.

It’s the “sometimes it works, sometimes it doesn’t” nature of race conditions that make them so difficult to track down. Thankfully I was able to identify and document my experience and pass it along to the CDK team. Anyone want to take a crack at a solution? I described several possible approaches in my write-up, with the “simple retry logic” approach likely being the best.

A Number Of Numbers

A Number Of Numbers

Back in my math major days in college, I was introduced to the Online Journal of Integer Sequences. It’s exactly what it says on the tin. As part of a class we were encouraged to contribute, which I did.

A few weeks ago I had another idea for a submission, and to my surprise no one else had added it, so once again I had opportunity to contribute a little piece of Internet history.

Here’s a complete list of the sequences I’ve authored over the years:

If the above isn’t enough online notoriety, check out my only published mathematical work, A Probabilistic View of Certain Weighted Fibonacci Sums. I was only a mild contributor, but still got an authorial credit, which is pretty cool.

The Butterfly Effect

The Butterfly Effect

I was privileged to have access to computers from an early age, from the humble TI-99/4A of my early elementary years to a snappy Pentium in high school (can’t remember the exact model, but it was pretty expensive; perhaps the 133MHz version?) The influence this access had on my life cannot be overstated.

Young Jud on TI-99/4A
Train up a child in way they should go

Now that I’m firmly in middle age, and on a career path where I’m regularly evaluating technical talent, I’m reminded of that privilege, and how so many didn’t have it then, and some still don’t have it now. How much untapped potential there must be within these groups!

If we’re going to overcome the lack of diversity in tech, it starts with access; early access, when life-long perceptions are formed. As the saying goes: the best time to plant a tree is 25 years ago, but the second best time is today. Gotta get planting!

There’s Gold In Them Thar Hills

There’s Gold In Them Thar Hills

In my conversations with fellow engineers, git comes up quite a bit. I find myself regularly giving advice both tactical and strategic on its effective use. Learning it in detail is a force multiplier, but few people do. Part of the problem is that training materials are all over the map.

Which is why I was so pleased to discover Git from the inside out. Without question the best introduction to git I’ve come across. It perfectly balances teaching basic commands while also explaining what’s actually happening. Despite having used git at a fairly advanced level for 10 years, I still learned some new things, for example that each git add creates an immutable blob object that is retained for a while even if you git add the same file again, and even if you never commit it. Also that it’s pretty easy to decode raw git objects should you ever need to; here’s a script I wrote to do just that, if you’re curious.

I’ve said before that abstractions are valuable, but they’re not excuses to avoid learning internals, because critical information lies beneath the surface. At the risk of pretentiously quoting myself:

When things go wrong, the engineer must descend into the particulars, and an inability to minimally reason about, if not fully grasp, what lies beneath an abstraction can prove fatal to the debugging process.

I didn’t write the above with version control in mind, but I surely could have. Engineering organizations are full of developers who run stuck the moment a git command fails. You don’t have to be that developer!

Turduckens Everywhere

Turduckens Everywhere

Did you know you can implement a CPU in Minecraft or write arbitrary computer programs using Magic: the Gathering cards? Computing is powerfully weird, and hidden structures and capabilities can arise from all sorts of odd places.

I read a theory a while back that the Internet was, in some meaningful sense, intelligently conscious, though we may never be able to interact with such an intelligence. The analogy was to human consciousness arising from the cells of the brain in a way that no individual cell could ever comprehend it, but despite that it being no less real. The notion of Turing-complete computers being buried inside all manner of languages and tools has a similar vibe to it.

Like All Good Things

Like All Good Things

The Morning Paper is wrapping things up, as explained in its final post. A treasure trove of computer science deep dives, it will be missed. But I can certainly relate to the desire to move on, especially when a global pandemic has brought life’s various priorities into sharper focus.

Luckily for all of us there is a rich back catalog of posts that could keep one busy for months. Do check them out. Need a place to start? Here’s my absolute favorite: Applying the Universal Scalability Law to organizations.

We’re All In This Together

We’re All In This Together

I went far too many years into my career before truly trying to understand networking, despite it being an increasingly common source of problems. If I could give younger self some advice, I’d recommend taking some time to learn networking.

In that spirit I present to you How DNS Works, a “fun and colorful” comic that describes in detail the operation of the Domain Name Service. Enjoy!

What The Devil’s In

What The Devil’s In

AWS provides a number of fantastic managed services that make building applications quick and easy. At least for the most part. But there are plenty of interesting gotchas, and instances where the underlying details matter.

This past week I was working on an app that used the Simple Queue Service (SQS) to exchange messages between components, and I had implemented long polling to reduce the cost of repeated API calls. I’d also set a long visibility timeout because the processor took a significant amount of time to handle each message.

During the course of testing I was finding that messages were getting stuck in an “in-flight” state; given the long visibility timeout, this was causing delays in processing because the handler had to wait for the timeout to expire for these stuck messages. But I couldn’t initially figure out why the messages were getting stuck in the first place. I only had one handler thread; why were messages getting pulled in flight, but not getting processed and eventually removed?

It turns out the reason was that in the course of testing I was regularly killing off the handler with Ctrl+C and restarting it. And that terminate signal was cutting short the long poll API call into SQS. Why did that matter? Because a long poll call fires off a process on the AWS servers that is waiting for messages to show up on the queue so it can return them. That process continues to run even if the client that initiated it dies. Thus if a message shows up on the queue after the client goes away, but before the long poll time expires, it’s taken off the queue as “in flight”, but sits there until the visibility timeout hits because there’s nothing to subsequently process and delete it.

I was unable to figure out the above until I learned more about what actually happens within AWS during an SQS long poll. Finding this thread about the Node.js client helped too (I was writing my client in Python but the behavior is common across all SDK implementations). If I’d only been able to reason at the level of the queue abstraction, I’m not sure I could have solved the problem. Once again, descending into the particulars was the path to a solution.

Turtles All The Way Down

Turtles All The Way Down

One could make an argument that computer science is the study of effective abstractions. It is no small challenge to build interfaces on lower-level details in a way that enables higher-level capabilities. But once in place, the higher-level constructs become the next layer’s low-level details, and exponentially-growing design power is unlocked.

Nowhere is this more apparent than in the explosion of cloud computing, where hardware itself has been abstracted away, where “serverless architectures” and “managed services” have enabled a form of “pure thought stuff” that Fred Brooks could only dream about.

At least in theory. In reality, there is no perfect abstraction in which the lower-level details become completely irrelevant. We do a disservice to software developers when we pretend that because high-level abstractions like AWS Lambdas exist that their underlying implementations never need to be understood. When things go wrong, the engineer must descend into the particulars, and an inability to minimally reason about, if not fully grasp, what lies beneath an abstraction can prove fatal to the debugging process.

Consider my previous post. Node’s package management system has enabled an explosion of abstractions that power some of the web’s best tools, but too often developers are not trained on what it’s doing or how to fix problems. Package documentation makes it sound so simple (“just run npm i and you’re golden!”) But if you want to use npm, you need to grok the details, or you’ll never be productive.

As another example, last week I was troubleshooting a deployment to Lambda, and the issue ended up being file permissions inside the zipped code package. One might be inclined to believe that since Lambda is “serverless” that the upload simply floats into the clouds and magically does its work. But of course that’s untrue: there is a server (with its myriad hardware abstractions), there is an operating system and corresponding system user, there is a disk to which those files are written, and there are file permissions on said disk. And if the files are not readable by the system user (e.g. if they were created on a machine with a restrictive umask) the Lambda cannot function. What seems a minor detail proves critical.

Is there a way to hide that detail from the user? Maybe? I don’t claim to understand the complex domain of cloud function implementations (if one had to do so to use them, few could), but I’m glad I had sufficient knowledge to know what to consider when I experienced trouble.

To Alcohol and WiFi!

To Alcohol and WiFi!

The causes of, and solutions to, all of life’s problems.

It isn’t often that an intermittent network connect is a benefit, but in this case a connection hiccup actually tipped me off to a useful workaround.

When you’re an engineering manager, you’re “important”, which means you have to go to a lot of meetings. And because you’re so very “important”, you can’t be troubled to close your laptop when walking across the office to said meetings, because you might miss someone’s giphy on Slack. Pretty sure I looked like an idiot, but that’s the price you pay for being in charge. Or something.

Anyhow, I’d been fighting an npm issue all morning (natch), where a particular module (bcrypt) was core dumping on my Mac. Not cool, bcrypt, not cool. Couldn’t figure out what was going on, but as is typical, “have you tried erasing your node_modules folder and re-running npm install?” Actually I had, but I was getting desperate, so thought I’d give it one more go. While simultaneously picking up my laptop to head to a meeting (keeping it open as I walked, because “important things” happening on it).

I arrived at the meeting (no idea what it was about, also pretty typical), and when the npm install had finally finished, I tried the program again, and lo and behold, it worked! I think at this moment I audibly exclaimed my excitement, despite the outburst not fitting the context of the meeting, that’s how happy I was. But I was also a bit befuddled. What had changed?

So I pored over the logs, both from the install that didn’t work, and the one that had (God bless anyone that ever has to review an npm log, it’s a special kind of hell).  Check out extracts from the install that failed to run, and the one that worked. Do you see the difference?

Please go look. I’ll wait.

Figure it out?

Did you notice that the binary of bcrypt failed to download in the second log, and npm fell back to compiling from source? That was the secret! Something must’ve been wrong with the prebuilt version for Mac. Now, I never solved what caused the crash in that build, but it was easy enough to work around it with npm --build-from-source.

But the real serendipity was the likely cause of the download failure. The only explanation I can think of is that our office’s crummy WiFi happened to flake out briefly as I was carrying my open laptop across the hall, just at the moment when the bcrypt binary was being downloaded, causing it to fail. But the network was back by the time the source tarball was downloaded, and the reset of the process finished normally.

Even as I write, it sounds preposterous. What are the odds? Maybe it was something else, I don’t have any proof. But you’ll never convince me.