Tag: Dive Deep

What The Devil’s In

What The Devil’s In

AWS provides a number of fantastic managed services that make building applications quick and easy. At least for the most part. But there are plenty of interesting gotchas, and instances where the underlying details matter.

This past week I was working on an app that used the Simple Queue Service (SQS) to exchange messages between components, and I had implemented long polling to reduce the cost of repeated API calls. I’d also set a long visibility timeout because the processor took a significant amount of time to handle each message.

During the course of testing I was finding that messages were getting stuck in an “in-flight” state; given the long visibility timeout, this was causing delays in processing because the handler had to wait for the timeout to expire for these stuck messages. But I couldn’t initially figure out why the messages were getting stuck in the first place. I only had one handler thread; why were messages getting pulled in flight, but not getting processed and eventually removed?

It turns out the reason was that in the course of testing I was regularly killing off the handler with Ctrl+C and restarting it. And that terminate signal was cutting short the long poll API call into SQS. Why did that matter? Because a long poll call fires off a process on the AWS servers that is waiting for messages to show up on the queue so it can return them. That process continues to run even if the client that initiated it dies. Thus if a message shows up on the queue after the client goes away, but before the long poll time expires, it’s taken off the queue as “in flight”, but sits there until the visibility timeout hits because there’s nothing to subsequently process and delete it.

I was unable to figure out the above until I learned more about what actually happens within AWS during an SQS long poll. Finding this thread about the Node.js client helped too (I was writing my client in Python but the behavior is common across all SDK implementations). If I’d only been able to reason at the level of the queue abstraction, I’m not sure I could have solved the problem. Once again, descending into the particulars was the path to a solution.

Turtles All The Way Down

Turtles All The Way Down

One could make an argument that computer science is the study of effective abstractions. It is no small challenge to build interfaces on lower-level details in a way that enables higher-level capabilities. But once in place, the higher-level constructs become the next layer’s low-level details, and exponentially-growing design power is unlocked.

Nowhere is this more apparent than in the explosion of cloud computing, where hardware itself has been abstracted away, where “serverless architectures” and “managed services” have enabled a form of “pure thought stuff” that Fred Brooks could only dream about.

At least in theory. In reality, there is no perfect abstraction in which the lower-level details become completely irrelevant. We do a disservice to software developers when we pretend that because high-level abstractions like AWS Lambdas exist that their underlying implementations never need to be understood. When things go wrong, the engineer must descend into the particulars, and an inability to minimally reason about, if not fully grasp, what lies beneath an abstraction can prove fatal to the debugging process.

Consider my previous post. Node’s package management system has enabled an explosion of abstractions that power some of the web’s best tools, but too often developers are not trained on what it’s doing or how to fix problems. Package documentation makes it sound so simple (“just run npm i and you’re golden!”) But if you want to use npm, you need to grok the details, or you’ll never be productive.

As another example, last week I was troubleshooting a deployment to Lambda, and the issue ended up being file permissions inside the zipped code package. One might be inclined to believe that since Lambda is “serverless” that the upload simply floats into the clouds and magically does its work. But of course that’s untrue: there is a server (with its myriad hardware abstractions), there is an operating system and corresponding system user, there is a disk to which those files are written, and there are file permissions on said disk. And if the files are not readable by the system user (e.g. if they were created on a machine with a restrictive umask) the Lambda cannot function. What seems a minor detail proves critical.

Is there a way to hide that detail from the user? Maybe? I don’t claim to understand the complex domain of cloud function implementations (if one had to do so to use them, few could), but I’m glad I had sufficient knowledge to know what to consider when I experienced trouble.

To Alcohol and WiFi!

To Alcohol and WiFi!

The causes of, and solutions to, all of life’s problems.

It isn’t often that an intermittent network connect is a benefit, but in this case a connection hiccup actually tipped me off to a useful workaround.

When you’re an engineering manager, you’re “important”, which means you have to go to a lot of meetings. And because you’re so very “important”, you can’t be troubled to close your laptop when walking across the office to said meetings, because you might miss someone’s giphy on Slack. Pretty sure I looked like an idiot, but that’s the price you pay for being in charge. Or something.

Anyhow, I’d been fighting an npm issue all morning (natch), where a particular module (bcrypt) was core dumping on my Mac. Not cool, bcrypt, not cool. Couldn’t figure out what was going on, but as is typical, “have you tried erasing your node_modules folder and re-running npm install?” Actually I had, but I was getting desperate, so thought I’d give it one more go. While simultaneously picking up my laptop to head to a meeting (keeping it open as I walked, because “important things” happening on it).

I arrived at the meeting (no idea what it was about, also pretty typical), and when the npm install had finally finished, I tried the program again, and lo and behold, it worked! I think at this moment I audibly exclaimed my excitement, despite the outburst not fitting the context of the meeting, that’s how happy I was. But I was also a bit befuddled. What had changed?

So I pored over the logs, both from the install that didn’t work, and the one that had (God bless anyone that ever has to review an npm log, it’s a special kind of hell).  Check out extracts from the install that failed to run, and the one that worked. Do you see the difference?

Please go look. I’ll wait.

Figure it out?

Did you notice that the binary of bcrypt failed to download in the second log, and npm fell back to compiling from source? That was the secret! Something must’ve been wrong with the prebuilt version for Mac. Now, I never solved what caused the crash in that build, but it was easy enough to work around it with npm --build-from-source.

But the real serendipity was the likely cause of the download failure. The only explanation I can think of is that our office’s crummy WiFi happened to flake out briefly as I was carrying my open laptop across the hall, just at the moment when the bcrypt binary was being downloaded, causing it to fail. But the network was back by the time the source tarball was downloaded, and the reset of the process finished normally.

Even as I write, it sounds preposterous. What are the odds? Maybe it was something else, I don’t have any proof. But you’ll never convince me.

A Tale As Old As 2001

A Tale As Old As 2001

For the next week or two I’m going to go back through my old drafts and finish them up. That means the stories are at least a year or two old. For this one, I’m curious if Edge finally changed the behavior. Anyone want to try it out?

When you’re debugging a pernicious issue, there’s no greater feeling than Google search auto-completing your first couple search terms and matching a page that describes your problem to a T. The challenge of course is figuring out those magic couple of words.

The team was recently trying to figure out an IE11-only problem (ugh) where our authentication mechanism was failing, but only for a subset of customers, with no obvious commonality. The server would return a Set-Cookie header, but the browser completely ignored it. WTF, Microsoft!

We’d spent an entire day trying to come up with a solution, until finally stumbling into the root cause: underscores in the subdomain. Chrome and Firefox are cool with them, but IE silently refuses to store cookies when they’re present. The details are a fascinating combination of unexpected side effects from a bug fix, misinterpreted web standards, and lingering backwards compatibility. This post captures the story nicely.

My product manager had never been thrilled with the way we’d been handling domain names. While I couldn’t have anticipated our design would lead to this misadventure (and a simple s/_/-/ solved the problem), I probably should have given his critique a closer listen.

Way Way Back

Way Way Back

I’ve had the pleasure of working with a large variety of technologies over the course of my career. Yesterday I was working on an interface to an old government database, without the aid of documentation, natch. After a few hours I was able to extract data from the system, but I was unable to decode it. Google is a great tool (I couldn’t get 15 minutes into my day without it), but if you don’t know what to search for you can’t find anything. Thankfully some careful guesses led me to the Wikipedia entry for EBCDIC, a character encoding developed in the mid-60s.

“Extended Binary Coded Decimal Interchange Code (EBCDIC) is an eight-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. EBCDIC descended from the code used with punched cards and the corresponding six bit binary-coded decimal code used with most of IBM’s computer peripherals of the late 1950s and early 1960s.”

It’s a fun thing in this industry to say you’ve worked with 50-year-old technology, but only once you’ve figured it out. Surprisingly there’s some modern tooling to interface with this particular encoding (yay Python!), so once I knew what I was dealing with it was straightforward enough. Maybe even easier than what it used to be, if this anecdote is to be believed:

EBCDIC: An alleged character set used on IBM dinosaurs. It exists in at least six mutually incompatible versions, all featuring such delights as non-contiguous letter sequences and the absence of several ASCII punctuation characters fairly important for modern computer languages (exactly which characters are absent varies according to which version of EBCDIC you’re looking at). IBM adapted EBCDIC from punched card code in the early 1960s and promulgated it as a customer-control tactic, spurning the already established ASCII standard. Today, IBM claims to be an open-systems company, but IBM’s own description of the EBCDIC variants and how to convert between them is still internally classified top-secret, burn-before-reading. Hackers blanch at the very name of EBCDIC and consider it a manifestation of purest evil.

Lions And Tigers And Permutations, Oh My!

Lions And Tigers And Permutations, Oh My!

This blog is turning into a bit of a “debugging adventure of the day”, which wasn’t the original intent, but whatever. I’ll get around to meatier posts someday.

This morning’s adventure was debugging a pretty serious issue with a web application where selections would not save. Initially it only manifest itself on Windows machines, so we thought Microsoft was to blame. Oddly, though, Chrome failed to work but Edge was fine. Weird.

Wanting to check a few other variables, we grabbed another co-worker’s Windows laptop, and it worked fine in Chrome. WTF!?! Nearly identical hardware, same OS, same browser, and same browser version, but one machine worked, and the other didn’t. Cleared caches, rebooted, etc. Same results.

But did you catch the word “nearly” there? Turns out there was a difference between the two laptops. One had a touchscreen, and the other didn’t. Fast-forward through an hour of slogging through a bunch of hacked together JavaScript code to support drag-and-drop on a variety of devices, and sure enough, the “non-mobile yet touchscreen enabled” case had its own special snowflake handler that wasn’t working correctly in Chrome.

Why you ask? Well, in some browsers, touch events are delayed by 300ms (this is so things like double-tap and dragging can be detected properly). This causes the mouse click event to go first, and when clicking on a button that submits the form, the form gets processed before the touch event can fire that needs to run and configure the inputs to the form submission. If you touch the button instead of clicking with the trackpad (a condition that only makes sense on a touch-enabled laptop vs a tablet), it works okay.

That means this situation only manifest itself on a touchscreen-enabled non-mobile device, in Chrome, when the user clicks the submit button with the trackpad vs touching the button. Can’t hardly blame QA for not finding that one.

The Devil Is In The Details

The Devil Is In The Details

Software development is not for the faint of heart.

  • The SQL DELIMITER  statement is particular to the mysql client tool, so you can’t use a standard dump file to bootstrap initialize a new database instance using mysqld –bootstrap. Eight hours later I figured out a workaround, namely exporting stored procedures directly from the mysql.proc  table.
  • A single stray character can entirely break a critical piece of system functionality; finding and removing it required a late night from multiple team members.

And that was just yesterday.

Balance Of Probability

Balance Of Probability

Back in college I co-authored a paper entitled “A Probabilistic View of Certain Weighted Fibonacci Sums” (available here). Looking back on it, I wonder if that particular combination of words had ever been used together before. Contrast that to a sentence like “Could you go to the store and pick up some milk?” Certainly that combination of words has been used many times.

All this was brought to mind after I uttered the following this morning:

“For the purposes of our system, both syntactic and semantic errors should cause input to be rejected as malformed.”

Do you think that sentence has ever been uttered before? What about your average everyday statement? How might we go about analyzing such a thing? It’s these questions that will kick off my first significant blog series, “The Tyranny Of The Finite”, which I hope to start next week.

Until then, today’s challenge is to go out and use a sentence that is unlikely to have ever been said before.

Docker Fail

Docker Fail

One of my favorite truisms in software development is the following:

“When two things aren’t the same, then they’re different.”

I don’t care how hard you work to make development and production environments the same, I don’t care how portable you think your programming language is across operating systems, and I especially don’t care about claims made on a product’s marketing material. Case in point:

[Docker] makes for efficient, lightweight, self-contained systems and guarantees that software will always run the same, regardless of where it’s deployed.

No doubt the Docker team has worked incredibly hard to make the above statement true, but nothing in life is guaranteed with 100% certainty. Last week I discovered that the cron utility will not execute any crontab or script that has more than one hard link pointed at it (don’t ask me why not, it just doesn’t). And due to the way Docker’s overlay filesystem works, the operating system can report a regular file as having multiple links. At least it does when the container is running under Kubernetes. But not always! When running with docker-compose on my Mac the crontab only reported one hard link, and thus cron worked great.

Took me nearly a day of fiddling to determine why the container worked great locally but failed in Kubernetes. Argh.

Adventures

Adventures

Yesterday the team spent most of the day debugging an urgent production issue. The root cause: a seemingly innocuous change made to a custom storage library eight months ago. Lesson learned: changes may have unintended side-effects.

And today I’m back to decompiling Java bytecode from a production server. The original source is long gone. Good times.