Author: Jud

Technologist interested in building both systems and organizations that are secure, scaleable, cost-effective, and most of all, good for humanity.
Tests In The Time Of Corona

Tests In The Time Of Corona

A popular discussion topic these days is how society will change due to the COVID-19 pandemic. One obvious likely outcome is even more will be accomplishable online, for example, the AWS certification exam I took from the comfort of my armchair this past Saturday morning. I thought I’d share my experience, and a few specific recommendations (in bold).

The process started with the normal scheduling workflow via the AWS Training website. Once that was complete and I paid the fee, I got an email with a link to instructions on how to run a system test. The recommendation was to do this right away, which I did. This involved installing a local application, verifying my laptop’s webcam and microphone worked, and testing my Internet bandwidth. Given that the application will completely take over your machine, and you must close all other running processes, I’d recommend using a personally-owned computer for this, and not a work computer (especially if you don’t have local admin).

Also as part of the above process, I had to take a picture of my face, my driver’s license (both sides), and four photos of the location where I’d take my exam, one from each of the front, back, left and right. A link sent to my phone via SMS guided me through these steps (so have your phone handy); it went quicker than I expected, and the photos were automagically submitted. Nice!

The day of the exam, about 15 minutes before my scheduled time, I closed everything on my personal laptop, and launched the client app. And here’s where I had my first hiccup: the app required that I upgrade to the latest version. This meant re-downloading and re-installing, which was annoying; I wished the app could have streamlined that process. Once the update was done, I had to go back through the entire system test process (including taking all the photos mentioned earlier). This was annoying, and made me wonder if the earlier test was even worth taking. At minimum, you want to be sure to get started at least 15 minutes before your scheduled start time in case you run into issues.

Once I got upgraded and back through the system test, my webcam became active and indicated I was being recorded. From this point you cannot leave the room (or move out of view of the camera), or open any other applications. I had to wait in this state for about 5 minutes before I was put in contact with a proctor, first via chat, and then via phone call. Be prepared to sit in silence for a bit while you wait. For a final verification I had to do a sweep of the room with my webcam to show it was empty of both other technology and other people. Be sure at this point that everything and everyone is out of the room. I even unplugged my extra monitor, just in case.

As I tried to launch the actual exam, I hit another problem. The app detected instances of chrome running in the background, even though I thought I’d closed everything. I had to go into activity monitor and kill them all before the test would launch. So, it’s worth looking through activity monitor right before you start to avoid this hangup (though the proctor was friendly and patient with me, which I appreciated).

(Oh, and speaking of things to do right before you start, use the restroom and get a drink of water right before you settle in; absolutely no breaks or food or drink are allowed).

Finally I could load the exam and begin. It’s a bit nerve-wracking to be video-recorded, but after a few minutes I tuned it out. You do need to be conscious not to look around the room, else the proctor might ping you to hold still (this didn’t happen to me, but I read about it). The test interface itself was pretty much like the in-person experience, with the ability to skip ahead, flag questions for review, etc.

Once I completed the exam, there was a survey, just like the in-person experience, and I was given the results (I passed, woot). Then the app closed and I was done. Piece of cake.

Even given the gotchas, I’d say the experience is easier and more convenient than scheduling a time with an in-person testing center, driving there, etc, if for no other reason than the available time slots for online proctoring ranged across all days of the week, and nearly all hours of the day (want to take your exam at 3am? knock yourself out). I’ll definitely use this online proctoring again, probably exclusively, as long as I have option to do so.

Have you been thinking about getting AWS certified? If so, I’d encourage you to go right now and get a date on the calendar. You’ve nothing to lose, and it’s now easier than ever.

CloudFormation Derriere Kicker

CloudFormation Derriere Kicker

Are you using the AWS Cloud Development Kit yet? If your job ever involves creating or maintaining infrastructure (on AWS, natch), you absolutely should be. Take the power of CloudFormation (minus the annoying parts) and mix it with the familiarity of your favorite programming language (Python, am I right? Yes I am. But TypeScript, Java, and .NET are also choices), and you get a killer option in the burgeoning Infrastructure as Code toolbox.

Instead of spending a bunch of time telling you why the CDK is so great, here’s some resources to help you to discover so for yourself:

Happy building!

OOPs I Did It Again

OOPs I Did It Again

Have you ever looked at a heavily object-oriented codebase (Java, I’m looking at you) and immediately felt dumber? Did the design patterns touted as the ultimate in software craftsmanship sound amazing when you studied them in school, but now every time you see them in real projects they result in layer upon layer of confusion?

You are not alone.

There is no one approach to rule them all when it comes to software development, and while I wouldn’t be quite as hard on object-oriented programming as the articles linked above (yes, there are multiple ones), it’s not the only, best, or even appropriate pattern to use in all circumstances.

What The Devil’s In

What The Devil’s In

AWS provides a number of fantastic managed services that make building applications quick and easy. At least for the most part. But there are plenty of interesting gotchas, and instances where the underlying details matter.

This past week I was working on an app that used the Simple Queue Service (SQS) to exchange messages between components, and I had implemented long polling to reduce the cost of repeated API calls. I’d also set a long visibility timeout because the processor took a significant amount of time to handle each message.

During the course of testing I was finding that messages were getting stuck in an “in-flight” state; given the long visibility timeout, this was causing delays in processing because the handler had to wait for the timeout to expire for these stuck messages. But I couldn’t initially figure out why the messages were getting stuck in the first place. I only had one handler thread; why were messages getting pulled in flight, but not getting processed and eventually removed?

It turns out the reason was that in the course of testing I was regularly killing off the handler with Ctrl+C and restarting it. And that terminate signal was cutting short the long poll API call into SQS. Why did that matter? Because a long poll call fires off a process on the AWS servers that is waiting for messages to show up on the queue so it can return them. That process continues to run even if the client that initiated it dies. Thus if a message shows up on the queue after the client goes away, but before the long poll time expires, it’s taken off the queue as “in flight”, but sits there until the visibility timeout hits because there’s nothing to subsequently process and delete it.

I was unable to figure out the above until I learned more about what actually happens within AWS during an SQS long poll. Finding this thread about the Node.js client helped too (I was writing my client in Python but the behavior is common across all SDK implementations). If I’d only been able to reason at the level of the queue abstraction, I’m not sure I could have solved the problem. Once again, descending into the particulars was the path to a solution.

Out Of Sight, Out Of Mind

Out Of Sight, Out Of Mind

Today I came across this statement from Alfred North Whitehead, and instantly loved it as an extension of my previous post on abstractions.

“Civilization advances by extending the number of important operations which we can perform without thinking about them.”

That to me is the essence of abstractions. Not that one needn’t ever be required to dig down into the implementation details, but that the layer on top of those details enables them to be ignored to an increasing degree.

Incidentally, this is the second Alfred North Whitehead reference I’ve come across recently, the first being a mention of his book Science and the Modern World in one of my favorite podcasts. Something tells me I need to dive deeper.

Turtles All The Way Down

Turtles All The Way Down

One could make an argument that computer science is the study of effective abstractions. It is no small challenge to build interfaces on lower-level details in a way that enables higher-level capabilities. But once in place, the higher-level constructs become the next layer’s low-level details, and exponentially-growing design power is unlocked.

Nowhere is this more apparent than in the explosion of cloud computing, where hardware itself has been abstracted away, where “serverless architectures” and “managed services” have enabled a form of “pure thought stuff” that Fred Brooks could only dream about.

At least in theory. In reality, there is no perfect abstraction in which the lower-level details become completely irrelevant. We do a disservice to software developers when we pretend that because high-level abstractions like AWS Lambdas exist that their underlying implementations never need to be understood. When things go wrong, the engineer must descend into the particulars, and an inability to minimally reason about, if not fully grasp, what lies beneath an abstraction can prove fatal to the debugging process.

Consider my previous post. Node’s package management system has enabled an explosion of abstractions that power some of the web’s best tools, but too often developers are not trained on what it’s doing or how to fix problems. Package documentation makes it sound so simple (“just run npm i and you’re golden!”) But if you want to use npm, you need to grok the details, or you’ll never be productive.

As another example, last week I was troubleshooting a deployment to Lambda, and the issue ended up being file permissions inside the zipped code package. One might be inclined to believe that since Lambda is “serverless” that the upload simply floats into the clouds and magically does its work. But of course that’s untrue: there is a server (with its myriad hardware abstractions), there is an operating system and corresponding system user, there is a disk to which those files are written, and there are file permissions on said disk. And if the files are not readable by the system user (e.g. if they were created on a machine with a restrictive umask) the Lambda cannot function. What seems a minor detail proves critical.

Is there a way to hide that detail from the user? Maybe? I don’t claim to understand the complex domain of cloud function implementations (if one had to do so to use them, few could), but I’m glad I had sufficient knowledge to know what to consider when I experienced trouble.

To Alcohol and WiFi!

To Alcohol and WiFi!

The causes of, and solutions to, all of life’s problems.

It isn’t often that an intermittent network connect is a benefit, but in this case a connection hiccup actually tipped me off to a useful workaround.

When you’re an engineering manager, you’re “important”, which means you have to go to a lot of meetings. And because you’re so very “important”, you can’t be troubled to close your laptop when walking across the office to said meetings, because you might miss someone’s giphy on Slack. Pretty sure I looked like an idiot, but that’s the price you pay for being in charge. Or something.

Anyhow, I’d been fighting an npm issue all morning (natch), where a particular module (bcrypt) was core dumping on my Mac. Not cool, bcrypt, not cool. Couldn’t figure out what was going on, but as is typical, “have you tried erasing your node_modules folder and re-running npm install?” Actually I had, but I was getting desperate, so thought I’d give it one more go. While simultaneously picking up my laptop to head to a meeting (keeping it open as I walked, because “important things” happening on it).

I arrived at the meeting (no idea what it was about, also pretty typical), and when the npm install had finally finished, I tried the program again, and lo and behold, it worked! I think at this moment I audibly exclaimed my excitement, despite the outburst not fitting the context of the meeting, that’s how happy I was. But I was also a bit befuddled. What had changed?

So I pored over the logs, both from the install that didn’t work, and the one that had (God bless anyone that ever has to review an npm log, it’s a special kind of hell).  Check out extracts from the install that failed to run, and the one that worked. Do you see the difference?

Please go look. I’ll wait.

Figure it out?

Did you notice that the binary of bcrypt failed to download in the second log, and npm fell back to compiling from source? That was the secret! Something must’ve been wrong with the prebuilt version for Mac. Now, I never solved what caused the crash in that build, but it was easy enough to work around it with npm --build-from-source.

But the real serendipity was the likely cause of the download failure. The only explanation I can think of is that our office’s crummy WiFi happened to flake out briefly as I was carrying my open laptop across the hall, just at the moment when the bcrypt binary was being downloaded, causing it to fail. But the network was back by the time the source tarball was downloaded, and the reset of the process finished normally.

Even as I write, it sounds preposterous. What are the odds? Maybe it was something else, I don’t have any proof. But you’ll never convince me.

A Tale As Old As 2001

A Tale As Old As 2001

For the next week or two I’m going to go back through my old drafts and finish them up. That means the stories are at least a year or two old. For this one, I’m curious if Edge finally changed the behavior. Anyone want to try it out?

When you’re debugging a pernicious issue, there’s no greater feeling than Google search auto-completing your first couple search terms and matching a page that describes your problem to a T. The challenge of course is figuring out those magic couple of words.

The team was recently trying to figure out an IE11-only problem (ugh) where our authentication mechanism was failing, but only for a subset of customers, with no obvious commonality. The server would return a Set-Cookie header, but the browser completely ignored it. WTF, Microsoft!

We’d spent an entire day trying to come up with a solution, until finally stumbling into the root cause: underscores in the subdomain. Chrome and Firefox are cool with them, but IE silently refuses to store cookies when they’re present. The details are a fascinating combination of unexpected side effects from a bug fix, misinterpreted web standards, and lingering backwards compatibility. This post captures the story nicely.

My product manager had never been thrilled with the way we’d been handling domain names. While I couldn’t have anticipated our design would lead to this misadventure (and a simple s/_/-/ solved the problem), I probably should have given his critique a closer listen.

It’s Been Awhile

It’s Been Awhile

Howdy friend. It’s been quiet here for some time now, but as is typical around a new year, I’m renewing my efforts to stay active on this blog (especially since I’ve mostly stopped using social media). This is in no small part to me now working for AWS Professional Services as a Senior Consultant in the public sector, a role for which improving my writing will be particularly valuable.

My silence should not be interpreted as inactivity, because a heck of a lot has gone down since I last posted:

  • Got promoted to the Director of Engineering for a 20+ person team (this actually happened in late 2017 but I’ve never mentioned it here)
  • Led that team through a painful acquisition process that required reducing the team by about a third
  • Experienced the joy of having a paycheck delayed by two full weeks during the holiday spending season
  • I celebrated my 40th birthday with a trip to Germany and Ireland
  • Was laid off when my employer ran out of money, without warning and with no final paycheck (about this much more could be said, but going to keep it short for now)
  • Dipped my toes into independent consulting for a few months while searching for a new job
  • Was hired by Amazon as a Senior SDE to work on their Last Mile team (the folks that get packages from delivery stations to your doorstep)
  • Transferred to AWS as I mentioned above

Pretty bonkers 18 months, but things are starting to settle down, and I’m eagerly anticipating the new normal of 2020. More to come!