Author: Jud

Technologist interested in building both systems and organizations that are secure, scaleable, cost-effective, and most of all, good for humanity.
Resolution Recap

Resolution Recap

Relaxing on a much-needed holiday has given me time to wrap up a couple books, bringing this year’s reading to a close (I’ve also finally started Alexander Hamilton, but no way I’m finishing it on my return flight; it’s good but long).

Per my meta-resolution, I aimed to read 44 books this year. I’m finishing at 48, though a few only barely qualify. Here’s this year’s 5-star selections:

How did I do in my objective to read more non-male, non-white authors? The goal was 32 books, and I finished with 14 non-male, 15 non-white, and 4 both, for a total of 33. Mission accomplished? Quantitatively yes, but qualitatively, the mission of broadening horizons is never done; this will continue to be a focus area.

What will I aim for next year (besides the obligatory quantity)? For one, I intend to read more history and biographies. Given my job, I also am going to do more reading on politics and government. Should be fun!

Evolution

Evolution

(Editor’s note: the past two posts, Mother Of Invention, Edge Case, and this one form a trilogy of sorts, all related to a particular project I’ve been digging into).

When I first needed a way to get access to AWS from a non-cloud-based computer, I implemented 3 options: hard-coded IAM user credentials (generally bad), user-based Cognito (okay but not super scalable), and X.509 via IoT (good technology, but cumbersome to set up).

This week I had a similar authentication need within an on-premises cluster, and was happy for the chance to learn the most up-to-date approach: IAM Roles Anywhere. I really appreciate the authors of these two blog posts who captured the step-by-step quite a bit better than the official documentation:

I used my own certificate authority because AWS Private CA is too dang expensive; $400 a month doesn’t grow on trees, ya know? Here’s the bash script to create the root CA:

mkdir -p root-ca/certs    # New Certificates issued are stored here
mkdir -p root-ca/db       # Openssl managed database
mkdir -p root-ca/private  # Private key dir for the CA

chmod 700 root-ca/private
touch root-ca/db/index

# Give our root-ca a unique identifier
openssl rand -hex 16 > root-ca/db/serial

# Create the certificate signing request
openssl req -new -config root-ca.conf -out root-ca.csr -keyout root-ca/private/root-ca.key

# Sign our request
openssl ca -selfsign -config root-ca.conf -in root-ca.csr -out root-ca.crt -extensions ca_ext

# Print out information about the created cert
openssl x509 -in root-ca.crt -text -noout

The output from the above is what’s used to create the Trust Anchor. Then here’s a script to create a certificate for the process that will be authenticating:

# Provide a name for the output files as a parameter
entity_name=$1

# Make your private key specific to your end entity
openssl genpkey -out $entity_name.key -algorithm RSA -pkeyopt rsa_keygen_bits:2048

# Using your newly generated private key make a certificate signing request
openssl req -new -key $entity_name.key -out $entity_name.csr

# Print out information about the created request
openssl req -text -noout -verify -in $entity_name.csr

# Sign the above cert
openssl ca -config root-ca.conf -in $entity_name.csr -out $entity_name.crt -extensions client_ext

# Print out information about the created cert
openssl x509 -in $entity_name.crt -text -noout

Special thanks also to the creator of iam-rolesanywhere-session, a Python package that makes it easy to create refreshable boto3 Session with IAM Roles Anywhere. Seriously, could it be easier?

from iam_rolesanywhere_session import IAMRolesAnywhereSession

roles_anywhere_session = IAMRolesAnywhereSession(
    trust_anchor_arn=my_trust_anchor_arn,
    profile_arn=my_profile_arn,
    role_arn=my_role_arn,
    certificate='my_certificate.crt',
    private_key='my_certificate.key',
)

boto3_session = roles_anywhere_session.get_session()
s3_client = boto3_session.client('s3')
print(s3_client.list_buckets())

This was a good reminder that technology marches ever onward, and what made sense yesterday might not be the best approach today. It was also a reminder that, like DNS, TLS and PKI are some of those things that every technologist ought to know (I’ve queued up this book in my Goodreads for a deeper dive). This isn’t the first time I’ve had to write code to create certificates, but it’s now the last, because I’ll have this reference post plus its associated code repository. And so will you.

Edge Case

Edge Case

I was today years old when I learned that an object key in S3 can end with a slash. Why might someone use such a strange key, you ask? Well, I was working today on a static website served by CloudFront that needs to serve a particular JSON document at /foo/bar/ (note the trailing slash). One option was to create the corresponding object at /foo/bar and then use a CloudFront function to remove the trailing slash. But that adds complexity, cost, and a tiny bit of latency. Could there be a better way?

Indeed there was! Create the object with a prefix of /foo/bar/ and Bob’s your uncle. Admittedly it’s a bit tricky to create an object with such a key. The console won’t do it, and neither will the aws CLI (at least not without getting fiddly with encoding, and no one’s got time for that). But boto3 to the rescue, it’ll happily do it.

Obligatory bit of additional knowledge: know your slashes.

Mother Of Invention

Mother Of Invention

More often than not, the tool you need to solve a particular programming problem has already been created and is easily discoverable via PyPI, npm, etc. I rejoice in these times.

Sometimes, however, the tool you need does not exist. Yet I still rejoice in these times, because they present an opportunity to create a new thing and share it with the world.

I’m thus here to announce sql-to-odata, a Python package containing tools to facilitate adding an OData interface in front of a SQL database. It’s limited right now to my specific use case (creating static extracts from SQLite), but if there ends up being broader interest, who knows what it might become.

Little Things

Little Things

One of my favorite tools is ngrok (pronounced en-grok, presumably referencing Stranger in a Strange Land, a book I read as a freshman in high school when I was far too young to appreciate it). If you need to get a locally-running service on the Internet, ngrok can do it in seconds with a single command. I use it all the time when experimenting with and debugging APIs, such as this weekend’s foray into LangChain.

Supposedly it can do a bang-up job of fronting production services also, but I’ve never tried it for that use case. Perhaps someday? In any case, I’m truly grateful it exists.

Drama

Drama

I don’t pretend to know everything that’s going on over at OpenAI, nor all the eventual lessons that will come from it, but unlike most tech-elite brouhahas this one might actually matter, as there’s a strong possibility taming artificial intelligence is “the final boss of humanity” (as one of the players in the ongoing saga has said).

Played around some yesterday with ChatGPT’s voice chats, and can totally see how we’re not far from deepening the emotional attachments to our devices. Her has never felt more prescient or likely. It’s mandatory viewing.

Know Thyself

Know Thyself

It’s inevitable that over time I’m going to repeat myself here (including post titles). When I’m aware of potential similarities, I try to embed links back to those prior posts. A while back I noted an idea of building a thematic map of all my posts, but I wasn’t sure how to go about doing so. Now that I’ve learned some about embeddings, it was time to try my hand at it.

You can find the code I wrote to accomplish all of this on GitHub. I was inspired by the clustering section of the OpenAI cookbook, but took considerable liberties rewriting the code there, as I’m not a huge fan of typical data science code examples (they’re suitable for notebooks, perhaps, but rarely include meaningful names or breakdown into logical functions).

First, I had to actually fetch all the post content. I briefly toyed with the WordPress REST API, but couldn’t figure out how to enable it. No worries, though, RSS to the rescue! Unfortunately it’s XML, and I fiddled a bit with using lxml to parse the it, but stumbled upon feedparser which abstracted the details. Awesome!

Since it’s the de facto standard for Python data science, I loaded the posts into a pandas DataFrame. I’m still working on my fluency with pandas, numpy, scikit, and matlibplot, amongst other common tools, and I’m grateful for any opportunity to get their power under my fingers.

To compute embeddings for each post, I used the OpenAI API with the text-embedding-ada-002 model. It’s not good to store API keys in code; for local scripts I store all mine in the MacOS keychain using keyring. Nice and easy.

Since OpenAI usage costs money, I don’t want to repeatedly call the API with identical inputs if I don’t have to. That’s where cachier comes in (a library I help maintain) so results can be transparently saved to disk for subsequent use.

Once I had the embeddings, I used K-means clustering to group posts into common themes, and then t-SNE to reduce the dimensionality and produce a visualization of the clusters. To produce a summary of the theme of each cluster I took a sample of posts from each and shoved them into GPT4.

To start I tried using 2 clusters, which produced the following distribution:

Pretty interesting that there’s a natural grouping going on. Here’s the themes and sample posts:

Blue Posts

The theme of these posts is the author’s personal and professional experiences with technology, education, open-source contributions, ethical considerations, and the impact of travel and diversity on personal growth and the tech industry.

Orange Posts

The theme of these posts revolves around the reflections, experiences, and insights of a software developer navigating the challenges and nuances of the tech industry.

Of course I had to try with a variety of different numbers of clusters, so I reran with 3, 5, and 8 clusters as well (anyone see a pattern there?)

Of those graphs, to my eye the 5 cluster one seemed the best balance between having enough distinct themes without starting to look too arbitrary. Here’s the summarizations for it:

Blue Posts

The theme of these posts is the author’s personal and professional experiences, challenges, and insights related to technology, software development, and working within the tech industry.

Orange Posts

The theme of these posts revolves around the challenges, insights, and anecdotes from the world of software development and engineering management.

Green Posts

The theme of these posts is the multifaceted nature of software development, encompassing the importance of maintaining code quality, the broad skill set required for effective development, and the challenges and responsibilities that come with the profession.

Red Posts

The theme of these posts is the reflection on and sharing of personal experiences, insights, and best practices related to software development, including contributing to communities, understanding abstractions, effective communication, and professional growth within the tech industry.

Purple Posts

The theme of these posts is the author’s personal reflections on their experiences, interests, and philosophies related to their career, hobbies, and life choices.

What’s next? I’d like a quantitative way to evaluate the quality of the theme clustering and summaries produced. There’s a lot of non-determinism in the functions used here, and with some twiddling I bet I can produce improved results. I’ve got some ideas, but will save them for a future post.

Taste The Rainbow

Taste The Rainbow

I’m sure there’s research out there that says people do better work when they’re happy. But anecdotally, it’s an obvious truth. Of course there are limits (“fun with respect to work” will almost always be “work with respect to fun”). But in general, fostering a positive work environment and encouraging employees to take care of themselves is good business.

Last week a colleague of mine was revising the spreadsheet we use for high-level estimation, and as part of her adjustments added a few splashes of color. The highlights had a functional purpose, yes, but they were also simply more pleasant to look at. It made me want to work on the spreadsheet at a subconscious level.

Isn’t that nice? I suppose the 49″ ultrawide monitor doesn’t hurt either. 😛

Another obvious example of this phenomenon is font quality and syntax highlighting. Take a look at the following “identical” code samples; which one would you rather work with?

Literally as I was drafting this blog I learned about Monaspace. Taking code aesthetics to the next level, I dig it. Describing the process of adjusting glyph widths as “texture healing” is an especially humanizing touch. Happiness matters!

Godwin’s Law Redux

Godwin’s Law Redux

As a tech discussion grows longer, the probability of a mention of Generative AI approaches one. It definitely happened at today’s 4S Tech meetup; we didn’t even make it all the way through the introductions.

Additional common topics: biometrics, productivity hacking, ways to get funding, something someone heard on a podcast.