20 years on AWS and never not my job

hackernews

20 Years on AWS and Never Not My Job

I created my first AWS account at 10:31 PM on April 10th, 2006. I had seen the announcement of Amazon S3 and had been thinking vaguely about the problem of secure backups — even though I didn't start Tarsnap until several months later — and the idea of an online storage service appealed to me. The fact that it was a web service made it even more appealing; I had been building web services since 1998, when I decided that coordinating a world-record-setting computation of Pi over HTTP would be easier than doing it over email.

While I created my AWS account because I was interested in Amazon S3, that was not in fact immediately available to me: In the early days of AWS, you had to specifically ask for each new service to be enabled for your account. My new AWS account did come with two services enabled by default, though — Amazon Simple Queue Service, which most people know as "the first AWS service", and Amazon E-Commerce Service, an API which allowed Amazon affiliates to access Amazon.com's product catalogue — which was the real first AWS service, but which most people have never heard of and which has been quietly scrubbed from AWS history.

It didn't take long before I started complaining about things. By this point I was the FreeBSD Security Officer, so my first interest with anything in the cloud was security. AWS requests are signed with API keys providing both authentication and integrity protection — confirming not only that the user was authorized, but also that the request hadn't been tampered with. There is, however, no corresponding signature on AWS responses — and at this time it was still very common to make AWS requests over HTTP rather than HTTPS, so the possibility of response tampering was very real. I don't recall if anyone from Amazon showed any interest when I posted about this on the (long-disappeared) AWS Developer Forums, but I still think it would be a good thing to have: With requests going over TLS it is obviously less critical now, but end-to-end signing is always going to be better than transport-layer security.

Of course, as soon as Amazon EC2 launched I had a new target: I wanted to run FreeBSD on it! I reached out to Jeff Barr via his blog and he put me in touch with people inside Amazon, and in early 2007 I had my first Amazon NDA. (Funny story, in 2007 Amazon was still using fax machines — but I didn't have a fax machine, so my first briefing was delayed while I snail-mailed a wet-ink signature down to Seattle.) Among the features I was briefed on was "Custom Kernels"; much like how AWS Lambda works today, Amazon EC2 launched without any "bring your own kernel" support. Obviously, to bring FreeBSD support to EC2 I was going to need to use this functionality, and it launched in November 2007 when Amazon EC2 gained the ability to run Red Hat; soon after that announcement went out, my FreeBSD account was allowlisted for the internal "publish Amazon Kernel Images" API.

But I didn't wait for this functionality to be offered before providing more feedback about Amazon EC2. In March 2007 I expressed concerns to an Amazonian about the security of Xen — it was at the time still quite a new system and Amazon was the first to be deploying it in truly hostile environments — and encouraged them to hire someone to do a thorough security audit of the code. When the Amazonian I was speaking to admitted that they didn't know who to engage for this, I thought about the people I had worked with in my time as FreeBSD Security Officer and recommended Tavis Ormandy to them. Later that year, Tavis was credited with reporting two vulnerabilities in Xen (CVE-2007-1320 and CVE-2007-1321); whether there is any connection between those events, I do not know.

I also mentioned — in fact in one of Jeff Barr's AWS user meetups in Second Life — that I wanted a way for an EC2 instance to be launched with a read-only root disk and a guaranteed state wipe of all memory on reboot, in order to allow an instance to be "reset" into a known-good state; my intended use case for this was building FreeBSD packages, which inherently involves running untrusted (or at least not-very-trusted) code. The initial response from Amazonians was a bit confused (why not just mount the filesystem read-only) but when I explained that my concern was about defending against attackers who had local kernel exploits, they understood the use case. I was very excited when EC2 Instance Attestation launched 18 years later.

I ended 2007 with a blog post which I was told was quite widely read within Amazon: Amazon, Web Services, and Sesame Street. In that post, I complained about the problem of Eventual Consistency and argued for a marginally stronger model: Eventually Known Consistency, which still takes the "A" route out of the CAP theorem, but exposes enough internal state that users can also get "C" in the happy path. Amazon S3 eventually flipped from being optimized for Availability to being optimized for Consistency (while still having extremely high Availability), and of course DynamoDB is famous for giving users the choice between Eventual or Strongly consistent reads; but I still think the model of Eventually Known Consistency is the better theoretical model even if it is harder for users to reason about.

In early 2008, Kip Macy got FreeBSD working on Xen with PAE — while FreeBSD was one of the first operating systems to run on Xen, it didn't support PAE and I was at the time not competent to write such low-level kernel code, so despite being the driving force behind FreeBSD/EC2 efforts I had to rely on more experienced developers to write the kernel code at the time. I was perfectly comfortable with userland code though — so when Amazon sent me internal "AMI tools" code (necessary for using non-public APIs), I spent a couple weeks porting it to run on FreeBSD. Protip: While I'm generally a tools-not-policy guy, if you find yourself writing Ruby scripts which construct and run bash scripts, you might want to reconsider your choice of languages.

Unfortunately even once I got FreeBSD packaged up into an AKI (Amazon Kernel Image) and AMI (Amazon Machine Image) it wouldn't boot in EC2; after exchanging dozens of emails with Cape Town, we determined that this was due to EC2 using Xen 3.0, which had a bug preventing it from supporting recursive page tables — a cute optimization that FreeBSD's VM code used. The problem was fixed in Xen 3.1, but Xen didn't have stable ABIs at that point, so upgrading EC2 to run on Xen 3.1 would have broken existing AMIs; while it was unfortunate for FreeBSD, Amazon made the obvious choice here by sticking with Xen 3.0 in order to support existing customers.

In March 2008, I received one of those emails which only really seems notable in hindsight:

Hi Colin,

This is Matt Garman from the EC2 team at Amazon.  [...]
Matt was inviting me to join the private Alpha of "Elastic Block Storage" (now generally known as "Elastic Block Store" — I'm not sure if Matt got the name wrong or if the name changed). While I was excited about the new functionality, as I explained to Matt the best time to talk to me about a new service is before building it. I come from a background of mathematics and theory; I can provide far more useful feedback on a design document than from alpha-test access.

By April 2008 I had Tarsnap in private beta and I was working on its accounting code — using Amazon SimpleDB as a storage back-end to record usage and account balances. This of course meant that I had to read the API documentation and write code for signing SimpleDB requests — back then it was necessary, but I still write my own AWS interface code rather than using any of their SDKs — and a detail of the signing scheme caught my eye: The canonicalization scheme had collisions. I didn't have any contacts on the SimpleDB team — and Amazon did not at the time have any "report security issues here" contacts — so on May 1st I sent an email to Jeff Barr starting with the line "Could you forward this onto someone from the SimpleDB team?"

While the issue wasn't fixed until December, Amazon did a good job of handling this — and stayed in contact with me throughout. They asked me to review their proposed "signature version 2" scheme; fixed their documentation when I pointed out an ambiguity; corrected what I euphemistically referred to as a "very weird design decision"; and allowlisted my account so I could test my code (which I had written against their documentation) against their API back-end. (I wrote more about this in my blog post AWS signature version 1 is insecure.)

In June 2008 I noticed that NextToken values — returned by SimpleDB when a query returns too many results and then passed back to SimpleDB to get more results — were simply base64-encoded serialized Java objects. This was inherently poor security hygiene: Cookies like that should be encrypted (to avoid leaking internal details) and signed (to protect against tampering). I didn't know how robust Amazon's Java object deserializer was, but this seemed like something which could be a problem (and should have been fixed regardless, as a poor design decision even if not exploitable), so I reported it to one of the people I was now in contact with on the SimpleDB team... and heard nothing back. Six months later, when a (perhaps more security minded) engineer I had been working with on the signing issue said "let me know if you find more security problems; since we don't yet have a security response page up, just email me" I re-reported the same issue and he wrote it up internally. (Even after this I still never received any response, mind you.)

Later in 2008, after Tarsnap was in public beta (but before it had much traction) — and after considerable prompting from Jeff Barr — I considered the possibility of working for Amazon. I had a phone interview with Al Vermeulen and slightly too late learned an important lesson: In a 45 minute interview, spending 30 minutes debating the merits of exceptions with an author of The Elements of Java Style is probably not the best use of time. I still firmly believe that I was correct — exceptions are an inherently poor way of handling errors because they make it easier to write bugs which won't be immediately obvious on casual code inspection — but I also know that it isn't necessary to correct everyone who is wrong.

Finally in November 2008, I drove down to Seattle for an AWS Start-up Tour event and met Amazonians in person for the first time; for me, the highlight of the trip was meeting the engineer I had been working with on the request signing vulnerability. We had a lengthy discussion about security, and in particular my desire for constrained AWS access keys: I was concerned about keys granting access to an entire account and the exposure it would create if they were leaked. I argued for cryptographically derived keys (e.g. hashing the master secret with "service=SimpleDB" to get a SimpleDB-only access key) while he preferred a ruleset-based design, which was more flexible but concerned me on grounds of complexity. Ultimately, I was entirely unsurprised when I was invited to join a private beta of IAM in January 2010 — and also somewhat amused when SigV4 launched in 2012 using derived keys.

For most of 2009 I was busy with growing Tarsnap. The EC2 team set up some Xen 3.1 hosts for testing and by mid-January I was able to launch and SSH into FreeBSD; but since EC2 had no concrete plans to upgrade away from Xen 3.0, the FreeBSD/EC2 project as a whole was still blocked. I did however notice and report a problem with the EC2 firewall: The default ruleset blocked ICMP, including Destination Unreachable (Fragmentation Required) messages — thereby breaking Path MTU Discovery. In December 2009 a manager in EC2 agreed with my proposed solution (adding a rule to the default ruleset) and wrote "I'll let you know as soon as I have an implementation plan in place and am confident it will happen soon". This was ultimately fixed in 2012, soon after I raised the issue publicly.

By the start of 2010, with EC2 still stuck on an ancient version of Xen, I was starting to despair of ever getting FreeBSD running, so I turned to the next best option: NetBSD, which famously runs on anything. It only took me a week — and a few round trip emails to Cape Town to ask for console logs — to create a NetBSD AMI which could boot, mount its root filesystem, configure the network, and launch sshd. While Amazon was a bit wary about me announcing this publicly — they quite reasonably didn't want me to say anything which could be construed as making a promise on their behalf — they agreed that I could discuss the work with developers outside the NDA, and the NetBSD team were excited to hear about the progress... although a bit confused as to why Amazon was still using paravirtualized Xen rather than HVM.

The lack of HVM continued to be a sore point — especially as I knew EC2 provided Xen/HVM for Windows instances — but in July 2010 Amazon launched "Cluster Compute" instances which supported HVM even for "Linux" images. I wasn't able to boot FreeBSD on these immediately — while HVM solved the paging table problem, there were still driver issues to address — but this gave me some hope for progress, so when Matt Garman mentioned they were "thinking about" making HVM more broadly available I immediately wrote back to encourage such thoughts; by this point it was clear that PV was a technological dead end, and I didn't want Amazon to be stuck on the wrong technology for any longer than necessary.

The first real breakthrough however came with the launch of the new t1.micro instance type in September. While it wasn't publicly announced at the time, this new instance family ran on Xen 3.4.2 — which lacked the bug which made it impossible to run FreeBSD. By mid-November I was able to SSH into a FreeBSD/EC2 t1.micro instance, and on December 13, 2010, I announced that FreeBSD was now available for EC2 t1.micro instances.

Once I'd gotten that far, things suddenly got easier. Amazon now had customers using FreeBSD — and they wanted more FreeBSD. A Solutions Architect put me in touch with a FreeBSD user who wanted support for larger instances, and they paid me for the time it took to get FreeBSD working on Cluster Compute instances; then it was pointed out to me that EC2 didn't really know which OS we were running, and I proceeded to make FreeBSD available on all 64-bit instance types via defenestration. Obviously this meant paying the "windows tax" to run FreeBSD — which Amazon was not very happy about! — but even with the added cost it filled an essential customer need. (This hack finally ceased to be necessary in July 2014, when T2 filled out the stable of instance types which supported running "Linux" on HVM.)

2012 was an exciting year. In April, I had the classic greybeard experience of debugging a network fault; I found that a significant proportion of my S3 requests to a particular endpoint were failing with peculiar errors, including SignatureDoesNotMatch failures. These error responses from Amazon S3 helpfully contained the StringToSign, and I could see that these did not match what I was sending to S3. I had enough errors to identify the error as a "stuck bit"; so I pulled out traceroute — this was pre-SRD so my packets were traversing a consistent path across the datacenter — and then proceeded to send a few million pings to each host along the path. The Amazonians on the AWS Developer Forums were somewhat bemused when I posted to report that a specific router had a hardware failure... and even more surprised when they were able to confirm the failure and replace the faulty hardware a few days later.

The highlight of 2012 however was the first re:Invent — which was short of technical content and had a horrible tshirt-to-suit ratio, but did give me the opportunity to talk to a number of Amazonians face to face. On one memorable occasion, after attending an Intel talk about "virtual machine security" (delivered by a VP who, in response to my questioning, professed to have no knowledge of "side channel attacks" or how they could affect virtual machines) I turned up at the EC2 booth in the expo hall to rant... and by complete accident ended up talking to a Principal engineer. I talked about my work exploiting HyperThreading to steal RSA keys, and explained that, while the precise exploit I'd found had been patched, I was absolutely certain there were many more ways that information could leak between two threads sharing a core. I ended with a strong recommendation: Based on my expertise in the field I would never run two EC2 instances in parallel on two threads of the same core. Years later, I was told that this recommendation was why so many EC2 instance families jumped straight to two vCPUs ("large") and skipped the "medium" size.

Time passed. With FreeBSD fundamentally working, I turned to the "nice to haves": merging my FreeBSD patches, simplifying the security update path (including automatically installing updates on first boot), and resizing the root filesystem on first boot. In April 2015, I finished integrating the FreeBSD/EC2 AMI build process into the FreeBSD src tree and handed off image builds to the FreeBSD release engineering team — moving FreeBSD/EC2 across the symbolic threshold from a "Colin" project to "official FreeBSD". I was still the de facto owner of the platform, mind you — but at least I wasn't responsible for running all of the builds.

In October 2016, I took a closer look at IAM Roles for Amazon EC2, which had launched in mid-2012. The more I thought about it, the more concerned I got; exposing credentials via the IMDS — an interface which runs over unauthenticated HTTP and which warned in its documentation against storing "sensitive data, such as passwords" — seemed like a recipe for accidental foot-shooting. I wrote a blog post "EC2's most dangerous feature" raising this concern (and others, such as overly broad IAM policies), but saw no response from Amazon... that is, not until July 2019, when Capital One was breached by exploiting the precise risk I had described, resulting in 106 million customers' information being stolen. In November 2019, I had a phone call with an Amazon engineer to discuss their plans for addressing the issue, and two weeks later, IMDSv2 launched — a useful improvement (especially given the urgency after the Capital One breach) but in my view just a mitigation of one particular exploit path rather than addressing the fundamental problem that credentials were being exposed via an interface which was entirely unsuitable for that purpose.

In May 2019, I was invited to join the AWS Heroes program, which recognizes non-Amazonians who make significant contributions to AWS. (The running joke among Heroes is that a Hero is someone who works for Amazon but doesn't get paid by Amazon.) The program is heavily weighted towards people who help developers learn how to use AWS (via blog posts, YouTube videos, workshops, et cetera), so I was something of an outlier; indeed, I was told that when I was nominated they weren't quite sure what to make of me, but since I had been nominated by a Distinguished Engineer and a Senior Principal Engineer, they felt they couldn't say no.

In March 2021, EC2 added support for booting x86 instances using UEFI; a "BootMode" parameter could be specified while registering an image to declare whether it should be booted using legacy BIOS or modern UEFI. For FreeBSD this was great news: Switching to UEFI mode dramatically sped up the boot process — performing loader I/O in 16-bit mode required bouncing data through a small buffer and cost us an extra 7 seconds of boot time. The only problem was that while all x86 instance types supported legacy BIOS booting, not all instance types supported UEFI — so I had to decide whether to degrade the experience for a small number of users to provide a significant speedup to most users. In June, I requested a BootMode=polyglot setting which would indicate that the image was able to boot either way (which, in fact, FreeBSD images already could) and instructed EC2 to pick the appropriate boot mode based on the instance. In March 2023, this landed as "BootMode=uefi-preferred", which I had to admit was a friendlier, albeit less geeky, name for it.

One of the most important things about the AWS Heroes program is the briefings Heroes get, especially at the annual "Heroes Summit". In August 2023, we had a presentation about Seekable OCI, and looking at the design I said to myself "hold on, they're missing something here": The speaker made security claims which were true under most circumstances, but did not hold in one particular use case. I wrote to the AWS Security team (unlike in 2008, there was now a well-staffed team with clear instructions on how to get in touch) saying, in part, "I'm not sure if this is them not understanding about [type of attack] or if it's just an issue of confused marketing, but I feel like someone needs to have a conversation with them". My sense was that this could probably be addressed with clear documentation saying "don't do this really weird thing which you probably weren't planning on doing anyway", but since I wasn't particularly familiar with the service I didn't want to make assumptions about how it was being used. After a few email round trips I was assured that the problem had been corrected internally and that the fix would be merged to the public GitHub repository soon. I accepted these assurances — over the years I've developed a good relationship with AWS Security people and trust them to handle such matters — and put it out of my mind.

In December 2023, however, I was talking to some Amazonians at re:Invent and was reminded of the issue. I hadn't heard anything further, which surprised me given that fixing this in code (rather than in documentation) would be fairly intrusive. I asked them to check up on the issue and they promised to report back to me in January, but they never did, and again I stopped thinking about it. The following re:Invent though, in December 2024, I met a Principal Engineer working on OCI and mentioned the issue to him — "hey, whatever happened with this issue?" — but he wasn't aware of it. In January 2025, I raised it again with a Security Engineer; he found the original ticket from 2023 and talked to the team, who pointed at a git commit which they thought fixed it.

The issue had not, in fact, been fixed: The 2023 commit prevented the problem from being triggered by accidental data corruption, but did nothing to prevent a deliberate attack. Once I pointed this out, things got moving quickly; I had a Zoom call with the engineering team a few days later, and by the end of February the problematic feature had been disabled for most customers pending a "major revision".

The largest change in my 20 years of working with Amazon started out as something entirely internal to FreeBSD. In September 2020, the FreeBSD Release Engineering Lead, Glen Barber, asked me if I could take on the role of Deputy Release Engineer — in other words, Hot Spare Release Engineer. As the owner of the FreeBSD/EC2 platform, I had been working with the Release Engineering team for many years, and Glen felt that I was the ideal candidate: reliable, trusted within the project, and familiar enough with release engineering processes to take over if he should happen to "get hit by a bus". While I made a point of learning as much as I could about how Glen managed FreeBSD releases, like most hot spares I never expected to be promoted.

Unfortunately, in late 2022 Glen was hospitalized with pneumonia, and while he recovered enough to leave the hospital a few months later, it became clear that the long-term effects of his hospitalization made it inadvisable for him to continue as release engineer; so on November 17, 2023, Glen decided to step back from the role and I took over as FreeBSD Release Engineering Lead. I like to think that I've done a good job since then — running weekly snapshot builds, tightening schedules, establishing a predictable and more rapid release cadence, and managing four releases a year — but my volunteer hours weren't unlimited, and it became clear that my release engineering commitments were making it impossible to keep up with EC2 support as well as I would have liked.

In April 2024 I confided in an Amazonian that I was "not really doing a good job of owning FreeBSD/EC2 right now" and asked if he could find some funding to support my work, on the theory that at a certain point time and dollars are fungible. He set to work, and within a couple weeks the core details had been sorted out; I received sponsorship from Amazon via GitHub Sponsors for 10 hours per week for a year and addressed a large number of outstanding issues. After a six month hiatus — most of which I spent working full time, unpaid, on FreeBSD 15.0 release engineering — I've now started a second 12-month term of sponsorship.

While I like to think that I've made important contributions to AWS over the past 20 years, it's important to note that this is by no means my work alone. I've had to remind Amazonians on occasion that I do not have direct access to internal AWS systems, but several Amazonians have stepped in as "remote hands" to file tickets, find internal contacts, inspect API logs, and obtain technical documentation for me. Even when people — including very senior engineers — have explicitly offered to help, I'm conscious of their time and call upon them as little as I can; but the fact is that I would not have been able to do even a fraction of what I've accomplished without their help.

Posted at 2026-04-11 05:31 | Permanent link | Comments
blog comments powered by Disqus

Source: hackernews

arrow_back Back to News