Thinking through the Cloud

Whatever happened to GraphQL?

Thinking through the Cloud — Wed, 21 Jan 2026 23:01:33 GMT

For those who remember, there was a period of time roughly between 2018-2021 when GraphQL was all the rage, so to speak.

It was a hot item in the technology world at a time when all the market gaps SaaS could theoretically fill were being plugged as rapidly as possible which in turn incentivized heavy investment into simplified, standardized tooling to reduce that time-to-market metric as far as possible.

While some technology strategies veered off into no-code solutions, others pushed forward in the evolution of software engineering toward reusable patterns and frameworks that reduced the need for every startup to re-invent the wheel on common technology implementation questions.

React offered a relatively small but powerful abstraction that would reduce the learning curve for newcomers and rookies to be able to build web apps all on their own.

Coding bootcamps became a supposed one-way ticket into the lavish payscale of technology jobs with many touting their credentials as MERN or PERN full-stack engineers after a month or two of training.

Next.js would build off React to offer even further abstraction to make rapid prototyping even more effortless. Few acknowledge that the mixed to relative success of LLMs to “vibe code” web applications is due in large to how simple this has become through platforms like Vercel and Next.js.

It is easy to tie together the strands of success in the train of evolution.

But GraphQL is something else.

During the peak of its hype train, GraphQL was going to displace RESTful API development as the universal solution. REST APIs were going to become obsolete. GraphQL would revolutionize the whole philosophy of middleware. At least by some accounts.

But this did not happen.

It has been adopted at a large scale as a solution for some of the time, but certainly not all of the time.

There was not a decisive push to integrate it universally in the same way that generative AI has since the GraphQL craze.

Why is that?

Let us find out. We are barely going to even touch the nuts and bolts of GraphQL here, but rather look at its history and reception.

What is GraphQL?

No proper post-mortem can begin without an account of the history and origins of its subject.

Like React, GraphQL is a baby of Facebook/Meta. Also like React, it was born to solve the same penumbra of problems around standardization and consistency.

In 2011 and 2012, smart phone adoption was close to solidified, and the world of mobile apps was about to explode.

Facebook was consolidating its leadership over the social media market on desktop and needed a firm way to entrench itself in the mobile app experience as well.

How does one build an application that can interface with Facebook’s vast troves of data from the limited power and performance of an early smartphone?

What React was on the software side, GraphQL was on the query language side.

As described in GraphQL’s 2015 prospectus when it became open source, the Facebook teams needed to find a consolidated, centralized interface for delivering feature like News Feed without the complexity of data shape transformation required between the client and server.

This document details several advantages toward the GraphQL way:

Data shape: The query response mirrors the shape of the request
Hierarchy: Embedded objects in one single query without multiple queries
Strongly typed: Each level of the GraphQL query has strong typing to validate query structure and types before execution.
Protocol: It is a stateless middleware, not a storage backend to manage
Introspective: It can be queried about itself rather than needing to go review a spec elsewhere
Version-light: The returned data’s shape is determined by the client not the server

Given these points, those with enough REST API heartburn could see the appeal of such a revolutionary approach, especially when it brands itself as “product-centric”, highly suitable for the rise of rapid SaaS development.

Subscribe now

But Why GraphQL?

The developers and maintainers had their own reasons for suggesting that GraphQL was “better”, but let us dive into some points from the broader online discourse about this new technology, particularly its apologists.

1. RESTful Is Not Modern

This is a subjective claim to be sure, but it is one that has been asserted by those who would rather see GraphQL displace REST API.

There are several common areas of complaint here.

REST produces a sprawl of endpoints at scale. Each new capability requires a new endpoint. Every new resource requires a whole new set of endpoints for all its associated verbs. The growing surface area introduces overhead, risk of inconsistencies, and dependencies between endpoints.
REST couples resources with teams which does not align with modern enterprise domain-driven design. To maintain a comprehensive Enterprise REST API requires complete and total coordination between teams on resources, including cross-domain calls and versioning. At least so critics claim.
RESTful routes mirror the page-based implementation of early web applications, not modern single page applications which are component-driven. This leads REST to consistent over and under fetching of data with the client jumping through hoops to get what it needs from the server.
REST does not offer type safety for API queries. If we have TypeScript for JavaScripts and ORMs for database transactions, the lack of type safety is a significant weakness.

More could be said here on the nature of these issues, including a defense of REST and it has evolved in recent years, but our main focus here is the sentiments which drove GraphQL proselytization.

At the very least, many of these are certainly understandable complaints for those trying to build with REST in a modern world and more than sufficient justification to seek to build something better.

2. Strong Contracts

If there is one positive thing which GraphQL advocates unanimously praise about this framework, it is the type safety.

In the world of client-server calls, strong contracts and type safety offer guarantees that eliminate a whole range of errors around what can be quite simply labeled “miscommunication” between two different systems.

The contract guarantees that a server will receive an object of a valid shape and type (enforced from the client side), while the client will receive a valid response object from the server (enforced before the response is sent).

Additionally, GraphQL offers resolvers with the intelligence to filter and pare down fields just to what is needed, a far cry from the data dumps one could easily find in a REST JSON response body. Where REST is verbose, GraphQL is clean and minimal.

Furthermore, GraphQL decouples the API contract from a backend implementation without exposing an entire DB implementation or schema to a client, a common antipattern that can emerge when DB schemas are treated as an API interface.

Critics would of course counter that OpenAPI offers the same kind of contract guarantees or one could opt for less over-engineered solutions for this kind of type safety, but this is still one broadly defended point around the advantage of GraphQL.

3. Decentralized Ownership

As noted above, scaling RESTful APIs is not necessarily easy, considering how resources and routes are coupled in awkward ways with their backend teams.

GraphQL offers far more flexibility and clarity for the delegation of subgraphs or namespacing of the graph to backend teams so that different departments or teams can still expose their data in a shared enterprise GraphQL endpoint but without the clumsiness involved with such schema maintenace.

Put simple, the subgraph is far more suited to microservices, as teams seek to evolve away from monolithic implementations.

4. Cloud Infrastructure

While decoupling subgraphs from a single maintainer, GraphQL’s philosophy of encapsulation behind one single API surface blends very well with modern cloud infrastructure as well.

And because it is decoupled from backend implementation, it can simultaneously route requests to various microservice backends, including serverless functions such as AWS Lambda with far more sophistication and flexibility than other traditional API implementations.

This in addition to managed CSP offerings for GraphQL such as AWS AppSync or simply throwing together a containerized Apollo server is cited as a reason many prefer GraphQL.

On the flip side, this advantage is somewhat neutralized by the fact that many serverless or cloud-native frameworks exist around the RESTful model. Serverless and AWS SAM, for example. It is not clear when GraphQL developers cite this advantage what these other frameworks lack on the cloud infrastructure front.

5. General Developer Experience

A few other miscellaneous points to be put under DevEx, in favor of GraphQL.

The schema evolution is much more flexible and clean, particularly in that GraphQL avoids versioning as much as possible. Versioning of course is required to account for the limited controls around API endpoints particularly when there are weak contracts.

The implementation philosophy behind GraphQL focuses on only returning the data that is explicitly requested. Fields will always resolve the way they need to by the client, without needing to jump through the hoops of server-side requirements in the way that REST implementations demand.

Additionally, data can be composed with considerable ease through the graph approach which makes data population in UI flows much cleaner and easier to develop. These can be whipped together in a single query rather than needing to juggle a set of queries and/or transforms based on the server-side logic.

Colocation of data is a developer experience positive that is hard to overlook when one has experienced it “done right”.

This all ties into a modern web-app style approach which is component-level driven, at its very root. It integrates natively with much of the UI web app ecosystem, for example in its helpers for pagination, suspense/defer fragments, and even an eslint plugin to offer guardrails.

Much of the “it just works” which drives modern software development, finds a well-matched companion in the world of GraphQL.

Subscribe now

But then… Why Not GraphQL?

We could go much deeper into the terminology and protocols and patterns of GraphQL. It was difficult enough to avoid this in speaking of its advantages, but that is not the drive of this piece, and there are better places to consider its technical underpinnings and internal technical dialogue more in depth.

Again, we are considering the broad sentiments which led to its rise and normalization, rather than its anticipated universalization.

So while we have seen many of the reasons put forth in favor of GraphQL adoption, particularly in that hype phase, I would like to present some of the arguments presented against GraphQL adoption in the years since then.

1. Overengineering

The risk of any new technology that purports to be a solution is the possibility that the cognitive load or overhead it takes for (1) humans to learn as practitioners, (2) it to be integrated within a larger technical ecosystem, or (3) LLMs to interface with it relatively intelligently can outweight the benefits it ostensibly brings.

The risk of overengineering, more or less.

And this seems to be one of the most common complaints more or less in the developer community around GraphQL.

The advantages that GraphQL offers in terms of type safety and strong contracts, many REST developers claim they can easily solve through the allegedly more minimal implementation of OpenAPI and/or TypeScript.

These previous downsides are purportedly solved through minor alterations or layering without needing the full paradigm shift of GraphQL.

Of course, many GraphQL are equally convinced that OpenAPI is hardly a less verbose and simpler alternative.

Whether one goes implementation first or specification first, it seems REST offers its own modern suite of solutions that does not require the GraphQL jump.

These criticisms extend also to AuthZ overhead, schema evolution, and more. But this is the fundamental principle.

2. Security Considerations

Aside from what are arguably more subjective contentions around GraphQL as an overengineered solution, there are more objective concerns about it from a security perspective.

By minimizing client overhead and shifting the logic to server-side orchestration, by allowing a single logical endpoint to fetch any number of permutations of data and data shapes from various backends, the question becomes how one can properly AuthZ (i.e. that you are authorized to receive the data you are asking about).

REST APIs have generally solved for this through integration with OAuth, identity tokens, access tokens, scopes, claims, etc. These are applied at the endpoint level though.

If GraphQL boasts a single endpoint per environment, how can AuthZ be engineered or implemented within a single endpoint?

Additionally, the hyper-flexible nature of GraphQL queries lead to a variety of security vulnerabilities, not least of all disclosing broad swathes of one’s schema to potential bad actors if they simply invoke the Introspection Query.

This could further be exacerbated by malicious actors who could easily execute a DoS attack by hitting the GraphQL server with expensive queries, including cyclical or recursive queries depending on the complexity and internal relationships of the data shape.

Which leads us into the next issue.

3. Performance and Rate Limiting

Depending on how deeply nested the data shape is, overly broad GraphQL queries can not only be computationally expensive to perform but also have high latency as they solve both horizontally for various data backend providers (hitting each of them individually or even redundantly perhaps) or vertically with increasingly deeply nested layers of data.

This is further compounded by trying to layer AuthZ on top of deeply nested queries.

The computational lift to not only resolve all data fields but determine AuthZ for each and individual fields introduces another N+1 problem on the performance side.

So ironically while many argue GraphQL is overengineered for simple use cases and small developer teams, it actually struggles to scale elegantly in enterprise data environments.

By offsetting so much of the logic to the GraphQL resolver server, what should in theory just be a minimal, centralized controller becomes a monolithic bottleneck even in microservice-driven architectures.

Of course, a counter argument would be that one could mitigate this through proper federation and gateway implementation and that this is simply the result of GraphQL antipatterns rather than the GraphQL philosophy.

4. Maintenance Complexity

But in any case, Any experienced owner-operator can begin to imagine the onslaught of complexity that these considerations bring into view.

Beyond what has been mentioned above, other GraphQL maintainers are rather critical of schema evolution and management as well.

Since GraphQL’s philosophy is that there should never be a breaking change and thus versioning is not needed, many are compelled to go through inordinate level of effort or gymnastics to fulfill such a contractual promise.

Versioning is a critical tool in a world of finite operational resources to meet one's contractual agreements as a service provider within a specific window while accepting the iterative nature of software or API development.

To commit a service provider to support any and all iterations of not only a middleware to meet the needs of both clients and server backends can prove quite a miserable vow to live through in the long run.

This in addition to the fact that modern enterprise teams are not quite coupled in the ways that GraphQL presumes can add to the disconnect and enteprise acrobatics that it was originally supposed to mitigate.

It is understandable in this way how many could be burnt out on the GraphQL option if this is their primary experience with it.

Conclusion

What are we to make of us all this?

Although such criticisms are rather serious, this does not mean that GraphQL is dead. It remains quite popular as one solution among many, but it is not the universal displacer that many heralded it to be even as recently as a few years ago.

It will remain a popular, well-supported technology for some time, barring any paradigm-shifting developments which may either render it obsolete or reinvigorate it, such as the purported third wave of GraphQL in light of agentic AI.

It seems that GraphQL is one of those overly celebrated even if at least somewhat ingenious technologies that succeeded on some level but not without some degree of failure.

A failure (if it can be called so) to have the luck or foresight to identify developments in the cloud and IT world as well as its own logic at scale. Both of which are divergences from the original intent of GraphQL’s design in the early 2010s that substantially undermined its universal adoption and made REST API a framework that faded as definitively as SOAP.

In its defense, one could argue that it was not meant to be applied or implemented universally, this was merely the zeitgeist of the hype train which overplayed its use cases. Even so, it remains a curious case study.

That is why we explore new technologies and ideas, and why things fail. Whether through failure of design or implementation.

So that we ourselves can iterate upon the past and build better, build smarter.

Subscribe now

Sources

Acjay. “GraphQL Was Not the Future.” August 14, 2024. acjay.com/2024/08/14/graphql-was-not-the-future/.

Bessey, Matt. “Why, after 6 Years, I’m Over GraphQL.” May 24, 2024. bessey.dev/blog/2024/05/24/why-im-over-graphql/.

Giroux, Marc-Andre. “Why, after 8 Years, I Still Like GraphQL Sometimes in the Right Context.” May 31, 2024. magiroux.com/eight-years-of-graphql.

Hygraph. “Why the Future of AI Is GraphQL.” Hygraph Blog, November 24, 2025. hygraph.com/blog/why-the-future-of-ai-is-graphql.

“Eight Years of GraphQL.” Discussion on Hacker News, July 17, 2024. news.ycombinator.com/item?id=46264704.

Grafbase, Inc. “Why GraphQL Is Eating the API World.” May 29, 2025. grafbase.com/blog/why-graphql-is-eating-the-api-world.

GraphQL Steering Group. GraphQL: A query language for your API. graphql.org.

GraphQL Project. “GraphQL: A data query language.” graphql.org/blog/2015-09-14-graphql. Published September 14, 2015. https://graphql.org/blog/2015-09-14-graphql.

WunderGraph. “Six-Year GraphQL Recap.” Wundergraph Blog, December 3, 2025. wundergraph.com/blog/six-year-graphql-recap.

WunderGraph. “GraphQL, REST, and OpenAPI Trend Analysis 2023.” Wundergraph Blog. wundergraph.com/blog/graphql_rest_openapi_trend_analysis_2023.

OWASP Foundation. “Testing GraphQL.” In Web Security Testing Guide (WSTG) v4.2, API Testing. OWASP.org. owasp.org/www-project-web-security-testing-guide/v42/4-Web_Application_Security_Testing/12-API_Testing/01-Testing_GraphQL.

5 Considerations for Multi-Tenant Architectures

Thinking through the Cloud — Sun, 02 Nov 2025 20:39:28 GMT

So you want to build yourself a multi-tenant ecosystem.

And you want to commit to it, all the way through production.

If you do, know that is a lift.

Today we will consider some of the most critical factors in the planning and evolution of multi-tenant architectures from my experience doing so for several companies at this point, and what I have learned along the way.

Terminology

First, a note on terminology.

Multi-tenant architecture can be terminologically confusing, particularly because assumptions are often built into these words, assumptions that are just as often not shared by everyone talking about it.

This can lead to confusion.

For example, if you are hosting an application with four hundred different commercial users, one could argue this is a multi-tenant application as each user is an independent “tenant” living inside the application.

This same person would then look at another application which consists of five near-identical deployments, with one customer group per deployment, and they would call this a single-tenant architecture.

On the flip side, someone could look at that application with four hundred users and say it is single-tenant, as there is one deployment that corresponds to the production environment.

They would then look at the five-deployments application and say it is multi-tenant, as the production environment has five tenants for the one environment.

The key distinction here is what “tenant” refers to. Both views are arguably valid but have very different understandings of what it is multi- with respect to.

We will define “tenant” as an infrastructurally-isolated instantiation of an application deployment for purposes of data or functional isolation.

This assumes that a “tenant” is a more granular subdivision of infrastructure beneath the “environment” layer (e.g. dev, qa, prod).

It also means that highly available (HA) environment architecture perhaps with an Active/Active or Active/Passive configuration would not necessarily introduce multi-tenancy, since the data or compute is not strictly bounded into one or the other.

Therefore a single-tenant deployment would mean one instantiation of an application in prod.

If there is one deployment per customer, that would entail a multi-tenant architecture. At least for our intents and purposes here.

1. Do You Really Need Multi-Tenancy?

This should be the architect’s go-to question when facing any net-new consideration: “Do you really need ______?”

If less truly is more, then is multi-tenancy really adding more than it will cost you?

This really does depend.

And it can be incredibly hard to forecast these variables if you have not lived through the growing pains of multi-tenant production ownership before.

Let us take into account some common reasons why companies push for multi-tenancy.

A. Data Isolation

Data isolation can be a very good reason for adoption a multi-tenant architecture, but it can also be an unnecessary one.

It can provide extremely secure safeguards to isolate the entire lifecycle of a single customer’s data if you instantiate separate tenants for each customer organization or BU you are working with.

But “data isolation” on its own is too broad and imprecise to be useful for a technologist in the way that it is for sales.

One must ask what types of data need to be isolated and to what extent and for what reasons?

The typical working assumption here is persisted application data such as in a PostgresDB.

But why is this necessary?

This requirement is most often driven by compliance or infosec teams, either internally or the customers’.

If it is compliance, then the question is which compliance frameworks are in scope for the application stack, or better yet, for individual application components.

If it falls under an information security program, then the question is which policies, standards, and baselines are in scope for the solution you are architecting.

Often enough, precisely mapping controls to components can lead to a relieving level of descoping.

It may be possible to achieve SOC 2 Type 2 certified compliance without needing multi-tenant PostgreSQL, if you can properly implement Postgres access controls and guardrails.

The minimum viable requirements for isolation may also shape the multi-tenant model you adopt. Perhaps separate Postgres databases are needed but can live in a shared cluster.

Multi-tenancy is not a one-size-fits-all approach as we shall see below.

Aside from the question of requirements scoping and mapping, another easy-to-miss possibility is whether the application stack itself already provides sufficient data isolation and guardrails, or is within a sprint’s distance of doing so.

This is a question of resourcing. For underresourced application teams it may simply be more viable to shift the load to infrastructure.

That is not a problem in itself, but it does come at a cost, a cost that runs with interest.

B. Noisy Neighbor

Some workloads are more intensive than others.

If customers are working inside a shared instance, particularly in an application without effective load testing or related protections, a power user may cause performance degradation for other users either through memory or CPU exhaustion, 3rd party API rate limiting, or exorbitant LLM token consumption.

Again the question may come down to (1) how well-designed or well-implemented the application itself is around handling load and putting in appropriate checks and balances or (2) what measures of vertical and horizontal scaling are in place to provide elasticity to support bursty compute requirements.

Sometimes that or a concerted refactoring effort may solve it.

Or if there is a special customer organization with heavy usage of the application, it may make the most sense across the board to provide them with a designated, isolated environment, so the blast radius of their system degradation is rather limited.

In this case it is more of a stopgap measure from a SRE perspective, but thinking beyond the pure technical architecture, it is a fairly effective customer relations move for those willing to pay or go along with it.

Where infosec teams are not involved, perhaps the power user wants the comfort of knowing they are using your SaaS offering in their own dedicated environment. Any slowdown is related to their heavy usage, not someone else’s.

An upsell to a premium isolated environment may boost ARR, particularly for larger customers, and this SKU approach is a valid business justification for supporting a multi-tenant architecture.

C. Easier Lifecycle Management

Perhaps there are not strict data isolation or performance considerations involved but for whatever reasons the application itself does not have robust data lifecycle management for users, groups, or workspaces.

If you happen to operate in a model with strict customer data deletion requirements and an application data model that really cannot service those needs, and no resourcing to implement such a procedure yourself, you may be backed into needing a multi-tenant solution here.

However this is overall a strategically weak reason for multi-tenancy if it is only one, and it behooves the technology strategy stakeholder to take a good, long look in the mirror and assess how committed they are to an application which cannot handle data lifecycle management well, and how long until that will be mitigated.

Subscribe now

2. Choose Your Multi-Tenancy Model

AWS Reference Architecture

If you do in fact decide to move forward with the multi-tenant architecture, the next question is the layer at which segmentation occurs.

This is a very broad spectrum and depends on the use case.

As the AWS example demonstrates above, you could theoretically have a multi-tenant architecture where the application tier is shared, but an individual RDS table is partitioned.

Or perhaps in the database tier it is the schema that is the partition layer, or the logical database, or the database instance.

In a silo model, you have dedicated application and database for each tenant, but this could be extended to the networking layer (VPC for each customer), or even beyond that to an AWS account per customer (leveraging Account Factory).

There are multiple dimensions to this, not just the single axis around the application stack itself.

If you have multi-tenancy, how do you then strategically group or partition:

Telemetry data
Audit logs
Big Data Analytics (OLAP)
SIEM/XDR and other security artifacts

Just to name a few.

This may or may not be complicated, again depending on the original requirements driving multi-tenancy in the first place.

But if for instance you require multi-tenancy for data isolation purposes, but your tracing or analytics tooling ends up piping sensitive data to a shared provider anyway, then you may need to think strategically about your analytics strategy.

The same goes if you have LLM models interface with sensitive data. If you depending upon embeddings, does your vector data store meet the criteria of data isolation? Or even with just inferential workloads, does your LLM observability tooling merely pipe all user prompts and context to a shared repository?

Anything that is tied to a driving component for multi-tenancy becomes a factor to consider.

This is part of the cognitive burden and potential complexity you must accept when moving forward with a multi-tenant architecture.

3. Change Management

Multi-tenancy introduces a fair degree of complication in the processes and procedures of change management.

Whereas in single-tenancy architectures change or patch management occurs once per environment (thus providing a very focused arena for change control), multi-tenancy often entails release management for each individual tenant

Let me provide just a few examples of where the complexity comes into play.

A. Fleet Deployments

The most obvious complicator is of course the deployment to each and every tenant in an environment for each release.

If you have N tenants in an environment, you will need to run the release playbook at least N times.

Automation is of course the compensating factor here.

But this really necessitates a robust CI/CD mechanism, one that can handle the nuances of multi-tenant change management.

GitOps pipelines are key here but again need to be robust for this to be a viable solution.

If you do not have the platform maturity to adopt full GitOps, then it is best to devise strategic patterns and guidelines for streamlining manual change operations as much as possible.

Bash and PowerShell can be a safety net in these situations, to account for gaps in your infrastructure-as-code or GitOps ecosystem, but they need to be handled astutely and with good judgment.

For instance, if the change request ticket includes a manual work item that is tied to a specific release, then there should be a concomitant effort by the ITSM, DevOps, or equivalent team to build out an automation of that manual work item which can then be run against the fleet during the release window.

Again, it’s a bit old-schooley, but the tradeoff of multi-tenancy is that most things change-management just become a heavier lift.

Whereas in smaller application ecosystems it may fully well be possible to have a polished GitHub flow and CI/CD pipeline that issues a new release for each approved commit, this may not be possible for multi-tenancy.

If you have a heavyweight, tightly coupled tenant architecture with a rolling restart deployment strategy, this would entail a lot of compute pressure to essentially reboot all the servers or containers tied to all tenants.

It is easy to identify solutions for this problem of course, but the fact that you do need a solution for it, does make it heavier than if you had single-tenant to begin with.

B. QA Strategy

QA and UAT rehash the same problems of manual versus automated testing.

With multi-tenancy, any number of manual QA tasks would need to be multiplied by N number of tenants. Unless you compromise your QA standards on this front and resort to strategic spot checking.

If you are able to invest in and achieve near 100% level of QA automation for a single environment, that is all well and good, but you may need to temper your expectations by recognizing just how feasible it is to run compute-intensive testing automations against an entire fleet of tenants, especially if you deploy a release more than once a day.

Furthermore, the engineering around such QA automation much also take into account multi-tenancy too, such as in how the headless QA tooling authenticates to each individual tenant, assuming there is not a shared admin user across the tenant environments (bad!).

Not unsolvable just more of a lift.

More than this, you have to account for the variability of QA results between different tenant environments.

What if QA testing succeeds in all but one tenant? What if it succeeds in 80 percent? 50 percent?

What are the criteria for a rollback? Do all environments get rolled back or just a few?Will the complexity of multiple live versions be accepted simply because of the urgency of the release or because the lift is not feasible?

Not all regressions are equal either, and if nearly all tenants are healthy but one or two have the equivalent of a Sev1/P1 release-introduced failure, there has to be tactical consideration on whether you rollback for all, for some, or try to fail the few forward and then patch the rest later.

C. Version Control

This of course leads into the aspects of version control which the multi-tenant stakeholder must account for as well, whether they like it nor not.

If we start off most simply with a monolithic single version for the entire tenant SaaS application stack, you have to identify your versioning strategy for the fleet.

Are there circumstances in which particular tenants may upgrade before others?

We noted just now the demands of the rollback strategy, but as is very common, various customers may have differing demands on their release cadence.

Some may prefer “move fast and break things” to opt in for new features while others may simply want their SaaS offering to work as expected all the time, regardless of bells and whistles.

This may lead to a business driven requirement for different release streams.

The advantage of multi-tenancy and leveraging Infrastructure-as-Code is that you can have very granular management of individual tenants’ versions without too much difficulty.

Sometimes it’s as simple as a Helm chart values.

But you have to account for the compatibility and version support of all deployed versions in the tenant fleet. If you need a critical security hotfix, you must be prepared to patch multiple different versions.

Additionally, if there is some feature flag tooling at play as well, this can quickly lead to a matrix of headaches around how to line up individual tenant versions with the suite of feature flags and their operational lifecycle.

Now take two dimensions of that matrix and add on the third dimension of microservice versioning.

If you have a microservices ecosystem, then each microservice ostensibly has its own versioning cycle and change management system.

If some microservices live within the multi-tenant offering while others live outside it but integrate directly with it, how do you orchestrate the versioning of shared versus tenant services for all tenants across all the versions for all the services currently supported?

This is another example of how multi-tenancy may not seem intimidating at first but can quickly become more than a thought problem as an organization or platform scales outward and upward.

D. New Workloads

On the note, of outward scaling, you can become fairly adept at accounting for multi-tenancy in your regular change management or release process, given you have a well-defined set of changes you are working with.

But an organization is not growing or thriving if it is not evolving. And the same goes for application architecture.

The development and integration of new workloads whether a Redis queue, a proxy server, a data warehouse, all need their rollout plans strategically aligned with the multi-tenant architecture.

To put it simply, it is simple enough to issue and mount a 3rd party API key for a single prod environment, but if requirements demand it, you may need to be able to programmatically generate and mount these for each individual tenant in scope.

What was relatively simple in single-tenancy becomes a thornier thing to solution for once you have N number of tenants to manage.

4. Other Operational Complexity

Outside the realm of change management, there are other additional considerations revolving around the ongoing operational complexity of managing multi-tenant workloads.

A. Snowflake State Management

The beauty of the customer is that they are not all the same.

Each have their own preferences and interests, and demands for custom-tailored solutions.

This is not inherently harmful but does pose a risk particularly when multi-tenant architecture becomes a vehicle for snowflake deployments.

Perhaps one customer wants a particular 3rd party integration disabled for security reasons, or another wants different caps set on file or data limits.

The beauty of multi-tenant architecture paired with infrastructure-as-code is that such granularity is in fact not only feasible but often easy to manage.

At least at first.

In a well-design multi-tenant platform, the actual implementation of snowflake state management can be a breeze.

But it comes at the cost of application coherence.

Product and engineering stakeholders may add feature on top of feature over time. More often than not these building blocks may have dependencies upon one another.

In a one-size-fits-all this is not as complicated to manage.

Yet if each tenant has their own bespoke configuration patterns, and the consistency of such configuration is not rigorously reviewed and maintained, it is only a matter of time before the snowballing of these snowflakes produces a reliability avalanche.

It is impossible for QA to catch these in pre-production environments unless you commit to some level of multi-tenant QA environment to account for the bespoke deployments of production tenants.

SRE stakeholders are fighting a losing battle of trying to corral in every possible permutation of config pattern that may cause regressions.

If any of this makes it way back to product and engineering, they themselves can express frustration because in their view they should not be required to account for each individual customer deployment’s requirements.

And so, without strategic coordinated effort, the multi-tenant bespoke deployments can devolve into a quagmire of both technical and personnel friction. High drag, low velocity.

While it is a well known adage that “config is cheap”, one must beware the force of compound interest. What once was cheap may quickly become taxing at scale.

B. Multiplied Incident Response

The same kind of operational complexity extends to incident response as well.

The classification of severity levels for incidents now has to factor in the probable occurrence where individual tenants may experience severe degradation but other tenants may have zero degradation.

How does this impact your SLI metrics? Do you take the same KPI hit for degradation in a few environments as you would for the whole tenant fleet? If these are accounted for differently, how do you properly distinguish between isolated downtime versus fleet-wide? What is the criterion for this threshold?

When an incident does occur in multi-tenancy, one must accurately assess and determine the scope of such an outage. One cannot merely say a particular degradation to the prod environment in most cases. Some tenants may be impacted while others are not.

Containment and remediation in turn follows this law of operational overhead. If the incident is tied to the tenant level, the lift to implement the fix in one tenant needs to be multiplied by the number of tenants.

In time-sensitive situations you are largely dependent upon the skills of your incident response team to create automated scripted remediations that can be run programmatically that will not negatively impact unaffected tenants, or introduce other regressions due to variation between tenants.

Institutional knowledge becomes a chief asset in these cases as those who know the ins-and-outs of your garden of multi-tenant peculiarities can often best assess the path of least resistance. Those with a documentation-based, or worse AI assistant-based, understanding of the application ecosystem may blindly step into the traps of bespoke deployments.

Often enough, a step or two of additional discovery is needed, and in general makes IR a bit heavier in both theory and practice.

Subscribe now

5. The Cost

Out of all the above reasons, the multiplier factor of hosting multi-tenant environments should be the most acutely felt by the business line.

This is heavily dependent upon your application architecture, but if the silo consists of isolated web servers or databases or datastores or for some reason stateless API server can make all the difference in your hosting costs.

But in those cases where you do have to multiply a small chunk of CPU, memory, or storage per customer, this can very quickly be felt in the ultimate P&L statement.1

If you have slow customer acceptance this may be gradually accounted for.

But if you have fast customer acceptance, lowball the license fees to maximize market capture, and are committed to the multi-tenant architecture, the preparer for your financial statements may be surprised to learn that the cloud operating costs are a greater debit than the credit of customer revenue.

That is more of a worst case scenario in an early startup setting, but the same principle applies.

Many server-based solutions simply have that base level overhead to run N number of servers. Serverless and scale to zero may mitigate this to some level, at least for compute costs, but you may still have to reckon with any duplicated storage overhead for supporting individual tenant environments.

If you are a stakeholder trying to steer your organization away from adopting a multi-tenant architecture for a new solution, it may be through the rigorous arithmetic of computing cloud spend that you may be able to get buy-in to avoid or at least mitigate such an approach. If the other reasons fail to be persuasive.

Conclusion

While this may seem likely a largely negative assessment of multi-tenant architecture, this is not to say multi-tenant architecture is inherently “bad”.

There are many cases that justify the leap to this kind of approach.

It is just that many orgs who are bought into the multi-tenant approach for one reason or another may not fully know what they are getting themselves into.

And that is why these thoughts are merely presented here as a cautionary tale.

Multi-tenant environments introduce an additional layer of challenge.

These are not insurmountable.

But they increase that slope of uphill battle technologies teams are often engaged in, to fight the ongoing skirmishes of technology operations.

To build things that work. And stay working for the long haul. Hopefully without getting burnt out.

Not to mention this can often be masked for early startups by the generous credits their CSP may provide them with for initial vendor lock-in.

8 Myths of Building "Lean"

Thinking through the Cloud — Mon, 13 Oct 2025 19:00:40 GMT

Like Agile, “lean” has become something of a meme or cliche in the business and technical world. This is no fault of Eric Ries or whoever may be deemed responsible for revitalizing the term, but it does require us to exercise some thinking muscles to recover the original idea here.

Let us first restate what “lean” is, then what is not, before we consider general principles to build lean from a technical design perspective.

The Lean MVP

We will take Eric Ries’ The Lean Startup as our starting point on this subject.

The startup, or more broadly the initiative, is concerned with building something new to fill a gap. If there is a gap, then there necessarily is uncertainty around how to fill that gap.

The core business proposition or objective should not be primarily locked into filling the gap with this particular solution but rather commit itself to solving a particular problem through hypothesis testing various solution.

Thus the core loop of building lean is to identify a business hypothesis with metrics to measure the success, and even more importantly the failure of said hypothesis. From there, the MVP is the smallest possible test that addresses the riskiest, most uncertain assumption. The MVP provides actionable information to inform the initiative which can pivot and iterate upon such findings.

It is validated learning that directs innovations in an auditable way to provide growth.

What Lean Is Not

Like waterfall in Agile's clothing, many business and technical professional like to attribute this positive buzzword “lean” to many things that are decidedly not lean.

Any initial prototype is deemed an MVP even if there is no consciously articulated business hypothesis being tested.1

Here are some common ways un-lean thinking can disguise itself as “lean”.

1. Avoiding Planning

This is perhaps one of the most common misuses of “lean” I encounter across organizations, particularly in startup culture.

Building technology is not easy.

The advance of time has led to the proliferation of technologies, their terms, their mappings, and their use cases. Things are not getting simpler.2

However if you are a decision maker on digital products, platforms, or applications, especially for a freshly-minted start-up, you may struggle to wrap your mind around all the options that are out there and confront you, whether through vendors or opinionated engineers you work with.

Taking each on their own may be simple enough but then they need to be reckoned altogether, and the complexity balloons the more factors you take into consideration.

Given such difficulty and a desire to “move fast” or “just build”, the temptation is to adopt a “lean” methodology and forego any and all considerations around system architecture whether it is the application you are building or the company’s operations itself.

However this can quickly turn into a let’s-throw-everything-at-the-wall-and-see-what-sticks mentality.

This can work, much like you can win at Russian roulette, but it is risky and more often than not costly.

Again this aversion to systems planning or enterprise architecture is not unfounded.

Plan-avoidance can take precedence as a gut reaction to the endless meeting chains of Waterfall. Talking everything through ad nauseam before one could even start doing anything at all.

Startup founders exiting corporate environments may feel liberated from QBRs and Scrum-of-Scrums and use this newfound freedom to instill a culture of technical anarchy in their startup.3

But as any technology professional is well aware, once there is the glimmer of gold in a dev, sandbox, or even local environment, the immediate question is how soon can we get this into production.

So if various individual contributors (ICs) are working on projects or features in their siloes, whatever shows the slightest signs of fruit, may face intense pressure to find its way to a production environment, with minimal other considerations.

Unless the IC is a master class contributor, then this is where things can get hairy, in trying to find this path to production. Some complicating factors includes but are not limited to:

How to build or deploy this across environments
How are secrets/keys managed
How secure is the application
Can it scale and/or handle load
Is it reliable and sustainable
Can it be extended for other features or patterns
Is it costly to run with a live user base
Can other ICs work on the application without an immense learning curve
If AI was used, how much bloat or inconsistent patterns are present (along with all the other gaps AI-driven coding can introduce)

The prototype first, plan later mindset inevitably puts the cart before the horse and can lead to serious delays, frustrations, and costs, simply because basic considerations around the operations of the workload were never considered, out of a desire to “think lean”.

This is not to say that one should overplan either.

Planning should take into account the tradeoffs. How much time do you validate hypotheses or engage in systems design rather than proceeding with caution?

Ideally you would be working with the production stakeholders to identify the factors that go into the “pipeline to prod” so that you do not face hang-ups before you get there.

2. Avoiding Thinking

This is similar to the first, but slightly different.

While planning-avoidance prematurely waves off systems planning, environment management, compliance etc. as “too heavy”, thinking-avoidance simply will take the most immediate, presented solution or idea as the best one.

This arises when there is a known problem or set of tradeoffs to consider either in the technology or operational structure of an organization.

You know there is a problem, but identifying the proper solution to that problem may require critical thinking or hypothesis testing.

The human temptation is to eschew that kind of hard work or hard thinking in favor of taking either the first solution recommended by their particular AI assistant or by saying “we’ll figure it out later”.

Many will quote design principles of minimalism or again call it “lean” to simply not worry about the problem and cross that bridge when we get to it.

But again this can end up being very costly and regrettable in retrospect, when a day or two of consideration could have saved the firm five or six figures of cloud or personnel spend or lost sales revenue.

This leads to deferring the answer to whatever is proposed first and following the chain of developments, eyes half shut, to see where it goes.

“Lean” is about actively prioritizing hypothesis testing using data and critical judgment, but like plan-avoidance, thinking-avoidance can often take on the label of lean minimalism to justify passivity or our own human tendency toward laziness.

3. Find the Case to Justify the Assumption

Another far too easy trap to fall into is shifting the purpose of the MVP or the lean prototype away from solving a question and toward proving an answer.

Technologists in startup and corporate environments often have a keen interest in their intelligence optics.

Because intelligence is immaterial and intangible, it becomes an ongoing site of competition around the players who claim to be technical, and even some who deny being technical but will reserve the right to enter technical debates.

The rivalries and various episodes that emerge from this can be dramatic enough, but it can also engender behavior that is costly to the company or the particular initiative.

People can attach their ego to this or that proposal. This or that solution they propose.

Then they feel the need to protect them, like their own children.

This is when thinking lean becomes about finding the right test that will prove the pre-selected answer.

It is the mindset that says, I know what is most lean, but I need to justify it so everything can see it.

So MVP hypothesis testing is oriented primarily around reinforcing the credentials of the one who authored the hypothesis rather than an honest assessment of its value delivery.

A problem not limited to technologists either but can range across a wide range of personalities and roles.

But it is how one can claim to be lean by finding the minimal acceptable evidence to advance a theory that they have already set in stone.

4. High Velocity, High Volume

One of the most appetizing facets of a lean methodology is the purported ability to “move fast”.

High developer velocity.

Low time to market.

This paves the way for more… features, features, features.

Imaginative and more visionary personalities can often struggle to balance the influx of new ideas against discipline, focus, and keeping it minimal.

They can take the “lean methodology” to mean that it’s got the metabolism to absorb a glut of caloric features without gaining weight.

Even the most scalable and extensible systems architecture can enable and support feature overloading, but at the cost of UX.

This can tie into plan avoidance, as the stakeholder may simply want to see if such and such a feature works or gains acceptance.4

If the technology counterparts can keep up, the number of features is itself taken as the definitive success metric, regardless of its impact.

The trick with “lean” is to find the test that can kill a feature before it proves costly to the company, if it should prove not to be worth the effort. Not to create a runway to load up on features ad infinitum.

5. High Velocity, Low Quality

A cousin of the previous mentality is the implicit acceptance of low quality work, particularly technical implementation, just in the interest of getting things out the door.

This can emerge in a variety of ways.

One, for the business line that operates out of a sense of constant urgency and last-minuteness, the technical IC is compelled to cut corners on their work, even though they know better. They hold their breath and finish the user story, knowing they have introduced technical.

Two, perhaps the business line is somewhat patient, but an engineer has mis-estimated the work. They have gone through several iterations, and now it is getting down to the wire, so when they stumble across something that barely works, the engineer will sprint toward that solution and pat themselves on the back for delivering at the eleventh hour.

Or let us say the product owner or equivalent business stakeholder takes a holistic, architectural mindset to iterative development. In other words, they are sharp and know they need to leave room for systems decisions.

This can also backfire, particularly in environments where the IC helps draft their own technical requirements.

Three, the technical IC who wants to minimize the amount of work they need to complete in the short-term will enumerate a long list of reasons or justifications for why they cannot meet certain additional acceptance criteria or systems requirements.

In their words, it is too darn complicated or difficult, and all the technical gobbedly-gook frightens off the product owner from pushing them any further on it, settling for whatever minimal quality the IC feels like accomplishing.

Long-winded pushback of this kind can successfully trick managers out of basic technical standards evaluation (e.g. idempotent migrations, retry logic, consistent React state), all because the IC will argue that the low quality technical work is the “lean” option.

Four, on the other end of the spectrum an overzealous engineer can propose an incredibly sophisticated and powerful piece of machinery for their feature, and on the off chance they estimate it somewhat accurately, it seems incredibly expensive and over-extended for them to complete any of these tasks.

In the interest of being lean, a manager or other stakeholder will thus descope any and all systems considerations to avoid the overengineered approach they are presented with, simply because often those same overzealous ICs have a very hard time self-selecting or prioritizing what is properly most important for systems design or scalability.

Thus the engineer is overruled because they could not constraint their production-readiness expectations to something acceptable to the business line.

In all of these, as some of you may be able to tell, a large part of the problem is perhaps due to org structure, roles, responsibilities, and separation of concerns, but all the same, it is easy to see how ostensible concerns about being “too heavy” can be used to ignore considerations around quality, sustainability, reliability, or scalability.

6. Pennywise

When budgets or seed round money are a concern, the myth of building “lean” can often find a way to justify saving pennies at the expense of pounds.

These are often the founders or startup cultures least amenable to the idea of “you have to spend money to make money”, perhaps because they have seen that line used to burn entire investor rounds.

But the idea of building lean can often be used to unnecessarily constrain the technology and talent pool budget in ways that can prove costly in the long-run.

Hiring junior or nearshore engineers has its benefits, especially when getting off the ground, but it also has its downsides if you want a scalable application with sustainable development velocity.

Going all in on the managed services for prototyping may seem the cheapest, smart move for avoiding hosting costs, but sticker shock at a minimal Kubernetes cluster may very soon be outweighed by the costs of getting off AWS Amplify or Vercel.

There is of course a virtue to fiscal discipline that can be lost on many startup executives with a starry-eyed vision and flush with investor cash, but the other extreme may constrain the startup or initiative’s potential for growth out of a rhetorical to be “lean”, when this is a misapplication of what is meant by a process focused on surgical, scientific precision to solve business questions.

7. The Lean Contortionist

All of the above myths come out of some sense of reducing or subtracting complexity in order to help build something “lean”, even if they misapply the original idea.

However, there is an entire separate category which has found a way to reinvent something “heavy” by virtue of the lean methodology.

This can arise out of a demand to be “lean” enough to be extensible in any and every direction.

You end up getting buried in a morass of requirements around portability, observability, HA, BC/DR, scalability, OS support, security, compliance, and anything else that can be thought of.

These can come from senior technical manager or architects who are unable to properly limit the field of vision or constrain their requirements in the traditional MVP sense and thus burden any new digital initiative with a potpourri of non-negotiables so that it is “lean” enough to change course to fit any requirements. Another way of just being heavy.

8. Silencing Stakeholders

This last one overlaps with many of the above but depending on how evolved or mature an organization is, this is when one feels there are simply too many competing voices in the room, and they feel it drowns out startup focus on rapid iteration and product velocity.

This can lead executives and managers to unilaterally exclude particular stakeholders from the iteration process, just so the time-to-market doesn’t have to be bogged down by meetings and perspectives.

It is not entirely unsympathetic if your governance, regulation, compliance (GRC) stakeholders multiply your iteration cycle by a factor of three or five with very untargeted, generic feedback.

Or if your infrastructure team is complaining about competing priorities and they will not be able to get around to it until next quarter. And you just want it done.

Thus the temptation to just opt not to invite those folks to this or that meeting, and tell them you did not want to bother them since you know how busy they are.

You want a lean, tight, small team to help get your feature or product to production, and the stakeholder bandwagon can get in the way of that sometimes.

But that is not what lean was proposed to accomplish.

Epilogue: Lean + Platform

Some of you may have experience with these species of pseudo-lean, or perhaps others besides those listed above.

There is no one size fits all to preventing or alleviating these.

Some of it comes down to properly weighing tradeoffs. Perhaps you do need to go in blind on some things.

Perhaps it is an issue of org structure and skillsets.

One way to provide a foundation to build “lean” is the premise of platform engineering.

Out of the hundred or thousand various considerations that can swim around any technologists’ eyes, the IC should be able to work with surgical focus that optimizes the value-delivery of their work while minimizing their need to “reinvent the wheel”.

This is why enterprise architecture, baselines, patterns, self-service infrastructure, platform bootstrapping, all of these provide the ingredients to abstract away many sources of noise or secondary concerns from the developer.

The security, compliance, infrastructure, and whatever else requirements are moved out of their field of vision and distraction.

This can help them hit the keyboard running. And build it.

Lean.

That of course does not prevent someone stumbling upon such hypotheses or findings by accident, even if they are not strictly following the MVP model.

Even if generative AI advances may abstract away the day-to-day complexity of writing software or managing config files, under-the-hood, that complexity is still very much there.

Alternatively, on the other end of the spectrum, a CTO or technical CEO may be so opinionated that there may be no room for discussion at all. The best of these will document and communicate their technical vision to individual team members, but there is often an inverse relationship between technical and communication/management skills.

Or worse they may presume that feature will have to be popular and do everything it takes to make that feature gain user acceptance.

A History of Storage: Database (DBMS)

Thinking through the Cloud — Thu, 02 Oct 2025 15:51:02 GMT

As we continue this series, looking from a high level at the evolution of persistent storage technologies, we arrive now at the database. More precisely the database management system (DBMS).

Because of the wide array of systems that can fall under the DBMS, our focus here today is largely on the “paradigm shifts” in the field of DBMS technology. Again at a very high level, for those wanting to dip their toes in the water.

CODASYL: The Data Base Task Group

For those who have read with us before, the introduction of magnetic disks in the 1960s paved the way for many other revolutionary developments in storage technology.

As data stores scaled, so too did the need to access such data efficiently.

Traditional block storage access was characterized by physical disk addresses. Memory pointers. Relationships between data were identified by where they lived.

The management overhead around address management increasingly introduced difficulties that became a priority to remediate.

In 1959, the Conference on Data Systems Languages (CODASYL) has formed to address technical standardization and developing a common computer programming language.

COBOL arose out of this.

The development of COBOL reignited the question of data processing and access standards.

As part of CODASYL, the Data Base Task Group was formed to invent a new model.

By I, Jbw, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=15230892

They came up with the Network Data Model.

The primary idea behind this data model was to replace the hierarchical tree-like structure of traditional data modeling with a more graph-like approach where parent and child records had a many-to-many relationship rather than a one-to-many.

This would offer low-level power for data access and processing.

However this model never took off in its own right for a variety of reasons. Foremost though was its rival we will consider next.

One item worthy of note here is that the CODASYL Network Model was the first to introduce the distinction between a DDL (Data Definition Language) and a DML (Data Manipulation Language) which is so common in the DBMS landscape today.

Origins of the Relational Database

The network approach was not the only proposed solution to data processing needs.

At IBM, in response to concerns about data scaling, a certain Edgar Codd authored a paper “A Relational Model of Data for Large Shared Data Banks” in 1970 which proposed representing “collections of relationships” that could be accessed independently of memory addressing.

This concern with the user-friendliness of data access patterns is stated most succinctly here:

Users should not normally be burdened with remembering the domain ordering of any relation (for example, the ordering supplier, then part, then project, then quantity in the relation supply). Accordingly, we propose that users deal, not with relations which are domain-ordered, but with relationships which are their domain-unordered counterparts. To accomplish this, domains must be uniquely identifiable at least within any given relation, without using position. (Codd, 380)

Codd describes the previous model like this. If there was a data set on cars and their attributes, and if one wanted details on an individual car, they would need to perform address lookups to:

Use an address reference to find the particular Model in a section of addresses assigned to car models
Use an address reference to find the particular Make in a section of addresses assigned to makes of the various cars
Use an address reference to find the particular Color in a section of addresses assigned to the colors of various cars
Etc.

Instead the user should only be required to perform one lookup for the particular car which would then provide a collection of relationships including Model, Make, and Color, etc.

The practical advantages of this are self-evident. This is Data Modeling 101.

While Codd propounded “time-varying” relationships as the secret ingredient of relational modeling, the more significant proposal was the application of relation theory to computer storage access, i.e. storage access independent of memory addressing.

From here could emerge the concept of data tables with identifying links between them.

This offers a greater degree of logical abstraction and flexibility than the CODASYL Network Data Model could. Its favorability increased in conjunction to hardware and networking advances that offset the comparative advantages of “low-level” power the Network Data Model had offered.

Relational data would define the landscape for decades, even to this day.

The Introduction of SQL

Codd had introduced an idea, but ideas like a relational data model need to find a way into practical implementation.

How does one build this new relational database (RDBMS)? How do we get past an index-based addressing system?

IBM though slow at first on the uptake answered with Structured Query Language (SQL).

Standardized in 1986/1987, particularly with ISO/IEC 9075:1987, SQL attempted to accomplish several new things.

First, it provided an abstraction that allowed the user to access multiple records with one command without needing the address for those records.

Second, it went beyond the DDL and DML segmentation introduced by CODASYL and identified Data Query Language (DQL) and Data Control Language (DCL) as well.

Ironically while SQL made great strides in standardization and hardware interoperability, there remains considerable incompatibility between the particular flavors of SQL which themselves may or may not align completely with the SQL standard itself.

A few of the main household names we will go over now. More depth could be provided perhaps in a future post.

PostgreSQL was developed initially in the 1980s at Berkeley to replace the Ingres project that came before it (hence post-gres). Ironically it was not originally a SQL project and went through several name changes and functional changes over the years. In 1994, however it gained its first SQL interpreter and has standardized around it sense. It remains completely free and open source.
MySQL was developed by a Swedish company (MySQL AB) around the same time with a release in 1995. Based on the InnoDB storage engine, it would eventually be swallowed up by Oracle in 2010. While MySQL is open source it also does offer proprietary licenses. (MariaDB was an offshoot of MySQL before Oracle’s acquisition as well.)
In the last 1980s, Microsoft sought to establish their own proprietary implementation of SQL, launching SQL Server 1.x in 1989 targeting Microsoft’s enterprise customer base while offering editions for other kinds of users and business as well.

Subscribe now

The NoSQL Umbrella

SQL was the rage of the 90s.

But as technology changes, so too do its needs.

It a testament to the robustness of the SQL and relational model that is remained so definitive and comprehensive for as long as it did.

In point of fact, needing to categorize very distinct database types such as graph and document under the NoSQL label shows the extent to which the SQL mindset dominated the field.

By the 2000s, database technologies are moving rapidly in divergent directions.

We will name a few of the alternative paradigms that emerged and have since become fairly established in the technology world.

Embedded (Key-Value)

The embedded database is built around tight coupling with application or hardware logic. It is simple, linear, with low latency.

As its name indicate, the key-value table features one key mapped to a “value” which could include either a very boolean or complex, nested data. In this sense, it is semi-structured data.

Clescop, CC BY-SA 4.0 , via Wikimedia Commons

One common design pattern is a “Single Table” design which allows the user to forego multiple table lookups for more high performant queries, given the table’s data model is properly architected.

The embedded KV had existed in various forms even before the primary SQL vendors.

One could argue it was first standardized in its own right by Berkeley DB (BDB) which was released in 1994 as part of Berkeley’s own Unix derivative operating under BSD. While it is difficult to track down primary sources around its launch, here is a presentation of its logic from 1999.

BDB was acquired by Oracle in 2006, but its model paved the way for many competitors to develop their own ACID-compliant, high-powered DBMS systems.

Amazon DynamoDB was one such answer, announced by then CTO Werner Vogels in 2012.

Each of the major cloud providers has come up with their own proprietary implementation in the years since.

Document

The document database is a more nuanced approach to semi-structured data, further defining the key-value model.

The document database (not to be confused with a file server that stores document files) consists of a very particular sense of document.

This is tricky to define, because there are no generally accepted vendor-agnostic standards or protocols around this form of database.1 This ambiguity can turn this category more into a catch-all than one with hard-and-fast boundaries.

What is common across the major product families such as MongoDB, DocumentDB, CosmosDB, and Firestore are a few key dimensions:

A document is a standalone JSON-esque aggregation of data.
A document is the unit of both storage writes and storage retrieval with a unique address inside a larger set.
A document is schema flexible yet also indexable by fields.

Its primary strength lies in its fast iteration, particularly through hierarchical data. All the data attributes associated with an individual customer for example can be coalesced into a single document unit.

This provides for stronger horizontal scaling but also richer query power than traditional Key-Value offers.

However this does come at the expense of low performance for cross-document constraints and queries, but it works fairly well when a document can function as a self-contained microcosm.

To trace out its history, we can find antecedents in Lotus Notes (1989) and XML DBs like eXist-db (2000).

One Lotus Notes developer left IBM to create a self-funded instantiation of what is generally considered the modern document database: Apache CouchDB.

From here, the major vendors would kickstart their own versions, but CouchDB has seen widespread success and adoption, even as the backend for the npm registry, vulnerabilities notwithstanding.

Graph

While the network data model was set aside in favor of relational data in the 1980s, in the early 2010s graph data, would see a revival under the auspices of widespread social network adoption and the need to think in more network-based data persistence and access.

Neo4j paved the way for graph data technology by storing data elements as nodes, connected through edges.

Like DocumentDB there are no official standards around graph database technology, however efforts have been made to define these through Graph Query Language and the graph model which articulates the graph relationship through RDF (subject-predicate-object) language.

While Graph is not necessarily built to handle bulk OLAP workloads it does remain very powerful in terms of its first-class relationships, low-latency queries (at least when well architected), and allows for remarkably flexible schema evolution.

Vector

Vector databases are currently extremely hot with the genAI bubble dominating the market at the moment.

Vector data is unique in that it is not a formal protocol or query language but rather mathematically and algorithmically defined in how it handles the dimensionality of data.

So it is not possible to say that SQL and Vector databases are alternative options, as vector database implementations can live for example in Postgres with pgvector.

This goes back further than one would expect to the suggestion of Approximate Nearest Neighbor (ANN) searches back in 1998. The basic idea of ANN is that it provides criteria for how “close to the mark” the query is rather than demanding total precision in what is returned.

This combined with mathematical representations of data as vectors where each dimension (of which there can be very many) is considered a feature of the data.

The tech giants each took their stab at this in the mid 2010s with Spotify’s Annoy for example or Facebook’s FAISS. But Hierarchical navigable small world (HNSW) would define the stage of these small graphs

There are of course other similarity search techniques, but HNSW is the basis for many vector data implementations across major vendors to this day, including Elasticsearch, MongoDB Atlas, and others.

This combined with the latest generation of LLMs has produced in large part the AI revolution that is currently underway.

Futures in DBMS

DBMS has come a long way since its introduction alongside the magnetic tape.

As data continues to grow horizontally, vertically, and in dimensionality, technology will of course innovate to meet the demands impressed upon it.

There are many different considerations to take into account here, but one interesting prospect is the advent of the AI Autonomous DB to further abstract the human from the loop of database development, consumption, and administration.

As with all things AI at the moment, we will see if we continue to approach the asymptotic curve as the major vendors fine-tune their models, or if another paradigm shift is in the cards to upset the playing field.

At least that I could find publicly available, please send to me if there are.

Planning Technology Delivery: Thinking it through Backwards

Thinking through the Cloud — Wed, 24 Sep 2025 16:31:07 GMT

A reverse waterfall in action

(Image credit: 7NEWS Sydney)

Technology delivery is if nothing else, the traversal of that tricky gap between the current state and the desired state of technology operations.

Of course to even have a desired state presumes that you have invested sufficient time and resources in discovery to identify both what that current and desired state to enough people involved.1 You have pursued that identification to the necessary extent that allows you to establish a broadly accepted “Definition of Done”.

Let us say you have arrived at a reasonable initial expectation of desired state with acceptance from your stakeholders. And you have a sufficiently strong grasp on the current realities of the organization, its teams, vendors, etc. to begin to plan that journey from start to finish.

One of the marks that distinguishes the architect or technical delivery manager from the software engineer is their ability to think over what this entails, and to exercise it in practice.

They make things happen.

Environment Management

One starting point for thinking over technology delivery is the question of environment management.

Many who write software, even with years under their belt and enviable comp packages, have simply spent their entire career in localhost.

When they have met their acceptance criteria locally and pass PR review, they consider their work finished.

Whatever goes wrong after localhost is someone else’s problem because it worked locally. A position often reiterated by their business stakeholders.

But this is a radical limitation of software engineering.

To think software development in the lens of only one environment.

Whether the mindset itself is good or bad, to ignore the multi-environment dimensions of cloud applications is debilitating from a systems engineering perspective.

And there quite a few environments out there sometimes.

sandbox, rnd, dev, qa, staging, nonprod, preprod, uat, prod-a, prod-b, bobs-env-do-not-delete

Depending on the organization, the path from local to production can turn out to be a marathon, whose procedures may or may not be documented or socialized.2

It is this gap which can stall projects for months. When the question “How do we get it to production?” was asked at the end rather than the beginning.

Every organization has a process whether they know it or not, if they have something running in production.

For every environment the organization has between local and production you will need to identify:

Source code/build artifact release process
Cloud resource management
Environment variable/secret management
Credential/key management
Domains
QA process
GRC requirements
Approval chain and change management.

Organizations with a robust platform strategy can significantly minimize and abstract away the overhead for all these criteria, with mature architecture and service delivery patterns for you to extend.

Other times, organizations can be rather disheveled when handling this, whether they are new or old. This is especially true if the organization is migrating between IT service delivery strategies or cloud providers while also introducing new workloads at the same time.

If you are working through a project one step at a time, perhaps an Agile flow with rapid, narrow, iterative development, you do this at the cost of potentially deferring large, expensive questions around architecture and technology delivery.

The complexity and time required to corral stakeholders and release requirements and the input of QA, security, infrastructure, or a Change Advisory Board can crush initiatives that are code complete but have nowhere to go.

Questions that don’t occur to someone until the end should have been posed at the very beginning. This is one point which Waterfall gets over Agile (at least in practice, if not in theory).

Hence a key tenet of technology delivery excellence is thinking through it backwards, so to speak.

You need to be able to envision not just what it needs to be, but where it needs to be and how it gets there.

Thinking it through Backwards

It is simple enough to offer this phrase “think it through backwards”, but what does it look like to think in this way?

The key is that this is a mindset rather than a set of knowledge that can simply be listed.

Environment management is one of the most generic, comprehensive examples of how technology delivery needs to be thought through from finish to start as well as from start to finish.

But you can extend this any number of directions.

The possibilities are infinite, and this itself also tends to confuse professionals.

What are all the requirements that we need to engineer for to ensure technology delivery to a sufficient degree?

To reiterate a basic truism: you simply cannot check all the theoretical boxes.

There is no such thing as the infinite, perfect app.3

The application is a set of finite instructions and requirements.

Well-engineered technology depends upon embracing the cliche that design is the adjudication of competing trade-offs.

To be able think through it backwards you must have a very clear sense of what that desired end state is and architect the software or platform accordingly.

This is something acquired naturally and iteratively through work experience, if you are willing to pay attention and learn from your missteps.

For some organizations TCO (total cost of ownership) is critical to accepting a new workloads. Other organizations, startups in particular, waive this as a non-material concern until an investor or board flags the OpEx.

GRC can be a primary concern in some organizations, a secondary or tertiary consideration in others.

QA and SRE may have right to veto release in some companies, or live in perpetual reactive mode in others.

The stakeholders you are working with as well as your project sponsor, and their chain to leadership, should most directly inform the prioritization of these factors not only in architecting the application but in gaining stakeholder consensus to ensure its delivery.

You can extend past experience to inform these questions, but you should not trust it completely. Things change between and within organizations. Stakeholders move around or out. Budgets change. Technology trends, vendors, and market forces can impact the stakes of your technology delivery.

This is why even if one were to catalog all of the various possible “finish line” items you should think about, it would not remain clear which are material or relevant for your use case.

Technology delivery is a continuous learning and planning cycle. You have to stay up to date on what is going on among people and process and technology. Losing sight of these can invalidate your whole plan.

You must spin a spider’s web to lightly but tightly tie together often vastly different departments. And that web must be maintained as requirements shift, turnover occurs, and even if project sponsors change. So long as you remain committed to your mission of executing delivery.

Keep asking yourself, where your north star is for your project and how you can reverse engineer your path to getting there.

Questions to Ask Yourself

To provide at least some specificity to this mindset, here are some questions you could consider as part of adopting this mindset:

What is the most immediate step that would come before this final objective?
Who or what is a gatekeeper to my ability to cross a delivery milestone?
Are there comparable examples to my initiative, a tale of success that provides a clear chain of execution for technology delivery?
Are there lessons learned from that comparable example have other material factors changed?
Conversely have there been any comparable earlier initiatives like this one that have failed or been suspended? For what reasons?
Based on the friction or time investment certain questions face, which considerations are worth deferring until the “rubber hits the road”? To what extent can I carry out a best effort analysis and document decision making here?
Which stakeholders do I need to involve now to minimize later delays? What stakeholders even want to be involved now? How do I document this?
Who is most knowledgeable in this organization or BU about getting things done? Seek the high signal-to-noise ratio individuals who can convey accurate information efficiently and forthrightly.
If working with a new technology library or provider, what open source information is available around production bugs (e.g. GitHub issues), which ones are most material for my use case?
Based on team and personnel resources, how they are allocated, and the business sponsors of that allocation, what will be the relative velocity of various teams in completion of their objectives. Will some drag behind others? How can this be remediated or documented?
What teams and which team personnel need which level of detail on architecture, delivery particulars, etc. ? Do not inundate the junior dev with timelines. Do not flood the product owner with architecture. Know the audience and consider which items of information is crucial for individual personnel to meet their work commitments. Technical verbosity can lead more often to losing someone’s ear than gaining it.

These may come as obvious questions to some, but it is easy to forget these in practice.

Waste emerges in many forms across technology spend.

Lack of delivery alignment is one of the most common and preventable contributors to waste and delays I have observed in my time.

A robust and clear-headed sense of how to get X to Y is not an easy thing to communicate and learn, but such a capacity can save organizations immensely.

Desired state is of course also fairly difficult to discern. Practitioners know how hard it is to get multiple stakeholders to align on an end state, or for that matter even for a single stakeholder to stick to the same picture for more than one meeting.

There is an irony that while so much of computing technology has been standardized over the decades, there is not a single, definitive framework or model for environment management in depth. No RFC, no NIST standards. ITIL and of course NIST SP 800-53 explicitly require a separation between “prod” and “non-prod” but that is the extent of the control.

One example of such a rabbit hole: Is your document processing microservice able to reconstruct corrupted document payloads with non-UTF encodings for a dead language with readily configurable integrations for Amazon Kinesis, Kafka, RabbitMQ, and IBM MQ all while developing its own post-quantum encryption algorithm to secure data? (With an in-process, automated counter-pentest agent)

Notes after Passing All 12 AWS Certification Exams

Thinking through the Cloud — Fri, 15 Aug 2025 03:06:09 GMT

I recently had the dubious pleasure of passing through the gauntlet of AWS exams and emerging certified in all 12 AWS certifications (plus some other retired ones).1 I am fortunate enough to have passed them all on the first take, though the Networking Specialty was a close call.

While the Internet may off you an abundance of material on the subject matter of these exams, I do hope to provide some thoughts and reflections from my own experience that may guide your path as you embark on this struggle with the AWS certifications process.

Jump to a question of interest to you, and I hope it proves helpful. Good luck.

Why are you taking an AWS exam?

SEO content farms have sections with this title as well, but there is a point to asking this question here.

You must weigh the cost of these exams.

Unless your employer can bankroll the exam process, you must weigh the cost of the exam(s) you are planning to take. Even if they are paying for the exam fee, you must still invest time, and not just time in a general sense, but that valuable quotient of highly-focused and highly-energized mental state that can be hard to come by.

If you are cognitively drained or worn out, you simply will not be able to prepare for or take the exams, even if you engage with AWS services on a daily basis.

Furthermore, if you do pass the exam, the AWS certification will last three years. In most cases, you will need to sit the exam again and pay the exact same fee in three years time, just to hold on to the title. This is the AWS version of Continuing Education Credits.

There are at least five distinct, general reasons to take the exam:

1. Career Launch

This is in my view the most justified reason to take the AWS exams.

If you are needing some form of resume bump to initiate a career transition into technology, this is a legitimate strategy. It was mine.

I was a humanities major with a few years of professional experience in nonprofit operations. I had taught myself a few programming languages in this time.

But the critical push that enabled my true technology career launch was having freshly minted AWS certifications to decorate my resume and profile.

It started with small upwork.com projects and escalated from there.

However, to repeat commonly iterated advice, certifications do not get you the job, but they do get you the interview. And that is the crucial step often enough.

So one should not expect to be handed a job or comfortable contract upon earning their AWS laurels, but it can smooth your path.

2. Career Advancement

Let’s say you have landed, so to speak, in your desired career track that can afford you progression to your professional north star.

There can be an impression that AWS certifications can provide you the opportunity for a promotion or vertical move.

From my point of view, I do not think this is true, unless the position you are aiming for specifically and explicitly names the AWS certification in the JD.

I say this from two vantage points.

First, in my experience from dozens of interviews for client projects or job roles, it is very rare for the initial screener, the second round, or the round table to ask about certifications. The only occasions it has come up is when third-party recruiters remark a specific certification is on the JD. There is not a general interest for cert collectors.

Second, I have had a number of conversations with various online HR experts and accounts who have deep history with recruiting pipelines. None of them indicated that certifications played a decisive role in their hiring hierarchy.

Experience, societies & publications, and even degrees play a far greater role on that front.

So if you are already “in” technology, it is in many cases far more strategic to invest in resume-building projects or society memberships than AWS or other cloud certifications.

3. Learn AWS Better

One might want to leverage the AWS certification process as a way to learn AWS better.

This is probably the least efficient justification for taking an AWS exam.

You are far better off exploring the plethora of AWS-provided workshops and resources or the equally bountiful array of challenges and puzzles published by individuals or communities.

The AWS documentation even offer example use cases for just about every facet of the platform. And you can evolve off their “hello world” scenarios with ease.

4. APN or Sales Benefits

The Amazon Partner Network (APN) offers special tiers and branding for those partners who have more certified individuals. An internal incentive to push the certifications program.

If acquiring an AWS certification advances your sales machine, that is fine. As long as you understand the inputs you are providing to make that happen.

5. For the challenge

This was my reason for this most recent gauntlet.

I floated a while between 5 and 7 AWS certs before letting them lapse.

But after concluding my previous role and needing a productive outlet while new prospects emerged, I decided upon pursuing all 12 AWS certifications.

This is truly a challenge, but a great discipline builder. Knowing that you are investing dozens of hours in preparation for a somewhat arbitrary exam that is not connected to hands-on experience. Preparation that drains your mind and inhibits your ability to be productive on other things.

Exams that cost money, even if potentially tax-deductible.

I will be honest and say that I do not plan to renew these certifications outside the Professional level unless the incentive structure changes in some way.2

For myself at least. You may feel differently in considering your path.

Which AWS exams do you plan to take?

There are several considerations worth thinking through with this.

Like I mentioned above, HR and recruiters are very targeted about the certifications they are looking for, if any at all. Certification collecting is the equivalent of putting buckets out in the rain to collect water.

It is better to be focused and strategic about which AWS exams you choose. Even an individual one is fine. Or you could do targeted combinations like:

Solutions Architect Associate + Professional
AI Practitioner + Machine Learning Associate + Machine Learning Specialty
CloudOps Associate + Developer Associate + DevOps Professional

This can be advantageous for a couple reasons.

1. Renewals

When Professional Certifications renew, their Associate counterparts renew as well. Unlike Azure certifications, AWS will make you pay up and take the whole exam to renew your certification. However, if you stick with a particular bundle, like the ones listed above, you would only need to renew the one Professional-level to hold on to the Associate level.

2. Content Overlap

If you are taking particular exams in rapid succession, this is also useful.

While it occurs less frequently than you’d expect, closely coupled AWS exams will feature near-identical questions.

I took the Machine Learning Associate and Specialty back-to-back, and I was surprised by how much the two exams had in common. (More on that later.)

Because the exams are more-or-less focused on your ability to recall AWS best practices and patterns, this can reduce the amount you need to learn during preparation.

3. Certification Shelf Life

There is a running joke that Google likes to shut down products and services nearly as often as they start them.

AWS has in the past few years seemed to adopt this methodology, not least of all to its AWS certifications.

In 2022, I had taken and passed the AWS Database Specialty and AWS Data Analytics Specialty, both of which have since been decommissioned. The AWS SAP and AWS Alexa Specialty exams have also both been shut down since I first started exploring AWS certifications.

It is mid-2025, and I am convinced that AWS Machine Learning Specialty will be put down soon as well.

Some certifications are simply more forward-facing than others. In 2025, AI Practitioner is one of those. The SysOps Administrator is getting rebranded as CloudOps Administrator.

You can also tell in the exams themselves which one is newer. While the curators who review AWS certification questions do a remarkably good job making sure outmoded services or approaches do not linger on in exam question, you can still tell when an exam iteration was launched.

The Data Engineer Associate and Machine Learning Associate exams though at the Associate level have a far more up-to-date of AWS offerings and cloud strategy than their Specialty counterparts.

Pay attention to that when deciding which ones to take. If you like.

Venue: How do you plan to take the exam?

This is more important than it sounds.

At this time, you can take an AWS exam either from a Pearson VUE testing center or from the comfort of your home.

I am honestly on the fence which of these two is better.

The online option can be a hassle because the Pearson VUE proctors are radically stringent on protocol in extremely arbitrary ways.

If you take the exam from your workstation, you will have to dismantle much of your desk to clear it off, remove extra monitors, and make sure no “content” is visible from any angle in the room.

The first time I had taken the DevOps Professional exam, the proctor noticed a painting I on the back wall and demanded I take it down, even though it was nailed in. This was all the more surprising because I had taken three other AWS exams from that same room where that was not a concern.

Beyond the initial screening, you are monitored rather closely. They will interrupt your exam and call you if you have your hand over your mouth during the two to three hour exam window.

But most importantly, you are not allowed to leave the room whatsoever. Restroom breaks are not permitted whatsoever if you are doing the online proctoring option.

This becomes far more obtrusive during the three hour examinations (which include another thirty minutes for check-in), and I have had to rush to finish exams early because of this.

By contrast, the in-person test centers can vary rather dramatically. If you live in a metropolitan area you can see for yourself how much this can be the case. Some are in basements, while others have windows. Some have sound machines. The proctors can have varying dispositions, ranging from conciliatory to neutral to DMV hostile.

The main factor for me are your neighbors in the shared exam room.

You will most likely be in a room with other exam takers sitting close beside you.

For whatever reason, these professional certification exams tend to attract individuals who are unable to resist making strange noises or talking to themselves during the exams. This was distracting for me for the examinations I took in person, to the extent I could not help looking over to see if the individual’s unusually punctuated breathing was a faint request for medical assistance.

But you do get restroom breaks at these testing centers, and you far less at risk for having your exam disqualified for arbitrary reasons than if you take them from your residence.

So weigh that in mind.

Strategies

Now that we have taken into account a few preliminary questions, I would like to suggest a few strategies for preparing for and taking these AWS certifications.

Test Prep - AI Assistants

First I would like to say that Tutorials Dojo is by far the gold standard when it comes to AWS certification practice exams. They offer the most rigorous and realistic simulation of the AWS exam experience, to the point that I am surprised they are allowed to simulate such an accurate reflection of exam content.3

Beyond that though, I am deeply indebted to AI assistant tools for helping prepare me for the most difficult Specialty exams. If you supply the appropriate AI assistant with the exam content guide and give it a few sample questions, it does a profoundly excellent job of generating practice questions for you.

I have done this to help identify general preparedness, then to isolate areas of improvement, then to cover specific domains or services or service features I needed to brush up on.

I will not provide the prompts or context I used to achieve this, but if you are interested, you can invest some time in identifying this yourself.

However, this does come with a strong caveat. If you provide it with practice questions yourself, it has a moderate chance of hallucination.4

But if you are verifying the answers yourself and continuing to feed it documentation, you will find it is a strong partner to coach you question-by-question to prepare you on the knowledge required for the exam.

The Exams Play Favorites

With any professional certification, there is an ongoing debate over how much that certification really does equate to “real life experience”. In my eclectic experience across developer, DevOps, cybersecurity, and management roles, I will say the AWS certification exams really do very little to prepare you for the real world.

Part of this is because (like most exams), the real world gives you free access to the Internet, documentation, and now for the most part AI assistants.

But another part of this is because the multiple-choice structure of the exam lends itself to a fairly opinionated outlook on what constitutes the “right” answer. This can be a world of a difference depending on the exam you are taking.

The most overt example of this is the AWS Data Engineer Associate and AWS Machine Learning Associate exams.

In the AWS Data Engineer Associate Exam, the answer that names AWS Glue is nearly always the right answer. That service is the gold standard for that exam.

In the AWS Machine Learning Associate Exam, the answer that names a SageMaker tool is nearly always the right answer, especially if AWS Glue is an alternative choice. It is actually fairly tricky how some questions will make AWS Glue sound appealing in the ML exam, but only to trick you. The answer explanations on those practice exams explain why SageMaker “is a better fit”.

So get to know the community of your exam, because it actually does consist of a fairly opinionated set of tools. Even on the wide-ranging Solutions Architect Professional, there is a gamut of AWS service that will simply never be mentioned from virtue of being too obscure to be noteworthy.

The Types of Questions

Aside from subject domain, there are a number of different kinds of AWS questions. You can actually identify the answer to the question based off a few “meta” considerations, without knowing anything about the subject matter or content of the question.

You can categorize 90% of AWS questions as:

Do you know AWS basics?

There is a question about IAM and S3 in just about every exam.

Do you know the most AWS managed approach?

AWS wants vendor lock-in. Therefore your answer should also be the most conducive to vendor lock-in. The custom, roll-your-own solutions are almost always the wrong answer. Picking the most AWS-managed option is a very likely right answer, excepting a few edge cases.

Do you know which suggested approach aligns with the key priority for this question

Many of the intermediate to expert level difficulty questions will have a paragraph or two introducing the situation, but it always best to start with the one sentence closer at the end of the question. This will identify the key priority for the question:

“What is the MOST cost efficient approach?”
“What is the MOST operationally efficient approach?”
“What is the MOST secure option?”

It is absolutely critical to use this lens when scanning both the questions and answers. There are a handful of cases in exam questions, when a generally acceptable answer is actually incorrect because the question is looking for more very specific priority.

Do you know this technical domain independently of AWS

There are a handful of questions surrounding encryption or DNS or machine learning algorithms which you will immediately have an answer to if you have a theoretical or practical grasp of its underlying concepts.

Technical knowledge irrespective of AWS will give you a shortcut to the answer.

Have you lived through this very specific scenario. If so, when does the AWS UI show at step 3?

There are a small handful of edge case questions which you can tell are supposed to be the most difficult ones out there. Unless you have battlefield experience with that specific item they are asking about, you will not know the very specific thing they are asking for.

If you in fact have had that experience, you will feel rather clever for knowing the answer.

But these are a small proportion and you do not need all of these to pass the exam, unless you are struggling on basic or intermediate items.

Miscellaneous

A few other test-taking strategies I would like to note for the exam-taker

If the question is particularly long, it is worth starting by reading the answer choices. Sometimes the answer choices are also long and nearly identical, but this is a good thing. You can use to identify what makes each answer choice *different* from one another, and this makes it easier to eliminate the incorrect options.
If you are struggling with a question, it is generally best to try to eliminate the answer choices down to two options (or sets of options) then review the question and see either if (1) you can identify a key architectural concern like cost optimization or security or (2) if you can visualize the technical specific steps because there is often a trick that can eliminate the wrong answers.
Depending on your pace and time remaining, it is okay to skip questions without reading them and then return to them later. If I was two-thirds through an exam and encountered a three paragraph question, I would invariably flag it and come back at the very end when I knew I could focus on it. Not all questions are equal in difficulty and it may be worth sitting on the most difficult ones until there are no others remaining.
Remember that you do not need to get every question right, and also remember that roughly ten questions on your exam are not even scored. Be familiar with the estimated pass rate metrics for your exam so you have a comfortable sense of how many questions you can miss. Aiming for 100% accuracy will do more to obstruct your psychologically than to help you hit your goals. Think in terms of 80:20 Pareto optimization to be able to reach a threshold to pass the exam.
Sometimes the answer to one question may accidentally be revealed in another exam question, further down the line. This has happened to me more than once. If you realize during the exam, you have gaps, try to construct answers from the knowledge base provided to you in the exam questions.
Know your reading speed. The word counts on these exam are heavy, especially if you are on the Professional exams. That is honestly one of the largest contributors to the cognitive difficulty of the exam experience (and also why the Cloud Practitioner exam is a breath of fresh air by contrast). If you are a fast reader, that certainly works to your advantage. If English is not your first language, make sure to exercise the testing accommodation for extended time. Regardless, reading the end of the question and answers first before reading the bulk of the question can help save you some time and reading energy on each individual question.
Unless you are taking a Foundational exam, you will not be told your results immediately upon finishing the exam. So do not expect to find out right away. However, for all the exams I took this year, I got the results in about 12 hours or so, depending on if I took the exam online or in a practice center.

Conclusion

These words are offered up in the hopes that they may prove useful or advisory for those considering AWS certification exams or preparing for them.

My experience is my own, and I cannot pretend that you will have the same experience if you choose to go through a gauntlet of twelve, three, or even one individual exam.

I merely wish you the best of luck on your endeavor as you build your cloud career.

A screenshot of my AWS IQ profile. AWS IQ is a platform for third-party freelancers and agencies to connect on specific projects.

AWS IQ is an excellent venue to advertise services as an IC, and it does provide an ostentatious view of AWS certifications. Yet, see footnote 2.

You will note the two Foundational exams do not display here, but this is made up for by the discontinued Database and Data Analytics Specialty exams which I had taken previously.

I was planning on expanding my consulting leads pipeline through AWS IQ—their official service for third-party experts. However, while I was in the midst of passing these exams, I discovered that AWS announced its intent to shut down the platform. So that motive was rendered nugatory.

That said, in 2025, their practice exams are starting to show their age. I do not blame them for being unable to keep up with the AWS question machine, but you do notice the practice tests they have for older AWS exams are loaded with questions that no longer make sense or line up to the modern AWS experience.

There are tricky questions in the AWS exams, but for example I was surprised not only that `o3` failed to identify what envelope encryption is, but when I pulled the same question up today, `GPT 5`failed in the exact same way, selecting the answer which indicated the top level key should be encrypted.

The Purpose of the Prototype

Thinking through the Cloud — Sat, 28 Jun 2025 21:20:14 GMT

Recent advances in LLM technology have raised the perennial specter of job displacement for those knowledge workers concerned with the development of new software platforms and technologies.

Yet unless such AI agents can break past the asymptotic boundary of their current intelligence and move a gradation or two closer to the threshold of AGI, the development and maintenance of software systems may be assisted but not fully usurped by agentic workflows.

Generative AI can perform a variety of low level tasks to an acceptable level within a condensed amount of time, but one dimension that even the latest and greatest models at this time are nowhere near managing is architecture, software and otherwise.

Solutions architecture is concerned not with the building of technology for its own sake or to satisfy the tinkering itch of its engineer, nor with the isolated perfected piece of technology to live and exist in an isolated cell, nor even to prototype a MVP to force the hand of stakeholders into technology adoption.

Each of these can and often do happen.

Solutions architecture ought however to be charged with the responsibility of building not just well in the abstract but in the concrete. Just as a building cannot be erected without considering the slope or stability of the ground or the surrounding buildings, so too digital solutions must pay a nearly obsessive attention to context.

That is why the first step of solutions architecture is to understand the organization as it exists now and in the future.

Even the most greenfield of projects—freed of seemingly ever constraint to roam free in the name of prototype or R&D—must still be embedded in a field of some kind, if its seeds are going to take root.

If your solution is part of a brand new LOB or BU, you will still be constrained by IT governance.

If your solutions is part of a dedicated Rapid Prototyping team with a direct line to the CEO, even your silo must prepare to accommodate compliance, SLA’s, and the hasty innovator’s worst nightmare: Site Reliability Engineering.

If your solution is part of a brand new company with zero tech debt and seed money for a brand-new totally from scratch green-as-the-grass first-time prototype, your solution will still be constrained by CEO and investor expectations and the criteria by which your product can with sufficient truthfulness be labeled a MVP that has reached the market.

Any development initiative without a distinct and laser-focused vision of that rubber hits the road moment of integration, when something becomes “production”, will likely founder even in the hands of the most skilled technicians and product owners.

Many will not see the light of day, or enough of that daylight to be considered a living, breathing, viable system.

Pet projects, prototypes, proof-of-concepts remain what they are, not “Ready for Production”.

This is not to say that such tinkering toys or half-baked projects are without value.

They are in fact very instrumental.

The greatest masterpieces of visual and literary art are often the successors of many sketches or “stillbirth” pieces. The crumpled up pieces of paper that are discarded become like rungs of a ladder. They have served their heuristic purpose in helping sharpen and hone the vision and focus of the artist or technicians to accomplish the final task.

Prototypes and proof-of-concepts are immensely powerful tools to advance the mission, to build solutions without the weight of overthinking. To try by doing and advance past the tempting dance of navel-gazing among opinionated stakeholders.

But the point is, if there are aspirations that your solution, your prototype is going to become something more, you must aggressively wrestle with the question of what you are prototyping toward.

As simple and glib as it may sound, you must ask what is the problem your solution will solve and how it will come to do so.

What makes a solution effective is that it does come into effect. It produces. It makes things happen.

So to build a solution requires the marrying of the future and present. It requires an upfront wrestling with properly contextualized questions: what does it mean to be production ready here? What does it mean to be production ready in light of general principles and best practice? What is the concrete roadmap to launch? To iterate? Who pays the bills? Who are the stakeholders?

That kernel of novelty your solution will introduce, that must be sown in light of these questions. If you want to be smart about it.

Of course, many can get away with a Series X investment round or a chain of promotions without really bothering with these considerations. Many battles in history have been won by the less-than-perfect, simply by virtue of right place, right time. Or by hanging around long enough.

But there is a difference to doing something and doing something well. The art of solution architecture is concerned with doing the thing well. And that requires forethought and planning.

A History of Storage: Files

Thinking through the Cloud — Thu, 24 Apr 2025 22:00:39 GMT

This is a series exploring the history of persistent storage systems in the world of computing. Previously, we covered block storage.

Today we go into the world of files.

Files are an interesting choice in how the world of computing evolved. A useful abstraction on one level that is yet simultaneously a notorious agent in complexity for technologists to interact with.

The world is driven by files. Some have said “Everything is a file”.

We shall follow this journey.

The First Files

Like many digital-focused terms, files have their start in the tangible world.

In Latin, filum can refer to a thread or string, or more ominously a cord of fate. In the fifteenth century legal professionals would sometimes use filing strings to keep track of their documents, hanging them up for ease of reference.

The file is not merely attached to a document but as a sorting technology for preserving and maintaining the organization of information, particularly for future reference by the user or others.

As punch card computing was developed in the 1940s, the term “file” quickly came to be used in reference to how the punch cards themselves were organized as files inside filing cabinets.

Eckert, W.J. Punched card methods in scientific computation. Link.

By 1956, documentation for the IBM 305 RAMAC referred to its hardware as containing “disk files”.

However, the advent of a computer file system did not come in full force until the development of Multics.

File System Pioneering: Multics EFS

Multics was an MIT project that began in the 1960s and was the first computer architecture to take a serious stab at developing a files system. Before its release, EPL developers were given documentation to begin building with Multics “logic”.

The 3000 page Multics System Programmers’ Manual (MSPM), section BE.10.00 “The Elementary File System (EFS)” documents the first proposals around such a preliminary file system, supplanted by a revision BE.10.01 the following month.

Before this point, block storage with tapes and disks was the primary storage mechanism that computers could work with. This was raw addressable chunks of memory without much abstraction.

Block addresses needed to be manually managed and were coupled to the hardware, a fairly error prone process.

No abstraction meant no file metadata, no I/O coordination, no lifecycle management, or access controls.

EFS introduced several important innovations which would become staples in the world of files:

Hierarchical structure for files and file metadata via pointers.
Logical records and word-level addressing for random/sequential access and offset support, abstracting out the physical layout of the drive.
I/O Mode flagging to declare files as permanent, foreign, or temporary using bit-flags, an early form of file attributes
A unified read/write model with Open/Write/Read/Close operations clearly defined.
Bit flags for EOF (end of file), EOR (end of record), or error reporting

Now we begin to see file type semantics that speak to lifecycle (Temporary, Permanent, Foreign) and more abstraction and interoperability that are central to the “files” package.

A Second Take: Multics Basic File System

The EFS was merely a prototype in the end. Multics ended up taking these preliminary elements and building out a far more comprehensive file management solution in their final Multics solution.

We see in BG.0 “Overview of the Basic File System” the unpacking of this new paradigm.

The Multics Basic File System consists of segments (files) and memory-resident page and segment tables. This is paired with a multi-level storage management system for infrequently accessed data or for backup data that would be sent to offline devices.

The segment is the heart of the novelty here. It is a linear array of data that can be accessed with implicit memory references (like block storage) or explicitly with read/write system calls (like file addressing).

It also added a number of controllers that managed the segments, directories (symbolic naming), access controls, page control, core (memory) control, and device interface modules (DIMs) to abstract I/O interactions.

But from a long-term perspective we see two other key ideas implemented here.

First, concurrency and fault handling. For the first time an abstracted layer of locking mechanisms, page faults, and segment faults were implemented to protect the file systems from concurrent writes or handle for other potential errors.

Second, we see user-based Access Control Lists (ACLs), a model we will dig into a bit further here.

As specified in BG.9.00, when a user attempts to access a directory branch, the Access Control “determines the [effective access] mode of the user with respect to this branch and returns this mode and the ring brackets” to the Directory Control.

There is an apparent mode and an effective mode.

The apparent mode is concerned with the read, write, or execute permissions available to the user or group (generic user).

The effective mode governs access itself and is produced through the ring brackets.

The ring brackets define the range of privilege levels under which the segment may be accessed or called. It is composed of three integers:

The first two integers are the low and high bounds of the access bracket, specifying the rings allowed to directly access the segment based on the user’s effective mode. They are named access_low and access_high.
The third integer is the high bound of the call bracket, identifying the highest ring that may attempt to execute the segment via a controlled gate crossing. It is named call_high

These are hardware-enforced privilege rings but they do introduce a mechanism for ring crossing via a gate which is referred to as the call bracket. In other words, a call bracket is where a segment cannot be accessed as belonging to another privilege ring, but a ring crossing through a gate mechanism can allow the user to execute the segment.

Put more simply, this allows for fine-grained privilege escalations, and it is the foundation of hardware kernel escalation and UNIX’s setuid.

Unix: Everything is a File

Multics was theoretically impressive and provided groundbreaking implementation of secure, fault-tolerant systems even outside the innovations around files, but we are still one step away from the foundation of modern operating systems today.

The all too famous Ken Thompson and Dennis Ritchie, both contributors to Multics, left for Bell Laboratories. The editors at Wikipedia claim Thompson did so in part to build an OS that could support his video game Space Travel he was developing. The result of this effort being Unix.

Whatever its origins may be, Unix was when files came into form in their own right. Extending Multics’ directory tree model, Unix implemented a hierarchical file system that not only managed content provided by users for storage or reference but a whole new philosophy.

“Everything is a file.”

Now for the first time devices, sockets, directories, even processes are presented as files (at least in a heuristic sense).

This offers extreme advantages in interoperability, simplicity, and composability as now there is consistent I/O model for essentially everything a developer could need in the Operating System. There are caveats to this naturally but that is the gist behind Thompson and Ritchie’s design and it is impressively effective given Thompson himself worked alone on the first few versions of Unix.

Now even remote device mounts can be treated in a unified directory structure, and this opened up brand new horizons in the world of file storage.

Unix additionally stripped down much of the complexity around ACL, and privilege rings to decouple it from the hardware and make it far simpler for developers to build against.

By reducing the learning curve and creating radically extensible systems, Unix paved the way for MacOS and Linux systems that remain with us to this day.

But how did file storage work into all this?

This is where we can enter specifically into the question of file storage.

Again, up to this point all files were stored as block storage in magnetic disks and tape and Multics and Unix introduced hierarchical storage, segmented memory, and access controls abstracted from raw blocks.

By 1975, Unix V6 launched with V7 shortly to follow, introducing inodes. Each file is associated with an inode which stores metadata and points to data blocks. Directories in turn simply map names to inode numbers.

This decouples file metadata like ownership and the storage layout from the name of the item itself, access through a singular file API.

Unix Evolved: BSD and UFS

So where did files go from there? As the Unix ecosystem continued to evolve so did the systems undergirding files.

One major child of Unix is BSD (Berkeley Software Distribution) an open source project with a somewhat tendentious relationship with Unix proprietary licensing but which yet still produced a number of innovations that could be incorporated further downstream into modern file systems.

With the release of BSD 4.2 in 1983 came a far more robust file system called UFS (Unix File System) that included:

Block size increases from 512 or 1024 bytes to 4096 and 8192 bytes.
Partial blocks were chunked into block fragments to save space
Metadata distributed across cylinders for better performance (lower seek time)
Linear scan performance of directories improved over time
Tunable inodes to adjust file system size as needed to balance storage needs and performance.

Across the board, UFS introduced a very matured flavor of the Unix file system one that would become standard not just on BSDs but also for SunOS, Solaris, and others.

It would also set the the stage for Linux.

ext: Linux and Modern File Systems

Aside from Unix offshoots like BSD, MINIX was launched in the late eighties for academic purposes with source code available for general use.

Linus Torvalds picked up on this free kernel and launched the Linux kernel in 1991, developing it off MINIX using the GNU C Compiler.

While the earliest versions of the kernel mirrored the MINIX file system, Remy Card implemented the first generation of ext (extended file system) in 1992 with Linux 0.96c. Unfortunately, this first generation did not offer much more than a Linux-native approach for file management.

However, this changed quickly when ext2 launched in 1993 with separate inode and data block bitmaps, fast symbolic links, and file attributes. This generation also suffered from problems around unclean shutdowns which required fsck to check and repair files after a large variety of system failures or even errors.

This was solved in ext3 with journaling. Journaling is the extended file system’s version of a write ahead log which records metadata operations before applying them to assist in clean transactions and rollbacks as required. This with backwards compatibility with ext2 created a clean upgrade path.

Yet this generation was limited by scale and performance. This led to ext4 which remains the default file system in the Linux kernel to this day.

ext4 has new major features such as extents that replace block maps to add efficiency for larger files, delayed allocation by batching writes, 64-bit blocks, journal checksums to provide integrity measures for the journal itself, and large max file and volume sizes.

Networked Files for Distributed Systems

While we’ve arrived at the modern implementation of file systems in Linux systems, we have bypassed many other offshoots along the way, such as Windows and SMB.

Let us refocus on another relevant historical strain for understanding files in the cloud: files and networking.

Much like with block storage where DAS evolved into NAS and SAN, a growing demand emerged for sharing files between UNIX systems across a LAN.

Sun Microsystems built NFS (Network File System) in 1984 to render remote files like they were local. It was formalized in a protocol in RFC 1094.

NFSv2 was the first public release of the Network File System and was built on the principle of mountable remote directories. A client mounts a remote directory and can read and write those files like they were local.

Initially communication was carried over the stateless UDP protocol. It worked relatively well over LAN but not over WANs. And if a server failed, that introduced issues as well.

These problems were mitigated to some extent with NFSv3 in 1995 which introduced TCP, doubling to 64-bit file sizes and offsets, asynchronous writes, file attribute caching to reduce round trips, but yet remained stateless. Clients and servers for forced to rely on idempotent operations.

Nevertheless, by this time NFSv3 was becoming a fairly standard component of the Linux package, particularly in HPC clusters and data centers with distributed systems.

NFSv4 finally introduced stateful protocols over TCP with open state, file locks, and delegations, alongside compound operations to increase performance.

Even more importantly, it introduced ACLs for fine-grained permissions across distributed systems, and authentication compatibility with TLS and Kerberos.

For the first time all exports also appeared under a single root directory for cleaner client side management of a global namespace.

NFSv4 is still the standard with minor versions such as NFSv4.1 and NFS4.2 for stability, parallelism, and other cloud-friendly features, but the architecture has largely been set in stone and seems to have matured fully.

The Migration: Files in the Cloud

A slight tangent before exploring the evolution of cloud service provider based files solutions.

The Unix model which we discussed above was not merely a de facto standard for operating systems but in some sense codified as the standard with the publication of POSIX compliance in IEEE 1003.1

Essentially, POSIX is a designation that identified whether various systems such as Linux were sufficiently interoperable with other operating systems by possessing the necessary criteria of Unix API compatibility, including around file and directory API calls such as read(), write(), chmod().

POSIX-compliance has thus become a criterion for identifying if a service could be deemed sufficiently like a file server.

This is a primary distinction for object storage which more or less came to the forefront of persistent storage paradigms with the release of Amazon S3 in 2006.

We can cover object storage more in depth later but it is crucial to note that S3 and object storage is not POSIX-compliant even if it does have file and folder like semantics.

Many companies have since migrated from using file storage to object storage, but in time the cloud service providers have each in turn released their own POSIX-compliant NFS systems with Amazon EFS, Azure Files (which includes Windows SMB), and Google Filestore.

To this day there are several primary advantages around leveraging these POSIX-compliant cloud services versus object storage:

Minimal replatforming: a suitable “lift and shift” option
Mountable storage that does not need accessed through an HTTP API
Much lower latency particularly on premium SKU offerings

The Future of Files and File Architecture

All the same the cloud computing push toward “serverless” and managed service implementations has increasingly led to the deprecation of file servers for cloud-native engineering.

File content repositories like Microsoft’s SharePoint or OneDrive have increasingly usurped the role that local or self-hosted file storage solutions used to play. This is not to say that cloud-hosted file storage is now universal (the statistics on this are difficult to pinpoint with accuracy) but it is fair to say that it is at least fairly ubiquitous across SMBs and enterprise companies.

Where could file storage advance from here?

A number of potential paths:

Further performance improvements through integration with NVMe-oF
The extension of Zero Trust Architecture to file system access controls
DAOS (Distributed Asynchronous Object Storage) for high-performance computing and POSIX-compatibility.
An IPFS (InterPlanetary File System) designed for truly global file identification and access across the entire world.
Semantic file systems that emphasize accessing files based on content rather than hierarchy. This is an old proposal but with the advent of generative AI has freshened interest in rethinking how data can be accessed.

If one thing is for certain, files are no longer an innovation focus area in themselves, but they are likely to remain able to piggy back off the advancements around object storage and AI semantic searching as those continue to bring new possibilities around how we use data, and consequently the questions around how we store such data.

AWS and Azure Well-Architected Framework Compared - Part I: Overview

Thinking through the Cloud — Sat, 19 Apr 2025 20:58:34 GMT

Introduction

Before we begin, what is architecture?

Architecture is concerned with the “tasteful application of scientific and traditional rules of good construction to the materials at hand”. And even in the term’s origins from Greek we find the sense of a craftsman weaving together disparate elements into a cohesive, unified whole. Such an idea is engrained directly into the etymology of “architecture”.

For most of its history in the English language, this term was of course applied to physical construction, but much like “engineering” and “infrastructure” these metaphors have evolved into an abstraction which applies just as much now to the digital landscape as it did to the material world before.

The quality of building can take a variety of forms.

You can build out of cheap necessity, hodge-podge ideas, a desire for experimentation, reckless urgency to meet a demand, or just to get by. But to build well is different than to build something at all.

This was well understood by the early pioneers of computing when architecture was applied to new things like software and networks.

At the core, you need architecture to undergird these complex, expensive, and massively scaled expanses of the digital fabric that clothes our world.

Cloud systems are no different in this regard.

Quality in cloud “construction” can vary widely as well.

When you wake up Day 1 in the AWS or Azure ecosystem, even if you are a fairly experienced software engineer, you can quite easily work yourself into a five or six digit monthly invoice.

Or something that breaks half the time.

Or something with a gaping security flaw.

You can build in the cloud, but to tastefully apply those rules of good construction using the materials at hand, is its own art and science, that of cloud architecture.

The Advent of the Well-Architected Framework

You do not need to go far to find a large pool of anecdotes from individuals and businesses over the past dozen years, who from some cost or configuration blunder, would never go back to the cloud again.

Like many new technologies, especially those that are neither foolproof nor user-friendly, it is easy for many with negative experiences with such technology to discount it because “it doesn’t work”.

For vendors like Amazon and Microsoft, you want long-term lock in from your customers.

This is why there was a concerted effort to both devise common architectural principles and designs and then to provide an educational system and robust credentialing (Solutions Architect) around new “roles” in the labor market earmarked to accelerate the advance of cloud migrations globally.

To this end, each cloud provider has published their own “Well-Architected Framework” to help consolidate in a holistic manner, all the consideration that should go into designing, implementing, and maintaining cloud systems within their specific ecosystems.

By paying attention to specific pillars such as Operational Excellence, Security, Reliability, Performance efficiency, Cost optimization, and (for AWS) Sustainability, each WAF provides a broad classificatory system that enables focus on particular key areas while maintaining a lens for the general big picture as well.

This post is the beginning of a series that will compare and contrast the AWS and Azure Well-Architected Frameworks (hereafter referred to as WAF). With Google Cloud’s WAF potentially at some future date.

We will look at not only the content but the format, the philosophy, and the structure of such frameworks and how they mesh with the opinionated ways each vendor cultivates their own offerings.

Revision History

AWS

Important to analyzing any publication is an assessment of its history. The differences here between the AWS and Azure WAF already appear quite stark.

AWS offers a designated “Document Revisions” page which outlines both the history of the AWS WAF from its initial publication in October 1, 2015 with each subsequent update notated with a description and change date. Most recently, a “major update” refresh in November 6, 2024 to refresh many of the WAF pillars, particularly in light of the advanced around generative AI.

Alongside this are framework versions with the first specified as 2022-03-31 and the most recent one as 2024-06-27. Four in total are noted though without any description or title attached to these versions.

What remains unclear from this view is what constitutes a change in framework versions versus the document revisions.

The first framework version 2022-03-31 does not even have a changelog entry for that date.

The subsequent three framework versions do have changelog notes that describe pillar refreshes or updates to best practices, but this does not alone seem to increment the framework version as the most recent changelog from November 6, 2024 has the most expansive description of changes but apparently does not constitute any change in the framework version.

There is a level of irony that a document or framework exclusively concerned with architectural best practices should itself have an unclear and confusing versioning system, but this irony is all the deeper with the Azure WAF which has no explicit revision history whatsoever.

Azure

The closest Azure comes to this is a “What’s new” page which offers a month-by-month view of changes to the WAF up to the last 12 months.

From a standpoint of comprehensive documentation, this is a fairly weak approach to documenting version history, particularly as the press release announcing the Azure WAF is dated July 20, 2020, and the public GitHub repo attached to this documentation received its first commit on December 1, 2021. Even using the Wayback Machine, the current URL for the WAF home page had its first snapshot on April 23rd, 2023.

There simply has not been a dedicated effort by Azure to provide a systematic view of the evolution of architectural best practices in the way that AWS has.

This irony is further underscored by the Azure WAF’s explicit recommendation to provide architecture decision records (ADRs) that provide an auditable tracking system for stakeholders to maintain accountability during architectural changes.

This is common sense in the profession and though it is applicable to all joint architectural decision, Microsoft does not seem to apply this principle to the WAF itself.

Compared

I think this truthfully reflects on the nature of the AWS and Azure ecosystems respectively around change control. At least from anecdotal practitioner experience, AWS has an at least somewhat clear process about AWS service changes and the ability to toggle between versions and UI interfaces in the AWS portal.

Azure is much more cluttered and opaque in this regard as services sometimes have preview options while at other times options and features get whisked in and out of the Azure portal. If you get really nitty-gritty you can use explicit API versioning to get Azure do what you want it to do, but it is often a moderate inconvenience at best when these changes do occur.

Architecture Fundamentals

Both WAF’s have a collection of pages in their overview sections concerned with the fundamentals of architecture, before it enters into the individual pillars. Here again we see broad differences between the AWS and Azure approach.

AWS

“Good intentions never work, you need good mechanisms to make anything happen” — Jeff Bezos

The AWS WAF focuses on architecture as a practice. Not in the sense of practicing piano before a recital, but in the sense that as a doctor practices medicine, so too will you practice architecture, as something between art and science.

For the AWS WAF, this begins with definitions both of the WAF pillars as well as of the terms contained therein.

This is an appropriate starting point as terminology often serves as an anchor to help plant the mind in the right kinds of words and concepts that are the most applicable to thinking through the problems of cloud architecture.

Following these definitions, come notes “On architecture” which provide a definitive statement I would consider fundamental to the nuance of the AWS WAF.

This page begins by describing a traditional “on-prem” customer who centralizes technology architecture into a single team which operates as the gatekeeper working in conjunction with isolated product or feature teams. It explicitly calls out TOGAF and Zachman as operating under this paradigm.

Amazon rejects this approach.

“At AWS, we prefer to distribute capabilities into teams rather than having a centralized team with that capability.”

Of course the risk of this approach is having an independent voice that can judiciously ensure adherence to standards. However, AWS says that they are effectively able to mitigate insider bias from these individual service teams by holding them accountable to centralized practices and mechanisms that ensure standards continue to be met.

This and much of the AWS WAF in general, is an outworking of Amazon’s famous leadership principles.

It is this strongly opinionated view of how the business as a whole should be run (i.e. the Amazon philosophy) which has shaped and determined the best practices that AWS prescribes itself.

For example, “Every Day is Day 1” emerges in how strongly AWS advocates for thinking in terms of cloud-native systems. Workloads need to continuously be reinvented so that they are not hampered by years of vestigial accrual. That is why cloud-native design is part and parcel of the AWS package.

This is further expounded upon in the next page “general design principles” which enumerates six general architectural principles:

“Stop guessing your capacity needs” - Cloud enables on-demand elasticity
“Test systems at production scale” - Testing production scale is far easier with Pay-As-You-Go cloud computing costs.
“Automate with architectural experimentation in mind” - Automations should be build in such a way as to provide architectural extensibility and modification
“Consider evolutionary architectures” - Architecture is not an event but a process. Cloud-native thinking enables this kind of elastic evolution.
“Drive architectures using data” - Architectural choices can be judged quantitatively and this can be leveraged for continuous improvement
“Improve through game days” - Test and simulate various cases through game days to see how the architecture holds up.

For the AWS WAF, architecture is not a theory or a science but a practice guided by general, flexible best-practice patterns which may evolve over time.

Azure

By contrast, the Azure WAF overview focuses on architecture as a role. The doctor may practice medicine, but what does that look like day-to-day? What is it to be a medical professional? What is it to be a professional cloud architect?

This is the anchor point for the Azure WAF.

These responsibilities are listed in the fundamentals of the solution architect:

“Have a decision-making framework”
1. Identify expected decisions in advance.
2. Make informed decisions, considering limitations, constraints, tradeoffs, effort reversability, and risk.
3. Document decisions in an architecture decision record (ADR) along with the justification.
4. Follow up on implementation
“Know cloud design patterns”
1. Evaluate a workload’s functional and nonfunctional requirements to recognize patterns.
“Be forward-thinking”
1. Growth model, how will workloads scale
2. Compliance changes and the roadmap
3. Regional expansion for locally-based companies.
4. Product roadmaps. What will deprecate and what will expand
“Design for supportability”
1. Cloud provider support
2. Operational visibility
3. Customer support capabilities
“Maintain and enhance your skills”
1. Education
2. Community participation
3. Explanatory exercises
“Collaborate for success”
1. Maximize cloud provider and community leverage
“Be methodical in your design approach”
1. Combine frameworks such as the TOGAF with WAF.

All of these are geared toward professionalism in the cloud architect world.

Alongside these fundamentals, are sections detailing the “Deliverables” of the architect, wrapped up in a checklist format which is a core part of the Azure WAF’s structure in general. Each pillar of excellence will get its own checklist as well.

This kind of list provides a far more concrete sense of what an architect does day-to-day. Meetings, presentations, reports, documentation. These are part and parcel of the architect lifestyle, most particularly in the corporate and enterprise environments where Microsoft is so deeply embedded, and Azure drew its initial customer base from.

I will not go into each of these deliverables individually here, but I do find the Azure WAF approach to be more robust in providing guidelines for how to document things like architecture decisions via the ADR (which again ironic as the WAF does not have a robust version history system).

Compared

These overview sections are particularly crafted to fit different audiences.

The AWS WAF is more focused on the technologist, how to speak to someone in technical leadership with the appropriate forms of abstraction to describe varying technology stacks.

By contrast, the Azure WAF explicitly notes that cloud architecture must comprise both the technical and the business dimension for it to work at all, which again makes sense based on Microsoft’s history of partnerships.

But beyond that, the Azure WAF is more helpful in this section because it specifically documents the architect’s role as a professional, which is sometimes the more difficult part in contrast to the more logical dimensions of architecture as a practice, in the way the AWS WAF focuses on it.

Perhaps most importantly when considering the foundations of the AWS and the Azure WAF in parallel, the AWS WAF exhorts their customers to partake fully in the Amazon model in order to thrive in the AWS ecosystem. This means decentralized architecture.

The Azure WAF explicitly calls attention to its compatibility with many architectural frameworks including TOGAF (see Fundamentals 7a above).

It does not aim to be a comprehensive philosophy that commandeers how the business is run. It rather seeks to negotiate with business stakeholders with where they are at and how they currently operate, something which enterprise management and leadership teams would probably prefer over the more abrasive Jeff Bezos/AWS tone.

This is a contrast that will appear even more clearly when considering how AWS and Azure respectively consider the first pillar of “Operational Excellence”.

Breaking down OAuth 2.0

Thinking through the Cloud — Thu, 10 Apr 2025 22:01:22 GMT

“The OAuth protocol was originally created by a small community of web developers from a variety of websites and other Internet services who wanted to solve the common problem of enabling delegated access to protected resources.”
RFC 5849

Auth is an ambiguous shorthand which can refer to either authentication (AuthN) or authorization (AuthZ).

Today, we explore OAuth 2.0. But first, one cannot begin to understand OAuth 2.0 without understanding the context of why AuthN and AuthZ came to be decoupled.

Fundamentals: AuthN and AuthZ

Authentication is concerned with verifying that an entity is who they claim to be; authorization takes that authentication and then identifies what that agent should be entitled to access.

Authentication and authorization are logically distinct flows, and consequently it is generally preferable for them to be technically decoupled as well.

This was not the case for HTTP services for quite some time, prior to the advent of OAuth.

Some examples of tightly coupled AuthN/AuthZ:

HTTP Basic Auth: adding the Authorization header as Base64-encoded “username:password”
API keys: static, long-lived keys passed in the HTTP header x-api-key or Authorization: Bearer {key}
Cookies: server sets a session cookie after authentication that binds the user’s session to their authentication status.

This is not to say all such couplings are inherently “bad”.

Cookies work well for same-origin web applications. However when the scope includes third-party integrations or mobile clients such coupling introduces problems.

Kerberos is designed to tightly couple authentication and authorization as well but is a fairly robust protocol when dealing with enterprise auth management. (It even back Microsoft Entra ID’s directory technology, from what I understand, to this day.)

But with the rise of third-party API integrations came a distinct need to decouple authentication and authorization.

Hence, OAuth.

OAuth 1.0: A First Stab

Web developers came together to find a solution that would abstract the authorization process out of authentication. OAuth 1.0 was originally published in October 2007 to solve this problem with RFC 5849 introduced in April 2010.

Although it came to be replaced fairly quickly with OAuth 2.0, this first OAuth protocol did lay a lot of the groundwork for its successor protocol.

As noted in RFC 5849, the traditional client-server authentication model depends on the client using its credentials to access server-hosted resources. However, the rise of cloud and distributed systems introduced requirements for third-party access at scale.

OAuth attempted to solve this by adding a third role to the client and server model: the resource owner. In this new paradigm, a server may host resources that it may not necessarily own. Consequently, the client can request access from the resource owner and then proceed to where those resources are hosted to be provided access, all while verifying the identity of the requesting client.

It should be remembered that OAuth is purely an authorization protocol, as it can be too easy to mentally slip it up and consider it at least partially authentication-based.

Delving into the inner working of OAuth 1.0 may be a useful exercise but is not our intent here as it is largely obsolete. There are several reasons why OAuth 1.0 needed to be replaced by a successor within a short timeframe.

In brief, OAuth 1.0 suffered from these flaws:

It required message-level security with signature-based verification. This is not intensive but also not layer properly with transport layer security technologies such as TLS or JWTs.
Every API request needed to be signed using either HMAC-SHA1 or RSA which is not only difficult to implement but can understandably introduce performance degradations.
Mobile clients and SPA’s cannot securely integrate with a signature-based flow
Low extensibility made it difficult to add token formats, transport mechanisms, or grant types, which is not ideal for a critical fast-moving technology.

OAuth 2.0 Fundamentals

Lessons were learned and the developer community came up with a fully reimagined authorization protocol to replace OAuth 1.0.

RFC 6749 emerged in October 2012, defining and laying the groundwork for what remains to this day one of the most widespread authorization protocols in digital technology.

While OAuth 1.0 defined three roles of client, server, and resource owner, this was further expanded to four roles in OAuth 2.0 defined in this way

Client: The application making requests for protected resources on authorized behalf of the resource owner
Resource Owner: The end-user or entity capable of granting access to a protected resource.
Authorization Server: The server that issues access tokens to the client authorized by the resource owner when that resource owner has been authenticated.
Resource Server: The server hosting the protected resources, responding to resource requests carrying valid access tokens.

Client Registration

Before using OAuth 2.0, clients must register with the authorization server. Registration requirements vary by implementation but typically include:

Client type (confidential or public)
Redirect URIs
Additional server-specific information

Clients are categorized as:

Confidential: Can securely store credentials (e.g. server-side applications)
Public: Cannot keep credentials confidential (e.g. SPAs, mobile apps)

Client Authentication

Confidential clients or clients with credentials must authenticate with the authorization server when making requests to the token endpoint. This authentication serves several important purposes:

Enforces the binding of refresh tokens and authorization codes to the correct client
Helps a compromised client recover by invalidating previous credentials
Implements periodic credential rotation for enhanced security

The method of authentication can vary depending on the implementation, but typically involves the client ID and client secret.

OAuth 2.0 Endpoints

The OAuth 2.0 protocol utilizes three primary endpoints:

Authorization Endpoint: Used for resource owner interaction and authorization grant issuance
- Returns code parameter for Authorization Code flow
- Returns token parameter for Implicit flow
- Must authenticate the resource owner before granting authorization
Token Endpoint: Where clients exchange authorization grants for access tokens
- Requires client authentication for confidential clients
- Issues access tokens and optional refresh tokens
Redirect Endpoint: The client's endpoint where the authorization server redirects after authorization
- Must be registered during client setup
- Should be protected by TLS

OAuth 2.0 Tokens

Before diving into the grant types, it's important to understand the tokens used in OAuth 2.0:

Access Tokens

Credentials used to access protected resources
Represent specific authorization scopes and durations
Presented to resource servers when making API requests

Refresh Tokens

Used to obtain new access tokens when the current ones expire
Never sent to resource servers
Provide a way to maintain long-term access without requiring re-authentication

OAuth 2.0 Authorization Grant Types

1. Authorization Code Grant

The Authorization Code grant is a user-based flow designed for server-side applications that can securely store client secrets.

Flow:

The client redirects the resource owner to the authorization server via the user's browser
The authorization server authenticates the resource owner
The authorization server redirects back to the client with an authorization code
The client exchanges this code for an access token through a secure back-channel request

Authorization Request Parameters:

response_type: Must be "code"
client_id: Client identifier
redirect_uri: (Optional) Where to send the response
scope: (Optional) Requested access scope
state: Recommended parameter to prevent CSRF attacks

Authorization Response:

code: The authorization code that will be used to request an access token
state: Same value as in the request (if included)

Access Token Request Parameters:

grant_type: Must be "authorization_code"
code: The authorization code received
redirect_uri: Must match the original request URI (if included initially)
client_id: Required if client does not authenticate

Access Token Response:

access_token: The access token
token_type: Type of token (usually "Bearer")
expires_in: Token lifetime in seconds (optional)
refresh_token: Token to get a new access token (optional)
scope: Scope of access token (optional)

This grant flow has several advantages including the ability to use refresh tokens so that the user does not need to reauthenticate, the resource owner’s credentials are never exposed to the client, and access tokens are passed directly to the client without required exposure to other entities. This is one of the most secure OAuth 2.0 flows.

2. Implicit Grant

The Implicit grant is a simplified version of the Authorization Code flow, primarily designed for single-page applications (SPAs) and mobile clients that cannot make secure back-channel calls.

Flow:

The client redirects the user to the authorization endpoint
After authentication, the authorization server returns the access token directly in the URL fragment
The client application extracts the token from the URL

Authorization Request Parameters:

response_type: Must be "token"
client_id: Client identifier
redirect_uri: (Optional) Where to send the response
scope: (Optional) Requested access scope
state: Recommended parameter to prevent CSRF attacks

Authorization Response (in URL fragment):

access_token: The access token
token_type: Type of token (usually "Bearer")
expires_in: Token lifetime in seconds (optional)
scope: Scope of access token (optional)
state: Same value as in the request (if included)

In contrast to Authorization Code, the Implicit flow does not allow the use of refresh tokens for security reasons. Furthermore, it is less secure as the tokens are exposed in browser URL fragments which can then be stored in browser history or accessible to any JavaScript running in the browser. It still has the advantage of skpipng the authorization code exchange and thus saving a round trip, but is relatively insecure.

Today, the Implicit flow has largely been deprecated in favor of Auth Code with PKCE (more on this below).

3. Resource Owner Password Credentials Grant (ROPC)

The ROPC grant is designed for scenarios where there is a high degree of trust between the resource owner and client.

Flow:

The client directly collects the resource owner's username and password
The client sends these credentials to the authorization server
The server validates and returns an access token

Access Token Request Parameters:

grant_type: Must be "password"
username: Resource owner's username
password: Resource owner's password
scope: (Optional) Requested access scope

Access Token Response:

access_token: The access token
token_type: Type of token (usually "Bearer")
expires_in: Token lifetime in seconds (optional)
refresh_token: Token to get a new access token (optional)
scope: Scope of access token (optional)

This flow is a strange one, primarily because it reintroduces the authentication and authorization coupling that OAuth 2.0 was designed to separate. It does allow some use of refresh tokens but does not have many advantages. It is altogether discouraged unless absolutely necessary.

4. Client Credentials Grant

The Client Credentials grant is designed for userless auth journeys between various servers.

Flow:

The client authenticates with its own credentials (client ID and secret)
The authorization server validates these credentials and issues an access token
The client uses this token to access protected resources

Access Token Request Parameters:

grant_type: Must be "client_credentials"
scope: (Optional) Requested access scope

Access Token Response:

access_token: The access token
token_type: Type of token (usually "Bearer")
expires_in: Token lifetime in seconds (optional)
scope: Scope of access token (optional)

With this flow, the machine identity can simply reauthenticate automatically and so refresh tokens are not generally used, except perhaps for performance reasons. This is very common implementation for API communication and microservice-driven environments which rely upon service accounts.

RFC 6749 specified those four OAuth 2.0 authorization grant flow types, but there remained room for more.

Several of which have gained widespread adoption in the years since then.

5. Authorization Code with PKCE

How can public clients such as mobile apps or SPA participate in OAuth 2.0? Authorization Code grant is susceptible to interception attacks while the Implicit flow bypasses authorization grants altogether.

Proof Key for Code Exchange by OAuth Public Clients (RFC 7636) was designed to solve just this problem.

PKCE introduces a new factor to the OAuth equation: the code_verifier.

Essentially, a unique code verifier value is generated by the client for each authorization request. It is then transformed typically using SHA-256 into a value known as the code_challenge and sent to the authorization server to fetch an authorization code.

The authorization code is returned to the client. At this point, the client sends an access token request with the authorization code to the token endpoint with the code_verifier to the token endpoint which then returns the access token. Thus, an intercepted authorization code cannot be used without the accompanying code_verifier.

Authorization Request Parameters:

All standard Authorization Code parameters
code_challenge: The transformed code verifier
code_challenge_method: The method used to transform the code verifier (e.g., "S256" for SHA-256)

Access Token Request Parameters:

All standard Authorization Code token request parameters
code_verifier: The original code verifier string

Both the Access Token response and the Refresh Tokens follow the same standard format as Authorization Code.

This is the recommended flow for all OAuth clients including both public and confidential clients by mitigating the risk of browser redirect theft or network interception.

6. Device Authorization Grant

The Internet has expanded beyond computers and smartphones to a wide panoply of “Things”. What does auth work like in the world of IoT where devices may not have browsers?

Device Authorization Grant (RFC 8628) answers just such a question, at least for Internet-connected devices that can make HTTPS requests.

This flow follows these steps:

The client sends an access request with a client identifier to the authorization server.
The authorization server issues a device code and an end-user code with an end-user verification URI.
The client instructs the end user (resource owner) to visit the verification URI on another device (transferring their user agent) and provides them with an end-user code.
The end user visits the verification URI, authenticates with the authorization server, then provides their user code. The authorization server will validate the user code and have the end user confirm the device auth request.
The device client will poll the authorization server to determine if the endu user has confirmed the auth request. Once this is completed, the polling response will include the access token.

Those with smart TV’s may already be familiar with this auth flow as one common consumer-facing implementation flow.

Or for those working with cloud providers, you may recognize this flow as a method for authenticating your terminal CLI (headless) against a cloud provider.

The key innovation in this grant flow is the device authorization endpoint which is entirely separate from the authorization endpoint reached by users.

Device Authorization Request Parameters:

client_id: Client identifier
scope: (Optional) Requested access scope

Device Authorization Response:

device_code: Code used by the client to poll for the access token
user_code: Code displayed to the user, typically easy to type
verification_uri: URI the user should visit to authenticate
verification_uri_complete: URI with the user code embedded (optional)
expires_in: Lifetime of the device code in seconds
interval: Minimum time in seconds between polling requests (optional)

Token Request Parameters:

grant_type: Must be "urn:ietf:params:oauth:grant-type"
device_code: Device code received in the authorization response
client_id: Client identifier (if client authentication is not used)

Refresh tokens are almost a necessity in this flow from a UX perspective because these device s maintain long-term access and the inconvenience of reauthenticating IoT devices is not appealing to most end-users.

This is a very robust pattern for extending the OAuth 2.0 framework out into the world of IoT and headless device auth (e.g. CLI).

7. Token Exchange Grant

Access tokens are useful technology when implemented properly. A credential with clearly-defined boundaries, expiration, that can be revoked remotely as needed.

But an access token on its own is immutable. If an authenticated user with broad authorization wants to delegate this access token to another user or system that should have fairly limited scope, there is little that can be done without reauthenticating through some other possible authorization flow.

Token Exchange Grant (RFC 8693) was the next extension to OAuth 2.0 to solve this problem by allowing for token exchanges for the purpose of delegation, authorized impersonation, token descoping, or service chaining.

This flow is simple enough. The client presents an existing token to the token endpoint with specifications for the new token. If these are valid, the authorization server will validate the request and issue a new token in the response. Simple as that.

Access Token Request Parameters:

grant_type: Must be "urn:ietf:params:oauth:grant-type"
subject_token: The token representing the subject
subject_token_type: The type of the subject token
actor_token: (Optional) Token representing the actor
actor_token_type: (Optional) Type of the actor token
resource: (Optional) Identifier for the target service or resource
audience: (Optional) Logical name of the target service
scope: (Optional) Requested scope
requested_token_type: (Optional) Type of token requested

Access Token Response:

access_token: The issued token
issued_token_type: Type of the issued token
token_type: Type of the access token
expires_in: Lifetime in seconds (optional)
scope: Scope of the issued token (optional)
refresh_token: Token to get a new access token (optional)

Assertions

Assertions, in the context of auth, are not original to OAuth 2.0. SAML and WS-Federation had been using assertions long before OAuth hit the scene.

An assertion is a set of data that allows one entity to make claims about an identity or subject. It functions as a digitally signed statement so that a valid assertion can be taken as fact by the designated recipient (audience) that the identity or subject has the attributes the issuer reports.

8. Assertion Grants

RFC 7521 opened the door for defining how assertions could be used as authorization grants within OAuth 2.0 both for SAML (RFC 7522) and JWT (RFC 7523)

At its highest level, assertion grant flows follow this sequence:

A client obtains an assertion from the issue
The client presents the assertion to the token endpoint
The authorization server validates the assertion according to type-specific rules
If valid, the authorization issues an access token to the client.

Broadly speaking, SAML is more commonly adopted in enterprise and legacy environments, particularly for web based SSO, while JWT is much more compact and widely embraced in modern web and mobile applications as well as API-centered ecosystems.

We won’t go into the SAML side here but let’s give a little more information on JWT.

JSON Web Tokens

JWT (JSON Web Token) emerged around the same time, though independently of, OAuth 2.0 and was standardized in RFC 7519 published in May 2015. Although it was not built from OAuth it did explicitly allocate a subname registration for OAuth (cf. 10.2), so the two were meant to go hand-in-hand.

XML in all its bulk and unreadability was an industry standard for quite some time, and as JSON continued to grow in popularity as an interchange format, so too did the desire for a lightweight token format based on JSON, especially as stateless APIs came to fruition.

JWT accomplishes several things at once which are important for OAuth:

It provides a compact way to encode token information (HTTP headers)
Tokens can directly include claims, no external reference required.
Built-in signature verification to ensure token integrity

All of these are contained in its simple tripartite format of a header which identifies the signature’s algorithm, the payload which contains the claims (assertions about the subject), and a signature to ensure token integrity.

JWT is not just limited to OAuth but has taken wide and deep root across the entire IAM and auth world.

In the context of OAuth 2.0 though it takes this format.

JWT Access Token Request Parameters:

grant_type: Must be "urn:ietf:params:oauth:grant-type"
assertion: The JWT assertion
scope: (Optional) Requested access scope

Access Token Response: Standard OAuth 2.0 token response.

JWT Required Claims:

iss: Issuer - who issued the JWT
sub: Subject - who the JWT is about
aud: Audience - recipient(s) the JWT is intended for
exp: Expiration time
iat: Issued at time
jti: JWT ID - unique identifier for the token

OAuth 2.0: Best Security Practices

RFC 9700 came out fairly recently when this was written, and it provides a general collation of security vulnerabilities and best practices around the OAuth 2.0 framework.

A suitable way to close this breakdown is to visit these threat vectors and best practices.

OAuth 2.0 Attack Vectors

This RFC publication lays out a number of attacker profiles which are specifically targeted by the mitigation document:

A1: An attacker who operates OAuth clients registered with the targeted authorization server, luring users to their compromised URIs.
A2: Network attackers who can eavesdrop, manipulate, block, or spoof messages that are not protected by TLS.
A3: Attackers that can read authorization response content
A4: Attackers that can read authorization request content
A5: Attackers that can acquire a valid access token issued by the authorization server

From these attacker profiles come the following attack types:

1. Insufficient Redirection URI Validation

Threat Vector: Redirection URI patterns introduce risk as opposed to complete, literal URI redirect matching.

If subdomains or paths have wildcard or other regex matching characters, it is possible for an attacker to hijack that subdomain or path, and then steal authorization codes or access tokens.

Mitigation: Use exact redirection URI matching.

Additional steps:

Do not expose open redirectors
Attach an arbitrary fragment identifier #_ to the redirection URI to prevent browser from reattaching them to Location (insecure)

2. Credential Leakage via Referer Headers

Threat Vector: Authorization codes or state values can be unintentionally disclosed in the Referer HTTP Header. Disclosed state values remove protection from CSRF attacks.

Mitigation: The page rendered as the result of the OAuth authorization response should not include third-party resources or links to external sites.

Additional steps:

Suppress the Referer header through a policy
Bind the authorization code to a confidential client or PKCE challenge.
Authorization codes must be invalidated by the authorization server after their first use at the token endpoint.
State value should be invalidated by the client after its first use at the redirection endpoint.
Use the form post response mode instead of a redirect for the authorization response.

3. Credential Leakage via Browser History

Threat Vector: Authorization codes and access tokens can end up in the browser’s history of visited URLs (e.g. Implicit Flow) which when exposed can lead to replay attacks.

Mitigation: Authorization code replay prevention and prevent access tokens from being included in URI query parameters

4. Mix-Up Attacks

Threat Vector: An OAuth client interacts with multiple authorization servers, one of which is under the control of an attacker. The attacker tricks the client into sending authorization codes or access tokens to the wrong server, thus hijacking them.

This can happen if the client stores the compromised authorization server in their session, redirects the user to that compromised auth server, then the compromised auth server forwards the request to a legitimate authorization server, using that legitimate server’s client ID. The user will log in on the legitimate server, but that server would then send the authorization code to the attacker.

Variants of this include intercepting and modifying the user’s traffic, intercept the access token in implicit grant, per-authorization server redirect URI’s, or OpenID Connect Abuse

Mitigation: Every auth response should include the iss issuer ID as a parameter or claim.

5. Authorization Code Injection

Threat Vector: An attacker who has acquired a valid authorization code may try to redeem that authorization code for an access token by injecting it into their client, attaching the victim’s identity to the attackers session.

Mitigation: PKCE mitigates this, attaching and checking a random nonce value when requesting the session can prevent this as well.

6. Access Token Injection

Threat Vector: An attacker with a valid, stolen access token may inject this into a legitimate client.

Mitigation: OAuth 2.0 on its own cannot mitigate this but integration with OpenID Connect could include the at_hash claim in the ID token, introducing an additional layer of protection.

7. Cross-Site Request Forgery

Threat Vector: An attacker injects to a legitimate client’s redirection URI to cause hte client to access resources under the attacker’s control. This is known as CSRF.

Mitigation: Linking the request to the user agent session is a sufficient countermeasure and can be implemented via including state or nonce values or using PKCE.

8. PKCE Downgrade Attack

Threat Vector: If an attacker can disable PKCE in their authorization request, PKCE can be downgraded so that state is not used at all, leaving OAuth open to CSRF.

Mitigation: It cannot be guaranteed that OAuth clients will universally implement state properly, so authorization servers must reject any code_verifier token requests if there were no code_challenge authorization requests.

9. Access Token Leakage at the Resource Server

Threat Vector: Counterfeit or compromised resource servers can receive access token and use the authorization granted in that token to target other resource servers.

Mitigation: Sender-constrained access tokens to prevent access replay attacks as well as audience restriction (see next one for more detail).

10. Misuse of Stolen Access Tokens

Threat Vector: Access tokens can be stolen in various ways and then used to access other resource servers.

Mitigation: Two primary strategies:

Sender-constrained Access Tokens: The sender is obligated to demonstrate knowledge of a certain secret as a pre-requisite for a resource server to accept the sent token. This can be implemented by either Mutual-TLS Client Authentication and Certificate-Bound Access Tokens (RFC 8705) or Demonstrating Proof of Possession (RFC 9449).
Audience-Restricted Access Tokens: Audience restriction aud scopes a token to a defined resource server. This also can enable token format to be custom tailored for various audiences. Implementation mechanisms are documented in RFC 6749 3.3 and RFC 8707

11. Open Redirection

Threat Vector: Open redirection at the client or authorization server level can introduce phishing or malicious redirects.

Mitigation: On the client side, only redirect if target URL’s are allowed or request integrity can be validated (OWASP guidelines). On the authorization server side, the client must always be authenticated before redirection.

12. 307 Redirect

Threat Vector: If an HTTP 307 (Temporary Redirect) occurs at the OAuth auhtorization endpoint, this can leak user credentials to the client (which in most authorization grant flows should not be happening).

Mitigation: HTTP 303 (See Other) is a better status code in this case because it converts POST to GET and drops the user credentials.

HTTP 302 is another option but can have more variable behavior depending on the browser.

13. TLS Terminating Reverse Proxies

Threat Vector: If a reverse proxy is deployed in front of the application server and TLS traffic terminates at the proxy, an attacker could try to circumvent security controls by manipulating particular HTTP headers such as X-Forwarded-For and then gaining the ability to eavesdrop, inject, or replay messages.

Mitigation: Reverse proxies must sanitize inbound requests to ensure the authenticity and integrity of all header values relevant to application server security.

14. Refresh Token Protection

Threat Vector: Refresh tokens grant full access to mint fresh access tokens. They can be stolen and replayed to gain access tokens.

Mitigation: Documented in RFC 6749. Use TLS in transit. Ensure tokens are confidential, unique, and non-guessable. Bind them to a specific client and leverage expiration, revocation, and rotation.

15. Client Impersonating Resource Owner

Threat Vector: A resource server could mistake a client for a resource owner such as when a malicious client could set a client_id to a value identifying the resource owner instead, whereupon the authorization server may accidentally believe the client is the resource owner.

Mitigation: Authorization servers should not allow clients to influence client_id, or if this is not possible, enforce another mechanism to distinguish between access tokens issued by client credentials and those issued by user-based authorization grant flows.

16. Clickjacking

Threat Vector: Clickjacking (i.e. UI redressing) can trick a user to accidentally click on a hidden link that can redirect authorization and cause credential theft or scope manipulation.

Mitigation: Authorization servers are responsible for preventing clickjacking by using CSP Level 2+ and X-Frame-Options headers. CSP is preferable for flexibility and multiple allowed origins.

17. Attacks on In-Browser Communication Flows

Threat Vector: When OAuth uses postMessage (example) instead of a redirect, which is common in SPAs, this can cause origin risks, with cross-site attacks

Mitigation: Authorization servers must use exact string matching for receiver origins and should only send to trusted origins. Clients must reject messages from untrusted origins using strict e.origin checks. The idea is to extend all redirect-based protections to postMessage flows.

Nutshell: 15 Best OAuth 2.0 Security Practices

Combining all these various attack vectors and mitigations, here are 15 best practices highlighted in RFC 9700 to ensure tight security on OAuth flows.

Redirect URIs should have exact string matching.
Avoid open redirectors.
Enforce HTTPS.
Never leak credentials during HTTP 307. (Use HTTP 303).
CSRF protection: PKCE for public clients, nonce for Open ID Connect, or state-bound CSRF tokens otherwise.
Only use secure and transaction-specific PKCE challenge methods (e.g. SHA-256)
If multiple authorization servers, use the iss parameter or secondarily distinct redirect URIs per server.
Never use Implicit grant. For public clients, use PKCE.
Never use Resource Owner Password Credentials (ROPC) Grant.
Use sender-constrained access and refresh tokens (via Mutual TLS or DPoP) where possible.
Use token scoping, audience restriction, and fine-grained access.
Use asymmetric client authentication.
Use authorization server metadata to auto-discover endpoints and enable secure client auto-configuration.
Verify postMessage origins.
Do not support CORS on the authorization endpoints.

Conclusion

OAuth 2.0 has a plethora of practitioners, but only a subset of those individuals often conceptually grasp how the framework’s esoteric (and perhaps clunky) words connect to each other.

Despite its relative complexity, OAuth 2.0 is certainly one of the most robust authorization frameworks to have ever been developed, and it behooves even the junior engineer to learn its general ins-and-outs as part of their journey as a technologist.

Understanding these authorization flows not only strengthens how security is implemented but can also open the door for extensibility as more sophisticated, secure, and user-friendly auth experiences continue to be engineered.

A History of Storage: Block

Thinking through the Cloud — Sat, 29 Mar 2025 22:32:09 GMT

When cloud practitioners talk about cloud storage, they generally refer to three primary categories: block, file, and object storage. These are not comprehensive or fully exclusive categories, but they are consistently decisive ways of referring to the ways in which data may be stored in cloud systems.

Let us explore the history of each of these and what they entail. This post will focus on block storage. We will look at brief pockets in a deep history that could be far more technical and exhaustive than we will choose to be here. This is mainly for the amateur or professional seeking a deeper understanding of the machines we manage.

Block Storage: Origins in the Mainframe

The early days of computing depended upon expensive, bulky mainframes where we find the advent of disk-based technology.

We can start with the IBM System/360, the famous “$5 billion gamble” introduced in 1964 which Jim Collins ranked among the top three business accomplishments of all time, next to the Boeing 707 and Ford Model T.

The System/360 presented a landmark computing architecture which consolidated a number of computing advancements within a single framework. This was where the 8-bit definition of “byte” was born.

Alongside the IBM System/360 emerged the IBM 2311 Disk Storage drive.

This disk storage drive provided direct access to 7.25 million 8-bit bytes per removeable disk pack. Eight drives could be attached to one single control unit for a total of 58MB with a transfer speed of 156 KB/s.

The disk was broken into a Cylinder/Head/Sector (CHS) geometry. Mainframe disks (14 inches in diameter) would rotate platters. Each platter had concentric tracks which in turn were grouped into cylinders, which were stacks of tracks aligned vertically across multiple platters.

Each track in turn was divided into a fixed or variable-sized sectors of data block, including the common pattern of 1024 byte blocks that are familiar to technologists today.

The head was the electromagnetic piece that hovers just about the disk platter and can either read or write data.

This was a mechanical process.

Read/write operations consisted of the following steps:

The control unit would send a “seek” command to the drive. This would trigger the actuator to move the heads.
The head would seek the correct cylinder and the system would wait until it was positioned above the right location. Because of this, the physical distance of cylinder traversal would determine how long the seek took (the IBM 2311 boasted an average seek time of 85ms).
When the correct record physically rotates under the head, the control unit signaled the channel to start reading or writing the raw bitstream. By reading, the data would be assembled into a buffer that was passed into main memory. By writing, the control unit would magnetize the cylinder track surface to etch the correct data bits according to a particular format (e.g. Count-Key-Data, abbreviated as CKD) determined by the OS or access method.

While this physical, magnetic procedure may seem foreign to us today, it is this underlying process which still defines how block storage works today, including modern operating systems.

We can glean these block storage principles from the IBM 2311:

Data is organized into fixed sized blocks (e.g. 1024 bytes)
Data is addressed by a numeric identifier
The OS or hypervisor reads or writes data based on addressing
The storage hardware does not know anything about what is stored (i.e. no metadata)

This is raw data. These fundamentals remain at the deepest layers under the hood of computing today.

SCSI and Open Standard Disk Storage

Block storage had its mainframe successors with IBM System/370 backed by the IBM 3330 and 3350 Winchester disk drives, IBM ESA/390 backed by 3390 as well as competitors such as the Honeywell 6000 or Burroughs B6500/6700.

These primarily operated under proprietary, internal channel interfaces.

This changed with the development of the Shugart Associates System Interface (SASI) in the late 1970s whose primary objective was to attach hard drives to multiple vendors’ computers. It was a move toward vendor neutrality via an open standard.

Much of SASI was co-opted into a more widely known standard Small Computer System Interface (SCSI) in 1986.

SCSI provided generic commands that could be applied across different vendors, such as INQUIRY, READ, WRITE, etc. This provided a higher level of abstraction where the SCSI device could handle seek times and geometry which no longer needed to be considered by the underlying host. The host would provide the commands and the firmware would translate into mechanical execution.

SCSI has since been extended into SAS, iSCSI, and Fibre Channel (FCI), all of which remain dependent upon that original SCSI command set.

SATA and PC Block Storage

By the 1980s, personal computing was on the rise and with it a concrete need to solve for the problem of storage. Integrated Drive Electronics (IDE) came about as a way to simplify PC hard drive connections by integrating the disk controller to the disk itself.

But PC drive storage truly hit its stride with ANSI’s ATA (AT attachment) as the de facto standard for PC consumers to interface with the hard drive.

This early iteration of ATA, now often referred to retroactively as PATA (Parallel ATA) was constrained by issues around signal integrity with the increase of bus speeds, primary/secondary configuration with multiple channels, and bandwidth ceilings generally hitting UDMA/133.

There was a need for better.

SATA (Serial ATA) 1.0 was published in 2003 (downloadable as ZIP here) out of the collaboration of multiple hardware vendors.

SATA 1.0 provided 1.5 GB/s signaling with ~150 MB/s throughput all while maintaining backwards compatibility with ATA.

It was then iterated on with SATA 2.0 (3.0 GB/s) the following year and by SATA 3.0 in 2009 (6.0 GB/s) with minor version increments since then. There has been no movement beyond this 6 GB standard primarily because of the push toward NVMe for high performance in the years to come.

Yet to this day SATA is of crucial importance for HDD storage of massive volumes at relatively low costs and remains a primary archival or backup strategy for cold data storage as the data at rest costs less per GB when compared against SAS or NVMe solutions.

Furthermore, SATA is recognized by nearly all hypervisors and operating systems, requiring almost no specialized hardware while maintaining a strong balance of performance and throughput for standard workloads.

Fibre Channel, iSCSI and Networked Storage

Before the 1980s, compute processes were attached to storage first through mainframe-style channel connections as we have seen above and then through directly attaching disk arrays to the hardware, what we now refer to as DAS (Direct Attached Storage).

However, personal computing meant multiple computers could be at play within a single data ecosystem, computers that were meant to talk to one another.

NAS (Network Attached Storage) solutions such as NFS (Network File System) in 1984 and SMB (Server Message Block) several years later were built to allow for file level sharing over a network.

This works for files but not for block-level access of data in the way that databases and many workloads would require. Organizations needed a way to standardize raw block device access across multiple servers.

Here is where SAN (Storage Access Network) comes into play as a new model for providing distributed block storage access, first through the invention of Fibre Channel in the 1990s.

Fibre Channel SAN works by encapsulating SCSI commands within Fibre channel frames that adhere to the FCP (Fibre Channel Protocol). Servers could perform SCSI read and write operations over a dedicated FC fabric to a centralized storage array for the first time in history.

Now there were new possibilities for block-storage fault tolerance and high availability, but at the same time this new complexity raised new problems to solve for as well such as networking infrastructure, fabric topology design, and building host bus adapters.

The main problem is that Fibre Channel requires specialized hardware. As networking bandwidth capacity soared in the late 1990s, and IP networks took root, a new demand emerged to find both a cheaper and simpler way to handle block storage connectivity.

iSCSI (Internet SCSI) was standardized in 2004 to tunnel those SCSI commands over TCP/IP, codified in RFC 3720.

Now for the first time servers could connect to remote block storage arrays using a standard Ethernet NIC or an iSCSI offload engine rather than needing a Fibre Channel Host Bus Adapter.

iSCSI gained widespread adoption particularly among cost-sensitive businesses, and as Ethernet continued to reach new speed ceilings (and with Offload NICs and Jumbo Frames coming into the picture), iSCSI came to significantly narrow the performance gap with FC.

To this day, FC and iSCSI remain common staples in enterprise data centers for handling storage networking.

NVMe and the Solid State Drive

HDD (Hard Disk Drives) have been all the rage for quite some time. They are one of those vestigial tethers back to those early days of mainframe computing because just like the IBM System/360 we discussed above, they depend upon rotating magnetic platters with a read/write actuator arm to access data. Although they may spin at 7,200 RPM they still have a seek time. They are fundamentally mechanical.

SSD (Solid State Drives) are a fundamental paradigm shift in this regard. They are the first quantum leap beyond magnetic platter drives to what we know as flash-based storage.

Flash memory is not a new thing. It was invented in the 1980s by engineers at Toshiba, deriving ideas and patterns from pre-existing EEPROM technology.

Flash memory can be divided into NOR flash and NAND flash (referring to the the NOR and NAND logic gates). NOR provides writing over an erased location at the granularity of the machine word with powerful direct random access at individual bytes. It has faster read speeds than its cousin NAND.

NAND flash memory can read, write, delete blocks through a serial access approach. While it may be less efficient for random access tasks, it introduces considerable cost-effective power for high-density data storage. NAND is the foundation of many flash-based technologies we are familiar with including USB drives, smart phones, and the SSD.

NAND flash SSDs no longer have a physical seek time where an actuator head spins over a magnetic drive like HDDs. Instead reads and writes occur electronically at microsecond-level latencies with significantly higher random I/O.

However, there remained a problem.

While SATA provides a wonderful level of host abstraction for read and write operations using SCSI commands, the standard was designed exclusively for mechanical drives. Consequently SATA SSDs encountered bottlenecks from the protocols used to access them.

The industry solution to this requires a brief rewind to 2003 when PCIe (Peripheral Component Interconnect Express) was introduced as a bus standard to serve as an alternative to SATA for SSDs, before SSDs were driven by flash technology. PCIe offers direct high speed links between the SSD and the CPU, designed from the bottom to maximize SSD efficiency.

This was then conjoined to NVMe (Non-Volatile Memory Express) in 2013 to complete a computer architecture standard which was built to fully leverage the parallelism possible in modern SSDs through physically storing a NVMe controller chip with the storage medium.

From the start, NVMe feature much higher IOPS and lower latency than SATA SSDs. Today’s performance-critical applications depend upon this kind of hardware.

This has even been extended to NVMe-oF (NVMe over Fabrics) which have attempted to apply the concept of direct-attached flash across entire networks. Just as Fibre Chanenl and iSCSI attempted to expand SCSI commands to the network layer, so too does NVMe-oF attempt to expand NVMe commands to the network layer as well.

PCIe and NVMe remain rapidly improving technologies, and it may not be long until HDD’s will only be considered an archival solution, much like tape storage today. Time shall tell what future horizons there are for block storage.

Block Storage and Cloud Architecture

Several considerations can be drawn from this cursory glance through block storage history within the context of cloud professionals.

First, and this goes for data storage in general, the lifespan of an organization’s data assets can often indicate how many protocols it has jumped through, failed to jump through, or been caught up in some weird in-between state.

This is especially the case with many on-premise data solutions or even earlier implementations of cloud block storage.

History and operations can shape organizational data in a number of ways that may not be obvious at first glance. As Peter Drucker notes, you cannot underestimate how long temporary solutions simply live on by default.

It is always important to ask oneself if the decisions that shaped technology or data solutions a certain way were primarily determined by constraints or requirements that no longer remain relevant, or if there is a true ghost in the past that may emerge if you disrupt the "way things have always been done here”.

Both are possible. This requires case-by-case judgment.

Second, interest in block storage does not seem to be going away anytime soon for IT professionals. If there is one consistency we can trace in this historical thread, it is that there has up until now been an intense interest and demand in optimizing and maximizing low-level data operations using block storage. Even as other data storage technologies wax and wane, block storage has been a consistent forerunner.

We can trace this to the fundamental advantages it has offered throughout history:

It can offer high performance on random reads/writes (e.g. OLTP databases, queues)
Because it is not concerned with metadata, block storage can use any filesystem and offers great flexibility for formatting, encryption, and partitioning.
VM operating systems are strongly integrated with block volumes (e.g. AWS EBS, Azure Disk) to parallel standard computing orchestration. Block storage is a universally accepted, deeply integrated method for storage that can be transferred and migrated across a huge breadth of technology solutions (in contrast to much more opinionated storage methods such as key-value stores).

Third, here are some open-ended use case questions I will leave here that cloud professional could use to judge between various block storage solutions:

Are the workload read/writes random or sequential? Is it more read intensive or write intensive? What are the latency requirements?
Does the workload need direct block device access (transactional databases and hypervisors do)?
What are the consistency model and redundancy expectations?
Is concurrent access a need for volumes, or is it acceptable for only one host to access a volume at a time?
What are the capacity, scaling, and cost needs?
What are the HA and Fault Tolerance requirements?
What layers and forms of encryption are required for data at rest?
How does it integrate with pre-existing hardware, systems, or data? How do your block storage choices align with the trajectory of organizational or industry technologies?
What are the retention and migration SLA expectations?
What levels of IT storage admin experience available and what maintenance overhead is acceptable or preferable for the business line?

Through all of this, it truly is a marvel that through the radical changes that have swept through computing, one thing that remains the same is that we are still reading from and writing to blocks at the end of the day.