The Well-Architected Frameworks Compared - Part II: Operational Excellence
Thinking through Operational Excellence in the Well-Architected Frameworks
Introduction
Some time ago, I had started a series comparing the Well-Architected Frameworks of the major Cloud Service Providers.
That first part considered the introductions and general shape of the framework, for example noting how Microsoft’s had a distinctive enterprise flair in contrast to the AWS emphasis on technology enablement and velocity.
Let us continue in this comparison by looking at the “Operational Excellence” pillar across AWS, Azure, and Google Cloud.
Definitions
Let us begin by laying their definitions of operational excellence side-by-side.
Operational excellence (OE) is a commitment to build software correctly while consistently delivering a great customer experience. - AWS
At the core of the Operational Excellence pillar are DevOps practices that ensure workload quality through standardized workflows and team cohesion. - Azure
Operational excellence in the cloud involves designing, implementing, and managing cloud solutions that provide value, performance, security, and reliability. - GCP
At least this is the closest the three frameworks come to defining “operational excellence”. I find it peculiar that there is no deliberate effort to define and delimit what operational excellence entails across these three frameworks.
It seems to be assumed that what “operational excellence” is and how it is achieved can be readily surmised by exploring the respective design principles and contents of the frameworks.
However if the purpose of a framework is to provide heuristics for linking concepts across domains, it seems to me a missed opportunity to outline at the forefront of the pillar definitions what “operational excellence” entails and its significance across the broader business.
I would argue it is in fact the single most critical element of these six pillars even if only a minimal level of operational excellence is achieved because without operations there is no business.
Security, reliability, performance efficiency, cost optimization, and sustainability are ultimately dependent upon the primary need to have business operations, for the business to be viable.
Of what importance are these other pillars if there is no baseline operational revenue to build off of?
Now it is certainly possible and more often than not probable for a business to attain a baseline of operational sustainability without ever quite reaching that next rung of operational excellence.
So there is an “above and beyond” baked into this kind of pillar, but the category itself remains fundamental for the ongoing viability of the company in general.
For our intents and purposes, let us draw from IBM’s definition and define operational excellence as the daily activities and processes that produce revenue-generating goods and services paired with the pursuit of effective continuous improvement measures to optimize operational efficiency.
Now let us consider each framework in turn.
AWS - Operational Excellence
Again, AWS defines Operational Excellence as “a commitment to build software correctly, while consistently delivering a great customer experience. The operational excellence pillar contains best practices for organizing your team, designing your workload, operating it at scale, and evolving it over time.”
They break this down into four “practice areas” by which they organize their best practices:
Organization
Prepare
Operate
Evolve
But before delving into those practice areas, AWS offers general design principles that are applicable across each practice area.
8 Design Principles
Organize teams around business outcomes
This is perhaps most critical for cloud efforts to succeed or fail in any organization.
AWS leaves much unsaid in their description of this principle, something we will go into more later.
Two key sentences though:
“Leadership should be fully invested and committed to a CloudOps transformation with a suitable cloud operating model that incentivizes teams to operate in the most efficient way and meet business outcomes.”
“The organization's long-term vision is translated into goals that are communicated across the enterprise to stakeholders and consumers of your cloud services. Goals and operational KPIs are aligned at all levels.”
Implement observability for actionable insights
“Gain a comprehensive understanding of workload behavior, performance, reliability, cost, and health.”
This is the KPIs principle, the bread and butter of any operations executive.
Safely automate where possible
Apply engineering discipline as you would for application code to your operational model. Automate through procedures, runbooks, infrastructure-as-code, but do so with guardrails.
Make frequent, small, reversible changes
Design scalable workloads with decoupled components with independent change cadences. Reduce the blast radius both around the size of the change (frequent, incremental development) and the scope of what it deploys to (decoupling).
Refine operational procedures frequently
Review and optimize operational procedures. Maintain them so they do not become obsolete.
Anticipate failure
Do not merely consider failure but actively drive and test failure scenarios to understand the risk associated with certain implementation choices.
Learn from all operational events and metrics
Drive improvement through lessons learned from all operational events and failures.
Use managed services
Reduce operational overhead by using AWS managed service where possible. The unsaid side effect being vendor lock-in, but this is one principle that can be far too often overlooked.
Best Practices
Organization
Any operational workload runs within an enterprise. Being a part of an enterprise means (1) there are shared goals to pursue and (2) individuals and teams have a specific roles to play in pursuit of those shared goals.
Operational objectives and outcomes need to be reviewed both against external customer needs but also against the internal stakeholder requirements and should be built upon a culture of collaboration and alignment with organizational governance.
Priorities, threats, and processes should be reviewed frequently to inform proper management decisions and risk acceptance.
Teams and departments should be properly structured and organized with the appropriate areas of responsibility, resources, KPIs, and objectives in order to meet these goals in alignment with the broader organization.
This includes having identifiable owners for applications, platforms, and infrastructure components so that each process can have an identifiable owner as well.
Senior leadership needs to sponsor and advocate for and support subordinates who are driving organization evolution and adoption of best practices.
This in brief is what AWS notes on the subject, alongside three questions:
OPS 1: How do you determine what your priorities are?
OPS 2: How do you structure your organization to support your business outcomes?
OPS 3: How does your organizational culture support your business outcomes?
Prepare
Workloads needs to be designed with an understanding of their implementation and expected behavior as well as for the observability of said behavior.
From a technical perspective this may include metrics, logs, events, and traces. All of these should be tied to business-level KPIs to provide objective quantitative measures of the workload’s ability to meet or service business goals.
This should also include a proper release control and change management process that support rapid iteration and feedback.
Make full use of AWS’s monitoring, governance, and observability structure for various AWS services in conjunction with your processes and teams to continuously evaluate your workloads for improvement.
OPS 4: How do you implement observability in your workload?
OPS 5: How do you reduce defects, ease remediation, and improve flow into production?
OPS 6: How do you mitigate deployment risks?
OPS 7: How do you know that you are ready to support a workload?
Operate
“Observability allows you to focus on meaningful data and understand your workload’s interactions and output. By concentrating on essential insights and eliminating unnecessary data, you maintain a straightforward approach to understanding workload performance.”
Success is measured by the achievement of business outcomes. Metrics should be used to evaluate this as well as support optimization discovery and investigation for incident response.
Planned and unplanned operational events must be accounted for through procedural runbooks and forensic playbooks respectively. Team responsibilities around incident response need defined in advance including communication and escalation flows.
Dashboards and notifications should be leveraged to cohesively monitor workload status, organized and centralized around stakeholders needs and requirements.
OPS 8: How do you utilize workload observability in your organization?
OPS 9: How do you understand the health of your operations?
OPS 10: How do you manage workload and operations events?
Evolve
Fourth is the continuous improvement suite.
Develop procedures and processes to actively discover, monitor, and document areas for operational improvement then dedicate work cycles to implementing, reviewing, and assessing those improvements as part of a broader continuous improvement strategy.
Feedback loops and sharing lessons learned are critical to supporting operational excellence across the enterprise, as well as analyzing trends and regular post-mortem reviews.
OPS 11: How do you evolve operations?
Azure - Operational Excellence
Let us turn now to Azure which focuses upon Operational Excellence as “DevOps practices that ensure workload quality through standardized workflows and team cohesion. This pillar defines operating procedures for development practices, observability, and release management.”
5 Design Principles
The Azure framework starts with asking these questions:
Do you execute operations with discipline?
Are customers using the workload with maximum predictability?
How do you learn from experience and collected data to drive continuous improvement?
Operations can devolve into chaotic, high-effort, last-ditch routines and bad habits if there is not clear ownership, leadership, or drive for improvement.
Operational excellence is dependent upon “continuous evaluation and strategic investments”. Start with what is recommended and then customize based empirically off what works for your organization.
“The goals of the Operational Excellence pillar are to do the right thing, to do it the right way, and to solve the right problems as a team.”
Azure organizes this page around 5 primary questions but then goes beyond that to offer a number of approaches, questions, and details subordinate to each principle to help you think through what this means in practice. I will not include these here, but it is a nice touch.
Embrace DevOps culture
Azure roots their Operational Excellence in “DevOps” which emphasizes the shared mission of the organization and fostering a collaborative environment of shared knowledge while also applying “clear lines of ownership and accountability to each team”.
Establish development standards
Development standards make developers accountable for addressing workload issues prior to release, optimizing for rapid iteration.
Evolve operations with observability
Build a culture of continuous improvement rooted in observability and effective monitoring, especially health modeling to anticipate issues before they become incidents.
Automate for efficiency
Automate where possible to free up human resources for other tasking and reduce human-driven error.
Adopt safe deployment practices
Develop workload supply chains through automation and modularization to streamline releases and change management and ensure quick recovery.
11 Key Design Strategies
OE:01 - DevOps Culture
OE:02 - Task execution process
OE:03 - Software development practices
OE:04 - Tools and processes
OE:05 - Infrastructure as code
OE:06 - Supply chain for workload development
OE:07 - Monitoring system
OE:08 - Incident response
OE:09 - Testing strategy
OE:10 - Automation design
OE:11 - Safe deployment practices
Design Patterns
The strategies and principles may sound right at a high level, but how one can actually implement operational excellence through proper technical design has not yet been made clear.
Azure offers an additional page, Design patterns, which provides very specific, opinionated guidance for workload implementation.
Patterns:
Anti-Corruption Layer: Shields new systems from legacy models, behaviors, and technical debt.
Choreography: Coordinates distributed services through decentralized, event-driven interactions.
Compute Resource Consolidation: Increases infrastructure density by running more workloads on shared compute.
Deployment Stamps: Deploys application and infrastructure versions as repeatable, controlled units.
Edge Workload Configuration: Manages edge configuration as versioned, auditable operational changes. (This one is not documented separately.)
External Configuration Store: Separates configuration from code for dynamic, environment-specific updates.
Gateway Aggregation: Combines multiple backend calls into a single client-facing request.
Gateway Offloading: Moves shared request-processing tasks from backend nodes to a gateway.
Gateway Routing: Routes requests to backends based on intent, logic, or availability.
Health Endpoint Monitoring: Exposes standard endpoints for checking system health and status.
Messaging Bridge: Connects otherwise incompatible messaging systems through an intermediary.
Publisher/Subscriber: Decouples producers and consumers through a broker or event bus.
Quarantine: Validates external assets before allowing workload consumption.
Sidecar: Adds companion functionality beside an application without changing core application code.
Strangler Fig: Incrementally replaces legacy components while the system remains running.
Google Cloud - Operational Excellence
The distinctive Google Cloud emphasis is CloudOps and SRE.
Where AWS emphasizes operating workloads at scale and Azure frames the pillar heavily through DevOps, Google Cloud frames Operational Excellence around operational readiness, incident management, resource optimization, automation, and continuous innovation.
Operational readiness is organized across four focus areas: workforce, processes, tooling, and governance.
These include roles and responsibilities, observability, managing service disruptions, cloud delivery, core operations, service levels, cloud financials, operating models, architecture review, and compliance.
Additionally, while AWS and Azure kept their design principles as a separate overlay from specific recommendations or self-assessment questions, the Google Cloud framework subordinates its recommendations specifically beneath their overarching principles.
5 Core Principles
Ensure operational readiness and performance using CloudOps
Prepare workloads for both Day 1 and Day 2 operations through SLOs, SLAs, observability, performance testing, and capacity planning. This is the most foundational “operations” section which most clearly ties cloud engineering to service commitments and measurable operational performance.
Manage incidents and problems
Reduce incident impact and prevent recurrence through monitoring, clear response procedures, root-cause analysis, retrospectives, and preventive measures. Note that as the “author” of Site Reliability Engineer (SRE), this principle is cached very heavily in SRE concepts and language.
Manage and optimize cloud resources
Continuously review cloud resources for right-sizing, autoscaling, cost optimization, and utilization management. This overlaps significantly with cost optimization, but Google Cloud rightly treats resource management as an operational discipline because poor resource hygiene becomes both a cost and performance problem.
Automate and manage change
Use IaC, version control, CI/CD, standard procedures, structured change management, automation, and orchestration to reduce the risk and overhead of change.
Continuously improve and innovate
Build a culture of learning, experimentation, retrospectives, feedback, and technical currency. This is Google Cloud’s continuous improvement principle.
Specific Recommendations
Now we can dive in a bit more to the recommendations listed under each core principle.
Ensure operational readiness and performance using CloudOps
Operational readiness is framed around Day 1 and Day 2 operations, organized across workforce, processes, tooling, and governance.
This includes defined ownership, observability, service disruption management, delivery processes, core operations, tooling, service levels, cloud financials, operating models, architecture review, and compliance.
Quite a few things, yes, and it does overlap a fair bit with the other pillars.
Define SLOs and SLAs: Establish measurable service targets and customer-facing commitments for critical workloads.
Implement comprehensive observability: Use metrics, logs, traces, alerts, and business indicators to understand workload health.
Implement performance and load testing: Validate that applications can handle expected and peak demand.
Plan capacity: Forecast resource needs from usage trends, business demand, quotas, and failover requirements.
Continuously monitor and optimize: Review metrics, logs, traces, and performance signals to identify improvements.
Manage incidents and problems
As mentioned earlier, Google Cloud’s incident section is heavily SRE-shaped. It emphasizes monitoring, automation, data-driven insights, actionable alerts, and retrospectives.
Establish clear incident response procedures: Define roles, escalation paths, communication flows, runbooks, and playbooks.
Centralize incident management: Track incidents, ownership, communication, and response progress in a common system.
Conduct thorough post-incident reviews: Use postmortems to identify root causes, systemic contributors, and corrective actions.
Build a knowledge base: Capture incident patterns, troubleshooting guidance, and operational lessons for reuse.
Automate incident response: Use automation to reduce detection, response, remediation, and recovery time.
Manage and optimize cloud resources
This section overlaps cost optimization but treats resource hygiene as an operational discipline. Efficiency and scalability are key here.
Right-size resources: Match resource allocation to actual demand to avoid waste and performance bottlenecks.
Use autoscaling: Dynamically scale compute and application capacity as demand changes.
Leverage cost optimization strategies: Use tooling and reviews to identify waste and improve cloud spend efficiency.
Establish cost allocation and budgeting: Assign cost ownership and monitor spend against budgets.
Automate and manage change
This is Google Cloud’s closest equivalent to the “DevOps” recommendations. It focuses on IaC, CI/CD, GitOps, and automated testing as mechanisms for safer change.
Adopt IaC: Define infrastructure declaratively for consistency, repeatability, and easier rollback.
Implement version control: Track infrastructure and configuration changes through Git or source control systems.
Build CI/CD pipelines: Automate build, test, and deployment stages for faster, safer releases.
Use configuration management tools: Standardize and automate resource configuration across environments.
Automate testing: Validate code and infrastructure changes before deployment.
Continuously improve and innovate
Just like the other two, Google Cloud frames continuous improvement around learning, experimentation, retrospectives, and feedback.
Foster a culture of learning: Encourage experimentation, knowledge sharing, and blameless learning.
Conduct regular retrospectives: Use structured reviews to identify what to start, stop, and continue.
Stay up-to-date with cloud technologies: Maintain skills through training, certifications, conferences, etc.
Actively seek and incorporate feedback: Use stakeholder and user feedback to improve cloud solutions.
Evaluation and Comparison
There comes a point when reading these frameworks side-by-side that it begins to feel a bit redundant.
That is a good sign insofar as that indicates these frameworks advocate similar principles and do not structurally contradict each other.
All three hone in on standardization, observability, automation, enterprise alignment, and continuous improvement as critical to the ongoing discipline of operational excellence.
They also offer brief pointers to their respective services or tools they offer that can help achieve operational excellence (although AWS offers the least guidance on this front out of the three).
However, there are particular nuances and distinctions between these three frameworks that are worth teasing out a bit.
Let’s first boil down each framework to a singular point of emphasis:
AWS - Lifecycle
Azure - Process
Google Cloud - Capability
Google Cloud
Starting with Google Cloud, let us consider what makes them capability-focused.
A capability is the quality of being able to achieve certain outcomes, generally on a repeatable, sustainable basis. Thus, operational capabilities are the instruments of operational excellence.
These are tools or services that either directly or indirectly enable desirable business goals or operational outcomes.
The enterprise toolkit so to speak.
Now because this concept is so broadly generic, the idea of capability does not guide you specifically to what you may want to build to achieve these outcomes (the framework itself nearly exclusively talks about monitoring/observability capabilities for KPIs), but this is ostensible where the executive or intrapreneur comes in to build or develop these capabilities.
But the Google Cloud framework guides you to think operations in terms of self-standing capabilities that enable the broader organization.
One could speculate on how this particular emphasis manifests itself in Google’s internal corporate culture and its notoriety to rapidly introduce and sunset various services, but that is another discussion.
Furthermore, as Google is the thought leader when it comes to Site Reliability Engineering, much of the language around operational excellence is couched in SRE concepts and vocabulary.
This is something we will cover more in-depth when we reach the “Reliability” Pillar of the frameworks, but it is worth noting here.
Where the Google Cloud framework shines most, it is in how it solidifies operational excellence around a unified concept of “capability” embedded in modern, up-to-date SRE practices.
Azure
The Azure framework really goes above and beyond here when compared to the other two on the topic of operational excellence.
I personally feel it is the strongest of the three in providing opinionated, structured guidance with applicable takeaways. Most notably the highlighted application architecture design patterns.
While AWS and Google Cloud respectively keep a safe distance from implementation details and specific question, Azure presses in hard through enumerating a broad and in-depth series of questions to address each design principle.
This is part and parcel of its process driven focus.
If a capability is a “what”, a process is a “how”.
The key question the Azure framework asks again and again is how things are executed. Are they defined, standardized, repeatable methods and processes.
This explains the primary emphasis the Azure framework places on DevOps practices and cultures. For these are the foundations of “how” things are operationally done in an organization.
Another recurring word that runs alongside this is discipline.
Since operational excellence is a process and a practice, such practice requires discipline.
This is a particularly helpful motif, especially when the framework suggests that just as you application code to apply a certain level of discipline (i.e. regularity, automation, etc) to achieve application outcomes, so too should you apply engineering discipline in the operational design of practices and processes.
In my personal experience and observations, the best managers operate from this outlook. Their direct reports consist of human resources with specialized talent and time that can be applied to various ends.
The key is to optimize their outputs through strategic attention to practice and discipline, not in the sense of micro-managing or productivity audits, but rather through understanding how team processes, guides, and workflows either promote or undermine the individual contributor’s ability to be effective.
And again the technically detailed and opinionated patterns Azure provides are very useful in terms of technical design decisions. Something neither of the two frameworks particularly goes into.
However, I am not entirely convinced if “DevOps” is the right term to anchor this entire pillar in. Partly because it is a rather dated term to apply to architectural standards, and also because it has accumulated a lot of unfortunate terminological baggage in the way that “Agile” has.
Because it can mean all things to all people it may mislead executives and stakeholders as to what it means.
Platform engineering philosophy has picked up as a new iteration of thinking through both the integration of teams but also the segregation of duties from a robust organizational structure model.
I feel that this would better be used as a successor term, but Azure opts to use it as a subordinate principle to DevOps.
All that being said, of course every term will inherently become corrupted as it is popularized, so it would also be fair to ask if we truly need to invent a new word every five years just because it comes to be indiscriminately applied and misused by the broader managerial community.
A philosophical question that is not our task here.
AWS
While I did argue that the Azure is the strongest overall framework especially when it comes to applicability both for design patterns and in terms of thinking of practices, I left the AWS framework for last here because I find its design principles to carry a lot of weight and important nuance the others do not have.
The design principles are very subtle and precise in ways that are very instructive, although surprisingly the AWS documentation really fails to unpack the inner meaning of the principles beyond framework cliché.
To close, we will take each of these one at a time and consider them.
Organize teams around business outcomes
As I mentioned in our previous entry, the AWS framework does much more about thinking strategically and comprehensively about the organization as a whole including recommendations to rethink org structure. A bold ask neither Microsoft or Azure voice.
But this is undeniably a very important question if one is to achieve operational excellence.
Does the department and team structure really align around business outcomes?
Outcomes applies present and future tense objectives. There is no qualifying mention of “existing or historical processes”.
This is a very challenging principle because it forces you to rethink your org structure day by day without a regard to what worked yesterday or why things ended up a certain way today.
Leadership needs to support a continuous reassessment of how the org structure enables business outcomes, despite how thorny and uncomfortable such considerations would be.
It’s always Day 1 at Amazon.
More than that though is the term organize. To organize is an active, intent-laden process. Things do not organize if you passively let them drift into entropy.
This require active engagement and thought around how each of the teams and departments fit with each other in service of broader business outcomes.
Teams should not be organized around personalities, politics, siloes, or other historical baggages, but around business outcomes.
Implement observability for actionable insights.
The fundamental problem of observability in any organization is how you can filter out signal from noise. Alerts and dashboards are more often than not filled with noise that are ignored.
Observability requires ongoing attention, maintenance, and refinement, but always toward a goal of actionable insights.
Because again, at the end of the day, observing something is meant to provide you with the data to make a decision. And ideally it does so with the least amount of data needed to make the most tactically effective decision.
If there is a glut of observability data that cannot economically lend itself to any decision-making or actionable insights, then it is pointless.
This applies to incident response, KPIs, FinOps, or anything involving data.
This is the underlying upshot of the scientific method. You drill down to only most essential Yes/No question, and identify the experiment which you can observe to make a decision.
Operational excellence is contingent upon individuals and teams being able to repeatedly and continuously make these Yes/No decisions with as little cognitive drain as possible.
Safely automate where possible.
Most organizations I have seen follow the motto: “Automate.”
The flaws around this one-word mantra have become far more blatantly apparent with the advent of generative AI and the fallout that entails.
Which is why the two qualifiers AWS offers here are import.
To say, “where possible” indicates that there are processes or procedures that are not possible to automate and one should not try to automate those.
I think a better term here would be “feasible” because there many times as well where automation is possible but a lemon not worth the squeeze for the time gained.
Although some other term may be needed still because I have seen situations in my consulting career where a services-based business was attempting to automate their own service business model in a way that would eliminate or severely reduce their service-based revenue streams. While it certainly was possible and feasible, it was not very strategic.
“Safely” is also a key term as I have lived through the battle scars of executive pressure to automate even if it was unsafe to do so. This is by no means uncommon and so it is worth inserting this word here.
It may be possible to automate but if you cannot do so in a safe manner, where the automation would put you at operational, security, compliance, or cost risks, then it should be evaluated whether the automation route should be pursued.
Generative AI offers useful lessons to be learned in this regard, as the drive toward automation has generally bypassed many important security and compliance considerations in a way that I feel organizations will only start to wrestle with in the next few years.
Make frequent, small, reversible changes.
This one is punchy.
Frequent: Deploy changes early and often. A common refrain to avoid drift, ballooning change scope, and overall to reduce risk. The more often you make changes, the more accustomed you will be to making them.
Small: This is conventional wisdom that should not require explanation. Change management should be incremental, thus small. The lower the risk the better.
Reversible: this is not as commonly mentioned, but the best kind of change management takes into account a technical or operational need to rollback or revert changes. For particular procedures like SQL migrations this is easier said than done, but needs to be said anyway.
Refine operational procedures frequently.
It takes serious management discipline to actively maintain operational procedures, but it is nonetheless important that once written the SOP is not left to sit and rust indefinitely.
Defined review cadences are an important investment to ensure the operational procedures continue to be useful as time goes on.
Anticipate failure.
Anticipate is a well-chosen word here. If you were simply accounting for failure, that would entail simply chalking up a certain “error budget” where you leave some wiggle room for things to go wrong. This in itself is important and underrated, but to anticipate something is to go one step further and think ahead.
Anticipation means you are both predictively looking ahead and visualizing what it will entail in concrete terms, but you are also acting accordingly.
Simulation and live testing of failure scenarios does require more effort but it has far larger payoffs than mere armchair speculation.
This allows you to build and design resiliency around system failure, human input error, or whatever operational lapse will occur, and this lends itself to the “Reliability” pillar.
Learn from all operational events and metrics.
To learn from all events and metrics may seem like a tall order.
But there is a bit of a tautology embedded here in a hidden implication.
Taken literally or objectively, an event by its nature is itself unique or important. We may throw the term event around loosely where it need not apply.
So the idea is that your SIEM or whatever observability system you have in place is fine-tuned to really mitigate and minimize false positives as much as possible.
The fact that you have a false positive is in itself something to learn from. It is indication you may need to filter differently.
That way you can pare down operational events and metrics as much as feasibly possible to true positives, and those can drive broader lessons or learnings.
Use managed services
This is not the most nuanced design principle, and it is certainly the unspoken motive of much cloud provider guidance to lock you in more with opinionated systems that are difficult to disentangle from.
But to this day I do see this principle frequently overlooked, especially the longer one has worked in technology.
The premise behind a technology company is that they have solved a particular technological need internally and are offering that solution, often at scale, to external customers.
Hence why many generic technologies such as Kubernetes, GraphQL, React, and all the original cloud services such as Amazon S3 were born out of specific internal needs at the technology giants.
The vast majority of companies are technology consumers that leverage the intellectual property of the tech producers to solve their own internal problems without needing to reinvent the wheel.
The beauty of the cloud service model is that these externally exposed services are being tried and tested across thousands if not more companies in business-critical high-stakes scenarios which is used to increasingly refine and expand those cloud service capabilities.
The consumer only pays whatever licensing or usage model the cloud provider charges without having to undergo the engineering lifecycle of trying to design and build technological solutions from scratch.
And at this point in time the pricing is generally in my view remarkably fair and competitive.
So it should be a non-starter for individual practitioners or business needs to first see what managed solution, service, framework, application there may be that meets the enterprise table of requirement before they go trying to build their own solution for each individual task, project, or initiative.
That is the premise of the shared responsibility model after all.
Focus on your unique core business value proposition and outsource and streamline the rest, as fits your strategy.

