The History of Infra-As-Code
Let's take a look at the history behind some of these frameworks to guess at what the future will look like.
Once upon a time everything was configured manually. *nix servers required physical media for setup, installing packages and config was done on a physical console. Network routing and security was a quagmire of scripts and utilities.
We’ve come a long way since then. With a one-liner like this:
aws ec2 run-instances --image-id ami-0abcdef12 --instance-type t2.micro --key-name MyKeyPair
We have some magic at our fingertips. The Cloud is launching virtual hardware, attaching volumes containing a preconfigured operating system, configuring network & security boundaries. Within a few seconds we have a usable system available on the internet. And not only that, this system can easily be scaled and re-sized with minimal fuss!
This capability leads us to value the uptime of a system much less and refocus our values on aspects like ease of configuration / repeatability / extensibility / modularization.
These principles might sound a lot more like software engineering than sys admin. When it comes to cloud configuration and maintenance it’s not one or the other, but a unique blend which gives rise to ideas like DevOps / Platform Engineering and other fresh disciplines.
The Rise Of Infa-As-Code
This brings us to the IAC topic. I aim to shed light on where the industry came from, and how industry values have changed and adapted over time in step with platform capabilities — and the shortcomings of each previous infra-as-code tool.
Puppet & Chef
The first of the infra-management tools aimed to solve the issue faced by many orgs;
how do we quickly, easily, safely and consistently manage our distributed server environments?
Picture massive scale, thousands of Linux and Windows systems of different classes needing to be updated, upgraded, modified, reported on and generally kept in good working order. It’s a mammoth task requiring many teams, even then it results in information siloing and potentially inconsistent system-wide changes.
Puppet and Chef came along around 2005 to solve these challenges. They utilized an agent based architecture ; every system you want to mange needs to have a special piece of software which receives requests and performs work.
Challenges of the day lead to architecture and design principles to solve them. Many distributed systems were configured to “check-in” to a central system which would feed them instructions on what changes to make. The agent is smart enough to maintain system state and changes were not repeated and efficiency through idempotency was achieved.
Great stuff! Way better than logging into systems or writing custom shell wrappers. This improved teams’ efficiency and outcomes greatly, but also wielded more power and required a higher degree of discipline and understanding of the centralized automation in addition to the systems it was serving!
This works great for servers which are already deployed and configured.
“how do i get my physical systems to the state where it can have an automation agent running?”
was about to be answered by the monumental proliferation of the cloud.
Ansible
Enter Ansible in 2012. One of my all-time favourites. Ansible sought to solve the rigidity of previous tools by offering a completely agent-less design and composable domain specific language based on YAML.
This revolutionary thinking by Ansible designers gave us the ability to write modular, flexible code in one-place, and maintain a great deal of customization and control in how it gets deployed. This was the dawn of the “Playbook”.
Ansible is an execution framework on top of the SSH protocol on Linux with shell-wrapped Python-executed commands (WinRM and PowerShell for Windows systems) Allowing us to write code which runs locally AND remotely. The state of actions is managed at the module layer, which executes on the target system, and reporting is handed back to the controller.
Modules capture the STDOUT and STDERR along with return codes of commands run on destination machines in a JSON object which is used for validation and reporting on the controller.
Ansible offers tight, feature rich integrations with cloud providers. This means we can use Ansible to launch our systems, configure our cloud, AND log into those systems to configure them and maintain them for the rest of their lifecycle.
The only downfall of Ansible IMO is the total flexibility which can lead to a plethora of custom designs that can be difficult to grok if you didn’t write them yourself!
Terraform
In 2014 HashiCorp came along and introduced their own language — HCL (which looks like a superset of JSON) and a cool new infra management tool named Terraform which was set to take the world by storm.
By 2014 cloud services were gaining monumental popularity. Developers and engineers everywhere were battling with some major challenges:
How can we maintain portability between cloud providers?
How can we standardize our infra code design and keep it testable?
How can do all this with one language which is independent of cloud providers
It’s important to note that AWS CloudFormation was also hugely popular around this time. A pure AWS tool, somewhat esoteric, results in total vendor lock… these drawbacks made some engineering leaders somewhat skeptical of over-investment on this front. Mitchell Hashimoto wrote about the need for a fully open source way to do effective cloud management.
The stage was set for Terraform to solve these problems.
Terraform/HCL enforce a strict modular code style for each atomic service in a given provider. HashiCorp and the community produced a wide array of providers and modules for public/private cloud platforms and other services. TF requires state to be persisted in a file for each stack (separate to the cloud state)
Terrform’s top level control is focused around 3 primary state transitions for any given stack declaration.
apply — bring resources from non-existence into existence (and make necessary changes to components of those resources)
plan — display the resultant set of an apply without making any provider calls
destroy — for every declared resource perform a delete action. (dangerous if miss-used)
It’s a little more complex and nuanced than that in the real world— when we start to consider issues like:
state file locking/corruption
configuration drift (cloud state changes via external channels)
failed API calls (due to security, network or provider downtime)
The biggest drawback I noticed when the first Terraform was released was the inability to (elegantly) access systems via the shell. Though I must point out, that Mr Hashimoto had already thought of an architectural pattern with a tool named Packer. Designed to build, distribute and maintain “Machine Images” (AMI’s). In combination with Terraform and other modern practices we technically shouldn’t need to “log into” stuff anymore. (I’m not sure that’s always true , that’s another story.)
By now, Terraform has been pushed to its limits, people are doing some really complex configuration which stretches the capabilities of HCL and its language constructs.
Pulumi
Pulumi builds upon Terraform (literally) with the goal of offering the user the choice of using ANY LANGUAGE to write their code. Not dis-similar in design to the AWS CDK which boils down to CloudFormation. Pulumi boils down to Terraform.
The current breed of infra code projects are getting so complex that engineers need fully-featured language support to write sustainable infra code.
Because Pulumi really is Terraform providers under-the-hood (until AWS Native support becomes broad and stable) the overall workflows are very similar to TF. Up, Preview, Destroy, Refresh . The Deployment Engine references cloud & stored state to determine state changes.
The Future
It’s been a wild ride watching the incremental improvements and approaches to these problems over the years.
What does the future hold for cloud engineering? Do the current generation of tools need to be pushed to their capabilities for us to even conceive what the strengths of the next generation should be?
👌