I’ve been known to berate others for abusing environments – despite my personal habits – but I think its time for me to curtail my anger and reconsider exactly what distinguishes one environment from another and why.
We’re used to managing a plethora of environments – production, standby, prod-support, pre-production, performance test, UAT, system-test, development-integration, dev etc. – each of which has its own unique characteristics and purpose and each with a not insignificant cost.
With all those environments we can very easily have 5 or 6 times the infrastructure required to run production sitting mostly idle – and yet still needing to be maintained and patched and consuming kilowatts of power. All this for what can seem like no good reason bar to satisfy some decades old procedural dictate handed down by those upon high.
Unsurprisingly many organisations try to combine responsibilities into a smaller set of environments to save on $’s at the cost of increased risk. And recent trends in dev-ops, cloud and automation are helping to reduce the day-to-day need for all these environments even further. After all, if we can spin up a new server, install the codebase and introduce it into service in a matter of minutes then why not kill it just as quickly? If we can use cheaper t2.micro instances in dev and use m4.large only in prod then why shouldn’t we do so?
So we can shrink the number and size of environments so now we only have 2 or 3 times production and with auto-scaling this baseline capacity can actually be pretty low.
If we can get there…
… and the problem today is that whilst the technology exists, the legacy architectures, standards, procedures and practices adopted over many years by organisations simply don’t allow these tools and techniques to be adopted at anywhere near the pace at which they are developing in the wild. That application written 10 years ago just doesn’t fit with the new cloud strategy the company is trying to develop. In short, revolution is fast (and bloody) and evolution is slow.
Our procedures and standards need to evolve at the same rate as technology and this just isn’t happening.
So I’ve been considering what all these environments are for and why they exist and think it comes down to three concerns; security, impact and truth.
– Security – What’s the security level of the data held? More often than not the production environment is the only one authorised to contain production data. That means it contains sensitive data or PII, has lots of access-control and auditing, firewalls everywhere, tripwires etc. There’s no way every developer is going to get access to this environment. Access is on a needs-to-know basis only… and we don’t need (and shouldn’t want) to know.
– Impact – Whats the impact to the business if the environment dies or runs slow? If dev goes down, no-one cares. Hell, if pre-prod goes down no-one bar prod-support really care.
– Truth – How true to version X does the environment have to be? Production clearly needs to be the correct release of the codebase across the board (MVT aside). If we have the wrong code with the wrong database then it matters. In the development environment?.. if a script fails then frankly it’s not the end of the world, and besides dev is usually going to be version X+n, unstable and flaky in any case.
So in terms of governance it’s those things that keep management awake at night. They want to know who’s got access to what, what they can do, on what boxes, with what assets and what the risk is to data exposure. When we want to push out the next release they want to know the impact if it screws up, that we’ve got a back-out plan for when it does and that we’ve tested it – the release, the install plan and the back-out. In short, they’re going to be a complete pain in the backside. For good reason.
But can we rethink our environments around these concerns and does this help? If we can demonstrate to management that we’ve met these needs then why shouldn’t they let us reduce, remove and recycle environments at will?
Production and stand-by will have to be secure and the truth. But the impact if stand-by goes down isn’t the same. There’s a risk if prod falls over but that’s not the same thing. So allowing data-analysts access to stand-by to run all sorts of wild and crazy queries may not be an issue unless prod falls flat on its face – a risk some will be willing to take to make more use of the tin and avoid environment spread. Better still, if the data in question isn’t sensitive or is just internal-use-only then why not mirror a copy into dev environments to provide a more realistic test data-set for developers?
And if the data is sensitive? Anonymise it and use that; or a decent sample of it, in dev and test environments. Doing so will improve the quality of code by increasing the likelihood developers will detect patterns and edge-cases sooner in the development cycle.
In terms of impact, If the impact to the business of an application outage is low then why insist on the full range of environments when frankly one or two will do? Many internal applications are only used 9 to 5 and have an RTO and RPO of in excess of 24 hrs. The business need to clearly understand what they’re agreeing to but ultimately it’s their $’s we’re spending and once they realise the cost they may be all too willing to take the risk. Having five different environments for every application for the sake of consistency alone isn’t justifiable.
And not all truths are equal. Some components don’t need the same rigour as others and may have lower impact to the business if they’re degraded to some degree. Allowing some components; especially expensive ones, to have fewer environments may complicate topologies and reduce the general comprehensiveness of the system but if we can justify it then so be it. We do though need to make sure this is very clearly understood by all involved else chaos can ensue – especially if some instances span environments (here be dragons).
Finally, if engineering teams paid more attention during development to performance and operability and could demonstrate this then the need for dedicated performance/pre-prod environments may also be reduced. We don’t need an environment matching production to understand the performance profile of the application under load. We just need to consider the systems characteristics and test cases with a willingness (i.e. an acceptance of risk) to extrapolate. A truthful representation of production is usually not necessary.
Risk is everything here and if we think about how the applications concerns stack up against the security risk, the impact risk to the business and risk of things not being the truth, the whole truth and nothing but… then perhaps we can be smarter about how we structure our environments to help reduce the costs involved irrespective of adopting revolutionary technology.