Security, Impact, Truth and Environments

I’ve been known to berate others for abusing environments – despite my personal habits – but I think its time for me to curtail my anger and reconsider exactly what distinguishes one environment from another and why.

We’re used to managing a plethora of environments – production, standby, prod-support, pre-production, performance test, UAT, system-test, development-integration, dev etc. – each of which has its own unique characteristics and purpose and each with a not insignificant cost.

With all those environments we can very easily have 5 or 6 times the infrastructure required to run production sitting mostly idle – and yet still needing to be maintained and patched and consuming kilowatts of power. All this for what can seem like no good reason bar to satisfy some decades old procedural dictate handed down by those upon high.

Unsurprisingly many organisations try to combine responsibilities into a smaller set of environments to save on $’s at the cost of increased risk. And recent trends in dev-ops, cloud and automation are helping to reduce the day-to-day need for all these environments even further. After all, if we can spin up a new server, install the codebase and introduce it into service in a matter of minutes then why not kill it just as quickly? If we can use cheaper t2.micro instances in dev and use m4.large only in prod then why shouldn’t we do so?

So we can shrink the number and size of environments so now we only have 2 or 3 times production and with auto-scaling this baseline capacity can actually be pretty low.

If we can get there…

… and the problem today is that whilst the technology exists, the legacy architectures, standards, procedures and practices adopted over many years by organisations simply don’t allow these tools and techniques to be adopted at anywhere near the pace at which they are developing in the wild. That application written 10 years ago just doesn’t fit with the new cloud strategy the company is trying to develop. In short, revolution is fast (and bloody) and evolution is slow.

Our procedures and standards need to evolve at the same rate as technology and this just isn’t happening.

So I’ve been considering what all these environments are for and why they exist and think it comes down to three concerns; security, impact and truth.

– Security – What’s the security level of the data held? More often than not the production environment is the only one authorised to contain production data. That means it contains sensitive data or PII, has lots of access-control and auditing, firewalls everywhere, tripwires etc. There’s no way every developer is going to get access to this environment. Access is on a needs-to-know basis only… and we don’t need (and shouldn’t want) to know.
– Impact – Whats the impact to the business if the environment dies or runs slow? If dev goes down, no-one cares. Hell, if pre-prod goes down no-one bar prod-support really care.
– Truth – How true to version X does the environment have to be? Production clearly needs to be the correct release of the codebase across the board (MVT aside). If we have the wrong code with the wrong database then it matters. In the development environment?.. if a script fails then frankly it’s not the end of the world, and besides dev is usually going to be version X+n, unstable and flaky in any case.

So in terms of governance it’s those things that keep management awake at night. They want to know who’s got access to what, what they can do, on what boxes, with what assets and what the risk is to data exposure. When we want to push out the next release they want to know the impact if it screws up, that we’ve got a back-out plan for when it does and that we’ve tested it – the release, the install plan and the back-out. In short, they’re going to be a complete pain in the backside. For good reason.

But can we rethink our environments around these concerns and does this help? If we can demonstrate to management that we’ve met these needs then why shouldn’t they let us reduce, remove and recycle environments at will?

Production and stand-by will have to be secure and the truth. But the impact if stand-by goes down isn’t the same. There’s a risk if prod falls over but that’s not the same thing. So allowing data-analysts access to stand-by to run all sorts of wild and crazy queries may not be an issue unless prod falls flat on its face – a risk some will be willing to take to make more use of the tin and avoid environment spread. Better still, if the data in question isn’t sensitive or is just internal-use-only then why not mirror a copy into dev environments to provide a more realistic test data-set for developers?

And if the data is sensitive? Anonymise it and use that; or a decent sample of it, in dev and test environments. Doing so will improve the quality of code by increasing the likelihood developers will detect patterns and edge-cases sooner in the development cycle.

In terms of impact, If the impact to the business of an application outage is low then why insist on the full range of environments when frankly one or two will do? Many internal applications are only used 9 to 5 and have an RTO and RPO of in excess of 24 hrs. The business need to clearly understand what they’re agreeing to but ultimately it’s their $’s we’re spending and once they realise the cost they may be all too willing to take the risk. Having five different environments for every application for the sake of consistency alone isn’t justifiable.

And not all truths are equal. Some components don’t need the same rigour as others and may have lower impact to the business if they’re degraded to some degree. Allowing some components; especially expensive ones, to have fewer environments may complicate topologies and reduce the general comprehensiveness of the system but if we can justify it then so be it. We do though need to make sure this is very clearly understood by all involved else chaos can ensue – especially if some instances span environments (here be dragons).

Finally, if engineering teams paid more attention during development to performance and operability and could demonstrate this then the need for dedicated performance/pre-prod environments may also be reduced. We don’t need an environment matching production to understand the performance profile of the application under load. We just need to consider the systems characteristics and test cases with a willingness (i.e. an acceptance of risk) to extrapolate. A truthful representation of production is usually not necessary.

Risk is everything here and if we think about how the applications concerns stack up against the security risk, the impact risk to the business and risk of things not being the truth, the whole truth and nothing but… then perhaps we can be smarter about how we structure our environments to help reduce the costs involved irrespective of adopting revolutionary technology.

Traceability

We can have a small server…

Screen Shot 2016-02-13 at 11.43.20

…a big server (aka vertical scaling)…

Screen Shot 2016-02-13 at 11.43.27

.. a cluster of servers (aka horizontal scaling)…

Screen Shot 2016-02-13 at 11.48.34

.. or even a compute grid (horizontal scaling on steroids).

Screen Shot 2016-02-13 at 11.43.41

For resiliency we can have active-passive…

Screen Shot 2016-02-13 at 11.52.46

… or active-active…

Screen Shot 2016-02-13 at 11.52.51

… or replication in a cluster or grid…

Screen Shot 2016-02-13 at 11.59.01

…each with their own connectivity, load-balancing and routing concerns.

From a logical perspective we could have a simple client-server setup…

Screen Shot 2016-02-13 at 13.03.29

…a two tier architecture…

Screen Shot 2016-02-13 at 13.03.35

…an n-tier architecture…

Screen Shot 2016-02-13 at 13.03.40

…a service oriented (micro- or ESB) architecture…

Screen Shot 2016-02-13 at 13.03.44

…and so on.

And in each environment we can have different physical topologies depending on the environmental needs with logical nodes mapped to each environments servers…

Screen Shot 2016-02-13 at 13.04.01

With our functional components deployed on our logical infrastructure using a myriad of other deployment topologies..

Screen Shot 2016-02-13 at 13.04.21

… or …

Screen Shot 2016-02-13 at 13.04.37

… and on and on and on…

And this functional perspective can be implemented using dozens of design patterns and a plethora of integration patterns.

Screen Shot 2016-02-13 at 12.08.46

With each component implemented using whichever products and packages we choose to be responsible for supporting one or more requirements and capabilities…

Screen Shot 2016-02-13 at 13.20.31

So the infrastructure we rely on, the products we select, the components we build or buy; the patterns we adopt and use… all exist for nothing but the underlying requirement.

We should therefore be able to trace from requirement through the design all the way to the tin on the floor.

And if we can do that we can answer lots of interesting questions such as “what happens if I turn this box off?”, “what’s impacted if I change this requirement?” or even “which requirements are driving costs?”. Which in turn can help improve supportability, maintainability and availability and reduce costs. You may even find your product sponsor questioning if they really need this or that feature…

Microsoft Predicts 2016

Microsofts predictions for 2016. Worth a read. Lots about machine learning, big-data and encryption. All very optimistic including one guy (Lucas Joppa) who expects the human race to wake up to technology being the saviour to our impending doom. Unfortunately I fear the human race is all too desperate for a saviour – real or imagined – and that desperation can lead to a belief in false gods little better than the devil we know today. But still, there’s a huge amount that can be done to improve the efficiency and effectiveness of technology in general and so vast room for improvement. Lets hope Mr Joppa is right…

… oh, and yes, environmental impact is a non-functional characteristic.