nonfunctionalarchitect.com: May 2014

2014/05/31

Mad Memoization (or how to make computers make mistakes)

Memoization is a technique used to effectively cache the results of computationally expensive functions to improve performance and throughput on subsequent executions. It can be implemented in a variety of languages but is perhaps best suited to functional programming languages where the response to a function should be consistent for a given set of input values. It's a nice idea and has some uses but perhaps isn't all that common since we tend to design programs so that we only call such functions once; when needed, in any case.

I have a twist on this. Rather than remembering the response to a function with a particular set of values, remember the responses to a function and just make a guess at the response next time.

A guess could be made based on the entropy of the input and/or output values. For example, where the response is a boolean value (true or false) and you find that 99% of the time the response is "true" but it takes 5 seconds to work this out, then... to hell with it, just return "true" and don't bother with the computation. Lazy I know.

Of course some of the time the response would be wrong but that's the price you pay for improving performance throughput.

There would be some (possibly significant) cost to determining the entropy of inputs/outputs and any function which modifies the internal state of the system (non-idempotent) should be avoided from such treatment for obvious reasons. You'd also only really want to rely on such behaviour when the system is busy and nearly overloaded already so you need a way to quickly get through the backlog - think of it like the exit gates of a rock concert when a fire breaks out, you quickly want to ditch the "check-every-ticket" protocol in favour of a "let-everyone-out-asap" solution.

You could even complicate the process a little further and employ a decision tree (based on information gain for example) when trying to determine the response to a particular set of inputs.

So, you need to identify expensive idempotent functions, calculate the entropy of inputs and outputs, build associated decision trees, get some feedback on the performance and load on the system and work out at which point to abandon reason and open the floodgates - all dynamically! Piece of piss... (humm, maybe not).

Anyway, your program would make mistakes when under load but should improve performance and throughput overall. Wtf! Like when would this ever be useful?

DoS attacks? Requests could be turned away at the front door to protect services deeper in the system?

The Slashdot effect? You may not give the users what they want but you'll at least not collapse under the load.

Resiliency? If you're dependent on some downstream component which is not responding (you could be getting timeouts after way too many seconds) then these requests will look expensive and the fallback to some default response (which may or may not be correct!?).

Ok, perhaps not my best idea to date but I like the idea of computers making mistakes by design rather than through incompetence of the developer (sorry, harsh I know, bugs happen, competent or otherwise).

Right, off to take the dog for a walk, or just step outside then come back in again if she's feeling tired...

Go-Daddy: Low TTL DNS Resolution Failures

Some of you may have noticed recently that www.nonfunctionalarchitect.com was not resolving correctly much of the time. At first I thought this was down to DNS replication taking a while though that shouldn't really explain inconsistent results from the same DNS servers (once picked up they should stick assuming I don't change the target (which I hadn't)).

So eventually I called Go-Daddy support who weren't much help and kept stating that "it works for us" suggesting it was my problem. This despite confirmation from friends and colleagues that they see the same issue from a number of different ISPs. They also didn't want to take the logs I'd captured demonstrating the problem or give me a reference number - a far cry from the recorded message in the queue promising to "exceed my expectations"! But hey, they're cheap...

Anyway... I'd set the TTL (Time To Live) on my DNS records to 600 seconds. This is something I've done since working on migration projects where you'd want DNS TTL to be short to minimise the time clients point at the old server (note: you need to make the change at least x1 the legacy TTL value before you start the migration... and not every DNS server obeys your TTL... but it's still worth doing). This isn't an insane value normally but really depends on whether your nameservers can handle the increased load. I asked the support guy if this was ok and he stated that it was and all looked fine with my DNS records... Cool, except that my problem still existed!

I had to try something so setup a simple shell script on a couple of servers to perform a lookup (nslookup) on Google.com, www.nonfunctionalarchitect.com and pop.nonfunctionalarchitect.com, and set the TTL to 1 day on www and 600secs on pop. This should hopefully prove that (a) DNS resolution is working (Google.com resolves), (b) that I am suffering a problem on www and pop and; with a bit of luck, (c) demonstrate if increasing the TTL makes any difference.

The result shows no DNS resolution fails for either Google.com or www.nonfunctionalarchitect.com. On the other hand pop fails around 10% of the time (12 failures from 129 requests). Here's a few of the results:

pop.nonfunctionalarchitect.com	canonical name = pop.secureserver.net.
pop.nonfunctionalarchitect.com	canonical name = pop.secureserver.net.
** server can't find pop.nonfunctionalarchitect.com: NXDOMAIN
pop.nonfunctionalarchitect.com	canonical name = pop.secureserver.net.
pop.nonfunctionalarchitect.com	canonical name = pop.secureserver.net.
pop.nonfunctionalarchitect.com	canonical name = pop.secureserver.net.
** server can't find pop.nonfunctionalarchitect.com: NXDOMAIN
** server can't find pop.nonfunctionalarchitect.com: NXDOMAIN
pop.nonfunctionalarchitect.com	canonical name = pop.secureserver.net.
pop.nonfunctionalarchitect.com	canonical name = pop.secureserver.net.

I can think of a number of reasons why this may be happening; including load on the Go-Daddy nameservers, overly aggressive DoS counter-measures, or misaligned configuration in nameserver/CNAME configuration etc. Configuration seems ok but I do wonder about the nameservers 1hr TTL versus the CNAMEs 600s TTL. For now it seems more stable at least and I'll do some experimentation with TTL values later to see if I can pin this down.

In the meantime, if you're getting DNS resolution failures with Go-Daddy and have relatively low TTL values set (<1hr) then consider increasing these to see if that helps.

2014/05/25

Sainsburys Glitch

A computer glitch at Sainsburys prevents delivery of some home orders. Caused by a... "computer fault". I doubt very much it was the computers fault though! It's highly unlikely it just forgot and rather more likely the poor thing broke a leg (disk), was knocked out (power outage) or was simply told to do something stupid by a piece of wet-ware (either as an erroneous instruction or by design).

Whatever... Once you've fixed the immediate issue then:

Root-cause analysis.

Determine the cost of incident.

Estimate probability of it occurring again.

Identify options to avoid in the future.

Cost these options.

Weigh the cost v benefit to see if anything should be done about it.

And do it efficiently! Start with ballpark estimates and rules-of-thumb to see if the arguments have any merit before getting bogged down in the detail (but make these assumptions clear when you explain it to the boss!).

2014/05/22

UK's security branch says Ubuntu most secure end-user OS (maybe)

Kind of late I know but I've recently completed a new desktop rollout project for a UK gov department to Windows 7 and found it interesting that CESG supposedly (see below) think that Ubuntu 12.04 is the most secure end-user OS. There was much discussion on this project around the security features and CESG compliance so I find this topic quite interesting.

They didn't look at a wide range of client devices so other Linux distributions may prove just as secure, as could OSX which seems a notable omission to me considering they included ChromeBooks in the list. It was also pointed out that the disk encryption and VPN solutions haven't been independently verified and they're certainly not CAPS approved; but then again, neither is Microsofts BitLocker solution.

The original page under gov.uk seems to have disappeared (likely as result of all the recent change going on there) but there's a lot on that site which covers end user device security including articles on Ubuntu 12.04 and Windows 7.

However, reading these two articles you don't get the view that Ubuntu is more secure than Windows - in fact, quite the opposite. There's a raft of significant risks associated with Ubuntu (well, seven) whilst only one significant risk is associated with Windows (VPN). Some of the Ubuntu issues look a little odd to me; such as users can ignore cert warnings since this is more a browser issue than OS related unless I've misunderstood as the context isn't very clear, but the basic features are there, just not certified to any significant degree. This is and easy argument for the proprietary solution provides to make and a deal clincher for anyone in government not looking to take risks (most of them). I doubt open-source solutions are really any less secure than these but they do need to get things verified if they're to stand up to these challenges. Governments around the world can have a huge impact on the market and use of open standards and solutions so helping them make the right decisions seems a no-brainer to me. JFDI guys...

Otherwise, the article does have a good list of the sort of requirements to look out for in end-user devices with respect to security which I reproduce here for my own future use:

Virtual Private Network (VPN)

Disk Encryption

Authentication

Secure Boot

Platform Integrity and Application Sandboxing

Application Whitelisting

Malicious Code Detection and Prevention

Security Policy Enforcement

External Interface Protection

Device Update Policy

Event Collection for Enterprise Analysis

Incident Response

2014/05/19

IE AppContainers and LocalStorage

IE's EPM (Enhanced Protected Mode) mode provides separate containers for web storage between desktop and Metro mode when using the Internet Zone. There's a page which discusses the detail but never really states why it behaves like this. It seems to me that this is unnecessarily complex and will lead to user confusion and angst - "why does switching to desktop mode lose my session/cookies/storage?" or more simply - "why do I have to login again?". It's also arguably a security risk since users will have multiple sessions/cookies active so could inadvertently leave themselves logged in or could lead to duplicate transactions because items may be placed in the basket in separate containers etc. It would be less of a concern if users couldn't easily switch, but of course they can because MS has kindly put a menu item on the Metro page to "View in the Desktop"!? It all seems to be related to providing enterprise users with the ability to maintain and configure a setup to provide greater access/functionality to intranet sites than you would want for untrusted Internet sites (enabling various plugins and the like).

To a degree, fair enough, but it's mostly as a result of intranet sites adopting features that weren't standardised or hardened sufficiently in the first place (ActiveX, Java etc.). These need to be got rid of though this will cost companies dearly - replacing existing functionality with something else but with no significant added value to the business bar adherence to standards/security compliance etc. is a hard sell.

So MS is; from one viewpoint, forced into this approach. The problem is it just adds more weight to my view that MS is so dependent on the enterprise customer and supporting the legacy of cruft they (MS & corporate intranets) have spawned over so many years that MS are no longer able to provide a clean, consistent and usable system (some would say they never were...).

Violation of rule #1 - Keep it Simple!

2014/05/17

Is lying the solution to a lack of privacy online?

I do wish social networking sites like G+ and FB would stop advertising peoples birthdays. Your birth date is one of those "known facts" used by many organisations (banks, government departments etc.) to verify your identity. Providing this data to social networking sites can result in information leakage and contribute to identity theft and security incidents. Combine this with all the other bits of information they capture and it would be quite easy for someone to bypass those security questions every call centre asks as a facade to security - they only need to gleam a little info from many sources.

This morning G+ asked me if I wanted to say happy birthday to Peter. I know Peter slightly but not well enough to be privy to such information and I have no idea whether it really is his (or your) birthday today, if it is... Happy Birthday! If it's not then congratulations on lying to Google and Facebook - it's good practice (so long as you can remember the lies you tell).

In a world where privacy is becoming impossible, lying may be our saviour. What a topsy-turvy world we're living in...

Windows 7 Incident

Having recently been responsible for an estate wide software upgrade programme for many thousand devices to Windows 7 I sympathise but have to find this amusing. However, it is an interesting approach to achieving a refresh in particularly short order... Make the best of it guys, treat it as an opportunity to audit your estate... I do hope your backup procedures are working though... ;)

Windows 7 Incident

2014/05/15

Pre-emptive Single Task Operating System++

A while ago I wrote a blog entry about a pre-emptive single task operating system that I think the world needs. It seems I'm not the only one and George RR Martin (Game of Thrones) also thinks there's a need for this. His seems to stem from security as well as a productivity perspective but I think I grok what he means. The feature bloat in products such as MS Office these days detracts from their usability. They may be able to boil the ocean but it's not really necessary and just gets in the way of the creative process. However, DOS surely has a limited life and it must be hard to find the h/w components to run this on now. I may fire up a VM with DOS sometime to remind myself of the good-old-days... need-for-a-preemptive-os++

2014/05/13

LocalServe

One of the things I have found irritating in the past is the need to install and configure a web-server each time the urge takes me to try something out. I don't run a local web-server permanently and being of a JFDI disposition the hurdle needed to get a server running is usually enough to stall what motivation I've managed to muster. Then I discovered that from Java 7 onwards it's fairly simple to implement your own web-server in plain Java - no need for an application server.

LocalServe implements two types of handlers:

1. File-handler - This serves any content (and sub-directories) in the directory in which localserve is run. Any file which is not found returns a 404 and any request for the root of a folder (path ending in "/") attempts to return the index.html file in the directory. Note that localserve does not provide any listing of directories.

If all you want to do is serve static content then the above is sufficient and LocalServe can be run using the command below in the directory you want to serve content from. This will run a webserver on port 8765 by default:

java -jar localserve.jar

The port number can also be changed by adding this to the end - e.g.:

java -jar localserve.jar 5678

2. SQL-handler - Often static content isn't enough and you need to use a database. This handler provides access to a database that can be called from JavaScript (typically via an AJAX request). A configuration file can be specified on the command line when running localserve. This configuration file provides details of a SQLite database, and the SQL statements that are supported. Each SQL statement has a logical name, SQL statement (including bindings), supported methods (POST or GET) and optionally a redirect (where to send the user on success). Calls to paths starting "/sql/" are sent to the SQL handler and the path element after this is used to match against a logical name in the configuration file. If found the SQL statement is executed with any HTTP parameters matching the bind names being bound accordingly. Two special names "PAGE" and "PAGE_SIZE" are defined such that queries which may return many many rows can be restricted to returning only certain pages of a certain size. Results from SQL commands are returned in JSON format.

The configuration file can be specified on the command line when running localserve as below:

java -jar localserve.jar config.json
or to run on a port other than 8765:
java -jar localserve.jar config.json 5678

(note that the position of these two parameters doesn't matter).

An example configuration file is shown below:

{ "wsm": {
    "connString": "jdbc:sqlite:/path/to/database/file.db",
    "driver": "org.sqlite.JDBC",
    "operations": [
        { "path": "listbob", 
          "statement": "select name, id from bob", 
          "methods": "GET,POST" },
        { "path": "querybob", 
          "statement": "select id, name from bob where id={ID} order by name desc", 
          "methods": "GET,POST" },
        { "path": "insertbob", 
          "statement": "insert into bob (id, name) values ({ID}, {N})", 
          "redirect": "/sql/querybob", 
          "methods": "GET,POST" }
    ]
}
}

The database here contains one very simple table (BOB) as:


CREATE TABLE BOB (ID VARCHAR(20), NAME VARCHAR(100));

The only database used is SQLite and the main JAR file contains all libraries for this to work. I have tried other databases (notably IBM DB2) which worked fine so long as the JAR libraries can be found.

An example response to something like http://localhost:8765/sql/listbob looks like:

{ "dataset": 
    {  "page": 0, 
       "pageSize": 20,  
       "record": [  
           { "name": "James Brown", "id": "1"} ,  
           { "name": "Simple Simon", "id": "2"} ,  
           { "name": "Ducky Duncan", "id": "3"}  
       ] 
    }
}

The attribute names are derived from the query names and are usually lower case. HOWEVER, you may find that if you explicitly state a column name the attribute may come back in uppercase (e.g. "select id||' - '||name LONG_NAME from bob" will result in an attribute with name "LONG_NAME").

Once you have the database setup and working then it's a relatively simple task to use JQuery to submit AJAX requests to the SQL handler to create/read/update/delete/list the database. A hastily knocked up example is below:

<!DOCTYPE html>
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> 
    <meta http-equiv="X-UA-Compatible" content="IE=edge"/> 
    <meta charset="utf-8"/>  
  
    <script src="jquery-2.0.3.min.js"></script>
    <script src="purl.js"></script>
    </head>
  <body>
      ...
  <script type="text/javascript">
      
      function query(searchId) {
        $.ajax({ url: "/sql/querybob",
            data: { "PAGE": 0, "PAGE_SIZE": 32000, "ID": searchId },
            success: function( data ) {
              var dataset = (JSON.parse(data)).dataset;
              var arr = new Array();
              for (var i=0;i<dataset.record.length;i++) {
                alert(dataset.record[i].id + " " + dataset.record[i].name);
          }
        });
      }
      query(1);
    </script>
  </body>
</html>

Anyway, it's been useful when I just want to prototype something and this is definitely not intended for any production use. It's simply a very simple webserver so that I can quickly get on with prototyping something.

If you're interested then the source can be found on GitHub where the LocalServe JAR file can be downloaded. The code is what I call "prototype quality" which means it's been made to work by beating it into some shape with a hammer - it is uncommented and not of production quality.

Java 7 and this JAR are all you should need to run LocalServe. As ever, no assurances, warranties, guarantees etc. are provided and; whether you lose a little data or the world goes into meltdown (and everything in-between), I'll accept no responsibility for any damages caused...

2014/05/05

Entropy - Part 2

A week or so ago I wrote a piece on entropy and how IT systems have a tendency for disorder to increase in a similar manner to the second law of thermodynamics. This article aims to identify what we can do about it...

It would be nice if there was some silver bullet but the fact of the matter is that; like the second law, the only real way to minimise disorder is to put some work in.

1. Housekeeping

As the debris of life slowly turns your pristine home into something more akin to the local dump, so the daily churn of changes gradually slows and destabilises your previously spotless new IT system. The solution is to crack on with the weekly chore of housekeeping in both cases (or possibly daily if you've kids, cats, dogs etc.). It's often overlooked and forgotten but a lack of housekeeping is frequently the cause of unnecessary outages.

Keeping logs clean and cycling on a regular basis (e.g. hoovering), monitoring disk usage (e.g. checking you've enough milk), cleaning up temporary files (e.g. discarding those out of date tins of sardines), refactoring code (e.g. a spring clean) etc. is not difficult and there's little excuse for not doing it. Reviewing the content of logs and gathering metrics on usage and performance can also help anticipate how frequently housekeeping is required ensure smooth running of the system (e.g. you could measure the amount of fluff hoovered up each week and use this as the basis to decide which days and how frequently the hoovering needs doing - good luck with that one!). This can also lead to additional work to introduce archiving capabilities (e.g. self storage) or purging of redundant data (e.g. taking the rubbish down the dump). But like your home, a little housekeeping done frequently is less effort (cost) than waiting till you can't get into the house because the doors jammed and the men in white suits and masks are threatening to come in and burn everything.

2. Standards Compliance

By following industry standards you stand a significantly better chance of being able to patch/upgrade/enhance without pain in the future than if you decide to do your own thing.

That should be enough said on the matter but the number of times I see teams misusing APIs or writing their own solutions to what are common problems is frankly staggering. We (and me especially) all like to build our own palaces. Unfortunately we lack sufficient exposure to the space of a problem to be able to produce designs which combines elegance with flexibility to address the full range of use cases or the authority and foresight to predict the future and influence this in a meaningful way. In short, standards are generally thought out by better people than you or me.

Once a standard is established then any future work will usually try to build on this or provide a roadmap of how to move from the old standard to the new.

3. Automation

The ability to repeatedly and reliably build the system decreases effort (cost) and improves quality and reliability. Any manual step in the build process will eventually lead to some degree of variance with potentially unquantifiable consequences. There are numerous tools available to help with this (e.g. Jenkins) though unfortunately usage of such tools is not as widespread as you would hope.

But perhaps the real killer feature is test automation which enables you to continuously execute tests against the system at comparatively negligible cost (when compared to maintaining a 24x7 human test team). With this in place (and getting the right test coverage is always an issue) you can exercise the system in any number of hypothetical scenarios to identify issues; both functional and non-functional, in a test environment before the production environment becomes compromised.

Computers are very good at doing repetitive tasks consistently. Humans are very good at coming up with new and creative test cases. Use each appropriately.

Much like housekeeping, frequent testing yields benefits at lower cost than simply waiting till the next major release when all sorts of issues will be uncovered and need to be addressed - many of which may have been around a while though no-one noticed... because no-one tested. Regular penetration testing and review of security procedures will help to proactively avoid vulnerabilities as they are uncovered in the wild, and regular testing of new browsers will help identify compatibility issues before your end-users do. There are some tools to help automate in this space (e.g. Security AppScan and WebDriver) though clearly it does cost to run and maintain such a continuous integration and testing regime. However, so long as the focus is correct and pragmatic then the cost benefits should be realised.

4. Design Patterns

Much like standards compliance, use of design patterns and good practices such as abstraction, isolation and dependency injection can help to ensure changes in the future can be accommodated at minimal effort. I mention this separately though since the two should not be confused. Standards may (or may not) adopt good design patterns and equally non-standard solutions may (or may not) adopt good design patterns - there are no guarantees either way.

Using design patterns also increases the likelihood that the next developer to come along will be able to pick up the code with greater ease than if it's some weird hair-brained spaghetti bowl of nonsense made up after a rather excessive liquid lunch. Dealing with the daily churn of changes becomes easier, maintenance costs come down and incidents are reduced.

So in summary, entropy should be considered a BAU (Business as Usual) issue and practices should be put in place to deal with it. Housekeeping, standards-compliance, automation through continuous integration and use of design patterns all help to keep the impact of change minimised and keep the level of disorder down.

Next time, some thoughts on how to measure entropy in the enterprise...

2014/05/03

Feedback - Logging and Monitoring

It seems to me that we are seeing an increasing number of issues such as this reported by the Guardian. A lost transaction results in a credit default against an individual with the result that they cannot obtain a mortgage to buy a house. Small error for the company, huge impact for the individual.

The company admitted that despite the request being submitted on their website they did not receive the request!? So either the user pressed submit then walked away without noting the response was something other than "all ok!" or the response was "all ok!" and the company failed to process the request correctly.

If the former then, well, user error for being a muppet... As end users we all need to accept some responsibility and check that we get the feedback we expect.

For the latter, there are several reasons why subsequent processing could have failed. Poor transaction management so the request never gets committed, poor process management so the request drops into some dead queue never to be dealt with (either through incompetence or through malicious intent), or through system failure and a need to rollback with resulting data loss.

With the growth in IT over the past couple of decades there are bound to be some quality issues as the result of ever shorter, demanding and more stringent deadlines and budgets. Time and effort needs to be spent exploring the hypothetical space of what could go wrong so that at least some conscious awareness and acceptance of the risks is achieved. This being the case I'm usually quite happy to be overruled by the customer on the basis of cost and time pressures.

However, it's often not expensive to put in place some logging and monitoring - and in this case there must have been something for the company to admit the request had been submitted. Web logs, application logs, database logs etc. are all valuable sources of information when PD'ing (Problem Determination). You do though need to spend at least some time reviewing and auditing these so you can identify issues and deal with them accordingly.

I remember one case where a code change was rushed out which never actually committed any transaction. Fortunately we had the safety net of some judicious logging which allowed us to recover and replay the transactions. WARNING: It worked here but this isn't always a good idea!

In general though, logging and monitoring are a very good idea. In some cases system defects will be identified, in others, transient issues will be found which may require further work to deal with them temporarily. Whatever the underlying issue it's important to incorporate feedback and quality controls into the design of systems to identify problems before they become disasters. At a rudimentary level logging can help with this but you need to close the feedback loop through active monitoring with processes in place to deal with incidents when they arise. It really shouldn't just be something that we only do when the customer complains.

I don't know the detail of what happened in this case. It could have been user-error, or we could applaud the company for having logging in place, or they could just have got lucky. In any case, we need to get feedback on how systems and processes are performing and operating in order to deal with issues when they arise, improve quality and indeed the business value of the system itself through continuous improvement.

nonfunctionalarchitect.com