Support-Friendly Tooling

Chris Oldwood from The OldWood Thing

One of the techniques I briefly mentioned in my last post “Treat All Test Environments Like Production” was how constraining the test environments by adhering to the Principle of Least Privilege drove us to add diagnostic specific features to our services and tools.

In some cases that might be as simple as exposing some existing functionality through an extra command line verb or service endpoint. For example a common technique these days is to add a “version” verb or “–-version” switch to allow you to check which build of a particular tool or service you have deployed [1].

As Bertrand Meyer suggests in his notion of Command/Query Separation (CQS) any behaviour which is a query in nature should have no side-effects and therefore could also be freely available to use for diagnostic purposes – security and performance issues notwithstanding. Naturally these queries would be over-and-above any queries you might run directly against your various data stores, i.e. databases, file-system, etc. using the vendors own lower-level tools.

Where it gets a little more tricky is on the “command” side as we might need to investigate the operation but without disturbing the current state of the system. In an ideal world it should be possible to execute them against a part of the system reserved for such eventualities, e.g. a special customer or schema that looks and acts like a real one but is owned by the team and therefore its side-effects are invisible to any real users. (This is one of the techniques that falls under the in-vogue term of “testing in production”.)

If the issue can be isolated to a particular component then it’s probably more effective to focus on that part of the system by replaying the operation whilst simultaneously redirecting the side-effects somewhere else (or avoiding them altogether) so that the investigation can be safely repeated. One technique here is to host the component in another type of process, such as a GUI or command line tool and provide a variety of widgets or switches to control the input and output locations. Alternatively you could use the Null Object pattern to send the side-effects into oblivion.

In its most simplest form it might be a case of adding a “--ReadOnly” switch that disables all attempts to write to back-end stores (but leaves logging intact if that won’t interfere). This would give you the chance to safely debug the process locally using production inputs. As an aside this idea has been formalised in the PowerShell world via the “-WhatIf” switch which allows you to run a script whilst disabling (where supported) the write actions of any cmdlets.

If the operation requires some form of bulk processing where there is likely to be far too much output for stdout or because you need a little more structure to the data then you can add multiple switches instead, such as the folder to write to and perhaps even a different format to use which is easier to analyse with the usual UNIX command line tools. If implementing a whole different persistence mechanism for support is considered excessive [2] you could just allow, say, an alternative database connection string to be provided for the writing side and point to a local instance.

Earlier I mentioned that the Principle of Least Privilege helped drive out the need for these customisations and that’s because restricting your access affects you in two ways. The first is that by not allowing you to make unintentional changes you cannot make the situation worse simply through your analysis. For example if you happened to be mistaken that a particular operation had no side-effects but it actually does now, then they would be blocked as a matter of security and an error reported. If done in the comfort of a test environment you now know what else you need to “mock out” to be able to execute the operation safely in future. And if the mocking feature somehow gets broken, your lack of privilege has always got your back. This is essentially just the principle of Defence in Depth applied for slightly different reasons.

The second benefit you get is a variation of yet another principle – Design for Testability. To support such features we need to be able to substitute alternative implementations for the real ones, which effectively means we need to “program to an interface, not an implementation”. Of course this will likely already be a by-product of any unit tests we write, but it’s good to know that it serves another purpose outside that use case.

What I’ve described might seem like a lot of work but you don’t have to go the whole hog and provide a separate host for the components and a variety of command-line switches to enable these behaviours, you could probably get away with just tweaking various configuration settings, which is the approach that initially drove my 2011 post “Testing Drives the Need for Flexible Configuration”. What has usually caused me to go the extra step though is the need to use these features more than just once in a blue moon, often to automate their use for longer term use. This is something I covered in much more detail very recently in “Libraries, Console Apps & GUIs”.

 

[1] Version information has been embedded in Windows binaries since the 3.x days back in the ‘90s but accessing it easily usually involved using the GUI shell (i.e. Explorer) unless the machine is remote and has limited access, e.g. the cloud. Luckily PowerShell provides an alternative route here and I’m sure there are plenty of third party command line tools as well.

[2] Do not underestimate how easy it is these days to serialise whole object graphs into JSON files and then process them with tools like JQ.

Instructions that cpus don’t need to support

Derek Jones from The Shape of Code

What instructions can computers do without (an earlier post covered instructions they should support)?

The R in RISC was supposed to stand for Reduced, but in practice almost all the instructions you would expect were supported. What was missing were the really complicated instructions that machines of the time (last 1980s), like the VAX, supported (analysis of instruction set usage showed that these complicated instructions were rarely used; from the compiler perspective the combination sequence of operations supported by these instructions rarely occurred in code).

One instruction that was often missing from the early RISC processors was integer multiply. Compilers were expected to generate a series of instructions that had the same effect. Some of the omitted ‘basic’ instructions got added to later versions of the processors that survived commercially (e.g., SPARC).

The status register is still a common omission from RISC designs (at least for the integer operations). Where is the data showing that in the grand scheme of things (i.e., processor performance running real programs), status registers slow things down? I know that hardware designers don’t like them because they introduce bottlenecks. I don’t recall ever having seen an analysis of instruction set usage targeted at the impact of status registers on generated code. Pointers welcome.

These days, nobody seems to analyze instruction set usage like they did in times past. Perhaps Intel’s marketing and the demise of almost every cpu vendor has dampened enthusiasm for researching new cpu designs. These days most new cpu designs seem to be fashion driven, rather than data driven.

Do computers need registers? An issue that once attracted lots of research was the optimal number of registers for a processor. The minimum number of registers (or temporary storage locations) needed to evaluate an expression was known by 1970. There were various studies of the impact, on code generation, of increasing/decreasing the number of registers available to the compiler. But these studies were done using 1990s era compilers and modern compilers do many more optimizations; whole program optimization ought to be able to make use of many more registers than are probably available on today’s processors (at least I think so, until somebody does a study that shows otherwise). There is a register-less processor that is supposed to be taking the world by storm, sometime soon.

Do computers need to support the IEEE floating-point representation? Logarithmic number systems are starting to be used in various devices, but accuracy remains an issue for some applications.

Treat All Test Environments Like Production

Chris Oldwood from The OldWood Thing

One of the policies I pushed for from the start when working on a greenfield system many years ago was the notion that we were going to treat all test environments (e.g. dev and UAT) like the production environment.

As you can probably imagine this was initially greeted with a heavy dose of scepticism. However all the complaints I could see against the idea were dysfunctional behaviours of the delivery process. All the little workarounds and hacks that were used to back-up their reasons for granting unfettered access to the environments seemed to be the result of poorly thought out design, inadequate localised testing or organisational problems. (See “Testing Drives the Need for Flexible Configuration” for how we addressed one of those concerns.)

To be clear, I am not suggesting that you should completely disable all access to the environment; on the contrary I believe that this is required even in production for those rare occasions when you just cannot piece together the problem from your monitoring and source code alone. No, what I was suggesting was that we employ the same speed bumps and privileges in our test environments that we would in production. And that went for the database too.

The underlying principle I was trying to enshrine here was that shared testing environments, by their very nature, should be treated with the utmost care to ensure a smooth delivery of change. In the past I have worked on systems where dev and test environments were a free-for-all. The result is that you waste so much time investigating issues that are orthogonal to your actual problem because someone messed with it for their own use and just left it in a broken state. (This is another example of the “Broken Windows” syndrome.)

A secondary point I was trying to make was that your test environments are also, by definition, your practice runs at getting things right. Many organisations have a lot of rigour around how they deploy to production but very little when it comes to the opportunities leading up to it. In essence your dev and test environments give you two chances to get things right before the final performance – if you’re not doing dress rehearsals beforehand how can you expect it to go right on the day? When production deployments go wrong we get fearful of them and then risk aversion kicks in meaning we do them less often and a downward spiral kicks in.

The outcome of this seemingly “draconian” approach to managing the development and test environments was that we also got to practice supporting the system in two other environments, and in a way that prepared us for what we needed to do when the fire was no longer just a drill. In particular we quickly learned what diagnostic tools we should already have on the box and, most importantly, what privileges we needed to perform certain actions. It also affected what custom tools we built and what extra features we added to the services and processes to allow safe use for analysis during support (e.g. a --ReadOnly switch).

The Principle of Least Privilege suggests that for our incident analysis we should only require read access to any resource, such as files, the database, OS logs, etc. If you know that you are protected from making accidental mistakes you can be more aggressive in your approach as you feel confident that the outcome of any mistake will not result in breaking the system any further [1][2]. Only at the point at which you need to make a change to the system configuration or data should the speed bumps kick in and you elevate yourself temporarily, make the change and immediately drop back to mere mortal status again.

The database was an area in particular where we had all been bitten before by support issues made worse through the execution of ad-hoc SQL passed around by email or pasted in off the wiki. Instead we added a new schema (i.e. namespace) specifically for admin and support stored procedures that were developed properly, i.e. they were written test-first. (See “You Write Your SQL Unit Tests in SQL” for more on how and why we did it this way.) This meant applying certain kinds of workarounds were easier to administer because they were essentially part of the production codebase, not just some afterthought that nobody maintained.

On the design front this also started to have an interesting effect as we found ourselves wanting to leverage our production service code in new ways to ensure that we avoided violating invariants by hosting the underlying service components inside new containers, i.e. command line tools or making them scriptable. (See “Building Systems as Toolkits”.)

The Interface Segregation Principle is your friend here as it pushes you towards having separate interfaces for reading and writing making it clearer which components you can direct towards a production service if you’re trying to reproduce an issue locally. For example our calculation engine support tool allowed you to point any “readers” towards real service endpoints whilst redirecting the the writers to /dev/null (i.e. using the Null Object pattern) or to some simple in-memory implementation (think Dictionary) to pass data from one internal task to the next.

I find it somewhat annoying that we went to a lot of effort to give ourselves the best chance of designing and building a supportable system that also provided traceability only for the infrastructure team to disallow our request for personal per-environment support accounts, saying instead that we needed to share a single one! Even getting them to give us a separate account for dev, UAT and production was hard work. It sometimes feel like the people who complain most about a lack of transparency and rigour are the same ones that deny you access to exactly that.

I know there were times when it felt as though we could drop our guard in dev or UAT “just this once” but I don’t remember us ever doing that. Instead we always used it as an opportunity to learn more about what the real need was and how it could become a bona fide feature rather than just a hack.

 

[1] That’s not entirely true. A BA once concocted a SQL query during support that ended up “bug checking” SQL Server and brought the entire system to a grinding halt. They then did it again by accident after it was restarted :o).

[2] A second example was where someone left the Sysinternals DebugView tool running overnight on a server whereupon it filled up the log window and locked up a service due to the way OutputDebugString works under the covers.

Cubic Line Division – a.k.

a.k. from thus spake a.k.

Last time we took a look at how we can use linear interpolation to approximate a function from a set of points on its graph by connecting them with straight lines. As a consequence the result isn't smooth, meaning that its derivative isn't continuous and is undefined at the x values of the points, known as the nodes of the interpolation.
In this post we shall see how we can define a smooth interpolation by connecting the points with curves rather than straight lines.

Software engineering is fertile ground for the belief in silver bullets

Derek Jones from The Shape of Code

The idea that there exists some wonderful technique or methodology, which solves one or more perceived software engineering problems, was given a name in 1986; the title of Brooks’ paper No Silver Bullet is a big clue that the author does not think it exists. Indeed, over the years a steady stream of papers have attempted to dispel the idea that silver bullets exist. These attempts have two things in common: the use of reasoning and facts to make their case, and failure to dispel the idea that there are no silver bullets in software engineering.

Now, I am a great fan of reasoning using facts, but I am also a fan of evidence driven approaches to solving problems. There is now over 30 years of evidence that reasoning using facts is not an effective means of convincing people that silver bullets don’t exist.

Belief in silver bullets will not go away until it ceases to be in some peoples’ interest for them to exist.

If you have something to sell, there is a benefit to having customers believe in silver bullets: the product/research will dramatically improve performance, time to market, costs, profitability, etc…

Belief in silver bullets is not unfounded. Computing has a 70-year history of things going faster, getting cheaper and systems doing what was once thought impossible. The press has bought into this and amazing success stories abound. Having worked on a few projects that delivered faster/cheaper/impossible systems, I know that no silver bullets were involved, just lots of hard work and sometimes being in the right place at the right time. Hard work and happenstance don’t make for feel-good headlines, and rarely get mentioned in the press.

When faced with a problem, the young and inexperienced tend to be optimists; there must be a silver bullet better way of doing this that is fast/cheap/efficient. The computing field has been evolving so rapidly that many of those involved are young and inexperienced; fertile ground for belief in silver bullets to flourish.

Consequences of a belief in silver bullets in industry include, time/cost overruns on projects and money wasted on tools that are never used. In academia a belief in silver bullets results in the pointless invention of new programming languages, methodologies, programming techniques, etc.

The belief in silver bullets will not fade away until the rate of change in computing slows to a crawl and most of those involved have gained substantial experience from which they can see that results come from hard work (and some amount of luck).

Organizational structure in the Digital and Agile age

Allan Kelly from Allan Kelly Associates

iStock_000003002725XSmall-2018-07-3-18-18.jpg

Someone ask the other day: how should a organisation be designed?

There are two potential answers, which actually aren’t as contradictory as they look at first site.

The first is very simple: Don’t.

That is, don’t design your organization, don’t set out an organizational chart, don’t set out a plan and aim to restructure your organization to that plan. Rather create the conditions to let a structure emerge.

I suppose its the difference between “design” meaning “create a plan for the way you want things to be” and “design” meaning “the way things are arranged.” To differentiate them the first might be called “intentional design” and the latter “emergent design.”

That does not necessarily imply all emergent structures are good. As we see in code sometimes emergent designs are not always the best and over time they need refactoring. Which implies at some point there needs to be intentional design.

Put it like this: I’d rather your organization pulls the design rather than you push a design on the organization.

Organizational structure is itself a function of business strategy. And both need to be part emergent and part intentional. Although you might have noticed I tend towards emergent while most of the world tends towards intentional!

Thus it helps to have a reference model of how you think the organization should be, maybe something to steer the organization towards.

So the second answer to the question would be longer:

  • Create standing delivery teams which are embedded in the business line itself. This is sometimes call stream teams, or stream based development, or “teams aligned to the value stream”, or several other names I can’t think of just now.
  • Each business line is itself a stream of work and digital delivery teams support that work.
  • Teams contain all the skills and authority to do the work that is required for that business stream.
  • The team is part of the stream so the business/technical divide should dissolve. Something I call BusTech.
  • Teams are value seeking and value creating: the team seeks opportunities to create value for the business and delivers on the most valuable ones.
  • Devolve authority to the teams whenever you can. Teams are mini-businesses. (Notice I deliberately don’t use the word empowerment.)
  • Teams grow when the business is successful and more digital capability is needed. And teams shrink when money is tight or less capability is needed.
  • Teams may split (Amoeba style) from time to time. New teams may be in the same business line (addressing another question) or part of another, possibly new, business line.
  • Active – or Agile – Portfolio Management sits on top to monitor progress, provide extra resources, remove resources, etc. There may even be multiple portfolio processes, one at the business line level and perhaps one above multiple business lines.
  • Minimally Viable Teams are started to explore new initiatives, sometimes these go on to be full standing teams but they may also be dissolved if the idea doesn’t validate.
  • Seek to minimise common services between teams because these create bottlenecks, conflicts and delays. Each team should stand alone. This may mean some duplication, and therefore some extra costs, but accept that. Once you have your model working you can fine tune such things later.
  • Don’t worry about planning and synchronisation between teams to much, worry more about getting the teams to release more often and deal with synchronisation issues when they become a problem.

They are the main points at any rate. If you’d like to know more Continuous Digital contains a longer discussion of the topic. (Continuous Digital actually builds on Xanpan in this regard, and the (never finished) Xanpan Appendix discusses the same idea.)

Sign-up to receive these posts by e-mail and free eBook of Xanpan

The post Organizational structure in the Digital and Agile age appeared first on Allan Kelly Associates.

#NoProject #NoEstimates workshop

Allan Kelly from Allan Kelly Associates

MilkCartons-2018-07-3-17-57.png

In August I’m running a 1-day workshop in Zurich with Vasco Duarte on the bleeding edge of Agile: #NoProjects and #NoEstimates for Digital First companies.

This is a pre-conference event for the ALE 2018 conference which is happening the same week in Zurich. Everyone is welcome, you don’t need to attend the conference.

If you book in the next two weeks you get it for cheap, after July 20 the price goes up – although its still only a few hundred euros.

Book now, save money and secure your place – places are limited!

For those ho can’t get to Zurich in August I’ve got a Continuous Digital workshop of my own and a half-day management briefing. Right now you can book either of these for private in-house delivery. I’m looking at offering these as public courses here in London, if you are interested get in touch and help me fix a date.

(I have a love hate relationship with #NoProjects, I’d love to retire the name but it resonates with so many people. So I tend to use #NoProjects when I’m discussing my critique of the project model and Continuous Digital when I’m setting out my preferred alternative.)

The post #NoProject #NoEstimates workshop appeared first on Allan Kelly Associates.