WMI Performance Anomaly: Querying the Number of CPU Cores

Chris Oldwood from The OldWood Thing

As one of the few devs that both likes and is reasonably well-versed in PowerShell I became the point of contact for a colleague that was bemused by a performance oddity when querying the number of cores on a host. He was introducing Ninja into the build and needed to throttle its expectations around how many actual cores there were because hyperthreading was enabled and our compilation intensive build was being slowed by its bad guesswork [1].

The PowerShell query for the number of cores (rather than logical processors) was pulled straight from the Internet and seemed fairly simple:

(Get-WmiObject Win32_Processor |
  measure -Property NumberOfCores -Sum).Sum

However when he ran this it took a whopping 4 seconds! Although he was using a Windows VM running on QEMU/KVM, I knew from benchmarking a while back this setup added very little overhead, i.e. only a percentage or two, and even on my work PC I observed a similar tardy performance. Here’s how we measured it:

Measure-Command {
  (Get-WmiObject Win32_Processor |
  measure -Property NumberOfCores -Sum).Sum
} | % TotalSeconds
4.0867539

(As I write my HP laptop running Windows 11 is still showing well over a second to run this command.)

My first instinct was that this was some weird overhead with PowerShell, what with it being .Net based so I tried the classic native wmic tool under the Git Bash to see how that behaved:

$ time WMIC CPU Get //Format:List | grep NumberOfCores  | cut -d '=' -f 2 | awk '{ sum += $1 } END{ print sum }'
4

real    0m4.138s

As you can see there was no real difference so that discounted the .Net theory. For kicks I tried lscpu under the WSL based Ubuntu 20.04 and that returned a far more sane time:

$ time lscpu > /dev/null

real    0m0.064s

I presume that lscpu will do some direct spelunking but even so the added machinery of WMI should not be adding the kind of ridiculous overhead that we were seeing. I even tried my own C++ based WMICmd tool as I knew that was talking directly to WMI with no extra cleverness going on behind the scenes, but I got a similar outcome.

On a whim I decided to try pushing more work onto WMI by passing a custom query instead so that it only needed to return the one value I cared about:

Measure-Command {
  (Get-WmiObject -Query 'select NumberOfCores from Win32_Processor' |
  measure -Property NumberOfCores -Sum).Sum
} | % TotalSeconds
0.0481644

Lo-and-behold that gave a timing in the tens of milliseconds range which was far closer to lscpu and definitely more like what we were expecting.

While my office machine has some “industrial strength” [2] anti-virus software that could easily be to blame, my colleague’s VM didn’t, only the default of MS Defender. So at this point I’m none the wiser about what was going on although my personal laptop suggests that the native tools of wmic and wmicmd are both returning times more in-line with lscpu so something funky is going on somewhere.

 

[1] Hyper-threaded cores meant Ninja was scheduling too much concurrent work.

[2] Read that as “massively interfering”!

 

Chaining IF and && with CMD

Chris Oldwood from The OldWood Thing

An interesting bug cropped up the other day in a dub configuration file which made me realise I wasn’t consciously aware of the precedence of && when used in an IF statement with cmd.exe.

Batch File Idioms

I’ve written a ton of batch files over the years and, with error handling being a manual affair, the usual pattern is to alternate pairs of statement + error check, e.g.

mkdir folder
if %errorlevel% neq 0 exit /b %errorlevel%

It’s not uncommon for people to explicitly leave off the error check in this particular scenario so that (hopefully) the folder will exist whether not it already does. However it then masks a (not uncommon) failure where the folder can’t be created due to permissions and so I tend to go for the more verbose option:

if not exist "folder" (
  mkdir folder
  if !errorlevel! neq 0 exit /b !errorlevel!
)

Note the switch from %errorlevel% to !errorlevel!. I tend to use setlocal EnableDelayedExpansion at the beginning of every batch file and use !var! everywhere by convention to avoid forgetting this transformation as it’s an easy mistake to make in batch files.

Chaining Statements

In cmd you can chain commands with & (much like ; in bash) with && being used when the previous command succeeds and || for when it fails. This is useful with tools like dub which allow you to define “one liners” that will be executed during a build by “shelling out”. For example you might write this:

mkdir bin\media && copy media\*.* bin\media

This works fine first time but it’s not idempotent which might be okay for automated builds where the workspace is always clean but it’s annoying when running the build repeatedly, locally. Hence you might be inclined to fix this by changing it to:

if not exist "bin\media" mkdir bin\media && copy media\*.* bin\media

Sadly this doesn’t do what the author intended because the && is part of the IF statement “then” block – the copy is only executed if the folder doesn’t exist. Hence this was the aforementioned bug which wasn’t spotted at first as it worked fine for the automated builds but failed locally.

Here is a canonical example:

> if exist "C:\" echo A && echo B
A
B

> if not exist "C:\" echo A && echo B

As you can see, in the second case B is not printed so is part of the IF statement happy path.

Parenthesis to the Rescue

Naturally the solution to problems involving ordering or precedence is to introduce parenthesis to be more explicit.

If you look at how parenthesis were used in the second example right back at the beginning you might be inclined to write this thinking that the parenthesis create a scope somewhat akin to {} in C style languages:

> if not exist "C:\" (echo A) && echo B

But it won’t work as the parenthesis are still part of the “then” statement. (They are useful to control evaluation when mixing compound conditional commands that use, say, || and & [1].)

Hence the correct solution is to use parenthesis around the entire IF statement:

> (if not exist "C:\" echo A) && echo B
B

Applying this to the original problem, it’s:

(if not exist "bin\media" mkdir bin\media) && copy media\*.* bin\media

 

[1] Single line with multiple commands using Windows batch file

Transient Expand-Archive Failures

Chris Oldwood from The OldWood Thing

[I’m sure there is something else going on here but on the off-chance someone else is also observing this and also lost at least they’ll know they’re not alone.]

We have a GitLab project pipeline that started out as a monolithic job but over the last 9 months has slowly been parallelized and now runs as over 150 jobs spread out across a cluster of 4 fairly decent [1] machines with 8 to 10 concurrent jobs per host. More recently we’ve started seeing the PowerShell Expand-Archive cmdlet failing randomly up to 5% of the time with the following error:

Remove-Item : Cannot find path {...} because it does not exist.

The line of code highlighted in the error is:

$expandedItems | % { Remove-Item $_ -Force -Recurse }

If you google this message it suggests this probably isn’t the real error but a problem with the cmdlet trying to clean-up after failing to extract the contents of the .zip file. Sadly the reason why the extraction might have failed in the first place is now lost.

Investigation

While investigating this error message I ran across two main hits – one from Stack Overflow and the other on the PowerShell GitHub project – both about hitting the classic long path problem in Windows. In our case the extracted paths, even including the build agent root, is still only 100 characters so well within the limit as the archive only has one subfolder and the filenames are short.

Also the archive is built with it’s companion cmdlet Compress-Archive so I doubt it’s an impedance mismatch in our choice of tools.

My gut reaction to anything spurious like this is that it’s the virus scanner (AV) [2]. Sadly I have no direct control over the virus scanner product choice or its configuration. In this instance the machines have Trend Micro whereas the other build agents I’ve built are VMs and have Windows Defender [3], but their load is also much lower. I managed to get the build folder excluded temporarily but that appears to have had no effect and nothing was logged in the AV to say it had blocked anything. (The “behaviour monitoring” in modern AV products often gets triggered by build tools which is annoying.)

After discounting the obvious and checking that memory exhaustion also wasn’t a factor as the memory load for the jobs is variable and the worst case loading can cause the page-file to be used, I wondered if there the problem lay with the GitLab runner cache somehow.

Corrupt Runner Cache?

To avoid downloading the .zip file artefact for every job run we utilise the GitLab runner local cache. This is effectively a .zip file of a packages folder in the project working copy that gets packed up and re-used in the other jobs on the same machine which, given our level of concurrency, means it’s constantly in use. Hence I wondered if our archive was being corrupted when the cache was being unpacked as I’ve seen embedded .zip files cause problems in the past for AV tools (even though it supposedly shouldn’t have been touching the folder). So I added a step to test our archive’s integrity before unpacking it by using 7-Zip as there doesn’t appear to be a companion cmdlet Test-Archive. I immediately saw the integrity test pass but the Expand-Archive step fail a few times so I’m pretty sure the problem is not archive corruption.

Workaround

The workaround which I’ve employed is to use 7-Zip for the unpacking step too and so far we’ve seen no errors at all but I’m left wondering why Expand-Archive was intermittently failing. Taking an extra dependency on a popular tool like 7-Zip is hardly onerous but it bumps the complexity up very slightly and needs to be accounted for in the docs / scripts.

In my 2017 post Fallibility I mentioned how I once worked with someone who was more content to accept they’d found an undocumented bug in the Windows CopyFile() function than believe there was a flaw in their code or analysis [4]. Hence I feel something as ubiquitous as Expand-Archive is unlikely to have a decompression bug and that there is some piece of the puzzle here that I’m missing. Maybe the AV is still interfering in some way that isn’t triggered by 7-Zip or the transient memory pressure caused by the heavier jobs is having an impact?

Given the low cost of the workaround (use 7-Zip instead) the time, effort and disruption needed to run further experiments to explore this problem further is sadly too high. For the time being annecdata is the best I can do.

 

[1] 8 /16 cores, 64 / 128 GB RAM, and NVMe based disks.

[2] I once did some Windows kernel debugging to help prove an anti-virus product update was the reason our engine processes where not terminating correctly under low memory conditions.

[3] Ideally servers shouldn’t need anti-virus tools at all but the principle of Defence in Depth suggests the minor performance impact is worth it to potentially help slow lateral movement.

[4] TL;DR: I quickly showed it was the latter at fault not the Windows API.

 

Lose the Source Luke?

Chris Oldwood from The OldWood Thing

We were writing a new service to distribute financial pricing data around the trading floor as a companion to our new desktop pricing tool. The plugin architecture allowed us to write modular components that could tap into the event streams for various reasons, e.g. provide gateways to 3rd party data streams.

Linking New to Old

One of the first plugins we wrote allowed us to publish pricing data to a much older in-house data service which had been sat running in the server room for some years as part of the contributions system. This meant we could eventually phase that out and switch over to the new platform once we had parity with it.

The plugin was a doddle to write and we quickly had pricing data flowing from the new service out to a test instance of the old service which we intended to leave running in the background for soak testing. As it was an in-house tool there was no installer and my colleague had a copy of the binaries lying around on his machine [1]. Also he was one of the original developers so knew exactly what he was doing to set it up.

A Curious Error Message

Everything seemed to be working fine at first but as the data volumes grew we suddenly noticed that the data feed would eventually hang after a few days. In the beginning we were developing the core of the new service so quickly it was constantly being upgraded but now the pace was slowing down the new service was alive for much longer. Given how mature the old service was we assumed the issue was with the new one. Also there was a curious message in the log for the old service about “an invalid transaction ID” before the feed stopped.

While debugging the new plugin code my colleague remembered that the Transaction ID meant the message sequence number that goes in every message to allow for ordering and re-transmission when running over UDP. The data type for that was a 16-bit unsigned integer so it dawned on us that we had probably messed up handling the wrapping of the Transaction ID.

Use the Source Luke

Given how long ago he last worked on the old service he couldn’t quite remember what the protocol was for resetting the Transaction ID so we decided to go and look at the old service source code to see how it handled it. Despite being at the company for a few years myself this all pre-dated me so I left my colleague to do the rummaging.

Not long after my colleague came back over to my desk and asked if I might know where the source code was. Like so many programmers in a small company I was a part-time sysadmin and generally looked after some of servers we used for development duties, such as the one where our Visual SourceSafe repository lived that contained all the projects we’d ever worked on since I joined.

The VCS Upgrade

When I first started at the company there were only a couple of programmers not working on the mainframe and they wrote their own version control system. It was very Heath Robinson and used exclusive file locks to side-step the problem of concurrent changes. Having been used to a few VCS tools by then such as PVCS, Star Versions, and Visual SourceSafe I suggested that we move to a 3rd party VCS product as we needed more optimistic concurrency controls as more people were going to join the team. Given the MSDN licenses we already had along with my own experience Visual SourceSafe (VSS) seemed like a natural choice back then [2].

Around the same time the existing development server was getting a bit long in the tooth so the company forked out for a brand new server and so I set-up the new VSS repository on that and all my code went in there along with all the subsequent projects we started. None of the people that joined after me ever touched any of the old codebase or VCS as it was so mature it hadn’t needed changing in some time and anyway the two original devs where still there to look after it.

The Office Move

A couple of years after I joined, the owners of the lovely building the company had been renting for the last few decades decided they wanted to gut and renovate it as the area in London where we were based was getting a big makeover. Hence we were forced to move to new premises about half a mile away. The new premises were nice and modern and I no longer had the vent from the portable air-conditioning machine from one of the small server rooms pumping out hot air right behind my desk [3].

When moving day came I made sure the new server with all our stuff on got safely transported to the new office’s server room so that we ready to go again on Monday morning. As we stood staring around the empty office floor my colleague pointed to the old development server which had lay dormant in the corner and asked me (rhetorically) whether we should even bother taking it with us. As far as I was concerned everything I’d ever needed had always been on the new server and so I didn’t know what was left that we’d still need.

My colleague agreed and so we left the server to be chucked in the skip when the bulldozers came.

Dormant, But Not Redundant

It turned out their original home-grown version control system had a few projects in it, including the old data service. Luckily one of the original developers who worked on the contributions side still had an up-to-date copy of that and my colleague found a local copy of the code for one of the other services but had no idea how up-to-date it was. Sadly nobody had even a partial copy of the source to the data service we were interested in but we were going to replace that anyway so in the end the loss was far less significant than we originally feared.

In retrospect I can’t believe we didn’t even just take the hard disk with us. The server was a classic tower so took up a far bit of room which was still somewhat at a premium in the new office whereas the disk could probably have sit in a desk drawer or even been fitted as an extra drive in the new midi sized development server.

 

[1] +1 for xcopy deployment which made setting up development and test instances a piece of cake.

[2] There are a lot of stories of file corruption issues with VSS but in the 7 years I’d used it with small teams, even over a VPN, we only had one file corruption issue that we quickly restored from a backup.

[3] We were on the opposite side from the windows too so didn’t even get a cool breeze from those either.

 

The Case of the Curious Commit Message

Chris Oldwood from The OldWood Thing

I had taken a new contract at an investment bank and started working on a very mature codebase which was stored in ClearCase. As a long-time user [1] of version control systems one of the things that bugged me about the codebase were empty commit messages. On a mature codebase where people have come and gone it’s hard enough to work out what was going on just from the code, decent commit messages should be there to give you that extra context around the “why”.

Rallying the Troops

After attempting to sell the virtues of commit messages to my colleagues a couple of times during team meetings there were still the odd one or two that consistently avoided doing so. So I decided to try a name-and-shame approach [2] by emailing a table of names along with their percentage of non-empty commit messages hoping those that appeared at the bottom would consider changing their ways.

At the time I was just getting my head around ClearCase and there were a couple of complaints from people who felt unduly chastised because they didn’t have 100% when they felt they should. It turned out their accounts were used for some automated check-ins which had no message which I didn’t know about, so I excluded those commits and published a revised table.

Progress?

On the plus side this got people discussing what a good commit message looked like and it brought up some question marks around certain practices that others had done. For example a few team members wouldn’t write a formal message but simply paste the ID of the issue from ClearQuest [3]. Naturally this passed my “not empty” test but it raised a question about overly terse commit messages. Given where we were coming from I felt this was definitely acceptable (for the time being) as they were still using the commit message to provide more details, albeit in the form of a link to the underlying business request [5].

However, it got me thinking about whether people were not really playing ball and might be gaming the system so I started looking into overly terse commit messages and I’m glad to say everyone was entering into the spirit of things [4]. Everyone except one person who had never even been on the initial radar but who had a sizable number of commits with the simple message:

    nt

That’s right, just the two letters ‘n’ and ‘t’. (There were others but this was the most prevalent and memorable.)

A Curious Message

Looking at the diffs that went with these messages it wasn’t obvious what “nt” meant. My initial instinct was that it was an abbreviation of some sort, perhaps a business term I was unaware of as the developer was involved in the more maths heavy side of the project. They were far more common before my “shake-up” so I was pleased that whatever this term was it was being replaced by more useful comments now but I was still intrigued. Naturally I walked across the room to the very pleasant developer in question and asked him what “nt” meant.

It turned out it didn’t mean anything, and the developer was largely unaware they even existed! So where did they come from?

The Mist Clears

Luckily while we were chatting he started making a new change and I saw the ClearCase check-out dialog appear and the initial message was a few letters of garbage. I looked at what he intended to type in the editor and it dawned on me what was happening – the “nt” was the latter part of the word “int”.

Just as with Visual SourceSafe, the ClearCase Visual Studio plugin would trigger when you started editing a file and nothing else was checked out at that point. It would pop-up a dialog so you could configure how the check-out was done. For example you might want to put an exclusive lock on the file [6] or you could provide a message so others could see what files were being edited concurrently. By default the focus in this dialog was on the OK button so it was possible to dismiss this dialog without even really seeing it…

Hence this is what was going on:

  1. The dev typed “int” to start a declaration as part of a new set of code changes.
  2. The “i” keypress triggered the ClearCase plugin which noticed this was the start of a new check-out and promptly threw up a dialog with the remaining letters “nt” in the message field.
  3. By then the dev had already pressed “space” at the end of the type name which, due to the default button focus, caused the dialog to immediately disappear.
  4. When he committed the changes at the end he never edited the message anyway, he would just click the commit button and move on.

Case closed. From a UI perspective it probably falls into the same category (although with less disastrous consequences) as those unexpected popups that ask if you want to reboot your machine, NOW. Ouch!

 

[1] I was introduced to them on my very first job and have been fortunate enough to use one on virtually every job since, even if I ended up setting one up :o).

[2] In retrospect I probably didn’t try hard enough to sell it and should have taken a more personal approach for the laggards as maybe there were good reasons why they weren’t doing it, e.g. tooling.

[3] Yes, an enterprise level defect tracking tool with all the pain you’d expect from such a product.

[4] For non-trivial things that is, the message “typo” still appeared for some of those but that raised a whole different set of questions around not compiling or testing changes before committing them!

[5] Including the ticket number at the start of a commit message is something I promote in my Commit Checklist.

[6] This was useful for non-mergeable files like DTS packages and media assets but often ended up creating more harm than good as they got left locked and you had to get an admin to unlock them, and they were in another team.

Planning is Inevitable

Chris Oldwood from The OldWood Thing

Like most programmers I’ve generally tried to steer well clear of getting involved in management duties. The trouble is that as you get older I think this becomes harder and harder to avoid. Once you get the mechanics of programming under control you might find you have more time to ponder about some of those other duties which go into delivering software because they begin to frustrate you.

The Price of Success

Around the turn of the millennium I was working in a small team for a small financial organisation. The management structure was flat and we had the blessing of the owner to deliver what we thought the users needed and when. With a small but experienced team of programmers we could adapt to the every growing list of feature requests from our users. Much of what we were doing at the time was trying to work out how certain financial markets were being priced so there was plenty of experimentation which lead to the writing and rewriting of the pricing engine as we learned more.

The trouble with the team being successful and managing to reproduce prices from other more expensive 3rd party pricing software was that we were then able to replace it. But of course it also has some other less important features that users then decided they needed too. Being in-house and responsive to their changes just means the backlog grows and grows and grows…

The Honeymoon Must End

While those users at the front of the queue are happy their needs are being met you’ll end up pushing others further down the queue and then they start asking when you’re going to get around to them. If you’re lucky the highs from the wins can outweigh the lows from those you have to disappoint.

The trouble for me was that I didn’t like having to keep disappointing people by telling them they weren’t even on the horizon, let alone next on the list. The team was doing well at delivering features and reacting to change but we effectively had no idea where we stood in terms of delivering all those other features that weren’t being worked on.

MS Project Crash Course

The company had one of those MSDN Universal licenses which included a ton of other Microsoft software that we never used, including Microsoft Project. I had a vague idea of how to use it after seeing some plans produced by previous project managers and set about ploughing through our “backlog” [1] estimating every request with a wild guess. I then added the five of us programmers in the team as the “resources” [2] and got the tool to help distribute the work amongst ourselves as best as possible.

I don’t remember how long this took but I suspect it was spread over a few days while I did other stuff, but at the end I had a lovely Gantt Chart that told us everything we needed to know – we had far too much and not enough people to do it in any meaningful timeframe. If I remember correctly we had something like a year’s worth of work even if nothing else was added to the “TODO list” from now on, which of course is ridiculous – software is never done until it’s decommissioned.

For a brief moment I almost felt compelled to publish the plan and even try and keep it up-to-date, after all I’d spend all that effort creating it, why wouldn’t I? Fortunately I fairly quickly realised that the true value in the plan was knowing that we had too much work and therefore something had to change. Maybe we needed more people, whether that was actual programmers or some form of manager to streamline the workload. Or maybe we just needed to accept the reality that some stuff was never going to get done and we should ditch it. Product backlogs are like the garage or attic where “stuff” just ends up, forgotten about but taking up space in the faint hope that one day it’ll be useful.

Saying No

The truth was uncomfortable and I remember it lead to some very awkward conversations between the development team and the users for a while [3]. There is only so long that you can keep telling people “it’s on the list” and “we’ll get to it eventually” before their patience wears out. It was unfair to string people along when we pretty much knew in our hearts we’d likely never have the time to accommodate them, but being the eternal optimists we hoped for the best all the same.

During that period of turmoil having the plan was a useful aid because it allowed is to have those awkward conversations about what happens if we take on new work. Long before we knew anything about “agility” we were doing our best to respond to change but didn’t really know how to handle the conflict caused by competing choices. There was definitely an element of “he who shouts loudest” that had a bearing on what made its way to the top of the pile rather than a quantitative approach to prioritisation.

Even today, some 20 years on, it’s hard to convince teams to throw away old backlog items on the premise that if they are important enough they’ll bubble up again. Every time I see an issue on GitHub that has been automatically closed because of inactivity it makes me a little bit sad, but I know it’s for the best; you simply cannot have a never-ending list of bugs and features – at some point you just have to let go of the past.

On the flipside, while I began to appreciate the futility of tracking so much work, I also think going through the backlog and producing a plan made me more tolerant of estimates. Being that person in the awkward situation of trying to manage someone’s expectations has helped me get a glimpse of what questions some people are trying to answer by creating their own plans and how our schedule might knock onto them. I’m in no way saying that I’d gladly sit through sessions of planning poker simply for someone to update some arbitrary project plan because it’s expected of the team, but I feel more confident asking the question about what decisions are likely to be affected by the information I’m being asked to provide.

Self-Organising Teams

Naturally I’d have preferred someone else to be the one to start thinking about the feature list and work out how we were going to organise ourselves to deal with the deluge of work, but that’s the beauty of a self-organising team. In a solid team people will pick up stuff that needs doing, even if it isn’t the most glamourous task because ultimately what they want is to see is the team succeed [4], because then they get to be part of that shared success.

 

[1] B.O.R.I.S (aka Back Office Request Information System) was a simple bug tracking database written with Microsoft Access. I’m not proud of it but it worked for our small team in the early days :o).

[2] Yes, the air quotes are for irony :o).

[3] A downside of being close to the customer is that you feel their pain. (This is of course a good thing from a process point of view because you can factor this into your planning.)

[4] See “Afterwood – The Centre Half” for more thoughts on the kind of role I seem to end up carving out for myself in a team.

Pair Programming Interviews

Chris Oldwood from The OldWood Thing

Let’s be honest, hiring people is hard and there are no perfect approaches. However it feels somewhat logical that if you’re hiring someone who will spend a significant amount of their time solving problems by writing software, then you should probably at least try and validate that they are up to the task. That doesn’t mean you don’t also look for ways to asses their suitability for the other aspects of software development that don’t involve programming, only that being able to solve a problem with code will encompass a fair part of what they’ll be doing on a day-to-day basis [1].

Early Computer Based Tests

The first time I was ever asked to write code on a computer as part of an interview was way back in the late ‘90s. Back then pair programming wasn’t much of a thing in the Enterprise circles I moved in and so the exercise was very hands-off. They left me in the boardroom with a computer (but no internet access) and gave me a choice of exercises. Someone popped in half way through to make sure I was alright but other than that I had no contact with anyone. At the end I chatted briefly with the interviewer about the task but it felt more like a box ticking affair than any real attempt to gain much of an insight into how I actually behaved as a programmer. (An exercise in separating “the wheat from the chaff”.)

I got the job and then watched from the other side of the table as other people went through the same process. In retrospect being asked to write code on an actual computer was still quite novel back then and therefore we probably didn’t explore it as much as we should have.

It was almost 15 years before I was asked to write code on a computer again as part of an interview. In between I had gone through the traditional pencil & paper exercises which I was struggling with more and more [2] as I adopted TDD and refactoring as my “stepwise refinement” process of choice.

My First Pair Programming Interview

Around 2013 an old friend in the ACCU, Ed Sykes, told me about a consultancy firm called Equal Experts who were looking to hire experienced freelance software developers. Part of their interview process was a simple kata done in a pair programming style. While I had done no formal pair programming up to that time [3] it was a core technique within the firm and so any candidates were expected to be comfortable adopting this practice where preferable.

I was interviewed by Ed Sykes, who played a kind of Product Owner role, and Adam Straughan, who was more hands-on in the experience. They gave me the Roman Numerals kata (decimal to roman conversion), which I hadn’t done before, and an hour to solve it. I took a pretty conventional approach but didn’t quite solve the whole thing in the allotted time as I didn’t quite manage to get the special cases to fall out more naturally. Still, the interviewers must have got what they were after as once again I got the job. Naturally I got involved in the hiring process at Equal Experts too because I really liked the process I had gone through and I wanted to see what it was like on the other side of the keyboard. It seemed so natural that I wondered why more companies didn’t adopt something similar, irrespective of whether or not any pair programming was involved in the role.

Whenever I got involved in hiring for the end client I also used the same technique although I tended to be a lone “technical” interviewer rather than having the luxury of the PO + Dev approach that I was first exposed to but it was still my preferred approach by a wide margin.

Pairing – Interactive Interviewing

On reflection what I liked most about this approach as a candidate, compared to the traditional one, is that it felt less like an exam, which I generally suck at, and more like what you’d really do on the job. Putting aside the current climate of living in a pandemic where many people are working at home by themselves, what I liked most was that I had access to other people and was encouraged to ask questions rather than solve the problem entirely by myself. To wit, it felt like I was interviewing to be part of a team of people, not stuck in a booth and expected to working autonomously [4]. Instead of just leaving you to flounder, the interviewers would actively nudge you to help unblock the situation, just like they (hopefully) would do in the real world. Not everyone notices the same things and as long as they aren’t holding the candidate’s hand the whole time that little nudge should be seen as a positive sign about taking on-board feedback rather than failing to solve the problem. It’s another small, but I feel hugely important, part of making the candidate feel comfortable.

The Pit of Success

We’ve all heard about those interviews where it’s less about the candidate and more about the interviewer trying to show how clever they are. It almost feels like the interviewer is going out of their way to make the interview as far removed from normal operating conditions as possible, as if the pressure of an interview is somehow akin to a production outage. If your goal is to get the best from the candidate, and it should be if you want the best chance of evaluating them fairly, then you need to make them feel as comfortable as possible. You only have a short period of time with them so getting them into the right frame of mind should be utmost in your mind.

One of the problems I faced in that early programming test was an unfamiliar computer. You have a choice of whether to try and adapt to the keyboard shortcuts you’re given or reconfigure the IDE to make it more natural. You might wonder if that’s part of the test which wastes yet more time and adds to the artificial nature of the setting. What about the toolset – can you use your preferred unit testing framework or shell? Even in the classic homogenous environment that is The Windows Enterprise there is often still room for personal preference, despite what some organisations might have you believe [5].

Asking the candidate to bring their own laptop overcomes all of these hurdles and gives them the opportunity to use their own choice of tools thereby allowing them to focus more on the problem and interaction with you and less on yak shaving. They should also have access to the Internet so they can google whatever they need to. It’s important to make this perfectly clear so they won’t feel penalised for “looking up the answer” to even simple things because we all do that for real, let alone under the pressure of an interview. Letting them get flustered because they can’t remember something seemingly trivial and then also worrying about how it’ll look if they google it won’t work in your favour. (Twitter is awash with people asking senior developers to point out that even they google the simple things sometimes and that you’re not expected to remember everything all the time.)

Unfortunately, simply because there are people out there that insist on interviewing in a way designed to trip up the candidate, I find I have to go overboard when discussing the setup to reassure them that there really are no tricks – that the whole point of the exercise is to get an insight into how they work in practice. Similarly reassuring the candidate that the problem is open-ended and that solving it in the allotted is not expected also helps to relax them so they can concentrate more on enjoying the process and feel comfortable with you stopping to discuss, say, their design choices instead of feeling the need to get to the end of yet another artificial deadline instead.

The Exercise

I guess it’s to be expected that if you set a programming exercise that you’d want the candidate to complete it; but for me the exercise is a means to a different end. I’m not interested in the problem itself, it’s the conversation we have that provides me with the confidence I need to decide if the candidate has potential. This implies that the problem cannot be overly cerebral as the intention is to code and chat at the same time.

While there are a number of popular katas out there, like the Roman Numerals conversion, I never really liked any of them. Consequently I came up with my own little problem based around command line parsing. For starters I felt this was a problem domain that was likely to be familiar to almost any candidate even if they’re more GUI oriented in practice. It’s also a problem that can be solved in a procedural, functional, or object-oriented way and may even, as the design evolves, be refactored from one style to the other, or even encompass aspects of multiple paradigms. (Many of the classic katas are very functional in nature.) There is also the potential to touch on I/O with the program usage and this allows the thorny subject of mocking and testability to be broached which I’ve found to be a rich seam of discussion with plenty of opinions.

(Even though the first iteration of the problem only requires supporting “-v” to print a version string I’ve had candidates create complex class hierarchies based around the Command design pattern despite making it clear that we’ll introduce new features in subsequent iterations.)

Mechanics

Aside from how a candidate solves a problem from a design standpoint I’m also interested in the actual mechanics of how they program. I don’t mean whether they can touch type or not – I personally can’t so that would be a poor indicator :o) – no, I mean how they use the tools. For example I find it interesting what they use the keyboard or mouse for, what keyboard shortcuts they use, how they select and move text, whether they use snippets or prefer the editor not to interfere. While I don’t think any of the candidate’s choices says anything significant about their ability to solve the problem, it does provide an interesting avenue for conversation.

It’s probably a very weak indicator but programmers are often an opinionated bunch and one area they can be highly opiniated about is the tools they use. Some people love to talk about what things they find useful, in essence what they feel improves or hinders their productivity. This in turn begs the question about what they believe “productivity” is in a software development context.

Reflection

What much of this observation and conversation boils down to is not about whether they do things the same way I do – on the contrary I really hope they don’t as diversity is important – it’s about the “reflective” nature of the person. How much of what they do is through conscious choice and how much is simply the result of doing things by rote.

In my experience the better programmers I have worked with tend to more aware of how they work. While many actions may fall into the realm of unconscious competence when “in the zone” they can likely explain their rationale because they’re are still (subconsciously) evaluating it in the background in case a better approach is suitable.

(Naturally this implies the people I tend to interview are, or purport to be, experienced programmers where that level of experience is assumed to be over 10 years. I’m not sure what you can expect to take away from this post when hiring those just starting out on their journey.)

An Imperfect Process

Right back at the start I said that interviewing is an imperfect process and while I think pairing with someone is an excellent way to get a window into their character and abilities, so much still comes down to a gut feeling and therefore a subjective assessment.

I once paired with someone in an interview and while I felt they were probably technically competent I felt just a tinge of uneasiness about them personally. Ultimately the final question was “would I be happy to work with this person?” and so I said “yes” because I felt I would be nit-picking to say “no”. As it happens I did end up working with this person and a couple of months into the contract I had to have an awkward conversation with my other two colleagues to see if they felt the same way I did about this team mate. They did and the team mate was “swapped out” after a long conversation with the account manager.

What caused us to find working with this person unpleasant wasn’t something we felt could easily and quickly be rectified. They had a general air of negativity about them and had a habit of making disparaging, sweeping remarks which showed they looked down on database administrators and other non-programming roles. They also lacked an attention to detail causing the rest of us to dot their I’s and cross their T’s. Even after bringing this up directly it didn’t get any better; they really just wanted to get on and write new code and leave the other tasks like reviewing, documenting, deploying, etc. to other people.

I doubt there is anything you can do in an hour of pairing to unearth these kind of undesirable traits [6] to a level that you can adequately assess, which is why the gut still has a role to play. (I suspect it was my many years of experience in the industry working with different people that originally set my spider senses tingling.)

Epilogue

The hiring question I may find myself putting to the client is whether they would prefer to accidentally let a good candidate slip away because the interview let them (the candidate) down or accidentally hire a less suitable candidate that appeared to “walk-the-walk” as well as “talk-the-talk” and potentially become a liability. Since doing pairing interviews this question has come up very rarely with a candidate as it’s been much clearer from the pairing experience what their abilities and attitude are.

 

[1] This doesn’t just apply to hiring individuals but can also work for whole teams, see “Choosing a Supplier: The Hackathon”.

[2] See “Afterwood – The Interview” for more on how much I dislike the pen & paper approach to coding interviews.

[3] My first experience was in a Cyber Dojo evening back in September 2010 that Jon Jagger ran at Skills Matter in London. I wrote it up for the ACCU: “Jon Jagger’s Coding Dojo”.

[4] Being a long-time freelancer this mode of operation is not unexpected as you are often hired into an organisation specifically for your expertise; your contributions outside of “coding” are far less clear. Some like the feedback on how the delivery process is working while others do not and just want you to write code.

[5] My In The Toolbox article “Getting Personal” takes a look at the boundary between team conventions and personal freedom for choices in tooling and approach.

[6] I’m not saying this person could not have improved if given the right guidance, they probably could have and I hope they actually have by now; they just weren’t right for this particular environment which needed a little more sensitivity and rigour.


Fast Hardware Hides Many Sins

Chris Oldwood from The OldWood Thing

Way back at the beginning of my professional programming career I worked for a small software house that wrote graphics software. Although it had a desktop publisher and line-art based graphics package in its suite it didn’t have a bitmap editor and so they decided to outsource that to another local company.

A Different User Base

The company they chose to outsource to had a very high-end bitmap editing product and so the deal – to produce a cut-down version – suited both parties. In principle they would take their high-end product, strip out the features aimed at the more sophisticated market (professional photographers) and throw in a few others that the lower end of the market would find beneficial instead. For example their current product only supported 24-bit video cards, which were pretty unusual in the early to mid ‘90s due to their high price, and so supporting 8-bit palleted images was new to them. Due to the large images their high-end product could handle using its own virtual memory system they also demanded a large, fast hard disk too.

Even though I was only a year or two into my career at that point I was asked to look after the project and so I would get the first drop of each version as they delivered it so that I could evaluate their progress and also keep an eye on quality. The very first drop I got contained various issues that in retrospect did not bode well for the project, which ultimately fell through, although that was not until much later. (Naturally I didn’t have the experience I have now that would probably cause me to pull the alarm chord much sooner.)

Hard Disk Disco

One of the features that they partially supported but we wanted to make a little more prominent was the ability to see what the RGB value of the pixel under the cursor was – often referred to now as a colour dropper or eye dropper. When I first used the feature on my 486DX PC I noticed that it was a somewhat laggy; this surprised me as I had implemented algorithms like Floyd-Steinberg dithering so knew a fair bit about image manipulation and what algorithms were expensive and this definitely wasn’t one! As an aside I had also noticed that the hard disk light on my PC was pretty busy too which made no sense but was probably worth mentioning to them as an aside.

After feeding back to them about this and various other things I’d noticed they made some suggestions that their virtual memory system was probably overly aggressive as the product was designed for more beefier hardware. That kind of made sense and I waited for the next drop.

On the next drop they had apparently made various changes to their virtual memory system which helped it cope much better with smaller images so they didn’t page unnecessarily but I still found the feature laggy, and as I played with it some more I noticed that the hard disk light was definitely flashing lots when I moved the mouse although it didn’t stop flashing entirely when I stopped moving it. For our QA department who only had somewhat smaller 386SX machines it was almost even more noticeable.

DBWIN – Airing Dirty Laundry

At our company all the developers ran the debug version of Windows 3.1. enhanced mode with a second mono monitor to display messages from the Windows APIs to point out bugs in our software, but it was also very interesting to see what errors other software generated too [1]. You probably won’t be surprised to discover that the bitmap editor generated a lot of warnings. For example Windows complained about the amount of extra (custom) data it was storing against a window handle (hundreds of bytes) which I later discovered was caused by them constantly copying image attribute data back-and-forth as individual values instead of allocating a single struct with the data and copying that single pointer around.

Unearthing The Truth

Anyway, back to the performance problem. Part of the deal enabled our company to gain access to the bitmap editor source code which they gave to us earlier than originally planned so that I could help them by debugging some of their gnarlier crashes [2]. Naturally the first issue I looked into was the colour dropper and I quickly discovered the root cause of the dreadful performance – they were reading the application’s .ini file every time [3] the mouse moved! They also had a timer which simulated a WM_MOUSEMOVE message for other reasons which was why it still flashed the hard disk light even when the mouse wasn’t actually moving.

When I spoke to them about it they explained that once upon a time they ran into a Targa video card where the driver returned the RGB values as BGR when calling GetPixel(). Hence what they were doing was checking the .ini file to see if there was an application setting there to tell them to swap the GetPixel() result. Naturally I asked them why they didn’t just read this setting once at application start-up and cache the value given that the user can’t swap the video card whilst the machine (let alone the application) was running. Their response was simply a shrug, which wasn’t surprising by that time as it was becoming ever more apparent that the quality of the code was making it hard to implement the features we wanted and our QA team was turning up other issues which the mostly one-man team was never going to cope with in a reasonable time frame.

Epilogue

I don’t think it’s hard to see how this feature ended up this way. It wasn’t a prominent part of their high-end product and given the kit their users ran on and the kind of images they were dealing with it probably never even registered with all the other swapping going on. While I’d like to think it was just an oversight and one should never optimise until they have measured and prioritised there were too many other signs in the codebase that suggested they were relying heavily on the hardware to compensate for poor design choices. The other is that with pretty much only one full-time developer [5] the pressure was surely on to focus on new features first and quality was further down the list.

The project was eventually canned and with the company I was working for struggling too due to the huge growth of Microsoft Publisher and CorelDraw I only just missed the chop myself. Sadly neither company is around today despite quality playing a major part in the company I worked for and it being significantly better than many of the competing products.

 

[1]  One of the first pieces of open source software I ever published (on CiX) was a Mono Display Adapter Library.

[2] One involved taking Windows “out at the knees” – not even CodeView or BoundsChecker would trap it – the machine would just restart. Using SoftICE I eventually found the cause – calling EndDialog() instead of DestroyWindow() to close a modeless dialog.

[3] Although Windows cached the contents of the .ini file it still needed to stat() the file on every read access to see if it had changed and disk caching wasn’t exactly stellar back then [4].

[4] See this tweet of mine about how I used to grep my hard disk under Windows 3.1 :o).

[5] I ended up moonlighting for them in my spare time by writing them a scanner driver for one of their clients while they concentrated on getting the cut-down bitmap editor done for my company.

Simple Tables From JSON Data With JQ and Column

Chris Oldwood from The OldWood Thing

My current role is more of a DevOps role and I’m spending more time than usual monitoring and administrating various services, such as the GitLab instance we use for source control, build pipelines, issue management, etc. While the GitLab UI is very useful for certain kinds of tasks the rich RESTful API allows you to easily build your own custom tools to to monitor, analyse, and investigate the things you’re particularly interested in.

For example one of the first views I wanted was an alphabetical list of all runners with their current status so that I could quickly see if any had gone AWOL during the night. The alphabetical sorting requirement is not something the standard UI view provides hence I needed to use the REST API or hope that someone had already done something similar first.

GitLab Clients

I quickly found two candidates: python-gitlab and go-gitlab-client which looked promising but they only really wrap the API – I’d still need to do some heavy lifting myself and understand what the GitLab API does. Given how simple the examples were, even with curl, it felt like I wasn’t really saving myself anything at this point, e.g.

curl --header "PRIVATE-TOKEN: $token" "https://gitlab.example.com/api/v4/runners"

So I decided to go with a wrapper script [1] approach instead and find a way to prettify the JSON output so that the script encapsulated a shell one-liner that would request the data and format the output in a simple table. Here is the kind of JSON the GitLab API would return for the list of runners:

[
  {
   "id": 6,
   "status": "online"
   . . .
  }
,
  {
   "id": 8,
   "status": "offline"
   . . .
  }
]

JQ – The JSON Tool

I’d come across the excellent JQ tool for querying JSON payloads many years ago so that was my first thought for at least simplifying the JSON payloads to the fields I was interested in. However on further reading I found it could do some simple formatting too. At first I thought the compact output using the –c option was what I needed (perhaps along with some tr magic to strip the punctuation), e.g.

$ echo '[{"id":1, "status":"online"}]' |\
  jq -c
[{"id":1,"status":"online"}]

but later I discovered the –r option provided raw output which formatted the values as simple text and removed all the JSON punctuation, e.g.

$ echo '[{"id":1, "status":"online"}]' |\
  jq -r '( .[] | "\(.id) \(.status)" )'
1 online

Naturally my first thought for the column headings was to use a couple of echo statements before the curl pipeline but I also discovered that you can mix-and match string literals with the output from the incoming JSON stream, e.g.

$ echo '[{"id":1, "status":"online"}]' |\
   jq -r '"ID Status",
          "-- ------",
          ( .[] | "\(.id) \(.status)" )'
ID Status
-- ------
1 online

This way the headings were only output if the command succeeded.

Neater Tables with Column

While these crude tables were readable and simple enough for further processing with grep and awk they were still pretty unsightly when the values of a column were too varied in length such as a branch name or description field. Putting them on the right hand side kind of worked but I wondered if I could create fixed width fields ala printf via jq.

At this point I stumbled across the StackOverflow question How to format a JSON string as a table using jq? where one of the later answers mentioned a command line tool called “column” which takes rows of text values and arranges them as columns of similar width by adjusting the spacing between elements.

This almost worked except for the fact that some fields had spaces in their input and column would treat them by default as separate elements. A simple change of field separator from a space to a tab meant that I could have my cake and eat it, e.g.

$ echo '[ {"id":1, "status":"online"},
          {"id":2, "status":"offline"} ]' |\
  jq -r '"ID\tStatus",
         "--\t-------",
         ( .[] | "\(.id)\t\(.status)" )' |\
  column -t -s $'\t'
ID  Status
--  -------
1   online
2   offline

Sorting and Limiting

While many of the views I was happy to order by ID, which is often the default for the API, or in the case of jobs and pipelines was a proxy for “start time”, there were cases where I needed to control the sorting. For example we used the runner description to store the hostname (or host + container name) so it made sense to order by that, e.g.

jq 'sort_by(.description|ascii_downcase)'

For the runner’s jobs the job ID ordering wasn’t that useful as the IDs were allocated up front but the job might start much later if it’s a latter part of the pipeline so I chose to order by the job start time instead with descending order so the most recent jobs were listed first, e.g.

jq ‘sort_by(.started_at) | reverse’

One other final trick that proved useful occasionally when there was no limiting in the API was to do it with jq instead, e.g

jq "sort_by(.name) | [limit($max; .[])]"

 

[1] See my 2013 article “In The Toolbox – Wrapper Scripts” for more about this common technique of simplifying tools.

Weekend Maintenance as Chaos Engineering

Chris Oldwood from The OldWood Thing

I was working on a new system – a grid based calculation engine for an investment bank – and I was beginning to read about some crazy ideas by Netflix around how they would kill off actual production servers to test their resilience to failure. I really liked this idea as it had that “put your money where your mouth is” feel to it and I felt we were designing a system that should cope with this kind of failure, and if it didn’t, then we had learned something and needed to fix it.

Failure is Expected

We had already had a few minor incidents during its early operation caused by dodgy data flowing down from upstream systems and had tackled that by temporarily remediating the data to get the system working but then immediately fixed the code so that the same kind of problem would not cause an issue in future. The project manager, who had also worked on a sister legacy system to one I’d worked on before, had made it clear from the start that he didn’t want another “support nightmare” like we’d both seen before [1] and pushed the “self-healing” angle which was a joy to hear. Consequently reliability was always foremost in our minds.

Once the system went live and the business began to rely on it the idea of randomly killing off services and servers in production was a hard prospect to sell. While the project manager had fought to help us get a UAT environment that almost brought us parity with production and was okay with us using that for testing the system’s reliability he was less happy about going to whole hog and adopting the Netflix approach. (The organisation was already very reserved and despite our impeccable record some other teams had some nasty failures that caused the organisation to become more risk adverse rather than address then root problems.)

Planned Disruption is Good!

Some months after we had gone live I drew the short straw and was involved with a large-scale DR test. We were already running active/active by making use of the DR facilities during the day and rotated the database cluster nodes every weekend [2] to avoid a node getting stale, hence we had a high degree of confidence that we would cope admirably with the test. Unfortunately there was a problem with one of the bank’s main trade systems such that it wouldn’t start after failover to DR that we never really got to do a full test and show that it was a no-brainer for us.

While the day was largely wasted for me as I sat around waiting for our turn it did give me time to think a bit more about how we would show that the system was working correctly and also when the DR test was finished and failed back over again that it had recovered properly. At that point I realised we didn’t need to implement any form of Chaos Engineering ourselves as the Infrastructure team were already providing it, every weekend!

It’s common for large enterprises to only perform emergency maintenance during the week and then make much more disruptive changes at the weekend, e.g. tearing parts of the network up, patching and rebooting servers, etc. At that time it was common for support teams to shut systems down and carefully bring them back up after the maintenance window to ensure they were operating correctly when the eastern markets opened late Sunday evening [3]. This was the perfect opportunity to do the complete opposite – drive the system hard over the weekend and see what state it was after the maintenance had finished – if it wasn’t still operating normally we’d missed some failure modes.

An Aria of Canaries

We were already pushing through a simple canary request every few minutes which allowed us to spot when things had unusually gone south but we wanted something heavier that might drive out subtler problems so we started pushing through heavy loads during the weekend too and then looked at what state they were in at the end of the weekend. These loads always had a lower priority than any real work so we could happily leave them to finish in the background rather than need to kill them off before the working week started. (This is a nice example of using the existing features of the system to avoid it disrupting the normal workload.)

This proved to be a fruitful idea as it unearthed a couple of places where the system wasn’t quite as reliable as we’d thought. For example we were leaking temporary files when the network was glitching and the calculation was restarted. Also the load pushed the app servers over the edge memory-wise and highlighted a bug in the nanny process when the machine was short of memory. There was also a bug in some exponential back-off code that backed off a little too far as it never expected an outage to last most of the weekend :o).

Order From Chaos

When they finally scheduled a repeat DR test some months later after supposedly ironing out the wrinkles with their key trade capture systems our test was a doddle as it just carried on after being brought back to life in the DR environment and similarly after reverting back to PROD it just picked up where it had left off and retried those jobs that had failed when the switchover started. Rather than shying away from the weekend disruption we had used it to our advantage to help improve its reliability.

 

[1] Eventually the team spends so much time fire-fighting there is no time left to actually fix the system and it turns into an endless soul-destroying job.

[2] Rotating the database cluster primary causes the database to work with an empty cache which is a great way to discover how much your common queries rely on heavily cached data. In one instance a 45-second reporting query took over 15 minutes when faced with no cached pages!

[3] See Arbitrary Cache Timeouts for an example where constant rebooting masked a bug.