Chris Oldwood – Page 5 – ACCU World of Code

The error message was somewhat flummoxing, largely because it was so generic, but also because the data all came from a database extract rather than manual input:

Input string was not in a correct format.

Naturally I looked carefully at all the various decimal and date values as I knew this was the kind of message you get when parsing those kind of values when theyâ€™re incorrectly formed, but none of them appeared to be at fault. The DateTime error message is actually slightly different [1] but Iâ€™d forgotten that at the time and so I eyeballed the dates as well as decimal values just in case.

Then I remembered that empty string values also caused this error, but lo-and-behold I was not missing any optional decimals or dates in my table either. Time to hit the debugger and see what was going on here [2].

The Plot Thickens

I changed the settings for the FormatException error type to break on throw, sent in my data to the service, and waited for it to trip. It didnâ€™t take long before the debugger fired into life and I could see that the code was trying to parse a decimal value as a double but the string value was â€œ0100/04/01â€, i.e. the 1st April in the year 100. WTF!

I immediately went back to my table and checked my data again, aware that a date like this would have stood out a mile first time around, but I was happy to assume that I could have missed it. This time I used some regular expressions just to be sure my eyes were not deceiving me.

The thing was I knew what column the parser thought the value was in but I didnâ€™t entirely trust that I hadnâ€™t mucked up the file structure and added or removed an errant comma in the CSV input file. I didnâ€™t appear to have done that and so the value that appeared to be causing this problem was the decimal number â€œ100.04â€, but how?

None of this made any sense and so I decided to debug the client code, right from reading in the CSV data file through to sending it across the wire to the service, to see what was happening. The service was invoked via a fairly simple WCF client assembly and as I stepped into that code I came across a method called NormaliseDate()...

The Mist Clears

What this method did was to attempt to parse the input string value as a date and if it was successful it would rewrite it in an unusual (to me) â€œuniversalâ€ format â€“ YYYY/MM/DD [3].

The first two parsing attempts it did were very specific, i.e. it used DateTime.ParseExact() to match the intended output format and the â€œsaneâ€ local time format of DD/MM/YYYY. So far, so good.

However the third and last attempt, for whatever reason, just used DateTime.Parse() in its no-frills form and that was happy to take a decimal number like â€œ100.04â€ and treat it as a date in the format YYY.MM! At first I wondered if it was treating it as a serial or OLE date of some kind but I think itâ€™s just more liberal in its choice of separators than the author of our method intended [4].

Naturally there are no unit tests for this code or any type of regression test suite that shows what kind of scenarios this method was intended to support. Due to lack of knowledge around deployment and use in the wild of the client library I was forced to pad the values in the input file with trailing zeroes in the short term to workaround the issue, yuck! [5]

JSON Parsers

This isnâ€™t the first time Iâ€™ve had a run-in with a date parser. When I was working on REST APIs I always got frustrated by how permissive the JSON parser would be in attempting to coerce a string value into a date (and time). All we ever wanted was to keep it simple and only allow ISO-8601 format timestamps in UTC unless there was a genuine need to support other formats.

Every time I started writing the acceptance tests though for timestamp validation Iâ€™d find that I could never quite configure the JSON parser to reject everything but the desired format. In the earlier days of my time with ASP.Net even getting it to stop accepting local times was a struggle and even caused us a problem as we discovered a US/UK date format confusion error which the parser was hiding from us.

In the end we resorted to creating our own Iso8601DateTime type which used the .Net DateTimeOffest type under the covers but effectively allowed us to use our own custom JSON serializer methods to only support the exact format we wanted.

More recently JSON.Net has gotten better at letting you control the format and parsing of dates but itâ€™s still not perfect and there are unit tests in past codebases that show variants that would unexpectedly pass, despite using the strictest settings. I wouldnâ€™t be surprised if our Iso8601DateTime type was still in use as I can only assume everyone else is far less pedantic about the validation of datetimes and those that are have taken a similar route to ensure they control parsing.

A Dangerous Game

One should not lose sight though of the real issue here which the attempt to classify string values by attempting to parse them. Even if you limit yourself to a single locale you might get away with it but when you try and do that across arbitrary locales youâ€™re just asking for trouble.

[1] â€œString was not recognized as a valid DateTime.â€

[2] This whole fiasco falls squarely in the territory Iâ€™ve covered before in my Overload article â€œTerse Exception Messagesâ€. Fixing this went to the top of my backlog, especially after I discovered it was a problem for our users too.

[3] Why they didnâ€™t just pick THE universal format of ISO-8601 is anyoneâ€™s guess.

[4] I still need to go back and read the documentation for this method because it clearly caters for scenarios I just donâ€™t normally see in my normal locale or user base.

[5] Thatâ€™s what happens with tactical solutions, no one ever quite gets around to documenting anything because they never think itâ€™ll survive for very long...

December 15, 2017December 15, 2017

Wit Limits

Chris Oldwood from The OldWood Thing

Iâ€™ve used the lightning talks at the last two ACCU conferences as a means of subjecting a captive audience to my dreadful array of programming / IT / geek one liners. (My previous two ACCU stand-up routines are published on this blog as â€œThe Daily Stand-Upâ€ and â€œStand-Up and Deliverâ€.) This year was no different, but I wasnâ€™t sure if I had enough â€œdecentâ€ new or unused material to survive the whole 5 minutes; unluckily for the audience I had...

Hence, here are the 34 one-liners I delivered under the title â€œWit Limitsâ€ [1] at this yearâ€™s ACCU conference:

â€œI thought it was odd when the doctor prescribed â€˜programmingâ€™ to help me cope with my migraine; then I realised he said â€˜codeineâ€™.â€

â€œThese news reports of drone strikes are quite disturbing, but what I donâ€™t understand is why we allowed delivery bots to form unions in the first place.â€

â€œWhen we have chips at the seaside and I run out of ketchup I like to go round dipping them in other peopleâ€™s. I call it crowd saucing.â€

â€œThe marketing department said we needed to be more disruptive, so I dropped the production database and deleted all the source code.â€

â€œOur product doesnâ€™t have a road map, it has a star map. Each release depends on whatever new shiny thing the developers become infatuated with next.â€

â€œWeâ€™ve recently started using CRC cards. We now add a 32-bit checksum to each user story to stop the product owner messing with it mid-sprint.â€

â€œOur Scrum Master is forever asking what we did yesterday, what weâ€™re doing today, and what our impediments are. Heâ€™s a big fan of continuous interrogation.â€

â€œIâ€™ve always been envious of the autonomy granted to James Bond, but I guess thatâ€™s what you get when youâ€™re M-powered.â€

â€œTeams that refuse to do planning poker have really gone up in my estimation.â€

â€œIâ€™ve always felt itâ€™s important to allow slack time in a schedule. I mean, how else are you going to keep up with all the instant messages?â€

â€œThe problem with people who are Prince certified is that they want to manage projects like itâ€™s 1999.â€

â€œSomeone recently told me there is a new build system written entirely in F#, but I reckon itâ€™s just Fake news.â€

â€œI know he invented object-orientation, but was the Hexagonal Architecture also invented by Alan Key?â€

â€œGuido seemed somewhat subdued when I asked him about how the Python enhancement process was going, so I gave him a PEP talk.â€

â€œI recently went to see beauty and the beast; a system where the back-end was written in Python and the front-end in JavaScript.â€

â€œI once worked at an online china shop. The CEO said we needed to move fast and break things, so I hired a bull.â€

â€œThe problem with Amazonâ€™s Dynamo DB is that it stops working when they stop peddling it.â€

â€œCompanies that securely store my important data in offsite data centres really get my back up.â€

â€œVampires never use database replication as they canâ€™t see their data in the mirror.â€

â€œThe other day a sysadmin asked me what I was using to provision hardware; he said that he was using Terraform. I replied, â€˜Application Formâ€™.â€

â€œWhenever I provision some new hardware I like to do it in batches of a hundred. My motto is â€˜infra-penny, infra-poundâ€™.â€

â€œCalvin Klein once offered me a modelling contract but I had to turn it down when I discovered they still used Rational Rose.â€

â€œThe other day I felt really uncomfortable after we had a massive disagreement about whether to use dashes or slashes to prefix our console app switches. I hate command line arguments.â€

â€œI like to think of myself as a pragmatist. When the code doesnâ€™t compile due to warnings, I just pragma them out.â€œ

â€œI reckon Vim should be classified as a Class A drug on the grounds that itâ€™s impossible to quit.â€

â€œIâ€™m pretty disappointed that my ZX81 based mule racing game keeps falling over. I guess I shouldnâ€™t have called it 1K Donkey.â€

â€œSurely to create safe self-driving cars we first have to solve the Halting Problem?â€

â€œNever use someone that canâ€™t write regular expressions to perform jobs interviews â€“ they tend to be a bad judge of character.â€

â€œWhen Robocop eats breakfast in the morning does he use his cereal port?â€

â€œIf you hit the Levis REST API twice, on endpoints they havenâ€™t implemented, youâ€™ll get a pair of 501â€™s.â€

â€œThe last time my wife and I tried to plait my daughterâ€™s hair concurrently it ended in dreadlock.â€

â€œSomeone has been sending me tiny photos of my bankâ€™s login page. I think Iâ€™m being subjected to a micro-fiching attack.â€

â€œThe last time I hired a rowing boat I could turn left and turn right, but not move forwards or backwards. I reckon it must have had exclusive oars.â€

â€œIâ€™ve always felt itâ€™s important that my kids are well grounded so when they go to bed at night I attach a wire from their ear to the radiator.â€

[1] I also used this title for an â€œagileâ€ focused routine at Agile in the City: Birmingham the month before. However the less said about this performance the better...

December 6, 2017

Network Saturation

Chris Oldwood from The OldWood Thing

The first indication that we seemed to have a problem was when some of the background processing jobs failed. The support team naturally looked at the log files where the jobs had failed and discovered that the cause was an inability to log-in to the database during process start-up. Naturally they tried to log-in themselves using SQL Server Management Studio or run a simple â€œSELECT GetDate();â€ style query via SQLCMD and discovered a similar problem.

Initial Symptoms

With the database appearing to be up the spout they raised a priority 1 ticket with the DBA team to investigate further. Whilst this was going on I started digging around the grid computation services we had built to see if any more light could be shed on what might be happening. This being the Windows Server 2003 era I had to either RDP onto a remote desktop or use PSEXEC to execute remote commands against our app servers. What surprised me was that these were behaving very erratically too.

This now started to look like some kind of network issue and so a ticket was raised with the infrastructure team to find out if they knew what was going on. In the meantime the DBAs came back and said they couldnâ€™t find anything particularly wrong with the database, although the transaction log consumption was much higher than usual at this point.

Closing In

Eventually I managed to remote onto our central logging service [1] and found that the dayâ€™s log file was massive by comparison and eating up disk space fast. TAILing the central log file I discovered page upon page of the same error about some internal calculation that had failed on the compute nodes. At this point it was clearly time to pull the emergency chord and shut the whole thing down as no progress was being made for the business and very little in diagnosing the root of the problem.

With the tap now turned off I was able to easily jump onto a compute node and inspect its log. What I discovered there was every Monte Carlo simulation of every trade it was trying to value was failing immediately in some set-up calculation. The â€œbest effortsâ€ error handling approach meant that the error was simply logged and the valuation continued for the remaining simulations â€“ rinse and repeat.

Errors at Scale

Of course what compounded the problem was the fact that there were approaching 100 compute nodes all sending any non-diagnostic log messages, i.e. all warnings and errors, across the network to one central service. This service would in turn log any error level messages in the databaseâ€™s â€œerror logâ€ table.

Consequently with each compute node failing rapidly (see â€œBlack Hole - The Fail Fast Anti-Patternâ€) and flooding the network with thousands of log messages per-second the network eventually became saturated. Those processes which had long-lived network connections (we used a high-performance messaging product for IPC) would continue to receive and generate traffic, albeit slowly, but establishing new connections usually resulted in some form of timeout being hit instead.

The root cause of the compute node set-up calculation failure was later traced back to some bad data which itself had resulted from poor error handling in some earlier initial batch-level calculation.

Points of Failure

This all happened just before Michael Nygard published his excellent book Release It! Some months later when I finally read it I found myself frequently nodding my head as his tales of woe echoed my own experiences.

One of the patterns he talks about in his book is the use of bulkheads to stop failures â€œjumping the cracksâ€. On the compute nodes the poor error handling strategy meant that the same error occurred over-and-over needlessly instead of failing once. The use of a circuit breaker could also have mitigated the volume of errors generated and triggered some kind of cooling off period.

Duplicating the operational log data in the same database as the business data might have been a sane thing to do when the system was tiny and handling manual requests, but as the system became more automated and scaled out this kind of data should have been moved elsewhere where it could be used more effectively.

One of the characteristics of a system like this is that there are a lot of calculations forming a pipeline, so garbage-in, garbage-out means something might not go pop right away but sometime later when the error has compounded. In this instance an error return value of â€“1 was persisted as if it was normal data instead of being detected. Latter stages could do sanity checks on data to avoid poisoning the whole thing before itâ€™s too late. It should also have been fairly easy to run a dummy calculation on the core inputs before opening the flood gates to mitigate a catastrophic failure, at least, for one due to bad input data.

Aside from the impedance mismatch in the error handling of different components there was also a disconnect in the error handling in the code that was biased towards one-off trader and support calculations, where the user is present, versus batch processing where the intention is for the system to run unattended. The design of the system needs to take both needs into consideration and adjust the error handling policy as appropriate. (See â€œThe Generation, Management and Handling of Errorsâ€ for further ideas.)

Although the system had a monitoring page it only showed the progress of the entire batch â€“ you needed to know the normal processing speed to realise something was up. A dashboard needs a variety of different indicators to show elevated error rates and other anomalous behaviour, ideally with automatic alerting when the things start heading south. Before you can do that though you need the data to work from, see â€œInstrument Everything You Can Afford Toâ€.

The Devil is in the (Non-Functional) Details

Following Gallâ€™s Law to the letter this particular system had grown over many, many years from a simple ad-hoc calculation tool to a full-blown grid-based compute engine. In the meantime some areas around stability and reliably had been addressed but ultimately the focus was generally on adding support for more calculation types rather than operational stability. The non-functional requirements are always the hardest to get buy-in for on an internal system but without them it can all come crashing down and end in tears with some dodgy inputs.

[1] Yes, back then everyone built their own logging libraries and tools like Splunk.

December 2, 2017December 2, 2017

Fallibility

Chris Oldwood from The OldWood Thing

Iâ€™ve generally been pretty fortunate with the people Iâ€™ve found myself working with. For the most part theyâ€™ve all been continuous learners and there has always been some give and take on both sides so that weâ€™ve learned different things from each other. Many years ago on one particular contract I had the misfortune to be thrown a curveball twice, by two different teammates. This post is a reflection on both theirs and my behaviour.

The Unsolicited Review

The first incident occurred when I had only been working on the project for a few weeks. Whilst adding some new behaviour to one of the support command-line tools I spotted some C++ code similar to this:

std::vector<string*> hosts;

for (. . .)
hosts.push_back(new string(. . .));

Having been used to using values, the RAII idiom and smart pointers for so long in C++ I was genuinely surprised by it. Naturally I flicked back through the commit log to see who wrote it and whether they could shed any light on it. This was also out of place given the rest of the code Iâ€™d seen. I discovered not only who the author was, but realised they were sitting but a few feet away and so decided to tap them up if they werenâ€™t busy to find out a little more.

Although I cannot be sure, I believe that I approached them in a friendly manner and enquired why this particular piece of code used raw pointers instead of one of the more usual resource management techniques [1]. What I expected was the usual kind of â€œDoh!â€ reply that we often give when we noticed weâ€™ve done something silly. What I absolutely wasnâ€™t prepared for was the look of anger on their face followed by them barking â€œAre you reviewing my code? Have I asked you to do that?â€

In somewhat of a daze I apologised for interrupting them and left the code as-was for the time being until I had due cause to fix it â€“ I didnâ€™t want to be seen to be going behind someoneâ€™s back either at this point as that might only cause even more friction.

Not long after this episode I had to work more closely with them on the build and deployment scripts. They would make code changes but then make no effort to test them, so even when I knew they were wrong I felt I should wait for the build to fail (a 2 hour process!) rather than be seen to â€œreviewâ€ it.

Luckily the person left soon after, but I had already been given the remit to fix as many memory leaks as possible so could close out my original issue before that point.

Whose Bug?

The second incident features someone I actually referred to very briefly in a post over 5 years ago (â€œCan Code Be Too Simple?â€), but that was for a different reason a little while after the following one.

I got pulled into a support conversation after some compute nodes appeared to be failing to load the cache file for a newly developed cache mechanism. For some reason the cache file appeared to be corrupted and so every time the compute process started, it choked on loading it. The file was copied from a UNC share on-demand and so the assumption was that this was when the corruption was happening.

What I quickly discovered was that the focus of the investigation was around the Windows API call CopyFile(). The hypothesis was that there was a bug in this function which was causing the file to become truncated.

Personally I found this hypothesis somewhat curious. I suggested to the author that the chances of there being a bug in such a core Windows API call in a version of Windows Server that was five years old was incredibly slim â€“ not impossible of course, but highly unlikely. Their response was that â€œmy code worksâ€ and therefore the bug must be in the Windows call. Try as I might to get them to entertain other possibilities and to investigate other avenues â€“ that our code elsewhere might have a problem â€“ they simply refused to accept it.

Feeling their analysis was somewhat lacklustre I took a look at the log files myself for both the compute and nanny processes and quickly discovered the source of the corruption. (The network contention copying the file was causing it to exceed the process start-up timeout and it was getting killed by the nanny during the lengthy CopyFile() call [2].)

Even when I showed them the log messages which backed up my own hypothesis they were still somewhat unconvinced until the fix went in and the problem went away.

Failure is Always an Option

Although I hadnâ€™t heard it back then, this quote from Jeffrey Snover really sums up the attitude Iâ€™ve always tried to adopt with my team mates:

â€œWhen confronted by conflict respond with curiosity.â€

Hence whenever someone has found a fault in my code or I might have done the same with theirs I do not just assume Iâ€™m right. In the first example I was 99% sure I knew how to fix the code but that wasnâ€™t enough, I wanted to know if I was missing something I didnâ€™t know about C++ or the codebase, or if the same was true for the author. In short I wanted to fix the root cause not just the symptoms.

In the second example there was clearly a conflict in our approaches. Iâ€™m willing to accept that any bug is almost certainly of my own making and that Iâ€™ll spend as much time as possible working on that basis until the only option left is it for to be in someone elseâ€™s code. Although I was okay to entertain their hypothesis, I also wanted to understand why they felt so sure of their own work as Windows API bugs are, in my experience, pretty rare and well documented [3].

Everyone has their off days and Iâ€™m no exception. If these had been one of those Iâ€™d not be writing about them. On the contrary these were just the beginning of some further unfortunate experiences. Both people continued to display tendencies that showed they were overconfident in their approach whilst also making it difficult for anyone else to critique their work. For (supposedly) experienced professionals I would have expected a little more personal reflection and openness.

The consequence of being such a closed book is that it is hard for others who may be able to provide valuable insights and learning to want to do so. When you work with people who are naturally reflective and inquisitive you get a buzz from helping them grow, and likewise when they teach you something new in return. With junior programmers you can allow for a certain amount of arrogance [4] and thatâ€™s a challenge worth taking on, but with much older programmers the view that â€œan old dog canâ€™t learn new tricksâ€ makes the prospect far less rewarding.

As an â€œold dogâ€ myself I know that I probably have to work a little harder these days to appear open and attentive to change and I believe that process starts by accepting Iâ€™m far from infallible.

[1] In this instance simply using string values directly was more than adequate.

[2] The immediate fix of course was simply to copy to a temporary filename and then rename on completion, see â€œCopy & Rename (Like Copy & Swap But For File-Systems)â€.

[3] The â€œIntriguing SCHTASKS Bugâ€ that I found back in 2011 was certainly unusual, but a little googling turned up an answer reasonably quickly.

[4] See â€œThe Downs and Ups of Being an ACCU Memberâ€ for my own watershed moment about how high the bar really goes.

October 20, 2017October 20, 2017

Good Stories Assure the Architecture

Chris Oldwood from The OldWood Thing

One of the problems a team can run into when they adopt a more agile way of working is they struggle to frame their backlog in the terms of user focused stories. This is a problem Iâ€™ve written about before in â€œTurning Technical Tasks Into User Storiesâ€ which looked at the problem for smaller units of work. Even if the team can buy into that premise for the more run-of-the-mill features it can still be a struggle to see how that works for the big ticket items like the systemâ€™s architecture.

The Awkward Silence

What Iâ€™ve experienced is that the team can start to regress when faced with discussions around what kind of architecture to aim for. With a backlog chock full of customer pleasing functionality the architectural conversations might begin to take a bit of a back seat as the focus is on fleshing out the walking skeleton with features. Naturally the nervousness starts to set in as the engineers begin to wonder when the architecture is going to get the attention it rightly deserves. Itâ€™s all very well supporting a handful of â€œfriendlyâ€ users but what about when you have real customers whoâ€™ve entrusted you with their data and they want to make use of it without a moments notice at any hour of the day?

The temptation, which should be resisted, can be to see architectural work as outside the scope of the core backlog â€“ creating a separate backlog for stuff â€œthe business does not understandâ€. This way can lead to a split in the backlog, and potentially even two separate backlogs â€“ a functional and a non-functional one. This just makes prioritisation impossible. Also burying the work kills transparency, eventually erodes trust, and still doesnâ€™t get you the answers you really need.

Instead, the urge should be to frame the architectural concerns in terms the stakeholder does understand, so that the business can be more informed about their actual benefits. In addition, when â€œThe Architectureâ€ is a journey and not a single destination there is no longer one set of benefits to aim for there are multiple trade-offs as the architecture evolves over time, changing at each step to satisfy the ongoing needs of the customer(s) along the way. There is in essence no â€œfinal solutionâ€ there is only â€œwhat we need for the foreseeable futureâ€.

Tell Me a Story

So, what do I mean by â€œgood storiesâ€? Well, the traditional way this goes is for an analyst to solicit some non-functional requirements for some speculative eventual system behaviour. If weâ€™re really lucky it might end up in the right ballpark at one particular point in the future. Whatâ€™s missing from this scene is a proper conversation, a proper story â€“ one with a beginning, a middle, and an end â€“ where we are today, the short term and the longer term vision.

But not only do we need to get a feel for their aspirations we also need quantifiable metrics about how the system needs to perform. Vague statements like â€œfast enoughâ€ are just not helpful. A globally accessible system with an anticipated latency in the tens of milliseconds will need to break the law of physics unless we trade-off something else. We also need to know how those exceptional events like Cyber Monday are to be factored into the operation side.

Itâ€™s not just about performance either. In many cases end users care that their data is secure, both in-flight (over the network) and at rest, although they likely have no idea what this actually means in practice. Patching servers is a technical task, but the bigger story is about how the team responds to a vulnerability which may make patching irrelevant. Similarly database backups are not the issue itâ€™s about service availability â€“ you cannot be highly available if the loss of an entire data centre potentially means waiting for a database to be restored from scratch elsewhere.

Most of the traditional conversations around non-functional requirements focus entirely on the happy path, for me the conversation doesnâ€™t really get going until you start talking about what needs to happen when the system is down. Itâ€™s never a case of â€œifâ€, but â€œwhenâ€ it fails and therefore mitigating these problems features heavily in our architectural choices. Itâ€™s an uncomfortable conversation as we never like discussing failure but thatâ€™s what having â€œgrown upâ€ conversations mean.

Incremental Architecture

Although Iâ€™ve used the term â€œstoryâ€ in this postâ€™s title, many of the issues that need discussing are really in the realm of â€œepicsâ€. However we shouldnâ€™t get bogged down in the terminology, instead the essence is to remember to focus on the outcome from the userâ€™s perspective. Ask yourselves how fast, how secure, how available, etc. it needs to be now, and how those needs might change in response to the systemâ€™s, and the businessâ€™s growth.

With a clearer picture of the potential risks and opportunities we are better placed to design and build in small increments such that the architecture can be allowed to emerge at a sustainable rate.

October 13, 2017October 16, 2017

The User-Agent is not Just for Browsers

Chris Oldwood from The OldWood Thing

One of the trickiest problems when youâ€™re building a web service is knowing who your clients are. I donâ€™t mean your customers, thatâ€™s a much harder problem, no, I literally mean you donâ€™t know what client software is talking to you.

Although it shouldnâ€™t really matter who your consumers are from a technical perspective, once your service starts to field requests and youâ€™re working out what and how to monitor it, knowing this becomes far more useful.

Proactive Monitoring

For example the last API I worked on we were generating 404â€™s for a regular stream of requests because the consumer had a bug in their URL formatting and erroneously appended an extra space for one of the segments. We could see this at our end but didnâ€™t know who to tell. We had to spam our â€œAPI Consumersâ€ Slack channel in the hope the right person would notice [1].

We also had consumers sending us the wrong kind of authorisation token, which again we could see but didnâ€™t know which team to contact. Although having a Slack channel for the API helped, we found that people only paid attention to it when they noticed a problem. It also appeared, from our end, that devs would prefer to fumble around rather than pair with us on getting their client end working quickly and reliably.

Client Detection

Absent any other information a cloud hosted service pretty much only has the client IP to go on. If youâ€™re behind a load balancer then youâ€™re looking at the X-Forwarded-For header instead which might give you a clue. Of course if many of your consumers are also services running in the cloud or behind the on-premise firewall they all look pretty much the same.

Hence as part of our API documentation we strongly encouraged consumers to supply a User-Agent field with their service name, purpose, and version, e.g. MyMobileApp:Test/1.0.56. This meant that we would now have a better chance of talking to the right people when we spotted them doing something odd.

From a monitoring perspective we can then use the User-Agent in various ways to slice-and-dice our traffic. For example we can now successfully attribute load to various consumers. We can also filter out certain behaviours from triggering alerts when we know, for example, that itâ€™s their contract tests passing bad data on purpose.

By providing us with a version number we can also see when they release a new version and help them ensure theyâ€™ve deprecated old versions. Whilst you would expect service owners to know exactly what theyâ€™ve got running where, youâ€™d be surprised how many donâ€™t know they have old instances lying around. It also helps identify who the laggards are that are holding up removal of your legacy features.

Causality

A somewhat related idea is the use of â€œtraceâ€ or â€œcorrelationâ€ IDs, which is something Iâ€™ve covered before in â€œCausality - A Mechanism for Relating Distributed Diagnostic Contextsâ€. These are unique IDs for diagnosing problems with requests and itâ€™s useful to include a prefix for the originating system. However that system may not be your actual client if there are various other services between you and them. Hence the causality ID covers the end-to-end where the User-Agent can cover the local client-server hop.

You would think that the benefit of passing it was fairly clear â€“ it allows providers to proactively help consumers fix their problems. And yet like so many non-functional requirements it sits lower down their backlog because itâ€™s only optional [2]. Not only that but by masking themselves it actually hampers delivery of new features because youâ€™re working harder than necessary to keep the existing lights on.

[1] Ironically the requests were for some automated tests which they didnâ€™t realise were failing!

[2] We wanted to make the User-Agent header mandatory on all non-production environments [3] to try and convince our consumers of the benefits but it didnâ€™t sit well with the upper echelons.

[3] The idea being that its use in production then becomes automatic but does not exclude easy use of diagnostic tools like CURL for production issues.

October 12, 2017October 12, 2017

Donâ€™t Hide the Solution Structure

Chris Oldwood from The OldWood Thing

Whenever you join an existing team and start work on their codebase you need to orientate yourself so that you have a feel for the systemâ€™s architecture and design. If youâ€™re lucky there is some documentation, perhaps nice diagrams to give you an overview. Hopefully you also have an extensive suite of tests to tell you how the system behaves.

More than likely there is nothing or very little to go on, and if itâ€™s a truly legacy system any documentation could well be way out of date. At this point you pretty much only have the source code to work from. Whilst this is the source of truth, the amount of code you need to read to become au fait with all the various high-level concepts depends in part on how well itâ€™s laid out.

Static Structure

Irrespective of whether you like to think of your layers in terms of onions or brick walls, all code essentially gets organised on disk and that means the solution structure is hierarchical in nature. In the most popular languages that support namespaces, these are also hierarchical and are commonly laid out on disk to reflect the same hierarchy [1].

Although the compiler is happy to just hoover up source code from the entire solution and largely ignore the relative position of the callers and callees there are useful conventions, which if honoured, allow you to reason and refactor the code more easily due to lower coupling. For example, defining an interface in the same source file as a class that implements it suggests a different inheritance use than when the interface sits externally further up the hierarchy. Also, seeing code higher up the hierarchy referencing types deeper down in an unrelated branch is another smell, of an abstraction potentially depending on an implementation detail.

Navigating the Structure

One of the things Iâ€™ve noticed in recent years whilst pairing is that many developers appear to navigate the source code solely through their IDE, and within the IDE by using features like â€œgo to definition (implementation)â€. Some very rarely see the solution structure because they hide it to gain more screen real estate for the source file of current interest [2].

Hence the only time the solution structure is visible is when there is a need to add a new source file. My purely anecdotal evidence suggests that this will be added without a great deal of thought as the code can be easy located in future directly by the author through its class name or another reference; they never have to consider where it â€œlogicallyâ€ resides.

Sprawling Suburbs

The net result is that namespaces and packages suffer from urban sprawl as they slowly accrete more and more code. This newer code adds more dependencies and so the package as a whole acquires an ever increasing number of dependencies. Left unchecked this can lead to horrible cyclic dependencies that are a nightmare to resolve.

I recently had the opportunity to revisit the codebase for a greenfield system I had started a few years before. We initially partitioned the code into a few key assemblies to get ourselves going and so I was somewhat surprised to still see the same assemblies a few years later, albeit massively overgrown with extra responsibilities. As a consequence even their simple home-grown tools had bizarre dependencies dragged in through bloated shared libraries [3].

Take a Stroll

So in future, instead of taking the Underground (subway) through your codebase every day, stop, and take a stroll every now-and-then around the paths. The same rules about cohesion within the methods of a class also apply at the higher levels of design â€“ classes in a namespace, namespaces in an assembly, assemblies in a solution, etc. Then youâ€™ll find that as the system grows itâ€™s easier to refactor at the package level [3].

(For more on this topic see my older post â€œWhoâ€™s Maintaining the 100 Foot View?â€.)

[1] Annoyingly this is not a common practice in the C++ codebases Iâ€™ve worked on.

[2] If I was being flippant I might suggest that if you really need the space the code may be too complicated, as I once did on Twitter here.

[3] I once dragged in a projectâ€™s shared library for a few useful extension methods to use in a simple console app and found I had pulled in an IoC container and almost a dozen other NuGet dependencies!

[4] In C# the internal access modifier has zero effect if you stick all your code into one assembly.

October 11, 2017October 11, 2017

Every Commit Needs the Rationale to Support It

Chris Oldwood from The OldWood Thing

Each and every change to a codebase should be performed for a very specific reason â€“ we shouldnâ€™t just change some code because we feel like it. If you follow a checklist (mental or otherwise), such as the one I described in â€œCommit Checklistâ€, then each commit should be as cohesive as possible with any unintentional edits reverted to spare our blushes.

However, whilst the code can say what behaviour has changed, we also need to say why it was changed. The old adage â€œuse the source Lukeâ€ is great for reminding us that the only source of truth is the code itself, but changes made without any supporting documentation makes software archaeology [1] incredibly difficult in the future.

The Commit Log

Take the following one line change to the JSON serialization settings used when persisting to a database:

DateTimeZoneHandling = DateTimeZoneHandling.Utc;

This single-line edit appeared in a commit all by itself. Now, any change which has the potential to affect the storage or retrieval of the systemâ€™s data is something which should not be entered into lightly. Even if the change was done to make what is currently a default setting explicit, this fact still needs to be recorded â€“ the rationale is important.

The first port of call for any documentation around a change is probably the commit message. Given that it lives with the code and is (usually) immutable it stands the best chance of remaining intact over time. In the example above the commit message was simply:

â€œBug Fix: added date time zone handling to UTC for database json serializationâ€

In the same way that poor code comments have a habit of simply stating what the code does, the same malaise can affect commit messages by merely restating what was changed. Our example largely suffers from this, but it also teases us by additionally mentioning that it was done to fix a bug. Suddenly we have so many more unanswered questions about the change.

Code Change Comments

In the dim and distant past it was not unusual to use code comments to annotate changes as well as to describe the behaviour of the code. Before the advent of version control features like â€œblameâ€ (aka annotate) it was non-trivial to track down the commit where any particular line of code changed. As such it seemed easier to embed the change details in the code itself rather than the VCS tool, especially if the supporting documentation lived in another system; you could just use the Change Request ID as the comment.

As you can imagine this sorta worked okay at first but as the code continued to change and refactoring became more popular these comments became as distracting and pointless as the more traditional kind. It also did nothing to help reduce the overheard of tracking the how-and-why in different places.

Feature Trackers

The situation originally used to be worse than this as new features might be tracked in one place by the business whilst bugs were tracked elsewhere by the development team. This meant that the â€œwhyâ€ could be distributed right across time and space without the necessary links to tie them all together.

The desire to track all work in one place in an Enterprise tool like JIRA has at least reduced the number of places you need to look for â€œthe bigger pictureâ€, assuming you use the tool for more than just recording estimates and time spent, but of course there are lightweight alternatives [2]. Hence recording the JIRA number or Trello card number in the commit message is probably the most common approach to linking these two sides of the change.

As an aside, one of the reasons many teams havenâ€™t historically put all their documentation in their source code repo is because itâ€™s often been inaccessible to non-developer colleagues, either due to lack of permissions or technical ability. Fortunately tools like GitHub have started to bridge this divide.

Executable Specifications

One of the oldest problems in software development has been keeping the supporting documentation and code in sync. As features evolve it becomes harder and harder to know what the canonical reason for any change is because the current behaviour may be the sum of all previous related requirements.

An ever-growing technique for combating this has been to express the documentation, i.e. the requirements, in code too, in the form of tests. At a high level these are acceptance tests, with more technical behaviours expressed as unit or integration tests.

This brings me back to my earlier example. Itâ€™s incredibly rare that any code change would be committed without some kind of corresponding change to the automated tests. In this instance the bug must have manifested itself in the persistence layer and Iâ€™d expect at least one new test to be added (or an existing one fixed) to illustrate what the bug is. Hence the rationale for the change is to fix a bug, and the rationale can largely be described through the use of one or more well written tests rather than in prose.

Exceptions

There are of course no absolutes in life and fixing a spelling mistake should not require pages of notes, although spelling incorrectly on purpose probably does [3].

The point is that there is a balance to be struck if we are to trade-off the short and long term maintenance of the system. It might be tempting to rely on tribal knowledge or the product ownerâ€™s notes to avoid thinking about how the rationale is best expressed, but finding a way to encode that information in executable form, such as through tests, provides both the present reviewer and the future software archaeologist with the most usable representation.

[1] See my â€œSoftware Archaeologyâ€ article for more about spelunking a codebaseâ€™s history.

[2] Iâ€™ve written about the various tools Iâ€™ve used in the past in â€œFeature Trackingâ€.

[3] The HTTP â€œrefererâ€ header being a notable exception, See Wikipedia.

June 15, 2017June 15, 2017

Refactoring â€“ Before or After?

Chris Oldwood from The OldWood Thing

I recently worked on a codebase where I had a new feature to implement but found myself struggling to understand the existing structure. Despite paring a considerable amount I realised that without other people to easily guide me I still got lost trying to find where I needed to make the change. I felt like I was walking through a familiar wood but the exact route eluded me without my usual guides.

I reverted the changes I had made and proposed that now might be a good point to do a little reorganisation. The response was met with a brief and light-hearted game of â€œKen Beck Quote Tennisâ€ - some suggested we do the refactoring before the feature whilst others preferred after. I felt there was a somewhat superficial conflict here that I hadnâ€™t really noticed before and wondered what the drivers might be to taking one approach over the other.

Refactor After

If youâ€™re into Test Driven Development (TDD) then youâ€™ll have the mantra â€œRed, Green, Refactorâ€ firmly lodged in your psyche. When practicing TDD you first write the test, then make it pass, and finally finish up by refactoring the code to remove duplication or otherwise simplify it. Ken Beckâ€™s Test Driven Development: By Example is probably the de facto read for adopting this practice.

The approach here can be seen as one where the refactoring comes after you have the functionality working. From a value perspective most of it comes from having the functionality itself â€“ the refactoring step is an investment in the codebase to allow future value to be added more easily later.

Just after adding a feature is the point where youâ€™ve probably learned the most about the problem at hand and so ensuring the design best represents your current understanding is a worthwhile aid to future comprehension.

Refactor Before

Another saying from Kent Beck that Iâ€™m particularly fond of is â€œmake the change easy, then make the easy changeâ€ [1]. Here he is alluding to a dose of refactoring up-front to mould the codebase into a shape that is more amenable to allowing you to add the feature you really want.

At this point we are not adding anything new but are leaning on all the existing tests, and maybe improving them too, to ensure that we make no functional changes. The value here is about reducing the risk of the new feature by showing that the codebase can safely evolve towards supporting it. More importantly It also gives the earliest visibility to others about the new direction the code will take [2].

We know the least amount about what it will take to implement the new feature at this point but we also have a working product that we can leverage to see how itâ€™s likely to be impacted.

Refactor Before, During & After

Taken at face value it might appear to be contradictory about when the best time to refactor is. Of course this is really a straw man argument as the best time is in fact â€œall the timeâ€ â€“ we should continually keep the code in good shape [3].

That said the act of refactoring should not occur within a vacuum, it should be driven by a need to make a more valuable change. If the code never needed to change we wouldnâ€™t be doing it in the first place and this should be borne in mind when working on a large codebase where there might be a temptation to refactor purely for the sake of it. Seeing stories or tasks go on the backlog which solely amount to a refactoring are a smell and should be heavily scrutinised.

Emergent Design

That said, there are no absolutes and whilst I would view any isolated refactoring task with suspicion, that is effectively what I was proposing back at the beginning of this post. One of the side-effects of emergent design is that you can get yourself into quite a state before a cohesive design finally emerges.

Whilst on paper we had a number of potential designs all vying for a place in the architecture we had gone with the simplest possible thing for as long as possible in the hope that more complex features would arrive on the backlog and we would then have the forces we needed to evaluate one design over another.

Hence the refactoring decision became one between digging ourselves into an even deeper hole first, and then refactoring heavily once we had made the functional change, or doing some up-front preparation to solidify some of the emerging concepts first. There is the potential for waste if you go too far down the up-front route but if youâ€™ve been watching how the design and feature list have been emerging over time itâ€™s likely you already know where you are heading when the time comes to put the design into action.

[1] I tend to elide the warning from the original quote about the first part potentially being hard when saying it out loud because the audience is usually well aware of that :o).

[2] See â€œThe Cost of Long-Lived Feature Branchesâ€ for a cautionary tale about storing up changes.

[3] See â€œRelentless Refactoringâ€ for the changes in attitude towards this practice.

June 12, 2017June 12, 2017

Stack Overflow With Custom JsonConverter

Chris Oldwood from The OldWood Thing

[There is a Gist on GitHub that contains a minimal working example and summary of this post.]

We recently needed to change our data model so that what was originally a list of one type, became a list of objects of different types with a common base, i.e. our JSON deserialization now needed to deal with polymorphic types.

Naturally we googled the problem to see what support, if any, Newtonsoftâ€™s JSON.Net had. Although it has some built-in support, like many built-in solutions it stores fully qualified type names which we didnâ€™t want in our JSON, we just wanted simple technology-agnostic type names like â€œcatâ€ or â€œdogâ€ that we would be happy to map manually somewhere in our code. We didnâ€™t want to write all the deserialization logic manually, but was happy to give the library a leg-up with the mapping of types.

JsonConverter

Our searching quickly led to the following question on Stack Overflow: â€œDeserializing polymorphic json classes without type information using json.netâ€. The lack of type information mentioned in the question meant the exact .Net type (i.e. name, assembly, version, etc.), and so the answer describes how to do it where you can infer the resulting type from one or more attributes in the data itself. In our case it was a field unsurprisingly called â€œtypeâ€ that held a simplified name as described earlier.

The crux of the solution involves creating a JsonConverter and implementing the two methods CanConvert and ReadJson. If we follow that Stack Overflow postâ€™s top answer we end up with an implementation something like this:

public class CustomJsonConverter : JsonConverter
{
public override bool CanConvert(Type objectType)
{
    return typeof(BaseType).
                       IsAssignableFrom(objectType);
}

public override object ReadJson(JsonReader reader,
           Type objectType, object existingValue,
           JsonSerializer serializer)
{
    JObject item = JObject.Load(reader);

    if (item.Value<string>(â€œtypeâ€) == â€œDerivedâ€)
    {
      return item.ToObject<DerivedType>();
    }
    else
    . . .
}
}

This all made perfect sense and even agreed with a couple of other blog posts on the topic we unearthed. However when we plugged it in we ended up with an infinite loop in the ReadJson method that resulted in a StackOverflowException. Doing some more googling and checking the Newtonsoft JSON.Net documentation didnâ€™t point out our â€œobviousâ€ mistake and so we resorted to the time honoured technique of fumbling around with the code to see if we could get this (seemingly promising) solution working.

A Blind Alley

One avenue that appeared to fix the problem was manually adding the JsonConverter to the list of Converters in the JsonSerializerSettings object instead of using the [JsonConverter] attribute on the base class. We went back and forth with some unit tests to prove that this was indeed the solution and even committed this fix to our codebase.

However I was never really satisfied with this outcome and so decided to write this incident up. I started to work through the simplest possible example to illustrate the behaviour but when I came to repro it I found that neither approach worked â€“ attribute or serializer settings - I always got into an infinite loop.

Hence I questioned our original diagnosis and continued to see if there was a more satisfactory answer.

ToObject vs Populate

I went back and re-read the various hits we got with those additional keywords (recursion, infinite loop and stack overflow) to see if weâ€™d missed something along the way. The two main candidates were â€œPolymorphic JSON Deserialization failing using Json.Netâ€ and â€œCustom inheritance JsonConverter fails when JsonConverterAttribute is usedâ€. Neither of these explicitly references the answer we initially found and what might be wrong with it â€“ they give a different answer to a slightly different question.

However in these answers they suggest de-serializing the object in a different way, instead of using ToObject<DerivedType>() to do all the heavy lifting, they suggest creating the uninitialized object yourself and then using Populate() to fill in the details, like this:

{
JObject item = JObject.Load(reader);

if (item.Value<string>(â€œtypeâ€) == â€œDerivedâ€)
{
    var @object = new DerivedType();
    serializer.Populate(item.CreateReader(), @object);
    return @object;
}
else
    . . .
}

Plugging this approach into my minimal example worked, and for both the converter techniques too: attribute and serializer settings.

Unanswered Questions

So Iâ€™ve found another technique that works, which is great, but I still lack closure around the whole affair. For example, how come the answer in the the original Stack Overflow question â€œDeserializing polymorphic json classesâ€ didnâ€™t work for us? That answer has plenty of up-votes and so should be considered pretty reliable. Has there been a change to Newtonsoftâ€™s JSON.Net library that has somehow caused this answer to now break for others? Is there a new bug that weâ€™ve literally only just discovered (weâ€™re using v10)? Why donâ€™t the JSON.Net docs warn against this if it really is an issue, or are we looking in the wrong part of the docs?

As described right at the beginning Iâ€™ve published a Gist with my minimal example and added a comment to the Stack Overflow answer with that link so that anyone else on the same journey has some other pieces of the jigsaw to work with. Perhaps over time my comment will also acquire up-votes to help indicate that itâ€™s not so cut-and-dried. Or maybe someone who knows the right answer will spot it and point out where we went wrong.

Ultimately though this is probably a case of not seeing the wood for the trees. Itâ€™s so easy when youâ€™re trying to solve one problem to get lost in the accidental complexity and not take a step back. Answers on Stack Overflow generally carry a large degree of gravitas, but they should not be assumed to be infallible. All documentation can go out of date even if there are (seemingly) many eyes watching over it.

When your mind-set is one that always assumes the bugs are of your own making, unless the evidence is overwhelming, then those times when you might actually not be entirely at fault seem to feel all the more embarrassing when you realise the answer was probably there all along but you discounted it too early because your train of thought was elsewhere.

Author: Chris Oldwood

The Perils of DateTime.Parse()

Wit Limits

Network Saturation

Fallibility

Good Stories Assure the Architecture

The User-Agent is not Just for Browsers

Donâ€™t Hide the Solution Structure

Every Commit Needs the Rationale to Support It

Refactoring â€“ Before or After?

Stack Overflow With Custom JsonConverter