Network Saturation

Chris Oldwood from The OldWood Thing

The first indication that we seemed to have a problem was when some of the background processing jobs failed. The support team naturally looked at the log files where the jobs had failed and discovered that the cause was an inability to log-in to the database during process start-up. Naturally they tried to log-in themselves using SQL Server Management Studio or run a simple “SELECT GetDate();” style query via SQLCMD and discovered a similar problem.

Initial Symptoms

With the database appearing to be up the spout they raised a priority 1 ticket with the DBA team to investigate further. Whilst this was going on I started digging around the grid computation services we had built to see if any more light could be shed on what might be happening. This being the Windows Server 2003 era I had to either RDP onto a remote desktop or use PSEXEC to execute remote commands against our app servers. What surprised me was that these were behaving very erratically too.

This now started to look like some kind of network issue and so a ticket was raised with the infrastructure team to find out if they knew what was going on. In the meantime the DBAs came back and said they couldn’t find anything particularly wrong with the database, although the transaction log consumption was much higher than usual at this point.

Closing In

Eventually I managed to remote onto our central logging service [1] and found that the day’s log file was massive by comparison and eating up disk space fast. TAILing the central log file I discovered page upon page of the same error about some internal calculation that had failed on the compute nodes. At this point it was clearly time to pull the emergency chord and shut the whole thing down as no progress was being made for the business and very little in diagnosing the root of the problem.

With the tap now turned off I was able to easily jump onto a compute node and inspect its log. What I discovered there was every Monte Carlo simulation of every trade it was trying to value was failing immediately in some set-up calculation. The “best efforts” error handling approach meant that the error was simply logged and the valuation continued for the remaining simulations – rinse and repeat.

Errors at Scale

Of course what compounded the problem was the fact that there were approaching 100 compute nodes all sending any non-diagnostic log messages, i.e. all warnings and errors, across the network to one central service. This service would in turn log any error level messages in the database’s “error log” table.

Consequently with each compute node failing rapidly (see “Black Hole - The Fail Fast Anti-Pattern”) and flooding the network with thousands of log messages per-second the network eventually became saturated. Those processes which had long-lived network connections (we used a high-performance messaging product for IPC) would continue to receive and generate traffic, albeit slowly, but establishing new connections usually resulted in some form of timeout being hit instead.

The root cause of the compute node set-up calculation failure was later traced back to some bad data which itself had resulted from poor error handling in some earlier initial batch-level calculation.

Points of Failure

This all happened just before Michael Nygard published his excellent book Release It! Some months later when I finally read it I found myself frequently nodding my head as his tales of woe echoed my own experiences.

One of the patterns he talks about in his book is the use of bulkheads to stop failures “jumping the cracks”. On the compute nodes the poor error handling strategy meant that the same error occurred over-and-over needlessly instead of failing once. The use of a circuit breaker could also have mitigated the volume of errors generated and triggered some kind of cooling off period.

Duplicating the operational log data in the same database as the business data might have been a sane thing to do when the system was tiny and handling manual requests, but as the system became more automated and scaled out this kind of data should have been moved elsewhere where it could be used more effectively.

One of the characteristics of a system like this is that there are a lot of calculations forming a pipeline, so garbage-in, garbage-out means something might not go pop right away but sometime later when the error has compounded. In this instance an error return value of –1 was persisted as if it was normal data instead of being detected. Latter stages could do sanity checks on data to avoid poisoning the whole thing before it’s too late. It should also have been fairly easy to run a dummy calculation on the core inputs before opening the flood gates to mitigate a catastrophic failure, at least, for one due to bad input data.

Aside from the impedance mismatch in the error handling of different components there was also a disconnect in the error handling in the code that was biased towards one-off trader and support calculations, where the user is present, versus batch processing where the intention is for the system to run unattended. The design of the system needs to take both needs into consideration and adjust the error handling policy as appropriate. (See “The Generation, Management and Handling of Errors” for further ideas.)

Although the system had a monitoring page it only showed the progress of the entire batch – you needed to know the normal processing speed to realise something was up. A dashboard needs a variety of different indicators to show elevated error rates and other anomalous behaviour, ideally with automatic alerting when the things start heading south. Before you can do that though you need the data to work from, see “Instrument Everything You Can Afford To”.

The Devil is in the (Non-Functional) Details

Following Gall’s Law to the letter this particular system had grown over many, many years from a simple ad-hoc calculation tool to a full-blown grid-based compute engine. In the meantime some areas around stability and reliably had been addressed but ultimately the focus was generally on adding support for more calculation types rather than operational stability. The non-functional requirements are always the hardest to get buy-in for on an internal system but without them it can all come crashing down and end in tears with some dodgy inputs.

 

[1] Yes, back then everyone built their own logging libraries and tools like Splunk.

The User-Agent is not Just for Browsers

Chris Oldwood from The OldWood Thing

One of the trickiest problems when you’re building a web service is knowing who your clients are. I don’t mean your customers, that’s a much harder problem, no, I literally mean you don’t know what client software is talking to you.

Although it shouldn’t really matter who your consumers are from a technical perspective, once your service starts to field requests and you’re working out what and how to monitor it, knowing this becomes far more useful.

Proactive Monitoring

For example the last API I worked on we were generating 404’s for a regular stream of requests because the consumer had a bug in their URL formatting and erroneously appended an extra space for one of the segments. We could see this at our end but didn’t know who to tell. We had to spam our “API Consumers” Slack channel in the hope the right person would notice [1].

We also had consumers sending us the wrong kind of authorisation token, which again we could see but didn’t know which team to contact. Although having a Slack channel for the API helped, we found that people only paid attention to it when they noticed a problem. It also appeared, from our end, that devs would prefer to fumble around rather than pair with us on getting their client end working quickly and reliably.

Client Detection

Absent any other information a cloud hosted service pretty much only has the client IP to go on. If you’re behind a load balancer then you’re looking at the X-Forwarded-For header instead which might give you a clue. Of course if many of your consumers are also services running in the cloud or behind the on-premise firewall they all look pretty much the same.

Hence as part of our API documentation we strongly encouraged consumers to supply a User-Agent field with their service name, purpose, and version, e.g. MyMobileApp:Test/1.0.56. This meant that we would now have a better chance of talking to the right people when we spotted them doing something odd.

From a monitoring perspective we can then use the User-Agent in various ways to slice-and-dice our traffic. For example we can now successfully attribute load to various consumers. We can also filter out certain behaviours from triggering alerts when we know, for example, that it’s their contract tests passing bad data on purpose.

By providing us with a version number we can also see when they release a new version and help them ensure they’ve deprecated old versions. Whilst you would expect service owners to know exactly what they’ve got running where, you’d be surprised how many don’t know they have old instances lying around. It also helps identify who the laggards are that are holding up removal of your legacy features.

Causality

A somewhat related idea is the use of “trace” or “correlation” IDs, which is something I’ve covered before in “Causality - A Mechanism for Relating Distributed Diagnostic Contexts”. These are unique IDs for diagnosing problems with requests and it’s useful to include a prefix for the originating system. However that system may not be your actual client if there are various other services between you and them. Hence the causality ID covers the end-to-end where the User-Agent can cover the local client-server hop.

You would think that the benefit of passing it was fairly clear – it allows providers to proactively help consumers fix their problems. And yet like so many non-functional requirements it sits lower down their backlog because it’s only optional [2]. Not only that but by masking themselves it actually hampers delivery of new features because you’re working harder than necessary to keep the existing lights on.

 

[1] Ironically the requests were for some automated tests which they didn’t realise were failing!

[2] We wanted to make the User-Agent header mandatory on all non-production environments [3] to try and convince our consumers of the benefits but it didn’t sit well with the upper echelons.

[3] The idea being that its use in production then becomes automatic but does not exclude easy use of diagnostic tools like CURL for production issues.