monitoring – ACCU World of Code

My current role is more of a DevOps role and Iâ€™m spending more time than usual monitoring and administrating various services, such as the GitLab instance we use for source control, build pipelines, issue management, etc. While the GitLab UI is very useful for certain kinds of tasks the rich RESTful API allows you to easily build your own custom tools to to monitor, analyse, and investigate the things youâ€™re particularly interested in.

For example one of the first views I wanted was an alphabetical list of all runners with their current status so that I could quickly see if any had gone AWOL during the night. The alphabetical sorting requirement is not something the standard UI view provides hence I needed to use the REST API or hope that someone had already done something similar first.

GitLab Clients

I quickly found two candidates: python-gitlab and go-gitlab-client which looked promising but they only really wrap the API â€“ Iâ€™d still need to do some heavy lifting myself and understand what the GitLab API does. Given how simple the examples were, even with curl, it felt like I wasnâ€™t really saving myself anything at this point, e.g.

curl --header "PRIVATE-TOKEN: $token" "https://gitlab.example.com/api/v4/runners"

So I decided to go with a wrapper script [1] approach instead and find a way to prettify the JSON output so that the script encapsulated a shell one-liner that would request the data and format the output in a simple table. Here is the kind of JSON the GitLab API would return for the list of runners:

[
{
   "id": 6,
   "status": "online"
   . . .
}
,
{
   "id": 8,
   "status": "offline"
   . . .
}
]

JQ â€“ The JSON Tool

Iâ€™d come across the excellent JQ tool for querying JSON payloads many years ago so that was my first thought for at least simplifying the JSON payloads to the fields I was interested in. However on further reading I found it could do some simple formatting too. At first I thought the compact output using the â€“c option was what I needed (perhaps along with some tr magic to strip the punctuation), e.g.

$ echo '[{"id":1, "status":"online"}]' |\
jq -c
[{"id":1,"status":"online"}]

but later I discovered the â€“r option provided raw output which formatted the values as simple text and removed all the JSON punctuation, e.g.

$ echo '[{"id":1, "status":"online"}]' |\
jq -r '( .[] | "\(.id) \(.status)" )'
1 online

Naturally my first thought for the column headings was to use a couple of echo statements before the curl pipeline but I also discovered that you can mix-and match string literals with the output from the incoming JSON stream, e.g.

$ echo '[{"id":1, "status":"online"}]' |\
   jq -r '"ID Status",
          "-- ------",
          ( .[] | "\(.id) \(.status)" )'
ID Status
-- ------
1 online

This way the headings were only output if the command succeeded.

Neater Tables with Column

While these crude tables were readable and simple enough for further processing with grep and awk they were still pretty unsightly when the values of a column were too varied in length such as a branch name or description field. Putting them on the right hand side kind of worked but I wondered if I could create fixed width fields ala printf via jq.

At this point I stumbled across the StackOverflow question How to format a JSON string as a table using jq? where one of the later answers mentioned a command line tool called â€œcolumnâ€ which takes rows of text values and arranges them as columns of similar width by adjusting the spacing between elements.

This almost worked except for the fact that some fields had spaces in their input and column would treat them by default as separate elements. A simple change of field separator from a space to a tab meant that I could have my cake and eat it, e.g.

$ echo '[ {"id":1, "status":"online"},
          {"id":2, "status":"offline"} ]' |\
jq -r '"ID\tStatus",
         "--\t-------",
         ( .[] | "\(.id)\t\(.status)" )' |\
column -t -s $'\t'
ID Status
-- -------
1   online
2   offline

Sorting and Limiting

While many of the views I was happy to order by ID, which is often the default for the API, or in the case of jobs and pipelines was a proxy for â€œstart timeâ€, there were cases where I needed to control the sorting. For example we used the runner description to store the hostname (or host + container name) so it made sense to order by that, e.g.

jq 'sort_by(.description|ascii_downcase)'

For the runnerâ€™s jobs the job ID ordering wasnâ€™t that useful as the IDs were allocated up front but the job might start much later if itâ€™s a latter part of the pipeline so I chose to order by the job start time instead with descending order so the most recent jobs were listed first, e.g.

jq â€˜sort_by(.started_at) | reverseâ€™

One other final trick that proved useful occasionally when there was no limiting in the API was to do it with jq instead, e.g

jq "sort_by(.name) | [limit($max; .[])]"

[1] See my 2013 article â€œIn The Toolbox â€“ Wrapper Scriptsâ€ for more about this common technique of simplifying tools.

November 18, 2019November 18, 2019

Arbitrary Cache Timeouts

Chris Oldwood from The OldWood Thing

Like many other programmers Iâ€™ve probably added my fair share of caches to systems over the years, and as we know from the old joke, one of the two hardest problems in computer science is knowing when to invalidate them. Itâ€™s a hard question, to be sure, but a really annoying behaviour you can run into as a maintainer is when the invalidation appears to be done arbitrarily, usually by specifying some timeout seemingly plucked out of thin air and maybe even changed equally arbitrarily. (It may not be, but documenting such decisions is usually way down the list of important things to do.)

Invalidation

If there is a need for a cache in production, and letâ€™s face it thatâ€™s the usual driver, then any automatic invalidation is likely to be based on doing it as infrequently as possible to ensure the highest hit ratio. The problem is that that value can often be hard-coded and mask cache invalidation bugs because it rarely kicks in. The knee-jerk reaction to â€œthings behaving weirdlyâ€ in production is to switch everything off-and-on again thereby implicitly invalidating any caches, but this doesnâ€™t help us find those bugs.

The most recent impetus for this post was just such a bug which surfaced because the cache invalidation logic never ran in practice. The cache timeout was set arbitrarily large, which seemed odd, but I eventually discovered it was supposed to be irrelevant because the service hosting it should have been rebooted at midnight every day! Due to the password for the account used to run the reboot task expiring it never happened and the invalidated items then got upset when they were requested again. Instead of simply fetching the item from the upstream source and caching it again, the cache had some remnants of the stale items and failed the request instead. Being an infrequent code path it didnâ€™t obviously ring any bells so took longer to diagnose.

Design for Testability

While itâ€™s useful to avoid throwing away data unnecessarily in production we know that the live environment rarely needs the most flexibility when it comes to configuration (see â€œTesting Drives the Need for Flexible Configurationâ€). On the contrary, Iâ€™d expect to have any cache being cycled reasonably quickly in a test environment to try and flush out any issues as Iâ€™d expect more side-effects from cache misses than hits.

If you are writing any automated tests around the caching behaviour that is often a good time to consider the other non-functional requirements, such as monitoring and support. For example, does the service or tool hosting the cache expose some means to flush it manually? While rebooting a service may do the trick it does nothing to help you track down issues around residual state and often ends up wreaking havoc with any connected clients if theyâ€™re not written with a proper distributed system mindset.

Another scenario to consider is if the cache gets poisoned; if there is no easy way to eject the bad data youâ€™re looking at the sledgehammer approach again. If your cache is HA (highly available) and backed by some persistent storage getting bad data out could be a real challenge when youâ€™re under the cosh. One system I worked on had random caches poisoned with bad data due to a threading serialization bug in an external library.

Monitoring

The monitoring side is probably equally important. If you generate no instrumentation data how do you know if your cache is even having the desired effect? One team I was on added a new cache to a service and we were bewildered to discover that it was never used. It turned out the WCF service settings were configured to create a new service instance for every request and therefore a new cache was created every time! This was despite the fact that we had unit tests for the cache and they were happily passing [1].

Itâ€™s also important to realise that a cache without an eviction policy is just another name for a memory leak. You cannot keep caching data forever unless you know there is a hard upper bound. Hence youâ€™re going to need to use the instrumentation data to help find the sweet spot that gives you the right balance between time and space.

We also shouldnâ€™t blindly assume that caches will continue to provide the same performance in future as they do now; our metrics will allow us to see any change in trends over time which might highlight a change in data thatâ€™s causing it to be less efficient. For example one cache I saw would see its efficiency plummet for a while because a large bunch of single use items got requested, cached, and then discarded as the common data got requested again. Once identified we disabled caching for those kinds of items, not so much for the performance benefit but to avoid blurring the monitoring data with unnecessary â€œglitchesâ€ [2].

[1] See â€œMan Cannot Live by Unit Testing Aloneâ€ for other tales of the perils of that mindset.

[2] This is a topic I covered more extensively in my Overload article â€œMonitoring: Turning Noise Into Signalâ€.

December 6, 2017

Network Saturation

Chris Oldwood from The OldWood Thing

The first indication that we seemed to have a problem was when some of the background processing jobs failed. The support team naturally looked at the log files where the jobs had failed and discovered that the cause was an inability to log-in to the database during process start-up. Naturally they tried to log-in themselves using SQL Server Management Studio or run a simple â€œSELECT GetDate();â€ style query via SQLCMD and discovered a similar problem.

Initial Symptoms

With the database appearing to be up the spout they raised a priority 1 ticket with the DBA team to investigate further. Whilst this was going on I started digging around the grid computation services we had built to see if any more light could be shed on what might be happening. This being the Windows Server 2003 era I had to either RDP onto a remote desktop or use PSEXEC to execute remote commands against our app servers. What surprised me was that these were behaving very erratically too.

This now started to look like some kind of network issue and so a ticket was raised with the infrastructure team to find out if they knew what was going on. In the meantime the DBAs came back and said they couldnâ€™t find anything particularly wrong with the database, although the transaction log consumption was much higher than usual at this point.

Closing In

Eventually I managed to remote onto our central logging service [1] and found that the dayâ€™s log file was massive by comparison and eating up disk space fast. TAILing the central log file I discovered page upon page of the same error about some internal calculation that had failed on the compute nodes. At this point it was clearly time to pull the emergency chord and shut the whole thing down as no progress was being made for the business and very little in diagnosing the root of the problem.

With the tap now turned off I was able to easily jump onto a compute node and inspect its log. What I discovered there was every Monte Carlo simulation of every trade it was trying to value was failing immediately in some set-up calculation. The â€œbest effortsâ€ error handling approach meant that the error was simply logged and the valuation continued for the remaining simulations â€“ rinse and repeat.

Errors at Scale

Of course what compounded the problem was the fact that there were approaching 100 compute nodes all sending any non-diagnostic log messages, i.e. all warnings and errors, across the network to one central service. This service would in turn log any error level messages in the databaseâ€™s â€œerror logâ€ table.

Consequently with each compute node failing rapidly (see â€œBlack Hole - The Fail Fast Anti-Patternâ€) and flooding the network with thousands of log messages per-second the network eventually became saturated. Those processes which had long-lived network connections (we used a high-performance messaging product for IPC) would continue to receive and generate traffic, albeit slowly, but establishing new connections usually resulted in some form of timeout being hit instead.

The root cause of the compute node set-up calculation failure was later traced back to some bad data which itself had resulted from poor error handling in some earlier initial batch-level calculation.

Points of Failure

This all happened just before Michael Nygard published his excellent book Release It! Some months later when I finally read it I found myself frequently nodding my head as his tales of woe echoed my own experiences.

One of the patterns he talks about in his book is the use of bulkheads to stop failures â€œjumping the cracksâ€. On the compute nodes the poor error handling strategy meant that the same error occurred over-and-over needlessly instead of failing once. The use of a circuit breaker could also have mitigated the volume of errors generated and triggered some kind of cooling off period.

Duplicating the operational log data in the same database as the business data might have been a sane thing to do when the system was tiny and handling manual requests, but as the system became more automated and scaled out this kind of data should have been moved elsewhere where it could be used more effectively.

One of the characteristics of a system like this is that there are a lot of calculations forming a pipeline, so garbage-in, garbage-out means something might not go pop right away but sometime later when the error has compounded. In this instance an error return value of â€“1 was persisted as if it was normal data instead of being detected. Latter stages could do sanity checks on data to avoid poisoning the whole thing before itâ€™s too late. It should also have been fairly easy to run a dummy calculation on the core inputs before opening the flood gates to mitigate a catastrophic failure, at least, for one due to bad input data.

Aside from the impedance mismatch in the error handling of different components there was also a disconnect in the error handling in the code that was biased towards one-off trader and support calculations, where the user is present, versus batch processing where the intention is for the system to run unattended. The design of the system needs to take both needs into consideration and adjust the error handling policy as appropriate. (See â€œThe Generation, Management and Handling of Errorsâ€ for further ideas.)

Although the system had a monitoring page it only showed the progress of the entire batch â€“ you needed to know the normal processing speed to realise something was up. A dashboard needs a variety of different indicators to show elevated error rates and other anomalous behaviour, ideally with automatic alerting when the things start heading south. Before you can do that though you need the data to work from, see â€œInstrument Everything You Can Afford Toâ€.

The Devil is in the (Non-Functional) Details

Following Gallâ€™s Law to the letter this particular system had grown over many, many years from a simple ad-hoc calculation tool to a full-blown grid-based compute engine. In the meantime some areas around stability and reliably had been addressed but ultimately the focus was generally on adding support for more calculation types rather than operational stability. The non-functional requirements are always the hardest to get buy-in for on an internal system but without them it can all come crashing down and end in tears with some dodgy inputs.

[1] Yes, back then everyone built their own logging libraries and tools like Splunk.

October 13, 2017October 16, 2017

The User-Agent is not Just for Browsers

Chris Oldwood from The OldWood Thing

One of the trickiest problems when youâ€™re building a web service is knowing who your clients are. I donâ€™t mean your customers, thatâ€™s a much harder problem, no, I literally mean you donâ€™t know what client software is talking to you.

Although it shouldnâ€™t really matter who your consumers are from a technical perspective, once your service starts to field requests and youâ€™re working out what and how to monitor it, knowing this becomes far more useful.

Proactive Monitoring

For example the last API I worked on we were generating 404â€™s for a regular stream of requests because the consumer had a bug in their URL formatting and erroneously appended an extra space for one of the segments. We could see this at our end but didnâ€™t know who to tell. We had to spam our â€œAPI Consumersâ€ Slack channel in the hope the right person would notice [1].

We also had consumers sending us the wrong kind of authorisation token, which again we could see but didnâ€™t know which team to contact. Although having a Slack channel for the API helped, we found that people only paid attention to it when they noticed a problem. It also appeared, from our end, that devs would prefer to fumble around rather than pair with us on getting their client end working quickly and reliably.

Client Detection

Absent any other information a cloud hosted service pretty much only has the client IP to go on. If youâ€™re behind a load balancer then youâ€™re looking at the X-Forwarded-For header instead which might give you a clue. Of course if many of your consumers are also services running in the cloud or behind the on-premise firewall they all look pretty much the same.

Hence as part of our API documentation we strongly encouraged consumers to supply a User-Agent field with their service name, purpose, and version, e.g. MyMobileApp:Test/1.0.56. This meant that we would now have a better chance of talking to the right people when we spotted them doing something odd.

From a monitoring perspective we can then use the User-Agent in various ways to slice-and-dice our traffic. For example we can now successfully attribute load to various consumers. We can also filter out certain behaviours from triggering alerts when we know, for example, that itâ€™s their contract tests passing bad data on purpose.

By providing us with a version number we can also see when they release a new version and help them ensure theyâ€™ve deprecated old versions. Whilst you would expect service owners to know exactly what theyâ€™ve got running where, youâ€™d be surprised how many donâ€™t know they have old instances lying around. It also helps identify who the laggards are that are holding up removal of your legacy features.

Causality

A somewhat related idea is the use of â€œtraceâ€ or â€œcorrelationâ€ IDs, which is something Iâ€™ve covered before in â€œCausality - A Mechanism for Relating Distributed Diagnostic Contextsâ€. These are unique IDs for diagnosing problems with requests and itâ€™s useful to include a prefix for the originating system. However that system may not be your actual client if there are various other services between you and them. Hence the causality ID covers the end-to-end where the User-Agent can cover the local client-server hop.

You would think that the benefit of passing it was fairly clear â€“ it allows providers to proactively help consumers fix their problems. And yet like so many non-functional requirements it sits lower down their backlog because itâ€™s only optional [2]. Not only that but by masking themselves it actually hampers delivery of new features because youâ€™re working harder than necessary to keep the existing lights on.

[1] Ironically the requests were for some automated tests which they didnâ€™t realise were failing!

[2] We wanted to make the User-Agent header mandatory on all non-production environments [3] to try and convince our consumers of the benefits but it didnâ€™t sit well with the upper echelons.

[3] The idea being that its use in production then becomes automatic but does not exclude easy use of diagnostic tools like CURL for production issues.

Category: monitoring

Simple Tables From JSON Data With JQ and Column

Arbitrary Cache Timeouts

Network Saturation

The User-Agent is not Just for Browsers