Quality control in a zero cost of replication business

Derek Jones from The Shape of Code

When a new manufacturing material becomes available, its use is often integrated with existing techniques, e.g., using scientific management techniques for software production.

Customers want reliable products, and companies that sell unreliable products don’t make money (and may even lose lots of money).

Quality assurance of manufactured products is a huge subject, and lots of techniques have been developed.

Needless to say, quality assurance techniques applied to the production of hardware are often touted (and sometimes applied) as the solution for improving the quality of software products (whatever quality is currently being defined as).

There is a fundamental difference between the production of hardware and software:

  • Hardware is designed, a prototype made and this prototype refined until it is ready to go into production. Hardware production involves duplicating an existing product. The purpose of quality control for hardware production is ensuring that the created copies are close enough to identical to the original that they can be profitably sold. Industrial design has to take into account the practicalities of mass production, e.g., can this device be made at a low enough cost.
  • Software involves the same design, prototype, refinement steps, in some form or another. However, the final product can be perfectly replicated at almost zero cost, e.g., downloadable file(s), burn a DVD, etc.

Software production is a once-off process, and applying techniques designed to ensure the consistency of a repetitive process don’t sound like a good idea. Software production is not at all like mass production (the build process comes closest to this form of production).

Sometimes people claim that software development does involve repetition, in that a tiny percentage of the possible source code constructs are used most of the time. The same is also true of human communications, in that a few words are used most of the time. Does the frequent use of a small number of words make speaking/writing a repetitive process in the way that manufacturing identical widgets is repetitive?

The virtually zero cost of replication (and distribution, via the internet, for many companies) does more than remove a major phase of the traditional manufacturing process. Zero cost of replication has a huge impact on the economics of quality control (assuming high quality is considered to be equivalent to high reliability, as measured by number of faults experienced by customers). In many markets it is commercially viable to ship software products that are believed to contain many mistakes, because the cost of fixing them is so very low; unlike the cost of hardware, which is non-trivial and involves shipping costs (if only for a replacement).

Zero defects is not an economically viable mantra for many software companies. When companies employ people to build the same set of items, day in day out, there is economic sense in having them meet together (e.g., quality circles) to discuss saving the company money, by reducing production defects.

Many software products have a short lifespan, source code has a brief and lonely existence, and many development projects are never shipped to paying customers.

In software development companies it makes economic sense for quality circles to discuss the minimum number of known problems they need to fix, before shipping a product.

Agile: Prix fixe or a la carte?

Allan Kelly from Allan Kelly Associates

small-andersen-jensen-Hk2eu3OODdg-unsplash-2020-08-11-14-40.jpg

Advert: September micro-workshops – spaced limited

User Stories Masterclass, Agile Estimation & Forecasting, Maximising value delivered

Early bird discounts & free tickets for unemployed/furloughed

Book with code Blog15 for 15% discount or get more details


“They don’t do Scrum so much as ScrumBut”

“We don’t do Scrum by the book, we changed it”

“We follow SAFe, except we’ve tailored it”

“We do a mix of agile methods”

“They call it agile but it isn’t really”

You’ve heard all these comments right? But have you noticed the tone of voice? The context in which they are said?

In my experience people say these things in a guilty way, what they mean to say is:

“They don’t do Scrum so much as Scrum but we don’t do it the way we should”

“We don’t do Scrum by the book, we changed it, we dropped the Scrum Master, we flex our sprints, …”

“We follow SAFe, except we’ve tailored it by dropping the agile coaches, the technical aspects and …”

“We do a mix of agile methods, we don’t do anything properly and its half baked”

“They call it agile but I don’t think they really understand what agile is”

Practitioners aren’t helped by advisors – coaches, trainers, consultants, what-not – who go around criticising teams for not following “Brand X Method” properly. But forget about them.

I want to rid you of your guilt. Nobody should feel guilty for not doing Scrum by the book, or SAFe the right way, or perfect Kanban.

Nobody, absolutely no person or organization I have ever met or heard of, does any method by the book.

After all “agile is a journey” and you might just be at a different point on the journey right now. To me agile is learning and there is more learning to be done – should we criticise people because the haven’t learned something?

All these methods offer a price fix menu: you pay a fixed price and you get a set menu.

In reality all agile methods should be seen as an à la carte menu: pick what you like, mix and match.

In fact, don’t just pick from the Scrum menu or the SAFe menu, pick across the menus: Scrum, XP, Kanban, SAFe, LeSS, DaD, whatever!

And do not feel guilty about it.

Do it.

My agile method, Xanpan explicitly says: mix and match. Xanpan lays out a model but it also says change things, find what works for you, steal from others.

The only thing you can get wrong in agile is doing things the same as you did 3 months ago. Keep experimenting, keep truing new ways, new ideas. If you improve then great, if not, roll-back and try something else.

In other words, keep learning.

Picture: Thanks to Andersen Jensen for the above photo on Unsplash.


Subscribe to my blog newsletter and download Project Myopia for Free

The post Agile: Prix fixe or a la carte? appeared first on Allan Kelly Associates.

Agile & OKRs – the end is in sight

Allan Kelly from Allan Kelly Associates

The end is insight for my new book “Little Book of Agile and OKRs”. There are only a few more (short) chapters I want to write and I have put the wheels in motion to get a professional cover.

After that there is a lot of editing – including a professional copy edit – and perhaps a change of title, plus an audio version.

Anyone buying “Little Book of Agile and OKRs” will receive updates for free as they are published on LeanPub.

And if you are prepared to trade a little of your time I’m give you Little Book of Agile & OKRs for free. I’m looking for reviewers, right now I’d like feedback on my content, in a few months I’ll be looking for reviews on Amazon.


Subscribe to my blog newsletter and download Project Myopia for Free

The post Agile & OKRs – the end is in sight appeared first on Allan Kelly Associates.

How to send an SMS using netcat (via SMPP)

Andy Balaam from Andy Balaam's Blog

SMPP is a binary protocol used by phone companies to send text messages, otherwise known as SMS messages.

It can work over TCP, so we can use netcat on the command line to send messages.

Setting up

[Note: the netcat I am using is Ncat 7.70 on Linux.]

The server that receives messages is called an SMSC. You may have your own one, but if not, you can use the CloudHopper one like this:

sudo apt install make maven  # (or similar on non-Debian-derived distros)
git clone https://github.com/fizzed/cloudhopper-smpp.git
cd cloudhopper-smpp

If you are a little slow, like me, I’d suggest making it wait a bit longer for bind requests before giving up on you. To do that, edit the main() method of src/test/java/com/cloudhopper/smpp/demo/ServerMain.java to add a line like this: configuration.setBindTimeout(500000); on about line 80, near the other similar lines. This will make it wait 500 seconds for you to send a BIND_TRANSCEIVER, instead of giving up after just 5 seconds.

Once you’ve made that change, you can run:

make server

Now you have an SMSC running!

Leave that open, and go into another terminal, and type:

mkfifo tmpfifo
nc 0.0.0.0 2776 < tmpfifo | xxd

The mkfifp part creates a “fifo” – a named pipe through which we will send our SMPP commands.

The nc part starts Ncat, connecting to the SMSC we started.

The xxd part will take any binary data coming out of Ncat and display it in a more human-readable way.

Leave that open too, and in a third terminal type:

exec 3> tmpfifo

This makes everything we send to file descriptor 3 go into the fifo, and therefore into Ncat.

Now we have a way of sending binary data to Ncat, which will send it on to the SMSC and print out any responses.

Note: we will be using SMPP version 3.4 since it is in the widest use, even though it is not the newest.

Terminology

“SMPP” is the protocol we are speaking, which we are using over TCP/IP.

An SMSC is a server (which receives messages intended for phones and sends back responses and receipts).

We will be acting as an ESME or client (which sends messages intended for phones and receives responses and receipts).

The units of information that are passed back and forth in SMPP are called “PDUs” (Protocol Data Units) – these are just bits of binary data that flow over the TCP connection between two computers.

The spec talks about “octets” – this means 8-bit bytes.

ENQUIRE_LINK

First, we’ll check the SMSC is responding, by sending an ENQUIRE_LINK, which is used to ask the SMSC whether it’s there and working.

Go back to the third terminal (where we ran exec) and type this:

LEN16='\x00\x00\x00\x10'
ENQUIRE_LINK='\x00\x00\x00\x15'
NULL='\x00\x00\x00\x00'
SEQ1='\x00\x00\x00\x01'

echo -n -e "${LEN16}${ENQUIRE_LINK}${NULL}${SEQ1}" >&3

Explanation: an ENQUIRE_LINK PDU consists of:

  • 4 bytes to say the length of the whole PDU in bytes. ENQUIRE_LINK PDUs are always 16 bytes, “00000010” in hex. I called this LEN.
  • 4 bytes to say what type of PDU this is. ENQUIRE_LINK is “00000015” in hex. I called this ENQUIRE_LINK.
  • 4 bytes that are always zero in ENQUIRE_LINK. I called this NULL.
  • 4 bytes that identify this request, called a sequence number. The response from the server will include this so we can match responses to requests. I called this SEQ.

Check back in the second terminal (where you ran nc). If everything worked, you should see something like this:

00000000: 0000 0010 8000 0015 0000 0000 0000 0001  ................

Ignoring the first and last parts (which are how xxd formats its output), the response we receive is four 4-byte parts, very similar to what we sent:

  • 4 bytes to say the length of the whole PDU in bytes. Here it is “00000010” hex, or 16 decimal.
  • 4 bytes to say what type of PDU this is. Here is is ENQUIRE_LINK is “80000015” in hex, which is the code for ENQUIRE_LINK_RESP.
  • 4 bytes for the success status of the ENQUIRE_LINK_RESP. This is always “00000000”, which means success and is called ESME_ROK in the spec.
  • 4 bytes that match the sequence number we sent. This is “00000001”, as we expected.

BIND_TRANSCEIVER

Now we can see that the SMSC is working, let’s “bind” to it. That means something like logging in: we convince the SMSC that we are a legitimate client, and tell it what type of connection we want, and, assuming it agrees, it will hold the connection open for us for as long as we need.

We are going to bind as a transceiver, which means both a transmitter and receiver, so we can both send messages and receive responses.

Send the bind request like this:

LEN32='\x00\x00\x00\x20'
BIND_TRANSCEIVER='\x00\x00\x00\x09'
NULL='\x00\x00\x00\x00'
SEQ2='\x00\x00\x00\x02'
SYSTEM_ID="sys\x00"
PASSWORD="pas\x00"
SYSTEM_TYPE='typ\x00'
SYSTEM_ID='sys\x00'
PASSWORD='pas\x00'
INTERFACE_VERSION='\x01'
INTERFACE_VERSION='\x34'
ADDR_TON='\x00'
ADDR_NPI='\x00'
ADDRESS_RANGE='\x00'

echo -n -e "${LEN32}${BIND_TRANSCEIVER}${NULL}${SEQ2}${SYSTEM_ID}${PASSWORD}${SYSTEM_TYPE}${INTERFACE_VERSION}${ADDR_TON}${ADDR_NPI}${ADDRESS_RANGE}" >&3

Explanation: this PDU is 32 bytes long, so the first thing we send is “00000020” hex, which is 32.

Then we send “00000009” for the type: BIND_TRANSCEIVER, 4 bytes of zeros, and a sequence number – this time I used 2.

That was the header. Now the body of the PDU starts with a system id (basically a username), a password, and a system type (extra info about who you are). These are all variable-length null-terminated strings, so I ended each one with \x00.

The rest of the body is some options about the types of phone number we will be sending from and sending to – I made them all “00” hex, which means “we don’t know”.

If it worked, you should see this in the nc output:

00000000: 0000 0021 8000 0009 0000 0000 0000 0002  ...!............
00000010: 636c 6f75 6468 6f70 7065 7200 0210 0001  cloudhopper.....

As before, the first 4 bytes are for how long the PDU is – 33 bytes – and the next 4 bytes are for what type of PDU this is – “80000009” is for BIND_TRANSCEIVER_RESP which is the response to a BIND_TRANSCEIVER.

The next 4 bytes are for the status – these are zeroes which indicates success (ESME_ROK) again. After that is our sequence number (2).

The next 15 bytes are the characters of the word “cloudhopper” followed by a zero – this is the system id of the SMSC.

The next byte (“01”) – the last one we can see – is the beginning of a “TLV”, or optional part of the response. xxd actually delayed the last byte of the output, so we can’t see it yet, but it is “34”. Together, “0134” means “the interface version we support is SMPP 3.4”.

SUBMIT_SM

The reason why we’re here is to send a message. To do that, we use a SUBMIT_SM:

LEN61='\x00\x00\x00\x3d'
SUBMIT_SM='\x00\x00\x00\x04'
SEQ3='\x00\x00\x00\x03'
SERVICE_TYPE='\x00'
SOURCE_ADDR_TON='\x00'
SOURCE_ADDR_NPI='\x00'
SOURCE_ADDR='447000123123\x00'
DEST_ADDR_TON='\x00'
DEST_ADDR_NPI='\x00'
DESTINATION_ADDR='447111222222\x00'
ESM_CLASS='\x00'
PROTOCOL_ID='\x01'
PRIORITY_FLAG='\x01'
SCHEDULE_DELIVERY_TIME='\x00'
VALIDITY_PERIOD='\x00'
REGISTERED_DELIVERY='\x01'
REPLACE_IF_PRESENT_FLAG='\x00'
DATA_CODING='\x03'
SM_DEFAULT_MSG_ID='\x00'
SM_LENGTH='\x04'
SHORT_MESSAGE='hihi'
echo -n -e "${LEN61}${SUBMIT_SM}${NULL}${SEQ3}${SERVICE_TYPE}${SOURCE_ADDR_TON}${SOURCE_ADDR_NPI}${SOURCE_ADDR}${DEST_ADDR_TON}${DEST_ADDR_NPI}${DESTINATION_ADDR}${ESM_CLASS}${PROTOCOL_ID}${PRIORITY_FLAG}${SCHEDULE_DELIVERY_TIME}${VALIDITY_PERIOD}${REGISTERED_DELIVERY}${REPLACE_IF_PRESENT_FLAG}${DATA_CODING}${SM_DEFAULT_MSG_ID}${SM_LENGTH}${SHORT_MESSAGE}" >&3

LEN61 is the length in bytes of the PDU, SUBMIT_SM is the type of PDU, and SEQ3 is a sequence number, as before.

SOURCE_ADDR is a null-terminated (i.e. it ends with a zero byte) string of ASCII characters saying who the message is from. This can be a phone number, or a name (but the rules about what names are allowed are complicated and region-specific). SOURCE_ADDR_TON and SOURCE_ADDR_NPI give information about what type of address we are providing – we set them to zero to mean “we don’t know”.

DESTINATION_ADDR, DEST_ADDR_TON and DEST_ADDR_NPI describe the phone number we are sending to.

ESM_CLASS tells the SMSC how to treat your message – we use “store and forward” mode, which means keep it and send it when you can.

PROTOCOL_ID tells it what to do if it finds duplicate messages – we use “Replace Short Message Type 1”, which I guess means use the latest version you received.

PRIORITY_FLAG means how important the message is – we used “interactive”.

SCHEDULE_DELIVERY_TIME is when to send – we say “immediate”.

VALIDITY_PERIOD means how long should this message live before we give up trying to send it (e.g. if the user’s phone is off). We use “default” so the SMSC will do something sensible.

REGISTERED_DELIVERY gives information about whether we want a receipt saying the message arrived on the phone. We say “yes please”.

REPLACE_IF_PRESENT_FLAG is also about duplicate messages (I’m not sure how it interacts with PROTOCOL_ID) – the value we used means “don’t replace”.

DATA_CODING states what character encoding you are using to send the message text – we used “Latin 1”, which means ISO-8859-1.

SM_DEFAULT_MSG_ID allows us to use one of a handful of hard-coded standard messages – we say “no, use a custom one”.

SM_LENGTH is the length in bytes of the “short message” – the actual text that the user will see on the phone screen.

SHORT_MESSAGE is the short message itself – our message is all ASCII characters, but we could use any bytes and they will be interpreted as characters in ISO-8859-1 encoding.

You should see a response in the other terminal like this:

00000020: 3400 0000 1180 0000 0400 0000 0000 0000  4...............

The initial “34” is the left-over byte from the previous message as mentioned above. After that, we have:

“00000011” for the length of this PDU (11 bytes).

“80000004” for the type – SUBMIT_SM_RESP which tells us whether the message was accepted (but not whether it was received).

“00000000” for the status – zero means “OK”.

The last two bytes are chopped off again, but what we actually get back is:

“00000003”, which is the sequence number, and then:

“00” which is a null-terminated ASCII message ID: in this case the SMSC is saying that the ID it has given this message is “”, which is probably not very helpful! If this ID were not empty, it would help us later if we receive a delivery receipt, or if we want to ask about the message, or change or cancel it.

DELIVER_SM

If you stop the SMSC process (the one we started with make server) by pressing Ctrl-C, and start a different one with make server-echo, and then repeat the other commands (note you need to be quick because you only get 5 seconds to bind before it gives up on you – make similar changes to what we did in ServerMain to ServerEchoMain if this causes problems), you will receive a delivery receipt from the server, which looks like this:

“0000003d” for the length of this PDU (59 bytes).

“00000005” for the type (DELIVER_SM).

“00000000” for the unused command status.

“00000001” as a sequence number. Note, this is unrelated the sequence number of the original message: to match with the original message, we must use the message ID provided in the SUBMIT_SM_RESP.

“0000003400” to mean we are using SMPP 3.4. (This is a null-terminated string of bytes.)

“00” and “00” for the TON and NPI of the source address, followed by the source address itself, which is a null-terminated ASCII string: “34343731313132323232323200”. This translates to “447111222222”, which was the destination address of our original message. Note: some SMSCs switch the source and destination addresses like this in their delivery receipts, and some don’t, which makes life interesting.

“00” and “00” for the TOM and NPI of the destination address, followed by “34343730303031323331323300” for the address itself, which translates to “447000123123”, as expected.

The DELIVER_SM PDU continues with much of the information repeated from the original message, and the SMSC is allowed to providing a short message as part of the receipt – in our example the cloudhopper SMSC repeats the original message. Some SMSCs use the short message to provide information such as the message ID and the delivery time, but there is no formal standard for how to provide it. Other SMSCs use a TLV to provide the message ID instead.

In order to complete the conversation, you should provide a DELIVER_SM_RESP, and then an UNBIND, but hopefully based on what we’ve done and the SMPP 3.4 standard, you should be able to figure it out.

You did it

SMPP is a binary protocol layered directly on top of TCP, which makes it slightly harder to work with by hand than the HTTP protocols with which many of us are more familiar, but I hope I’ve convinced you it’s possible to understand what’s going on without resorting to some kind of heavyweight debugging tool or library.

Happy texting!

Extreme value theory in software engineering

Derek Jones from The Shape of Code

As its name suggests, extreme value theory deals with extreme deviations from the average, e.g., how often will rainfall be heavy enough to cause a river to overflow its banks.

The initial list of statistical topics to I thought ought to be covered in my evidence-based software engineering book included extreme value theory. At the time, and even today, there were/are no books covering “Statistics for software engineering”, so I had no prior work to guide my selection of topics. I was keen to cover all the important topics, had heard of it in several (non-software) contexts and jumped to the conclusion that it must be applicable to software engineering.

Years pass: the draft accumulate a wide variety of analysis techniques applied to software engineering data, but, no use of extreme value theory.

Something else does not happen: I don’t find any ‘Using extreme value theory to analyse data’ books. Yes, there are some really heavy-duty maths books available, but nothing of a practical persuasion.

The book’s Extreme value section becomes a subsection, then a subsubsection, and ended up inside a comment (I cannot bring myself to delete it).

It appears that extreme value theory is more talked about than used. I can understand why. Extreme events are newsworthy; rivers that don’t overflow their banks are not news.

Just over a month ago a discussion cropped up on the UK’s C++ standards’ panel mailing list: was email traffic down because of COVID-19? The panel’s convenor, Roger Orr, posted some data on monthly volumes. Oh, data :-)

Monthly data is a bit too granular for detailed analysis over relatively short periods. After some poking around Roger was able to send me the date&time of every post to the WG21‘s Core and Lib reflectors, since February 2016 (there have been various changes of hosts and configurations over the years, and date of posts since 2016 was straightforward to obtain).

During our email exchanges, Roger had mentioned that every now and again a huge discussion thread comes out of nowhere. Woah, sounds like WG21 could do with some extreme value theory. How often are huge discussion threads likely to occur, and how huge is a once in 10-years thread that they might have to deal with?

There are two techniques for analysing the distribution of extreme values present in a sample (both based around the generalized extreme value distribution):

  • Generalized Extreme Value (GEV) uses block maxima, e.g., maximum number of daily emails sent in each month,
  • Generalized Pareto (GP) uses peak over threshold: pick a threshold and extract day values for when more than this threshold number of emails was sent.

The plots below show the maximum number of monthly emails that are expected to occur (y-axis) within a given number of months (x-axis), for WG21’s Core and Lib email lists. The circles are actual occurrences, and dashed lines 95% confidence intervals; GEP was used for these fits (code+data):

Expected maximum for emails appearing on C++'s core and lib reflectors within a given period

The 10-year return value for Core is around a daily maximum of 70 +-30, and closer to 200 +-100 for Lib.

The model used is very simplistic, and fails to take into account the growth in members joining these lists and traffic lost when a new mailing list is created for a new committee subgroup.

If any readers have suggests for uses of extreme value theory in software engineering, please let me know.

Postlude. This discussion has reordered events. My original interest in the mailing list data was the desire to find some evidence for the hypothesis that the volume of email increased as the date of the next WG21 meeting approached. For both Core and Lib, the volume actually decreases slightly as the date of the next meeting approaches; see code for details. Also, the volume of email at the weekend is around 60% lower than during weekdays.

Simple Tables From JSON Data With JQ and Column

Chris Oldwood from The OldWood Thing

My current role is more of a DevOps role and I’m spending more time than usual monitoring and administrating various services, such as the GitLab instance we use for source control, build pipelines, issue management, etc. While the GitLab UI is very useful for certain kinds of tasks the rich RESTful API allows you to easily build your own custom tools to to monitor, analyse, and investigate the things you’re particularly interested in.

For example one of the first views I wanted was an alphabetical list of all runners with their current status so that I could quickly see if any had gone AWOL during the night. The alphabetical sorting requirement is not something the standard UI view provides hence I needed to use the REST API or hope that someone had already done something similar first.

GitLab Clients

I quickly found two candidates: python-gitlab and go-gitlab-client which looked promising but they only really wrap the API – I’d still need to do some heavy lifting myself and understand what the GitLab API does. Given how simple the examples were, even with curl, it felt like I wasn’t really saving myself anything at this point, e.g.

curl --header "PRIVATE-TOKEN: $token" "https://gitlab.example.com/api/v4/runners"

So I decided to go with a wrapper script [1] approach instead and find a way to prettify the JSON output so that the script encapsulated a shell one-liner that would request the data and format the output in a simple table. Here is the kind of JSON the GitLab API would return for the list of runners:

[
  {
   "id": 6,
   "status": "online"
   . . .
  }
,
  {
   "id": 8,
   "status": "offline"
   . . .
  }
]

JQ – The JSON Tool

I’d come across the excellent JQ tool for querying JSON payloads many years ago so that was my first thought for at least simplifying the JSON payloads to the fields I was interested in. However on further reading I found it could do some simple formatting too. At first I thought the compact output using the –c option was what I needed (perhaps along with some tr magic to strip the punctuation), e.g.

$ echo '[{"id":1, "status":"online"}]' |\
  jq -c
[{"id":1,"status":"online"}]

but later I discovered the –r option provided raw output which formatted the values as simple text and removed all the JSON punctuation, e.g.

$ echo '[{"id":1, "status":"online"}]' |\
  jq -r '( .[] | "\(.id) \(.status)" )'
1 online

Naturally my first thought for the column headings was to use a couple of echo statements before the curl pipeline but I also discovered that you can mix-and match string literals with the output from the incoming JSON stream, e.g.

$ echo '[{"id":1, "status":"online"}]' |\
   jq -r '"ID Status",
          "-- ------",
          ( .[] | "\(.id) \(.status)" )'
ID Status
-- ------
1 online

This way the headings were only output if the command succeeded.

Neater Tables with Column

While these crude tables were readable and simple enough for further processing with grep and awk they were still pretty unsightly when the values of a column were too varied in length such as a branch name or description field. Putting them on the right hand side kind of worked but I wondered if I could create fixed width fields ala printf via jq.

At this point I stumbled across the StackOverflow question How to format a JSON string as a table using jq? where one of the later answers mentioned a command line tool called “column” which takes rows of text values and arranges them as columns of similar width by adjusting the spacing between elements.

This almost worked except for the fact that some fields had spaces in their input and column would treat them by default as separate elements. A simple change of field separator from a space to a tab meant that I could have my cake and eat it, e.g.

$ echo '[ {"id":1, "status":"online"},
          {"id":2, "status":"offline"} ]' |\
  jq -r '"ID\tStatus",
         "--\t-------",
         ( .[] | "\(.id)\t\(.status)" )' |\
  column -t -s $'\t'
ID  Status
--  -------
1   online
2   offline

Sorting and Limiting

While many of the views I was happy to order by ID, which is often the default for the API, or in the case of jobs and pipelines was a proxy for “start time”, there were cases where I needed to control the sorting. For example we used the runner description to store the hostname (or host + container name) so it made sense to order by that, e.g.

jq 'sort_by(.description|ascii_downcase)'

For the runner’s jobs the job ID ordering wasn’t that useful as the IDs were allocated up front but the job might start much later if it’s a latter part of the pipeline so I chose to order by the job start time instead with descending order so the most recent jobs were listed first, e.g.

jq ‘sort_by(.started_at) | reverse’

One other final trick that proved useful occasionally when there was no limiting in the API was to do it with jq instead, e.g

jq "sort_by(.name) | [limit($max; .[])]"

 

[1] See my 2013 article “In The Toolbox – Wrapper Scripts” for more about this common technique of simplifying tools.

Weekend Maintenance as Chaos Engineering

Chris Oldwood from The OldWood Thing

I was working on a new system – a grid based calculation engine for an investment bank – and I was beginning to read about some crazy ideas by Netflix around how they would kill off actual production servers to test their resilience to failure. I really liked this idea as it had that “put your money where your mouth is” feel to it and I felt we were designing a system that should cope with this kind of failure, and if it didn’t, then we had learned something and needed to fix it.

Failure is Expected

We had already had a few minor incidents during its early operation caused by dodgy data flowing down from upstream systems and had tackled that by temporarily remediating the data to get the system working but then immediately fixed the code so that the same kind of problem would not cause an issue in future. The project manager, who had also worked on a sister legacy system to one I’d worked on before, had made it clear from the start that he didn’t want another “support nightmare” like we’d both seen before [1] and pushed the “self-healing” angle which was a joy to hear. Consequently reliability was always foremost in our minds.

Once the system went live and the business began to rely on it the idea of randomly killing off services and servers in production was a hard prospect to sell. While the project manager had fought to help us get a UAT environment that almost brought us parity with production and was okay with us using that for testing the system’s reliability he was less happy about going to whole hog and adopting the Netflix approach. (The organisation was already very reserved and despite our impeccable record some other teams had some nasty failures that caused the organisation to become more risk adverse rather than address then root problems.)

Planned Disruption is Good!

Some months after we had gone live I drew the short straw and was involved with a large-scale DR test. We were already running active/active by making use of the DR facilities during the day and rotated the database cluster nodes every weekend [2] to avoid a node getting stale, hence we had a high degree of confidence that we would cope admirably with the test. Unfortunately there was a problem with one of the bank’s main trade systems such that it wouldn’t start after failover to DR that we never really got to do a full test and show that it was a no-brainer for us.

While the day was largely wasted for me as I sat around waiting for our turn it did give me time to think a bit more about how we would show that the system was working correctly and also when the DR test was finished and failed back over again that it had recovered properly. At that point I realised we didn’t need to implement any form of Chaos Engineering ourselves as the Infrastructure team were already providing it, every weekend!

It’s common for large enterprises to only perform emergency maintenance during the week and then make much more disruptive changes at the weekend, e.g. tearing parts of the network up, patching and rebooting servers, etc. At that time it was common for support teams to shut systems down and carefully bring them back up after the maintenance window to ensure they were operating correctly when the eastern markets opened late Sunday evening [3]. This was the perfect opportunity to do the complete opposite – drive the system hard over the weekend and see what state it was after the maintenance had finished – if it wasn’t still operating normally we’d missed some failure modes.

An Aria of Canaries

We were already pushing through a simple canary request every few minutes which allowed us to spot when things had unusually gone south but we wanted something heavier that might drive out subtler problems so we started pushing through heavy loads during the weekend too and then looked at what state they were in at the end of the weekend. These loads always had a lower priority than any real work so we could happily leave them to finish in the background rather than need to kill them off before the working week started. (This is a nice example of using the existing features of the system to avoid it disrupting the normal workload.)

This proved to be a fruitful idea as it unearthed a couple of places where the system wasn’t quite as reliable as we’d thought. For example we were leaking temporary files when the network was glitching and the calculation was restarted. Also the load pushed the app servers over the edge memory-wise and highlighted a bug in the nanny process when the machine was short of memory. There was also a bug in some exponential back-off code that backed off a little too far as it never expected an outage to last most of the weekend :o).

Order From Chaos

When they finally scheduled a repeat DR test some months later after supposedly ironing out the wrinkles with their key trade capture systems our test was a doddle as it just carried on after being brought back to life in the DR environment and similarly after reverting back to PROD it just picked up where it had left off and retried those jobs that had failed when the switchover started. Rather than shying away from the weekend disruption we had used it to our advantage to help improve its reliability.

 

[1] Eventually the team spends so much time fire-fighting there is no time left to actually fix the system and it turns into an endless soul-destroying job.

[2] Rotating the database cluster primary causes the database to work with an empty cache which is a great way to discover how much your common queries rely on heavily cached data. In one instance a 45-second reporting query took over 15 minutes when faced with no cached pages!

[3] See Arbitrary Cache Timeouts for an example where constant rebooting masked a bug.

One Thing Or Another – a.k.

a.k. from thus spake a.k.

Several years ago we took a look at memoryless processes in which the probability that we should wait for any given length of time for an event to occur is independent of how long we have already been waiting. We found that this implied that the waiting time must be exponentially distributed, that the waiting time for several events must be gamma distributed and that the number of events occuring in a unit of time must be Poisson distributed.
These govern continuous memoryless processes in which events can occur at any time but not those in which events can only occur at specified times, such as the roll of a die coming up six, known as Bernoulli processes. Observations of such processes are known as Bernoulli trials and their successes and failures are governed by the Bernoulli distribution, which we shall take a look at in this post.

One Thing Or Another – a.k.

a.k. from thus spake a.k.

Several years ago we took a look at memoryless processes in which the probability that we should wait for any given length of time for an event to occur is independent of how long we have already been waiting. We found that this implied that the waiting time must be exponentially distributed, that the waiting time for several events must be gamma distributed and that the number of events occuring in a unit of time must be Poisson distributed.
These govern continuous memoryless processes in which events can occur at any time but not those in which events can only occur at specified times, such as the roll of a die coming up six, known as Bernoulli processes. Observations of such processes are known as Bernoulli trials and their successes and failures are governed by the Bernoulli distribution, which we shall take a look at in this post.