How to send an SMS using netcat (via SMPP)

Andy Balaam from Andy Balaam's Blog

SMPP is a binary protocol used by phone companies to send text messages, otherwise known as SMS messages.

It can work over TCP, so we can use netcat on the command line to send messages.

Setting up

[Note: the netcat I am using is Ncat 7.70 on Linux.]

The server that receives messages is called an SMSC. You may have your own one, but if not, you can use the CloudHopper one like this:

sudo apt install make maven  # (or similar on non-Debian-derived distros)
git clone https://github.com/fizzed/cloudhopper-smpp.git
cd cloudhopper-smpp

If you are a little slow, like me, I’d suggest making it wait a bit longer for bind requests before giving up on you. To do that, edit the main() method of src/test/java/com/cloudhopper/smpp/demo/ServerMain.java to add a line like this: configuration.setBindTimeout(500000); on about line 80, near the other similar lines. This will make it wait 500 seconds for you to send a BIND_TRANSCEIVER, instead of giving up after just 5 seconds.

Once you’ve made that change, you can run:

make server

Now you have an SMSC running!

Leave that open, and go into another terminal, and type:

mkfifo tmpfifo
nc 0.0.0.0 2776 < tmpfifo | xxd

The mkfifp part creates a “fifo” – a named pipe through which we will send our SMPP commands.

The nc part starts Ncat, connecting to the SMSC we started.

The xxd part will take any binary data coming out of Ncat and display it in a more human-readable way.

Leave that open too, and in a third terminal type:

exec 3> tmpfifo

This makes everything we send to file descriptor 3 go into the fifo, and therefore into Ncat.

Now we have a way of sending binary data to Ncat, which will send it on to the SMSC and print out any responses.

Note: we will be using SMPP version 3.4 since it is in the widest use, even though it is not the newest.

Terminology

“SMPP” is the protocol we are speaking, which we are using over TCP/IP.

An SMSC is a server (which receives messages intended for phones and sends back responses and receipts).

We will be acting as an ESME or client (which sends messages intended for phones and receives responses and receipts).

The units of information that are passed back and forth in SMPP are called “PDUs” (Protocol Data Units) – these are just bits of binary data that flow over the TCP connection between two computers.

The spec talks about “octets” – this means 8-bit bytes.

ENQUIRE_LINK

First, we’ll check the SMSC is responding, by sending an ENQUIRE_LINK, which is used to ask the SMSC whether it’s there and working.

Go back to the third terminal (where we ran exec) and type this:

LEN16='\x00\x00\x00\x10'
ENQUIRE_LINK='\x00\x00\x00\x15'
NULL='\x00\x00\x00\x00'
SEQ1='\x00\x00\x00\x01'

echo -n -e "${LEN16}${ENQUIRE_LINK}${NULL}${SEQ1}" >&3

Explanation: an ENQUIRE_LINK PDU consists of:

  • 4 bytes to say the length of the whole PDU in bytes. ENQUIRE_LINK PDUs are always 16 bytes, “00000010” in hex. I called this LEN.
  • 4 bytes to say what type of PDU this is. ENQUIRE_LINK is “00000015” in hex. I called this ENQUIRE_LINK.
  • 4 bytes that are always zero in ENQUIRE_LINK. I called this NULL.
  • 4 bytes that identify this request, called a sequence number. The response from the server will include this so we can match responses to requests. I called this SEQ.

Check back in the second terminal (where you ran nc). If everything worked, you should see something like this:

00000000: 0000 0010 8000 0015 0000 0000 0000 0001  ................

Ignoring the first and last parts (which are how xxd formats its output), the response we receive is four 4-byte parts, very similar to what we sent:

  • 4 bytes to say the length of the whole PDU in bytes. Here it is “00000010” hex, or 16 decimal.
  • 4 bytes to say what type of PDU this is. Here is is ENQUIRE_LINK is “80000015” in hex, which is the code for ENQUIRE_LINK_RESP.
  • 4 bytes for the success status of the ENQUIRE_LINK_RESP. This is always “00000000”, which means success and is called ESME_ROK in the spec.
  • 4 bytes that match the sequence number we sent. This is “00000001”, as we expected.

BIND_TRANSCEIVER

Now we can see that the SMSC is working, let’s “bind” to it. That means something like logging in: we convince the SMSC that we are a legitimate client, and tell it what type of connection we want, and, assuming it agrees, it will hold the connection open for us for as long as we need.

We are going to bind as a transceiver, which means both a transmitter and receiver, so we can both send messages and receive responses.

Send the bind request like this:

LEN32='\x00\x00\x00\x20'
BIND_TRANSCEIVER='\x00\x00\x00\x09'
NULL='\x00\x00\x00\x00'
SEQ2='\x00\x00\x00\x02'
SYSTEM_ID="sys\x00"
PASSWORD="pas\x00"
SYSTEM_TYPE='typ\x00'
SYSTEM_ID='sys\x00'
PASSWORD='pas\x00'
INTERFACE_VERSION='\x01'
INTERFACE_VERSION='\x34'
ADDR_TON='\x00'
ADDR_NPI='\x00'
ADDRESS_RANGE='\x00'

echo -n -e "${LEN32}${BIND_TRANSCEIVER}${NULL}${SEQ2}${SYSTEM_ID}${PASSWORD}${SYSTEM_TYPE}${INTERFACE_VERSION}${ADDR_TON}${ADDR_NPI}${ADDRESS_RANGE}" >&3

Explanation: this PDU is 32 bytes long, so the first thing we send is “00000020” hex, which is 32.

Then we send “00000009” for the type: BIND_TRANSCEIVER, 4 bytes of zeros, and a sequence number – this time I used 2.

That was the header. Now the body of the PDU starts with a system id (basically a username), a password, and a system type (extra info about who you are). These are all variable-length null-terminated strings, so I ended each one with \x00.

The rest of the body is some options about the types of phone number we will be sending from and sending to – I made them all “00” hex, which means “we don’t know”.

If it worked, you should see this in the nc output:

00000000: 0000 0021 8000 0009 0000 0000 0000 0002  ...!............
00000010: 636c 6f75 6468 6f70 7065 7200 0210 0001  cloudhopper.....

As before, the first 4 bytes are for how long the PDU is – 33 bytes – and the next 4 bytes are for what type of PDU this is – “80000009” is for BIND_TRANSCEIVER_RESP which is the response to a BIND_TRANSCEIVER.

The next 4 bytes are for the status – these are zeroes which indicates success (ESME_ROK) again. After that is our sequence number (2).

The next 15 bytes are the characters of the word “cloudhopper” followed by a zero – this is the system id of the SMSC.

The next byte (“01”) – the last one we can see – is the beginning of a “TLV”, or optional part of the response. xxd actually delayed the last byte of the output, so we can’t see it yet, but it is “34”. Together, “0134” means “the interface version we support is SMPP 3.4”.

SUBMIT_SM

The reason why we’re here is to send a message. To do that, we use a SUBMIT_SM:

LEN61='\x00\x00\x00\x3d'
SUBMIT_SM='\x00\x00\x00\x04'
SEQ3='\x00\x00\x00\x03'
SERVICE_TYPE='\x00'
SOURCE_ADDR_TON='\x00'
SOURCE_ADDR_NPI='\x00'
SOURCE_ADDR='447000123123\x00'
DEST_ADDR_TON='\x00'
DEST_ADDR_NPI='\x00'
DESTINATION_ADDR='447111222222\x00'
ESM_CLASS='\x00'
PROTOCOL_ID='\x01'
PRIORITY_FLAG='\x01'
SCHEDULE_DELIVERY_TIME='\x00'
VALIDITY_PERIOD='\x00'
REGISTERED_DELIVERY='\x01'
REPLACE_IF_PRESENT_FLAG='\x00'
DATA_CODING='\x03'
SM_DEFAULT_MSG_ID='\x00'
SM_LENGTH='\x04'
SHORT_MESSAGE='hihi'
echo -n -e "${LEN61}${SUBMIT_SM}${NULL}${SEQ3}${SERVICE_TYPE}${SOURCE_ADDR_TON}${SOURCE_ADDR_NPI}${SOURCE_ADDR}${DEST_ADDR_TON}${DEST_ADDR_NPI}${DESTINATION_ADDR}${ESM_CLASS}${PROTOCOL_ID}${PRIORITY_FLAG}${SCHEDULE_DELIVERY_TIME}${VALIDITY_PERIOD}${REGISTERED_DELIVERY}${REPLACE_IF_PRESENT_FLAG}${DATA_CODING}${SM_DEFAULT_MSG_ID}${SM_LENGTH}${SHORT_MESSAGE}" >&3

LEN61 is the length in bytes of the PDU, SUBMIT_SM is the type of PDU, and SEQ3 is a sequence number, as before.

SOURCE_ADDR is a null-terminated (i.e. it ends with a zero byte) string of ASCII characters saying who the message is from. This can be a phone number, or a name (but the rules about what names are allowed are complicated and region-specific). SOURCE_ADDR_TON and SOURCE_ADDR_NPI give information about what type of address we are providing – we set them to zero to mean “we don’t know”.

DESTINATION_ADDR, DEST_ADDR_TON and DEST_ADDR_NPI describe the phone number we are sending to.

ESM_CLASS tells the SMSC how to treat your message – we use “store and forward” mode, which means keep it and send it when you can.

PROTOCOL_ID tells it what to do if it finds duplicate messages – we use “Replace Short Message Type 1”, which I guess means use the latest version you received.

PRIORITY_FLAG means how important the message is – we used “interactive”.

SCHEDULE_DELIVERY_TIME is when to send – we say “immediate”.

VALIDITY_PERIOD means how long should this message live before we give up trying to send it (e.g. if the user’s phone is off). We use “default” so the SMSC will do something sensible.

REGISTERED_DELIVERY gives information about whether we want a receipt saying the message arrived on the phone. We say “yes please”.

REPLACE_IF_PRESENT_FLAG is also about duplicate messages (I’m not sure how it interacts with PROTOCOL_ID) – the value we used means “don’t replace”.

DATA_CODING states what character encoding you are using to send the message text – we used “Latin 1”, which means ISO-8859-1.

SM_DEFAULT_MSG_ID allows us to use one of a handful of hard-coded standard messages – we say “no, use a custom one”.

SM_LENGTH is the length in bytes of the “short message” – the actual text that the user will see on the phone screen.

SHORT_MESSAGE is the short message itself – our message is all ASCII characters, but we could use any bytes and they will be interpreted as characters in ISO-8859-1 encoding.

You should see a response in the other terminal like this:

00000020: 3400 0000 1180 0000 0400 0000 0000 0000  4...............

The initial “34” is the left-over byte from the previous message as mentioned above. After that, we have:

“00000011” for the length of this PDU (11 bytes).

“80000004” for the type – SUBMIT_SM_RESP which tells us whether the message was accepted (but not whether it was received).

“00000000” for the status – zero means “OK”.

The last two bytes are chopped off again, but what we actually get back is:

“00000003”, which is the sequence number, and then:

“00” which is a null-terminated ASCII message ID: in this case the SMSC is saying that the ID it has given this message is “”, which is probably not very helpful! If this ID were not empty, it would help us later if we receive a delivery receipt, or if we want to ask about the message, or change or cancel it.

DELIVER_SM

If you stop the SMSC process (the one we started with make server) by pressing Ctrl-C, and start a different one with make server-echo, and then repeat the other commands (note you need to be quick because you only get 5 seconds to bind before it gives up on you – make similar changes to what we did in ServerMain to ServerEchoMain if this causes problems), you will receive a delivery receipt from the server, which looks like this:

“0000003d” for the length of this PDU (59 bytes).

“00000005” for the type (DELIVER_SM).

“00000000” for the unused command status.

“00000001” as a sequence number. Note, this is unrelated the sequence number of the original message: to match with the original message, we must use the message ID provided in the SUBMIT_SM_RESP.

“0000003400” to mean we are using SMPP 3.4. (This is a null-terminated string of bytes.)

“00” and “00” for the TON and NPI of the source address, followed by the source address itself, which is a null-terminated ASCII string: “34343731313132323232323200”. This translates to “447111222222”, which was the destination address of our original message. Note: some SMSCs switch the source and destination addresses like this in their delivery receipts, and some don’t, which makes life interesting.

“00” and “00” for the TOM and NPI of the destination address, followed by “34343730303031323331323300” for the address itself, which translates to “447000123123”, as expected.

The DELIVER_SM PDU continues with much of the information repeated from the original message, and the SMSC is allowed to providing a short message as part of the receipt – in our example the cloudhopper SMSC repeats the original message. Some SMSCs use the short message to provide information such as the message ID and the delivery time, but there is no formal standard for how to provide it. Other SMSCs use a TLV to provide the message ID instead.

In order to complete the conversation, you should provide a DELIVER_SM_RESP, and then an UNBIND, but hopefully based on what we’ve done and the SMPP 3.4 standard, you should be able to figure it out.

You did it

SMPP is a binary protocol layered directly on top of TCP, which makes it slightly harder to work with by hand than the HTTP protocols with which many of us are more familiar, but I hope I’ve convinced you it’s possible to understand what’s going on without resorting to some kind of heavyweight debugging tool or library.

Happy texting!

Extreme value theory in software engineering

Derek Jones from The Shape of Code

As its name suggests, extreme value theory deals with extreme deviations from the average, e.g., how often will rainfall be heavy enough to cause a river to overflow its banks.

The initial list of statistical topics to I thought ought to be covered in my evidence-based software engineering book included extreme value theory. At the time, and even today, there were/are no books covering “Statistics for software engineering”, so I had no prior work to guide my selection of topics. I was keen to cover all the important topics, had heard of it in several (non-software) contexts and jumped to the conclusion that it must be applicable to software engineering.

Years pass: the draft accumulate a wide variety of analysis techniques applied to software engineering data, but, no use of extreme value theory.

Something else does not happen: I don’t find any ‘Using extreme value theory to analyse data’ books. Yes, there are some really heavy-duty maths books available, but nothing of a practical persuasion.

The book’s Extreme value section becomes a subsection, then a subsubsection, and ended up inside a comment (I cannot bring myself to delete it).

It appears that extreme value theory is more talked about than used. I can understand why. Extreme events are newsworthy; rivers that don’t overflow their banks are not news.

Just over a month ago a discussion cropped up on the UK’s C++ standards’ panel mailing list: was email traffic down because of COVID-19? The panel’s convenor, Roger Orr, posted some data on monthly volumes. Oh, data :-)

Monthly data is a bit too granular for detailed analysis over relatively short periods. After some poking around Roger was able to send me the date&time of every post to the WG21‘s Core and Lib reflectors, since February 2016 (there have been various changes of hosts and configurations over the years, and date of posts since 2016 was straightforward to obtain).

During our email exchanges, Roger had mentioned that every now and again a huge discussion thread comes out of nowhere. Woah, sounds like WG21 could do with some extreme value theory. How often are huge discussion threads likely to occur, and how huge is a once in 10-years thread that they might have to deal with?

There are two techniques for analysing the distribution of extreme values present in a sample (both based around the generalized extreme value distribution):

  • Generalized Extreme Value (GEV) uses block maxima, e.g., maximum number of daily emails sent in each month,
  • Generalized Pareto (GP) uses peak over threshold: pick a threshold and extract day values for when more than this threshold number of emails was sent.

The plots below show the maximum number of monthly emails that are expected to occur (y-axis) within a given number of months (x-axis), for WG21’s Core and Lib email lists. The circles are actual occurrences, and dashed lines 95% confidence intervals; GEP was used for these fits (code+data):

Expected maximum for emails appearing on C++'s core and lib reflectors within a given period

The 10-year return value for Core is around a daily maximum of 70 +-30, and closer to 200 +-100 for Lib.

The model used is very simplistic, and fails to take into account the growth in members joining these lists and traffic lost when a new mailing list is created for a new committee subgroup.

If any readers have suggests for uses of extreme value theory in software engineering, please let me know.

Postlude. This discussion has reordered events. My original interest in the mailing list data was the desire to find some evidence for the hypothesis that the volume of email increased as the date of the next WG21 meeting approached. For both Core and Lib, the volume actually decreases slightly as the date of the next meeting approaches; see code for details. Also, the volume of email at the weekend is around 60% lower than during weekdays.

Simple Tables From JSON Data With JQ and Column

Chris Oldwood from The OldWood Thing

My current role is more of a DevOps role and I’m spending more time than usual monitoring and administrating various services, such as the GitLab instance we use for source control, build pipelines, issue management, etc. While the GitLab UI is very useful for certain kinds of tasks the rich RESTful API allows you to easily build your own custom tools to to monitor, analyse, and investigate the things you’re particularly interested in.

For example one of the first views I wanted was an alphabetical list of all runners with their current status so that I could quickly see if any had gone AWOL during the night. The alphabetical sorting requirement is not something the standard UI view provides hence I needed to use the REST API or hope that someone had already done something similar first.

GitLab Clients

I quickly found two candidates: python-gitlab and go-gitlab-client which looked promising but they only really wrap the API – I’d still need to do some heavy lifting myself and understand what the GitLab API does. Given how simple the examples were, even with curl, it felt like I wasn’t really saving myself anything at this point, e.g.

curl --header "PRIVATE-TOKEN: $token" "https://gitlab.example.com/api/v4/runners"

So I decided to go with a wrapper script [1] approach instead and find a way to prettify the JSON output so that the script encapsulated a shell one-liner that would request the data and format the output in a simple table. Here is the kind of JSON the GitLab API would return for the list of runners:

[
  {
   "id": 6,
   "status": "online"
   . . .
  }
,
  {
   "id": 8,
   "status": "offline"
   . . .
  }
]

JQ – The JSON Tool

I’d come across the excellent JQ tool for querying JSON payloads many years ago so that was my first thought for at least simplifying the JSON payloads to the fields I was interested in. However on further reading I found it could do some simple formatting too. At first I thought the compact output using the –c option was what I needed (perhaps along with some tr magic to strip the punctuation), e.g.

$ echo '[{"id":1, "status":"online"}]' |\
  jq -c
[{"id":1,"status":"online"}]

but later I discovered the –r option provided raw output which formatted the values as simple text and removed all the JSON punctuation, e.g.

$ echo '[{"id":1, "status":"online"}]' |\
  jq -r '( .[] | "\(.id) \(.status)" )'
1 online

Naturally my first thought for the column headings was to use a couple of echo statements before the curl pipeline but I also discovered that you can mix-and match string literals with the output from the incoming JSON stream, e.g.

$ echo '[{"id":1, "status":"online"}]' |\
   jq -r '"ID Status",
          "-- ------",
          ( .[] | "\(.id) \(.status)" )'
ID Status
-- ------
1 online

This way the headings were only output if the command succeeded.

Neater Tables with Column

While these crude tables were readable and simple enough for further processing with grep and awk they were still pretty unsightly when the values of a column were too varied in length such as a branch name or description field. Putting them on the right hand side kind of worked but I wondered if I could create fixed width fields ala printf via jq.

At this point I stumbled across the StackOverflow question How to format a JSON string as a table using jq? where one of the later answers mentioned a command line tool called “column” which takes rows of text values and arranges them as columns of similar width by adjusting the spacing between elements.

This almost worked except for the fact that some fields had spaces in their input and column would treat them by default as separate elements. A simple change of field separator from a space to a tab meant that I could have my cake and eat it, e.g.

$ echo '[ {"id":1, "status":"online"},
          {"id":2, "status":"offline"} ]' |\
  jq -r '"ID\tStatus",
         "--\t-------",
         ( .[] | "\(.id)\t\(.status)" )' |\
  column -t -s $'\t'
ID  Status
--  -------
1   online
2   offline

Sorting and Limiting

While many of the views I was happy to order by ID, which is often the default for the API, or in the case of jobs and pipelines was a proxy for “start time”, there were cases where I needed to control the sorting. For example we used the runner description to store the hostname (or host + container name) so it made sense to order by that, e.g.

jq 'sort_by(.description|ascii_downcase)'

For the runner’s jobs the job ID ordering wasn’t that useful as the IDs were allocated up front but the job might start much later if it’s a latter part of the pipeline so I chose to order by the job start time instead with descending order so the most recent jobs were listed first, e.g.

jq ‘sort_by(.started_at) | reverse’

One other final trick that proved useful occasionally when there was no limiting in the API was to do it with jq instead, e.g

jq "sort_by(.name) | [limit($max; .[])]"

 

[1] See my 2013 article “In The Toolbox – Wrapper Scripts” for more about this common technique of simplifying tools.

Weekend Maintenance as Chaos Engineering

Chris Oldwood from The OldWood Thing

I was working on a new system – a grid based calculation engine for an investment bank – and I was beginning to read about some crazy ideas by Netflix around how they would kill off actual production servers to test their resilience to failure. I really liked this idea as it had that “put your money where your mouth is” feel to it and I felt we were designing a system that should cope with this kind of failure, and if it didn’t, then we had learned something and needed to fix it.

Failure is Expected

We had already had a few minor incidents during its early operation caused by dodgy data flowing down from upstream systems and had tackled that by temporarily remediating the data to get the system working but then immediately fixed the code so that the same kind of problem would not cause an issue in future. The project manager, who had also worked on a sister legacy system to one I’d worked on before, had made it clear from the start that he didn’t want another “support nightmare” like we’d both seen before [1] and pushed the “self-healing” angle which was a joy to hear. Consequently reliability was always foremost in our minds.

Once the system went live and the business began to rely on it the idea of randomly killing off services and servers in production was a hard prospect to sell. While the project manager had fought to help us get a UAT environment that almost brought us parity with production and was okay with us using that for testing the system’s reliability he was less happy about going to whole hog and adopting the Netflix approach. (The organisation was already very reserved and despite our impeccable record some other teams had some nasty failures that caused the organisation to become more risk adverse rather than address then root problems.)

Planned Disruption is Good!

Some months after we had gone live I drew the short straw and was involved with a large-scale DR test. We were already running active/active by making use of the DR facilities during the day and rotated the database cluster nodes every weekend [2] to avoid a node getting stale, hence we had a high degree of confidence that we would cope admirably with the test. Unfortunately there was a problem with one of the bank’s main trade systems such that it wouldn’t start after failover to DR that we never really got to do a full test and show that it was a no-brainer for us.

While the day was largely wasted for me as I sat around waiting for our turn it did give me time to think a bit more about how we would show that the system was working correctly and also when the DR test was finished and failed back over again that it had recovered properly. At that point I realised we didn’t need to implement any form of Chaos Engineering ourselves as the Infrastructure team were already providing it, every weekend!

It’s common for large enterprises to only perform emergency maintenance during the week and then make much more disruptive changes at the weekend, e.g. tearing parts of the network up, patching and rebooting servers, etc. At that time it was common for support teams to shut systems down and carefully bring them back up after the maintenance window to ensure they were operating correctly when the eastern markets opened late Sunday evening [3]. This was the perfect opportunity to do the complete opposite – drive the system hard over the weekend and see what state it was after the maintenance had finished – if it wasn’t still operating normally we’d missed some failure modes.

An Aria of Canaries

We were already pushing through a simple canary request every few minutes which allowed us to spot when things had unusually gone south but we wanted something heavier that might drive out subtler problems so we started pushing through heavy loads during the weekend too and then looked at what state they were in at the end of the weekend. These loads always had a lower priority than any real work so we could happily leave them to finish in the background rather than need to kill them off before the working week started. (This is a nice example of using the existing features of the system to avoid it disrupting the normal workload.)

This proved to be a fruitful idea as it unearthed a couple of places where the system wasn’t quite as reliable as we’d thought. For example we were leaking temporary files when the network was glitching and the calculation was restarted. Also the load pushed the app servers over the edge memory-wise and highlighted a bug in the nanny process when the machine was short of memory. There was also a bug in some exponential back-off code that backed off a little too far as it never expected an outage to last most of the weekend :o).

Order From Chaos

When they finally scheduled a repeat DR test some months later after supposedly ironing out the wrinkles with their key trade capture systems our test was a doddle as it just carried on after being brought back to life in the DR environment and similarly after reverting back to PROD it just picked up where it had left off and retried those jobs that had failed when the switchover started. Rather than shying away from the weekend disruption we had used it to our advantage to help improve its reliability.

 

[1] Eventually the team spends so much time fire-fighting there is no time left to actually fix the system and it turns into an endless soul-destroying job.

[2] Rotating the database cluster primary causes the database to work with an empty cache which is a great way to discover how much your common queries rely on heavily cached data. In one instance a 45-second reporting query took over 15 minutes when faced with no cached pages!

[3] See Arbitrary Cache Timeouts for an example where constant rebooting masked a bug.

One Thing Or Another – a.k.

a.k. from thus spake a.k.

Several years ago we took a look at memoryless processes in which the probability that we should wait for any given length of time for an event to occur is independent of how long we have already been waiting. We found that this implied that the waiting time must be exponentially distributed, that the waiting time for several events must be gamma distributed and that the number of events occuring in a unit of time must be Poisson distributed.
These govern continuous memoryless processes in which events can occur at any time but not those in which events can only occur at specified times, such as the roll of a die coming up six, known as Bernoulli processes. Observations of such processes are known as Bernoulli trials and their successes and failures are governed by the Bernoulli distribution, which we shall take a look at in this post.

One Thing Or Another – a.k.

a.k. from thus spake a.k.

Several years ago we took a look at memoryless processes in which the probability that we should wait for any given length of time for an event to occur is independent of how long we have already been waiting. We found that this implied that the waiting time must be exponentially distributed, that the waiting time for several events must be gamma distributed and that the number of events occuring in a unit of time must be Poisson distributed.
These govern continuous memoryless processes in which events can occur at any time but not those in which events can only occur at specified times, such as the roll of a die coming up six, known as Bernoulli processes. Observations of such processes are known as Bernoulli trials and their successes and failures are governed by the Bernoulli distribution, which we shall take a look at in this post.

Testing Testing 1 2 – video blog

Allan Kelly from Allan Kelly Associates

embed

Testing isn’t, or shouldn’t be, about finding bugs. Type-1 Testing is about ensuring you can go to Type-2 Testing and get some useful feedback. Customer feedback is the really valuable stuff: does your product address the need you saw? Is there more to do? More value to be got? – and does the value delivered justify the cost of doing it?

Testing-Testing 1 2 is a 13-minute video blog rerecording of a private presentation I did during lock-down, I hope you find it interesting.


Now booking: September micro-workshops – spaced limited

User Stories Masterclass, Agile Estimation & Forecasting, Maximising value delivered

Early bird discounts & free tickets for unemployed/furloughed

Book with code Blog15 for 15% discount or get more details


The post Testing Testing 1 2 – video blog appeared first on Allan Kelly Associates.

Scientific management of software production

Derek Jones from The Shape of Code

When Frederick Taylor investigated the performance of workers in various industries, at the start of the 1900’s, he found that workers organise their work to suit themselves; workers were capable of producing significantly more than they routinely produced. This was hardly news. What made Taylor’s work different was that having discovered the huge difference between actual worker output and what he calculated could be achieved in practice, he was able to change work practices to achieve close to what he had calculated to be possible. Changing work practices took several years, and the workers did everything they could to resist it (Taylor’s The principles of scientific management is an honest and revealing account of his struggles).

Significantly increasing worker output pushed company profits through the roof, and managers everywhere wanted a piece of the action; scientific management took off. Note: scientific management is not a science of work, it is a science of the management of other people’s work.

The scientific management approach has been successfully applied to production where most of the work can be reduced to purely manual activities (i.e., requiring little thinking by those who performed them). The essence of the approach is to break down tasks into the smallest number of component parts, to simplify these components so they can be performed by less skilled workers, and to rearrange tasks in a way that gives management control over the production process. Deskilling tasks increases the size of the pool of potential workers, decreasing labor costs and increasing the interchangeability of workers.

Given the almost universal use of this management technique, it is to be expected that managers will attempt to apply it to the production of software. The software factory was tried, but did not take-off. The use of chief programmer teams had its origins in the scarcity of skilled staff; the idea is that somebody who knows what they were doing divides up the work into chunks that can be implemented by less skilled staff. This approach is essentially the early stages of scientific management, but it did not gain traction (see “Programmers and Managers: The Routinization of Computer Programming in the United States” by Kraft).

The production of software is different in that once the first copy has been created, the cost of reproduction is virtually zero. The human effort invested in creating software systems is primarily cognitive. The division between management and workers is along the lines of what they think about, not between thinking and physical effort.

Software systems can be broken down into simpler components (assuming all the requirements are known), but can the implementation of these components be simplified such that they can be implemented by less skilled developers? The process of simplification is practical when designing a system for repetitive reproduction (e.g., making the same widget again and again), but the first implementation of anything is unlikely to be simple (and only one implementation is needed for software).

If it is not possible to break down the implementation such that most of the work is easy to do, can we at least hire the most productive developers?

How productive are different developers? Programmer productivity has been a hot topic since people started writing software, but almost no effective research has been done.

I have no idea how to measure programmer productivity, but I do have some ideas about how to measure their performance (a high performance programmer can have zero productivity by writing programs, faster than anybody else, that don’t do anything useful, from the client’s perspective).

When the same task is repeatedly performed by different people it is possible to obtain some measure of average/minimum/maximum individual performance.

Task performance improves with practice, and an individual’s initial task performance will depend on their prior experience. Measuring performance based on a single implementation of a task provides some indication of minimum performance. To obtain information on an individual’s maximum performance they need to be measured over multiple performances of the same task (and of course working in a team affects performance).

Should high performance programmers be paid more than low performance programmers (ignoring the issue of productivity)? I am in favour of doing this.

What about productivity payments, e.g., piece work?

This question is a minefield of issues. Manual workers have been repeatedly found to set informal quotas amongst themselves, i.e., setting a maximum on the amount they will produce during a shift (see “Money and Motivation: An Analysis of Incentives in Industry” by William Whyte). Thankfully, I don’t think I will be in a position to have to address this issue anytime soon (i.e., I don’t see a reliable measure of programmer productivity being discovered in the foreseeable future).

Is Agile is obvious?

Allan Kelly from Allan Kelly Associates

iStock-1174931869-2020-07-27-14-29.jpg

From time-to-time people say to me:

“Agile is obvious”

When I’m being honest it is kind of hard to argue with them, it is certainly “obvious” to me. But at the same time agile is not obvious, or rather, the opposite of agile is also obvious. For example,

Agile says: obviously, you don’t know the future so don’t plan and research too far into the future.
Non-agile thinking says: obviously, failure to plan is planning to fail, obviously you need a plan of action, you need to plan for the future.

Agile says: obviously, people work best when they are self-motivated and given a say in what they do.
Non-agile says: obviously, people are lazy and will do as little as possible, therefore someone needs to manage them.

Agile says: high quality makes it easier to change in the future, obviously.
Non-agile says: obviously, quality is an endless quest, there is no point in polishing something which isn’t important, 20% of the effort gives 80% of the reward so don’t do any more.

Agile emphasises the here and now, the soon, obviously requirements can be handled just-in-time, so live for today.
Non-agile says: if we don’t think about the future we will obviously duplicate work and incur additional costs.

And my own entry: obviously, software development as diseconomies of scale therefore optimise for lots of small. The opposite is equally obvious: economies of scale are what makes modern business – and the cloud – successful so exploit them

There are a number of obvious examples that go with that:

Agile says: obviously we should test every change and new feature by itself to avoid the complications of interacting changes.
Non-agile says: obviously full test runs are slow and expensive so bundle work together and test it on mass.

Both agile and not-agile are obvious. What you consider obvious depends on your starting point. Once you start thinking “agile” a lot of things become obvious. But if you are not thinking agile then, if you are thinking some other model, then the opposite is also obvious.

Some would term this “An Agile Mindset”. However I don’t want to do that, I find the idea of “an agile mindset” too nebulous. I also note that most of the people I hear talking about “an agile mindset” seem to clinging on to some piece of holy lore which I consider not very agile and they believe is totally agile (the project model and upfront requirements usually.)

Instead I find myself going back to Theory-X and Theory-Y. In general people fall into one camp or the other. If you, your philosophy on work and life, align with theory-Y then all the “agile is obvious” statements above are indeed obvious. Conversely, if you generally follow a theory-X philosophy then all the non-agile statements are obvious.

Perhaps surprisingly I find people can flip, and be flipped, from X to Y. What is more difficult is getting people to unlearn behaviours and actions which they acquired with a theory-X mindset. Even if some element of theory-Y (and agile) is now obvious people need help to learn the new way and let go of the old. Some people can do this by themselves, others need help – or at least help speeding up the change.

Yes, thats part of my job as an Agile Guide. Sometimes just talking (and reflecting on recent events) helps. Sometimes exercises (or process miniatures they are sometimes called) help. Sometimes it is by experiments, exposing people to others can help as well – so conferences, user groups.

Rarely do people change because they went on training and were lectured too, but good training incorporates talking, reflection, exercises, etc. Such training is less training and more about practicing the future.

Obviously, my training is like that: I aim to make my training courses a rehearsal for future actions. Actually, while I “sell” training I prefer to think of it as a rehearsal or kaikaku event – kaikaku events also call a “kaizen blitz”, they are big change events from the people who brought you kaizen, more on them another time.

So when someone I’ve worked with turns around and says “Agile is obvious” I take it as a sign of success. They no longer seem agile as something strange, it is normal, it is onbvious.


Subscribe to my blog newsletter and download Project Myopia for Free

The post Is Agile is obvious? appeared first on Allan Kelly Associates.

September micro-workshops

Allan Kelly from Allan Kelly Associates

I’m running all three of my half-day workshops again in September:

You can read reviews of these workshops on the Allan Kelly agile training pages where you will find more details too.
SlimCoupon-2020-07-27-14-21.jpg
Tickets are on-sale now with Tito – with a 20% early bird discount. (I’m using Tito this time because it promises to handle VAT better for those of you outside Europe.) If you have any problems with Tito or would prefer to receive an invoice contact me directly via e-mail, allan at allankelly dot net.

Blog readers can get another 15% off with the code “Blog15”.

Plus, if you book and pay for one workshop you will receive a code for 50% off the other workshops – buy one get two half price offer.

As before there are a few free tickets for the unemployed and furloughed. I might release more unemployed free tickets nearer the time so join the wait list if you are unemployed and miss out.

The post September micro-workshops appeared first on Allan Kelly Associates.