Greenback backup

Paul Grenyer from Paul Grenyer


Why

When Naked Element was still a thing, we used DigitalOcean almost exclusively for our client’s hosting. For the sorts of projects we were doing it was the most straightforward and cost effective solution. DigitalOcean provided managed databases, but there was no facility to automatically back them up. This led us to develop a Python based program which was triggered once a day to perform the backup, push it to AWS S3 and send a confirmation or failure email.

We used Python due to familiarity, ease of use and low installation dependencies. I’ll demonstrate this later on in the Dockerfile. S3 was used for storage as DigitalOcean did not have their equivalent, ‘Spaces’, available in their UK data centre. The closest is in Amsterdam, but our clients preferred to have their data in the UK. 

Fast forward to May 2020 and I’m working on a personal project which uses a PostgreSQL database. I tried to use a combination of AWS and Terraform for the project’s infrastructure (as this is what I am using for my day job) but it just became too much effort to bend AWS to my will and it’s also quite expensive. I decided to move back to DigitalOcean and got the equivalent setup sorted in a day. I could have taken advantage of AWS’ free tier for the database for 12 months, but AWS backup storage is not free and I wanted as much as possible with one provider and within the same virtual private network (VPC).

I was back to needing my own backup solution. The new project I am working on uses Docker to run the main service. My Droplet (that’s what Digital Ocean calls its Linux server instances) setup up is  minimal: non-root user setup, firewall configuration and Docker install. The DigitalOcean Market Place includes a Docker image so most of that is done for me with a few clicks. I could have also installed Python and configured a backup program to run each evening. I’d also have to install the right version of the PostgreSQL client, which isn’t currently in the default Ubuntu repositories, so is a little involved. As I was already using Docker it made sense to create a new Docker image to install everything and run a Python programme to schedule and perform the backups. Of course some might argue that a whole Ubuntu install and configure in a Docker image is a bit much for one backup scheduler, but once it’s done it’s done and can easily be installed and run elsewhere as many times as is needed.

There are two more decisions to note. My new backup solution will use DigitalOcean spaces, as I’m not bothered about my data being in Amsterdam and I haven’t implemented an email server yet so there are no notification emails. This resulted in me jumping out of bed as soon as I woke each morning to check Spaces to see if the backup had worked, rather than just checking for an email. It took two days to get it all working correctly!

What

I reached for Naked Element’s trusty Python backup program affectionately named Greenback after the arch enemy of Danger Mouse (Green-back up, get it? No, me neither…) but discovered it was too specific and would need some work, but would serve as a great template to start with.

It’s worth nothing that I am a long way from a Python expert. I’m in the ‘reasonable working knowledge with lots of help from Google’ category. The first thing I needed the program to do was create the backup. At this point I was working locally where I had the correct PostgreSQL client installed, db_backup.py:

db_connection_string=os.environ['DATABASE_URL']

class GreenBack:
    def backup(self):    
        datestr = datetime.now().strftime("%d_%m_%Y_%H_%M_%S")
        backup_suffix = ".sql"
        backup_prefix = "backup_"

        destination = backup_prefix + datestr + backup_suffix
        backup_command = 'sh backup_command.sh ' + db_connection_string + ' ' + destination
        subprocess.check_output(backup_command.split(' '))
        return destination

I want to keep anything sensitive out of the code and out of source control, so I’ve brought in the connection string from an environment variable. The method constructs a filename based on the current date and time, calls an external bash script to perform the backup:

# connection string
# destination
pg_dump $1 > $2

and returns the backup file name. Of course for Ubuntu I had to make the bash script executable. Next I needed to push the backup file to Spaces, which means more environment variables:

region=''
access_key=os.environ['SPACES_KEY']
secret_access_key=os.environ['SPACES_SECRET']
bucket_url=os.environ['SPACES_URL']
backup_folder='dbbackups'
bucket_name='findmytea'

So that the program can access Spaces and another method:

class GreenBack:
    ...
    def archive(self, destination):
        session = boto3.session.Session()
        client = session.client('s3',
                                region_name=region,
                                endpoint_url=bucket_url,
                                aws_access_key_id=access_key,
                                aws_secret_access_key=secret_access_key)

        client.upload_file(destination, bucket_name, backup_folder + '/' + destination)
        os.remove(destination) 

It’s worth noting that DigitalOcean implemented the Spaces API to match the AWS S3 API so that the same tools can be used. The archive method creates a session and pushes the backup file to Spaces and then deletes it from the local file system. This is for reasons of disk space and security. A future enhancement to Greenback would be to automatically remove old backups from Spaces after a period of time.

The last thing the Python program needs to do is schedule the backups. A bit of Googling revealed an event loop which can be used to do this:

class GreenBack:
    last_backup_date = ""

    def callback(self, n, loop):
        today = datetime.now().strftime("%Y-%m-%d")
        if self.last_backup_date != today:
            logging.info('Backup started')
            destination = self.backup()
            self.archive(destination)
            
            self.last_backup_date = today
            logging.info('Backup finished')
        loop.call_at(loop.time() + n, self.callback, n, loop)
...

event_loop = asyncio.get_event_loop()
try:
    bk = GreenBack()
    bk.callback(60, event_loop)
    event_loop.run_forever()
finally:
    logging.info('closing event loop')
    event_loop.close()

On startup callback is executed. It checks the last_back_date against the current date and if they don’t match it runs the backup and updates the last_backup_date. If the dates do match and after running the backup, the callback method  is added to the event loop with a one minute delay. Calling event_loop.run_forever after the initial callback call means the program will wait forever and the process continues.

Now that I had a Python backup program I needed to create a Dockerfile that would be used to create a Docker image to setup the environment and start the program:

FROM ubuntu:xenial as ubuntu-env
WORKDIR /greenback

RUN apt update
RUN apt -y install python3 wget gnupg sysstat python3-pip

RUN pip3 install --upgrade pip
RUN pip3 install boto3 --upgrade
RUN pip3 install asyncio --upgrade

RUN echo 'deb http://apt.postgresql.org/pub/repos/apt/ xenial-pgdg main' > /etc/apt/sources.list.d/pgdg.list
RUN wget https://www.postgresql.org/media/keys/ACCC4CF8.asc
RUN apt-key add ACCC4CF8.asc

RUN apt update
RUN apt -y install postgresql-client-12

COPY db_backup.py ./
COPY backup_command.sh ./

ENTRYPOINT ["python3", "db_backup.py"]

The Dockerfile starts with an Ubuntu image. This is a bare bones, but fully functioning Ubuntu operating system. The Dockerfile then installs Python, its dependencies and the Greenback dependencies. Then it installs the PostgreSQL client, including adding the necessary repositories. Following that it copies the required Greenback files into the image and tells it how to run Greenback.

I like to automate as much as possible so while I did plenty of manual Docker image building, tagging and pushing to the repository during development, I also created a BitBucket Pipeline, which would do the same on every check in:

image: python:3.7.3

pipelines:
  default:
    - step:
          services:
            - docker
          script:
            - IMAGE="findmytea/greenback"
            - TAG=latest
            - docker login --username $DOCKER_USERNAME --password $DOCKER_PASSWORD
            - docker build -t $IMAGE:$TAG .
            - docker push $IMAGE:$TAG

Pipelines, BitBucket’s cloud based Continuous Integration and Continuous Deployment feature, is familiar with Python and Docker so it was quite simple to make it log in to Docker Hub, build, tag and push the image. To enable the pipeline all I had to do was add the bitbucket-pipelines.yml file to the root of the repository, checkin, follow the BitBucket pipeline process in the UI to enable it and add then add the build environment variables so the pipeline could log into Docker Hub. I’d already created the image repository in Docker Hub.

The Greenback image shouldn’t change very often and there isn’t a straightforward way of automating the updating of Docker images from Docker Hub, so I wrote a bash script to do it, deploy_greenback:

sudo docker pull findmytea/greenback
sudo docker kill greenback
sudo docker rm greenback
sudo docker run -d --name greenback  --restart always --env-file=.env findmytea/
greenback:latest
sudo docker ps
sudo docker logs -f greenback

Now, with a single command I can fetch the latest Greenback image, stop and remove the currently running image instance, install the new image, list the running images to reassure myself the new instance is running and follow the Greenback logs. When the latest image is run, it is named for easy identification, configured to restart when the Docker service is restarted and told where to read the environment variables from. The environment variables are in a local file called .env:

DATABASE_URL=...
SPACES_KEY=...
SPACES_SECRET=...
SPACES_URL=https://ams3.digitaloceanspaces.com

And that’s it! Greenback is now running in a Docker image instance on the application server and backs up the database to Spaces just after midnight every night.

Finally

While Greenback isn’t a perfect solution, it works, is configurable, a good platform for future enhancements and should require minimal configuration to be used with other projects in the future.

Greenback is checked into a public BitBucket repository and the full code can be found here:

https://bitbucket.org/findmytea/greenback/

The Greenback Docker image is in a public repository on Docker Hub and can be pulled with Docker:

docker pull findmytea/greenback

No replies to 135 research data requests: paper titles+author emails

Derek Jones from The Shape of Code

I regularly email researchers referring to a paper of theirs I have read, and asking for a copy of the data to use as an example in my evidence-based software engineering book; of course their work is cited as the source.

Around a third of emails don’t receive any reply (a small number ask why they should spend time sorting out the data for me, and I wrote a post to answer this question). If there is no reply after roughly 6-months, I follow up with a reminder, saying that I am still interested in their data (maybe 15% respond). If the data looks really interesting, I might email again after 6-12 months (I have outstanding requests going back to 2013).

I put some effort into checking that a current email address is being used. Sometimes the work was done by somebody who has moved into industry, and if I cannot find what looks like a current address I might email their supervisor.

I have had replies to later email, apologizing, saying that the first email was caught by their spam filter (the number of links in the email template was reduced to make it look less like spam). Sometimes the original email never percolated to the top of their todo list.

There are around 135 unreplied email requests (the data was automatically extracted from my email archive and is not perfect); the list of papers is below (the title is sometimes truncated because of the extraction process).

Given that I have collected around 620 software engineering datasets (there are several ways of counting a dataset), another 135 would make a noticeable difference. I suspect that much of the data is now lost, but even 10 more datasets would be nice to have.

After the following list of titles is a list of the 254 author last known email addresses. If you know any of these people, please ask them to get in touch.

If you are an author of one of these papers: ideally send me the data, otherwise email to tell me the status of the data (I’m summarising responses, so others can get some idea of what to expect).

50 CVEs in 50 Days: Fuzzing Adobe Reader
A Change-Aware Per-File Analysis to Compile Configurable Systems
A Design Structure Matrix Approach for Measuring Co-Change-Modularity
A Foundation for the Accurate Prediction of the Soft Error
AGENT-BASED SIMULATION OF THE SOFTWARE DEVELOPMENT PROCESS: A CASE STUDY
A Large Scale Evaluation of Automated Unit Test Generation Using
A large-scale study of the time required to compromise
A Large-Scale Study On Repetitiveness, Containment, and
Analysing Humanly Generated Random Number Sequences: A Pattern-Based
Analysis of Software Aging in a Web Server
Analyzing and predicting effort associated with finding & fixing
Analyzing CAD competence with univariate and multivariate
Analyzing Differences in Risk Perceptions between Developers
Analyzing the Decision Criteria of Software Developers Based on
An analysis of the effect of environmental and systems complexity on
An Empirical Analysis of Software-as-a-Service Development
An Empirical Comparison of Forgetting Models
An empirical study of the textual similarity between
An error model for pointing based on Fitts' law
An Evolutionary Study of Linux Memory Management for Fun and Profit
An examination of some software development effort and
An Experimental Survey of Energy Management Across the Stack
Anomaly Trends for Missions to Mars: Mars Global Surveyor
A Quantitative Evaluation of the RAPL Power Control System
Are Information Security Professionals Expected Value Maximisers?:
A replicated and refined empirical study of the use of friends in
ARRAY LAYOUTS FOR COMPARISON-BASED SEARCHING
A Study of Repetitiveness of Code Changes in Software Evolution
A Study on the Interactive Effects among Software Project Duration, Risk
Bias in Proportion Judgments: The Cyclical Power Model
Capitalization of software development costs
Configuration-aware regression testing: an empirical study of sampling
Cost-Benefit Analysis of Technical Software Documentation
Decomposing the problem-size effect: A comparison of response
Determinants of vendor profitability in two contractual regimes:
Diagnosing organizational risks in software projects:
Early estimation of users’ perception of Software Quality
MEASURING USER’S PERCEPTION AND OPINION OF SOFTWARE QUALITY
Empirical Analysis of Factors Affecting Confirmation
Estimating Agile Software Project Effort: An Empirical Study
Estimating computer depreciation using online auction data
Estimation fulfillment in software development projects
Ethical considerations in internet code reuse: A
Evaluating. Heuristics for Planning Effective and
Evaluating Pair Programming with Respect to System Complexity and
Evidence-Based Decision Making in Lean Software Project Management
Explaining Multisourcing Decisions in Application Outsourcing
Exploring defect correlations in a major. Fortran numerical library
Extended Comprehensive Study of Association Measures for
Eye gaze reveals a fast, parallel extraction of the syntax of
Factorial design analysis applied to the performance of
Frequent Value Locality and Its Applications
Historical and Impact Analysis of API Breaking Changes:
How do i know whether to trust a research result?
How do OSS projects change in number and size?
How much is “about” ? Fuzzy interpretation of approximate
Humans have evolved specialized skills of
Identifying and Classifying Ambiguity for Regulatory Requirements
Identifying Technical Competences of IT Professionals. The Case of
Impact of Programming and Application-Specific Knowledge
Individual-Level Loss Aversion in Riskless and Risky Choices
Industry Shakeouts and Technological Change
Inherent Diversity in Replicated Architectures
Initial Coin Offerings and Agile Practices
Interpreting Gradable Adjectives in Context: Domain
Is Branch Coverage a Good Measure of Testing Effectiveness?
JavaScript Developer Survey Results
Knowledge Acquisition Activity in Software Development
Language matters
Learning from Evolution History to Predict Future Requirement Changes
Learning from Experience in Software Development:
Learning from Prior Experience: An Empirical Study of
Links Between the Personalities, Views and Attitudes of Software Engineers
Making root cause analysis feasible for large code bases:
Making-Sense of the Impact and Importance of Outliers in Project
Management Aspects of Software Clone Detection and Analysis
Managing knowledge sharing in distributed innovation from the
Many-Core Compiler Fuzzing
Measuring Agility
Mining for Computing Jobs
Mining the Archive of Formal Proofs.
Modeling Readability to Improve Unit Tests
Modeling the Occurrence of Defects and Change
Modelling and Evaluating Software Project Risks with Quantitative
Moore’s Law and the Semiconductor Industry: A Vintage Model
More Testers – The Effect of Crowd Size and Time Restriction in
Motivations for self-assembling into project teams
Networks, social influence and the choice among competing innovations:
Nonliteral understanding of number words
Nonstationarity and the measurement of psychophysical response in
Occupations in Information Technology
On information systems project abandonment
On the Positive Effect of Reactive Programming on Software
ON THE USE OF REPLACEMENT MESSAGES IN API DEPRECATION:
On Vendor Preferences for Contract Types in Offshore Software Projects:
Peer Review on Open Source Software Projects:
Parameter-based refactoring and the relationship with fan-in/fan-out
Participation in Open Knowledge Communities and Job-Hopping:
Pipeline management for the acquisition of industrial projects
Predicting the Reliability of Mass-Market Software in the Marketplace
Prototyping A Process Monitoring Experiment
Quality vs risk: An investigation of their relationship in
Quantitative empirical trends in technical performance
Reported project management effort, project size, and contract type.
Reproducible Research in the Mathematical Sciences
Semantic Versioning versus Breaking Changes
Software Aging Analysis of the Linux Operating System
Software reliability as a function of user execution patterns
Software Start-up failure An exploratory study on the
Spatial estimation: a non-Bayesian alternative
System Life Expectancy and the Maintenance Effort: Exploring
Testing as an Investment
The enigma of evaluation: benefits, costs and risks of IT in
THE IMPACT OF PLANNING AND OTHER ORGANIZATIONAL FACTORS
The impact of size and volatility on IT project performance
The Influence of Size and Coverage on Test Suite
The Marginal Value of Increased Testing: An Empirical Analysis
The nature of the times to flight software failure during space missions
Theoretical and Practical Aspects of Programming Contest Ratings
The Performance of the N-Fold Requirement Inspection Method
The Reaction of Open-Source Projects to New Language Features:
The Role of Contracts on Quality and Returns to Quality in Offshore
The Stagnating Job Market for Young Scientists
Time Pressure — A Controlled Experiment of Test-case Development and
Turnover of Information Technology Professionals:
Unconventional applications of compiler analysis
Unifying DVFS and offlining in mobile multicores
Use of Structural Equation Modeling to Empirically Study the Turnover
Use Two-Level Rejuvenation to Combat Software Aging and
Using Function Points in Agile Projects
Using Learning Curves to Mine Student Models
Virtual Integration for Improved System Design
Which reduces IT turnover intention the most: Workplace characteristics
Why Did Your Project Fail?
Within-Die Variation-Aware Dynamic-Voltage-Frequency

Author emails (automatically extracted and manually checked to remove people who have replied on other issues; I hope I have caught them all).

Aaron.Carroll@nicta.com.au   abaker@ucar.edu   abd_elzamly@yahoo.com
actjn@siu.edu   agopal@rhsmith.umd.edu   akbar.namin@ttu.edu
aken@nsuok.edu   akmassey@umbc.edu   alessandro.murgia@uantwerpen.be
alexander.budzier@sbs.ox.ac.uk   alinebrito@dcc.ufmg.br   allen.p.nikora@jpl.nasa.gov
Allen.P.Nikora@jpl.nasa.gov   Altaf.Ahmad@asu.edu   Ana.Aizcorbe@bea.gov
angel.garcia@uc3m.es   anhnt@iastate.edu   a.pinna@diee.unica.it
arho.suominen@vtt.fi   arie.vandeursen@tudelft.nl   asang@ntu.edu.sg
austen.rainer@canterbury.ac.nz   awfboh@ntu.edu.sg   bent.flyvbjerg@sbs.ox.ac.uk
bf@ul.ie   bjg@empiricalreality.com   bojan.spasic@avl.com
bramesh@gsu.edu   brent@cosc.canterbury.ac.nz   brent.martin@canterbury.ac.nz
briand@simula.no   brian.fitzgerald@lero.ie   bronevetsky1@llnl.gov
burairah@utem.edu.my   calikli@chalmers.se   canton@mnec.gr
cc05@vokac.org   celio.santana@gmail.com   cguo13@hawk.iit.edu
charngda@ccr.buffalo.edu   charngdalu@yahoo.com   chenyy@comp.nus.edu.sg
chris.sauer@sbs.ox.ac.uk   christian.korunka@univie.ac.at   christopher.lidbury10@imperial.ac.uk
clitecky@business.siu.edu   cmagee@mit.edu   corey.phelps@mcgill.ca
cotroneo@unina.it   cthompson@cs.berkeley.edu   dagsj@ifi.uio.no
daniela.munteanu@univ-provence.fr   daniel.milroy@colorado.edu   dan@silverthreadinc.com
david@merobe.com   david.nembhard@oregonstate.edu   der.herr@hofr.at
dgrtwo@princeton.edu   dhkim@astate.edu   director@scit.edu
discy@nus.edu.sg   djl68@pitt.edu   dlautner@hawk.iit.edu
dport@hawaii.edu   dprtchan@nus.edu.sg   dredman@avsi.aero
drobinson@stackoverflow.com   dskusumo.itt@gmail.com   dwheeler@ida.org
eherrman@eva.mpg.de   Enrique.Dans@ie.edu   erik.arisholm@testify.no
ermira.daka@sheffield.ac.uk   etovar@fi.upm.es   fjshull@sei.cmu.edu
foreverheart9@gmail.com   founders@triplebyte.com   fschweitzer@ethz.ch
ghs2@psu.edu   gleison.brito@dcc.ufmg.br   glpkm@hotmail.com
gordon.fraser@uni-passau.de   greg@bronevetsky.com   gul.calikli@gu.se
guschroko@student.gu.se   hankhoffmann@cs.uchicago.edu   hannes.holm@foi.se
hannes.holm@ics.kth.se   hata@is.naist.jp   hbarth@wesleyan.edu
hello@ponyfoo.com   hiroshi.igaki@oit.ac.jp   hirtle@pitt.edu
hoan@iastate.edu   hora@dcc.ufmg.br   hrideshg@iastate.edu
huang@umd.edu   huazhe@cs.uchicago.edu   hwu28@hawk.iit.edu
ichischneider@gmail.com   I.Deary@ed.ac.uk   ilaria.lunesu@diee.unica.it
info@targetprocess.com   james@jpallister.com   jarmo.ahonen@uef.fi
jasmin.blanchette@mpi-inf.mpg.de   jasonweiyi@gmail.com   javier.alonso@duke.edu
jean-luc.autran@univ-provence.fr   jfmendes@ua.pt   jgo@ua.pt
jianh@illinois.edu   jimbo@business.siu.edu   jmunson@uidaho.edu
jo-anne.lefevre@carleton.ca   john.krogstie@ntnu.no   john.zhang@business.uconn.edu
jordan.weissmann@slate.com   jose.campos@sheffield.ac.uk   josephborel@aol.com
jselby@maplesoft.com   jselby@maplesoft.com   June.Verner@gmail.com
junyang@engr.pitt.edu   justinek@alumni.stanford.edu   justin.hollands@drdc-rddc.gc.ca
j.visser@sig.eu   kaisa.still@vtt.fi   kantor@cs.technion.ac.il
kevin.mcdaid@dkit.ie   kewusi@lmu.edu   klaas-jan.stol@lero.ie
K.Markantonakis@rhul.ac.uk   konstantinos.chronis@gmail.com   ktrivedi@duke.edu
laertexavier@dcc.ufmg.br   larissanadja@copin.ufcg.edu.br   lcao@odu.edu
leo@susaventures.com   lionel.briand@uni.lu   lsarigia@pme.duth.gr
lucia.2009@smu.edu.sg   magnus@magnusdettmar.com   mail@kaidence.org
ma.khan@uleth.ca   manuel.oriol@ch.abb.com   marc.schulz@rwth-aachen.de
Marek@gryting.biz   marie-jeanne.lesot@lip6.fr   mariusz.musial@ericpol.com
maruyama@atr.jp   matthias.biggeleben@open-xchange.com   matthias.stuermer@iwi.unibe.ch
mcknight@bus.msu.edu   mdettmar@deloitte.com   mdettmar@deloitte.se
melanie@cs.columbia.edu   Michael.english@lero.ie   michael.english@ul.ie
michael.grottke@fau.de   Michael.Grottke@wiso.uni-erlangen.de   Michael@targetprocess.com
mika.mantyla@aalto.fi   mika.mantyla@oulu.fi   mingshu@iscas.ac.cn
mischael.schill@inf.ethz.ch   misof@ksp.sk   mjaber@ryerson.ca
monica.pais@ifgoiano.edu.br   monicaspais@gmail.com   morin@scs.carleton.ca
mschermann@scu.edu   mtov@dcc.ufmg.br   mzhu@ets.org
ncerpa@utalca.cl   Neil.Stewart@warwick.ac.uk   Nelson.W.Green@jpl.nasa.gov
nick.wells@jobstats.co.uk   o.alexy@tum.de   oliver.krancher@iwi.unibe.ch
Oliver.Laitenberger@horn-company.de   olivier.gendreau@polymtl.ca   paula.j.savolainen@uef.fi
paulmcb@seas.upenn.edu   paul@strassmann.com   pchatzog@pme.duth.gr
perry@mail.utexas.edu   philippe.roche@st.com   phoonakker@cqpi.engr.wisc.edu
pierre.robillard@polymtl.ca   ploaiza@lsm.in2p3.fr   P.Love@curtin.edu.au
pokech@uonbi.ac.ke   psidhu@cmu.edu   pvk@pvk.ca
pyzychen@gmail.com   ren@iit.edu   rh13@aub.edu.lb
ricardo.colomo@uc3m.es   rkiyer@illinois.edu   robert.benkoczi@uleth.ca
roberto.natella@unina.it   roberto.pietrantuono@unina.it   salvaneschi@cs.tu-darmstadt.de
saurabh.dighe@intel.com   sdorogov@ua.pt   sebastien.lefort@lip6.fr
sebastien.sauze@l2mp.fr   shaji@scit.edu   shilin@itechs.iscas.ac.cn
show@um.edu.my   siegfrie@adelphi.edu   simona.ibba@diee.unica.it
simon.gaechter@nottingham.ac.uk   simonk@rpi.edu   simvrh@gmail.com
sl@monochromata.de   soenke.albers@the-klu.org   songxue@microsoft.com
s.raemaekers@sig.eu   sriram.vangal@intel.com   ssg@engr.uconn.edu
stavrino@eap.gr   stavrino@gmail.com   stefan@garage-coding.com
sterusso@unina.it   steve.a.shogren@gmail.com   svkbharathi@scit.edu
swilson@tcd.ie   tamada@cse.kyoto-su.ac.jp   tien@iastate.edu
tien.n.nguyen@utdallas.edu   tjleffel@gmail.com   tkabdelh@nps.edu
tsunoda@info.kindai.ac.jp   tung@iastate.edu   victoria@stodden.net
wangyi@us.ibm.com   William.L.Taber@jpl.nasa.gov   wmhan@takming.edu.tw
wobbrock@uw.edu   wq@itechs.iscas.ac.cn   xenos@eap.gr
xhua@hawk.iit.edu   xiao.qu@us.abb.com   yanglusi@comp.nus.edu.sg
ychen200@cba.ua.edu   yi.wang@rit.edu   yiw@ics.uci.edu
yoaval@checkpoint.com   zhangx@nku.edu   zhij@cs.toronto.edu
Zhongju.Zhang@asu.edu   zibran@cs.uno.edu

Algorithms are now commodities

Derek Jones from The Shape of Code

When I first started writing software, developers had to implement most of the algorithms they used; yes, hardware vendors provided libraries, but the culture was one of self-reliance (except for maths functions, which were technical and complicated).

Developers read Donald Knuth’s The Art of Computer Programming, it was the reliable source for step-by-step algorithms. I vividly remember seeing a library copy of one volume, where somebody had carefully hand-written, in very tiny letters, an update to one algorithm, and glued it to the page over the previous text.

Algorithms were important because computers were not yet fast enough to solve common problems at an acceptable rate; developers knew the time taken to execute common instructions and instruction timings were a topic of social chit-chat amongst developers (along with the number of registers available on a given cpu). Memory capacity was often measured in kilobytes, every byte counted.

This was the age of the algorithm.

Open source commoditized algorithms, and computers got a lot faster with memory measured in megabytes and then gigabytes.

When it comes to algorithm implementation, developers are now spoilt for choice; why waste time implementing the ‘low’ level stuff when there were plenty of other problems waiting to be implemented.

Algorithms are now like the bolts in a bridge: very important, but nobody talks about them. Today developers talk about story points, features, business logic, etc. Given a well-defined problem, many are now likely to search for an existing package, rather than write code from scratch (I certainly work this way).

New algorithms are still being invented, and researchers continue to look for improvements to existing algorithms. This is a niche activity.

There are companies where algorithms are not commodities. Google operates on a scale where what appears to others as small improvements, can save the company millions (purely because a small percentage of a huge amount can be a lot). Some company’s core competency may include an algorithmic component (whose non-commodity nature gives the company its edge over the competition), with the non-core competency treating algorithms as a commodity.

Knuth’s The Art of Computer Programming played an important role in making viable algorithms generally available; while the volumes are frequently cited, I suspect they are rarely read (I have not taken any of my three volumes off the shelf, to read, for years).

A few years ago, I suddenly realised that I was working on a book about software engineering that not only did not contain an algorithms chapter, and the 103 uses of the word algorithm all refer to it as a concept.

Today, we are in the age of the ecosystem.

Algorithms have not yet completed their journey to obscurity, which has to wait until people can tell computers what they want and not be concerned about the implementation details (or genetic algorithm programming gets a lot better).

Beta Animals – a.k.

a.k. from thus spake a.k.

Several years ago we took a look at the gamma function Γ, which is a generalisation of the factorial to non-integers, being equal to the factorial of a non-negative integer n when passed an argument of n+1 and smoothly interpolating between them. Like the normal cumulative distribution function Φ, it and its related functions are examples of special functions; so named because they frequently crop up in the solutions to interesting mathematical problems but can't be expressed as simple formulae, forcing us to resort to numerical approximation.
This time we shall take a look at another family of special functions derived from the beta function B.

I’m an Agile Guide

Allan Kelly from Allan Kelly Associates

aron-visuals-3jBU9TbKW7o-unsplash-2020-07-1-09-44.jpg

Anyone who keeps a keen eye on Linkedin might have noticed I recently changed my job description to Agile Guide. I feel “guide” more accurately reflects what I do: part coach, part advisor, part teacher.

I work as a consultant – a hired gun – but “consultant” is a very vague term and covers a lot of ground. Plus a lot of people in the technology industry have a very negative view of consultants. I’ve been known to share that view myself so while consultant might be an accurate description it was also vague and open to misinterpretation.

Many people consider me an Agile Coach, and I have worked as an agile coach. However – as I’ve written before – this too is a conflicted term. Most of us who go by the title “agile coach” like to talk about helping people help themselves, unlocking the individual, respecting the individual as the expert, and so on. I agree with a lot of that and I do it. Sometimes.

I also know what professional coaches do and I don’t feel I’m one of them. I have a lot of respect for real coaches. Such coaches put their own opinions second and I don’t. I am prepared to tell people the way I think it should be – they are free to ignore my advice but I’m prepared to say it.

Thats why I also regard myself as part teacher: not just direct training sessions (which I do) but also one-on-one and in small free format group sessions.

So what title should I use?

I’ve struggled with this for years. My epiphany came a few weeks ago: Agile guide. I help others to get more agile, coaching is one tool but so is direct advice and teaching.

Hadn’t others thought of “Agile Guide”. So I checked out LinkedIn myself. One person. Someone I respect, someone I call a friend: Woody Zuill.

I checked in with Woody and his thinking parallels mine.

So I’m an Agile Guide – I help individuals, teams and enterprises become more agile in a digital world.

Part coach, part advisor, part teacher, plus thinker and route finder. I use skills of coaching, teaching and consultancy.

Who knows, maybe, it will catch on. After all, as Woody pointed out, we have both changed the world already.


Subscribe to my blog newsletter and download Project Myopia for Free

The post I’m an Agile Guide appeared first on Allan Kelly Associates.

beta: Evidence-based Software Engineering – book

Derek Jones from The Shape of Code

My book, Evidence-based software engineering: based on the publicly available data is now out on beta release (pdf, and code+data). The plan is for a three-month review, with the final version available in the shops in time for Christmas (I plan to get a few hundred printed, and made available on Amazon).

The next few months will be spent responding to reader comments, and adding material from the remaining 20 odd datasets I have waiting to be analysed.

You can either email me with any comments, or add an issue to the book’s Github page.

While the content is very different from my original thoughts, 10-years ago, the original aim of discussing all the publicly available software engineering data has been carried through (in some cases more detailed data, in greater quantity, has supplanted earlier less detailed/smaller datasets).

The aim of only discussing a topic if public data is available, has been slightly bent in places (because I thought data would turn up, and it didn’t, or I wanted to connect two datasets, or I have not yet deleted what has been written).

The outcome of these two aims is that the flow of discussion is very disjoint, even disconnected. Another reason might be that I have not yet figured out how to connect the material in a sensible way. I’m the first person to go through this exercise, so I have no idea where it’s going.

The roughly 620+ datasets is three to four times larger than I thought was publicly available. More data is good news, but required more time to analyse and discuss.

Depending on the quantity of issues raised, updates of the beta release will happen.

As always, if you know of any interesting software engineering data, please tell me.

The future of office working

Allan Kelly from Allan Kelly Associates

HomeOffice-2020-06-30-16-47.jpg

The lockdown hasn’t effected my work environment that much. This is a picture of my home office, my garden office, my “man cave.” I am very lucky. When I’m not on site with clients I spend most of my time working here. Still, I miss visiting clients and conferences – although I did squeeze in Agile on the Beach New Zealand in the closing days of the old world.

A friend of mine works for Canonical – the Ubunto Linux people. The vast majority of the people at Ubunto work remotely. So while my friend lives in New Zealand he works with a team spread out across the world.

Once or twice a year, his whole team co-locates. They all fly together and work together for a couple of weeks, a sprint. Then they return home and work remotely with each other for another six months, then repeat. Once or twice I’ve seen Tim as he works in, or passes through, London. But as luck would have it the week I was in New Zealand he was on his travels.

In all this talk of how wonderful working-from home can be I don’t think there has been enough discussion about what makes a good experience and what makes it bad. The Ubunto story highlights a couple of points which I think are missing from much of the current discussion over remote/home working.

Firstly the team are equal: they are all remote.

Over the years I have observed big differences in the way teams which are all distributed operate and the way teams with only one or two members operate. When the bulk of a team are co-located and only one or two are remote there is an unequal relationship. My intuition says teams are better off when they are either all remote or all co-located. When five of the team are co-located and a sixth member is remote then things are more difficult.

The way lockdown came about this year teams went from one to the other overnight, literally. Which also means it was very egalitarian. Everyone was in the same position. That can only be a good thing.

The second thing I take from Ubunto is that face-to-face time is still valued. In fact, as we unlock I expect home working will be more common, more remote tools will be used but we will value co-locating and face-to-face contact all the more.

Where companies keep people remote I expect we will see the emergence of deliberate co-location. This might be a two week block like my friend or it might be a few hours or a whole day. I can easily imagine benefit from an agile team coming together for one day a fortnight to close out a sprint, run a retrospective and then plan the next sprint. (That will have a knock on effect in the office market as companies stop renting offices for years and rent meeting rooms by the day.)

In fact this will be essential. It is easier to build social capital – trust, comradeship – when you are face-to-face. For the last three months we have been running on the social capital and experience teams built up over years working together. The social capital account now needs toping up.

Existing teams, existing social capital and egalitarian distribution helped people work during the lockdowns. As we slowly move back to our offices that is going to change and we will need to make deliberate efforts to replenish social capital, keep people equal and build new teams.

New teams will benefit from working together face-to-face to get to know one another. Build some social capital and norms.

New employees in particular will need time together. Especially those, like fresh graduates, who are new to work altogether. Such people need to learn how to work. Doing that in a distributed environment is a challenge.

Those new to the workforce face an additional challenge: space.

Home working is OK for me, I have a garden big enough for an office (and I had the money to buy one.) How many people have been working of their kitchen table? or bedroom? Mixing your sleeping space with your work space can damage sleep patterns and add to stress.

How many new workers have that space?

I remember when I was a fresh graduate finding my way in the world I lived in house shares. I had a room in a shared house with similar people. Some of them I counted as friends but not all of them. Some of them had bad habits. Going to an office for work was important, staying in the house – especially when they were all staying in! – would have been too much.

Then there is the question of broadband: I have 200Mb fibre to the door but many people make do with a lot less and that can be variable. Not that 200Mb comes cheap…

If a company expect someone to work from home then I think they should pay for good broadband internet. They should also provide the equipment – I have heard of only a few employers who have given staff money to buy equipment for home.

In February and March many people all over the world found themselves having to work from home. Its been a better experience than many expected but I know people are flagging, I know people keen to get back to the office. In future I expect we’ll value face-to-face contact even as many people stay at home.

In the coming months people aren’t just going to walk back to the office and pick up where they left off. Work will be more varied. But now the initial pleasure and surprise of working for home is over we need to have a conversation about what companies need to do to support working from home in future and how it can be made sustainable.


Subscribe to my blog newsletter and download Project Myopia for Free

The post The future of office working appeared first on Allan Kelly Associates.

Time value profiles

Allan Kelly from Allan Kelly Associates

TimeValueProfile-2020-06-23-15-06.jpg

User Stories Masterclass onlne, 6 July NOW BOOKING – Code BLOG_READER for 30% discount + 25% Early Bird discount

New: Agile Estimating & Forecasting, 13 July NOW BOOKING – Code BLOG_READER for 45% discount + 25% Early Bird discount

The picture above is a time-value profile: it shows how value changes with time. It is a graphic illustration of cost of delay.

A fine wine might increase in value over time but most things – think product, project, feature or just story – decay with time. Having something today is worth more than having it next week.

I invented Time-Value profiles – although I’m happy to acknowledge Don Rienertsen’s influence. I’ve included time-value profiles in many presentations and courses (they are a key part of my value workshop) but oddly, while I’ve mentioned them in this blog before I’ve never described them. So here goes…

Imagine we want to build a feature for a product. Naturally we ask: “what is this worth?”

Money is the obvious way to measure value but strictly speaking money is not itself valuable – unless you happen to want small colourful pieces of paper or decorated lumps of metal. Money is a store of value. The value of money is not the money itself but what you can exchange the money for. And because money can be traded for a wide variety of things, which are themselves valuable, money is a useful medium for comparison and measurement.

So the question “what is this worth?” may be answered qualitatively (“a vaccine is valuable because it saves lives”) or quantitatively (“a vaccine is worth $10 trillion because it allows life to return to normal”). In order to compare competing opportunities and valuations, and in order to draw a graph, giving value a numerical quantity helps greatly.

A time-value profile shows quantitive value over time when value is measured numerically: maybe in hard money like dollars or yen, or an abstract measure like business points, wooden dollars or Atlantic shillings (I just invented that but it works).

The graph starts today: we say “If we had feature X today it would be worth 100,000 shillings”. Maybe it is worth 100,000 because that is what a customer would pay for it, or maybe because we could sell 100,000 units at 1 shilling each, or so on.

But we don’t have X today. “If we get feature X next month it will be worth 90,000 shillings.” One month delay, one month late to the customer, one month later on Amazon, costs 10,000 shillings.

“If feature X is 3 months away then it is worth less than 50,000 shillings.” And so on.

Now, the unit of value – dollars, francs, shillings, wood – is of little important. Sure $1,000,000 is very different to ₽1,000,000 (Russian roubles if you don’t know) but as long as you don’t mix currencies the actual currency is unimportant.

What is important is the shape of the curve and, especially, where abrupt changes happen. Look again at the graph above: between months two and three there is a sudden drop in value. That has scheduling implications.

Once you start to think about time-value profiles then it becomes obvious that value is a function of time and we need to understand what that function looks like for any given work – project, product, initiative, feature, story, just anything in fact.

It should also become clear that the question “how long will it take to build X” needs to be inverted: “how long have we got to build X?”, “how much of X could we build?” or “in the time we have what could create something to satisfy need X”

And then “how much of the available value can we capture?”. Having X might be worth 100,000 but having a half of X might still be worth 50,000 more than not having X.

As I’ve written before: to any given problem there are multiple solutions. Engineering is not about creating the best possible solution, it is about creating a solution within constraints – as my widgets exercise shows.

Add in capacity planning and a whole new paradigm of scheduling opens up.

Not that I wish to ignore costs – and effort estimates – but they are secondary, and the subject of another blog. I’ll write more about this, and perhaps put something into a workshop, in the meantime my value workshop is the best place to find out more.


Subscribe to my blog newsletter and download Project Myopia for Free

The post Time value profiles appeared first on Allan Kelly Associates.

How should involved if-statement conditionals be structured?

Derek Jones from The Shape of Code

Which of the following two if-statements do you think will be processed by readers in less time, and with fewer errors, when given the value of x, and asked to specify the output?

// First - sequence of subexpressions
if (x > 0 && x < 10 || x > 20 && x < 30)
   print("a");
else
   print "b");

// Second - nested ifs
if (x > 0 && x < 10)
   print("c");
else if (x > 20 && x < 30)
   print("d");
else
   print("e");

Ok, the behavior is not identical, in that the else if-arm produces different output than the preceding if-arm.

The paper Syntax, Predicates, Idioms — What Really Affects Code Complexity? analyses the results of an experiment that asked this question, including more deeply nested if-statements, the use of negation, and some for-statement questions (this post only considers the number of conditions/depth of nesting components). A total of 1,583 questions were answered by 220 professional developers, with 415 incorrect answers.

Based on the coefficients of regression models fitted to the results, subjects processed the nested form both faster and with fewer incorrect answers (code+data). As expected performance got slower, and more incorrect answers given, as the number of intervals in the if-condition increased (up to four in this experiment).

I think short-term memory is involved in this difference in performance; or at least I can concoct a theory that involves a capacity limited memory. Comprehending an expression (such as the conditional in an if-statement) requires maintaining information about the various components of the expression in working memory. When the first subexpression of x > 0 && x < 10 || x > 20 && x < 30 is false, and the subexpression after the || is processed, there is no now forget-what-went-before point like there is for the nested if-statements. I think that the single expression form is consuming more working memory than the nested form.

Does the result of this experiment (assuming it is replicated) mean that developers should be recommended to write sequences of conditions (e.g., the first if-statement example) about as:

if (x > 0 && x < 10)
   print("a");
else if (x > 20 && x < 30)
   print("a");
else
   print("b");

Duplicating code is not good, because both arms have to be kept in sync; ok, a function could be created, but this is extra effort. As other factors are taken into account, the costs of the nested form start to build up, is the benefit really worth the cost?

Answering this question is likely to need a lot of work, and it would be a more efficient use of resources to address questions about more commonly occurring conditions first.

A commonly occurring use is testing a single range; some of the ways of writing the range test include:

if (x > 0 && x < 10) ...

if (0 < x && x < 10) ...

if (10 > x && x > 0) ...

if (x > 0 && 10 > x) ...

Does one way of testing the range require less effort for readers to comprehend, and be more likely to be interpreted correctly?

There have been some experiments showing that people are more likely to give correct answers to questions involving information expressed as linear syllogisms, if the extremes are at the start/end of the sequence, such as in the following:

     A is better than B
     B is better than C

and not the following (which got the lowest percentage of correct answers):

     B is better than C
     B is worse than A

Your author ran an experiment to find out whether developers were more likely to give correct answers for particular forms of range tests in if-conditions.

Out of a total of 844 answers, 40 were answered incorrectly (roughly one per subject; it was a paper and pencil experiment, so no timings). It's good to see that the subjects were so competent, but with so few mistakes made the error bars are very wide, i.e., too few mistakes were made to be able to say that one representation was less mistake-prone than another.

I hope this post has got other researchers interested in understanding developer performance, when processing if-statements, and that they will be running more experiments help shed light on the processes involved.

Further On A Very Cellular Process – student

student from thus spake a.k.

You will no doubt recall my telling you of my fellow students' and my latest pastime of employing Professor B------'s Experimental Clockwork Mathematical Apparatus to explore the behaviours of cellular automata, which may be thought of as simplistic mathematical simulacra of animalcules such as amoebas.
Specifically, if we put together an infinite line of imaginary boxes, some of which are empty and some of which contain living cells, then we can define a set of rules to determine whether or not a box will contain a cell in the next generation depending upon its own, its left and its right neighbours contents in the current one.