Slides: Snake in KotlinJS slides
I’ve blogged about setting up a WireGuard VPN server earlier this year. It’s been running well since, but I needed to take care of some overdue maintenance tasks. Trying to log into the server this morning and I am greeted with “no route to host”. Eh? A quick check on my Vultr UI showed that […]
tldr; A paper makes various claims based on suspect data. A replication finds serious problems with the data extraction and analysis. A rebuttal paper spins the replication issues as being nothing serious, and actually validating the original results, i.e., the rebuttal is all smoke and mirrors.
When I first saw the paper: A Large-Scale Study of Programming Languages and Code Quality in Github, the pdf almost got deleted as soon as I started scanning the paper; it uses number of reported defects as a proxy for code quality. The number of reported defects in a program depends on the number of people using the program, more users will generate more defect reports. Unfortunately data on the number of people using a program is extremely hard to come by (I only know of one study that tried to estimate number of users); studies of Java have also found that around 40% of reported faults are requests for enhancement. Most fault report data is useless for the model building purposes to which it is put.
Two things caught my eye, and I did not delete the pdf. The authors have done good work in the past, and they were using a zero-truncated negative binomial distribution; I thought I was the only person using zero-truncated negative binomial distributions to analyze software engineering data. My data analysis alter-ego was intrigued.
Spending a bit more time on the paper confirmed my original view, it’s conclusions were not believable. The authors had done a lot of work, this was no paper written over a long weekend, but lots of silly mistakes had been made.
Lots of nonsense software engineering papers get published, nothing to write home about. Everybody gets writes a nonsense paper at some point in their career, hopefully they get caught by reviewers and are not published (the statistical analysis in this paper was probably above the level familiar to most software engineering reviewers). So, move along.
At the start of this year, the paper: On the Impact of Programming Languages on Code Quality: A Reproduction Study appeared, published in TOPLAS (the first was in CACM, both journals of the ACM).
This replication paper gave a detailed analysis of the mistakes in data extraction, and the sloppy data analyse performed in the original work. Large chunks of the first study were cut to pieces (finding many more issues than I did, but not pointing out the missing usage data). Reading this paper now, in more detail, I found it a careful, well argued, solid piece of work.
This publication is an interesting event. Replications are rare in software engineering, and this is the first time I have seen a take-down (of an empirical paper) like this published in a major journal. Ok, there have been previous published disagreements, but this is machine learning nonsense.
The Papers We Love meetup group ran a mini-workshop over the summer, and Jan Vitek gave a talk on the replication work (unfortunately a problem with the AV system means the videos are not available on the Papers We Love YouTube channel). I asked Jan why they had gone to so much trouble writing up a replication, when they had plenty of other nonsense papers to choose from. His reasoning was that the conclusions from the original work were starting to be widely cited, i.e., new, incorrect, community-wide beliefs were being created. The finding from the original paper, that has been catching on, is that programs written in some languages are more/less likely to contain defects than programs written in other languages. What I think is actually being measured is number of users of the programs written in particular languages (a factor not present in the data).
The sequence: publication, replication, rebuttal is how science is supposed to work. Scientists disagree about published work and it all gets thrashed out in a series of published papers. I’m pleased to see this is starting to happen in software engineering, it shows that researchers care and are willing to spend time analyzing each others work (rather than publishing another paper on the latest trendy topic).
From time to time I had considered writing a post about the first two articles, but an independent analysis of the data meant some serious thinking, and I was not that keen (since I did not think the data went anywhere interesting).
In the academic world, reputation and citations are the currency. When one set of academics publishes a list of mistakes, errors, oversights, blunders, etc in the published work of another set of academics, both reputation and citations are on the line.
I have not read many academic rebuttals, but one recurring pattern has been a pointed literary style. The style of this Rebuttal paper is somewhat breezy and cheerful (the odd pointed phrase pops out every now and again), attempting to wave off what the authors call general agreement with some minor differences. I have had some trouble understanding how the rebuttal points discussed are related to the problems highlighted in the replication paper. The tone of the medium post is that there is nothing to see here, let’s all move on and be friends.
An academic’s work is judged by the number of citations it has received. Citations are used to help decide whether someone should be promoted, or awarded a grant. As I write this post, Google Scholar listed 234 citations to the original paper (which is a lot, most papers have one or none). The abstract of the Rebuttal paper ends with “…and our paper is eminently citable.”
The claimed “Point-by-Point Rebuttal” takes the form of nine alleged claims made by the replication authors. In four cases the Claim paragraph ends with: “Hence the results may be wrong!”, in two cases with: “Hence, FSE14 and CACM17 can’t be right.” (these are references to the original conference and journal papers, respectively), and once with: “Thus, other problems may exist!”
The rebuttal points have a tenuous connection to the major issues raised by the replication paper, and many of them are trivial issues (compared to the real issues raised).
Summary bullet points (six of them) at the start of the Rebuttal discuss issues not covered by the rebuttal points. My favourite is the objection bullet point claiming a preference, in the replication, for the use of the Bonferroni correction rather than FDR (False Discovery Rate). The original analysis failed to use either technique, when it should have used one or the other, a serious oversight; the replication is careful and does the analysis using both.
I would be very surprised if the Rebuttal paper, in its current form, gets published in any serious journal; it’s currently on a preprint server. It is not a serious piece of work.
Somebody who has only read the Rebuttal paper would take away a strong impression that the criticisms in the replication paper were trivial, and that the paper was not a serious piece of work.
What happens next? Will the ACM appoint a committee of the great and the good to decide whether the CACM article should be retracted? We are not talking about fraud or deception, but a bunch of silly mistakes that invalidate the claimed findings. Researchers are supposed to care about the integrity of published work, but will anybody be willing to invest the effort needed to get this paper retracted? The authors will not want to give up those 234, and counting, citations.
Like many other programmers I’ve probably added my fair share of caches to systems over the years, and as we know from the old joke, one of the two hardest problems in computer science is knowing when to invalidate them. It’s a hard question, to be sure, but a really annoying behaviour you can run into as a maintainer is when the invalidation appears to be done arbitrarily, usually by specifying some timeout seemingly plucked out of thin air and maybe even changed equally arbitrarily. (It may not be, but documenting such decisions is usually way down the list of important things to do.)
If there is a need for a cache in production, and let’s face it that’s the usual driver, then any automatic invalidation is likely to be based on doing it as infrequently as possible to ensure the highest hit ratio. The problem is that that value can often be hard-coded and mask cache invalidation bugs because it rarely kicks in. The knee-jerk reaction to “things behaving weirdly” in production is to switch everything off-and-on again thereby implicitly invalidating any caches, but this doesn’t help us find those bugs.
The most recent impetus for this post was just such a bug which surfaced because the cache invalidation logic never ran in practice. The cache timeout was set arbitrarily large, which seemed odd, but I eventually discovered it was supposed to be irrelevant because the service hosting it should have been rebooted at midnight every day! Due to the password for the account used to run the reboot task expiring it never happened and the invalidated items then got upset when they were requested again. Instead of simply fetching the item from the upstream source and caching it again, the cache had some remnants of the stale items and failed the request instead. Being an infrequent code path it didn’t obviously ring any bells so took longer to diagnose.
Design for Testability
While it’s useful to avoid throwing away data unnecessarily in production we know that the live environment rarely needs the most flexibility when it comes to configuration (see “Testing Drives the Need for Flexible Configuration”). On the contrary, I’d expect to have any cache being cycled reasonably quickly in a test environment to try and flush out any issues as I’d expect more side-effects from cache misses than hits.
If you are writing any automated tests around the caching behaviour that is often a good time to consider the other non-functional requirements, such as monitoring and support. For example, does the service or tool hosting the cache expose some means to flush it manually? While rebooting a service may do the trick it does nothing to help you track down issues around residual state and often ends up wreaking havoc with any connected clients if they’re not written with a proper distributed system mindset.
Another scenario to consider is if the cache gets poisoned; if there is no easy way to eject the bad data you’re looking at the sledgehammer approach again. If your cache is HA (highly available) and backed by some persistent storage getting bad data out could be a real challenge when you’re under the cosh. One system I worked on had random caches poisoned with bad data due to a threading serialization bug in an external library.
The monitoring side is probably equally important. If you generate no instrumentation data how do you know if your cache is even having the desired effect? One team I was on added a new cache to a service and we were bewildered to discover that it was never used. It turned out the WCF service settings were configured to create a new service instance for every request and therefore a new cache was created every time! This was despite the fact that we had unit tests for the cache and they were happily passing .
It’s also important to realise that a cache without an eviction policy is just another name for a memory leak. You cannot keep caching data forever unless you know there is a hard upper bound. Hence you’re going to need to use the instrumentation data to help find the sweet spot that gives you the right balance between time and space.
We also shouldn’t blindly assume that caches will continue to provide the same performance in future as they do now; our metrics will allow us to see any change in trends over time which might highlight a change in data that’s causing it to be less efficient. For example one cache I saw would see its efficiency plummet for a while because a large bunch of single use items got requested, cached, and then discarded as the common data got requested again. Once identified we disabled caching for those kinds of items, not so much for the performance benefit but to avoid blurring the monitoring data with unnecessary “glitches” .
 See “Man Cannot Live by Unit Testing Alone” for other tales of the perils of that mindset.
 This is a topic I covered more extensively in my Overload article “Monitoring: Turning Noise Into Signal”.
MongoDB has a handy command to rename a collection, db.collectionName.renameCollection(). There is currently no equivalent to rename a database. Now if we accept that from time to time, one positively, absolutely just has to rename a database in MongoDB, well, there are a couple of options. Unfortunately they aren’t quite as straight forward as single […]
A very interesting article discussing SpaceX’s dramatically lower launch costs has convinced me that, in a decade or two, it will become economically viable to send people to Mars. Whether lots of people will be willing to go is another matter, but let’s assume that a non-trivial number of people decide to spend many years living in a colony on Mars; what computing hardware and software should they take with them?
Reliability and repairability are crucial. Same-day delivery of replacement parts is not an option; the opportunity for Earth/Mars travel occurs every 2-years (when both planets are on the same side of the Sun), and the journey takes 4-10 months.
Given the much higher radiation levels on Mars (200 mS/year; on Earth background radiation is around 3 mS/year), modern microelectronics will experience frequent bit-flips and have a low survival rate. Miniaturization is great for packing billions of transistors into a device, but increases the likelihood that a high energy particle traveling through the device will create a permanent short-circuit; Moore’s law has a much shorter useful life on Mars, compared to Earth. The lesser high energy particles can flip the current value of one or more bits.
Reliability and repairability of electronics, compared to other compute and control options, dictates minimizing the use of electronics (pneumatics is a viable replacement for many tasks; think World War II submarines), and simple calculation can be made using a slide rule or mechanical calculator (both are reliable, and possible to repair with simple tools). Some of the issues that need to be addressed when electronic devices are a proposed solution include:
- integrated circuits need to be fabricated with feature widths that are large enough such that devices are not unduly affected by background radiation,
- devices need to be built from exchangeable components, so if one breaks the others can be used as spares. Building a device from discrete components is great for exchangeability, but is not practical for building complicated cpus; one solution is to use simple cpus, and integrated circuits come in various sizes.
- use of devices that can be repaired or new ones manufactured on Mars. For instance, core memory might be locally repairable, and eventually locally produced.
There are lots of benefits from using the same cpu for everything, with ARM being the obvious choice. Some might suggest RISC-V, and perhaps this will be a better choice many years from now, when a Mars colony is being seriously planned.
Commercially available electronic storage devices have lifetimes measured in years, with a few passive media having lifetimes measured in decades (e.g., optical media); some early electronic storage devices had lifetimes likely to be measured in decades. Perhaps it is possible to produce hard discs with expected lifetimes measured in decades, research is needed (or computing on Mars will have to function without hard discs).
The media on which the source is held will degrade over time. Engraving important source code on the walls of colony housing is one long term storage technique; rather like the hieroglyphs on ancient Egyptian buildings.
What about displays? Have lots of small, same size, flat-screens, and fit them together for greater surface area. I don’t know much about displays, so won’t say more.
Computers built from discrete components consume lots of power (much lower power consumption is a benefit of fabricating smaller devices). No problem, they can double as heating systems. Switching power supplies can be very reliable.
Radio communications require electronics. The radios on the Voyager spacecraft have been operating for 42 years, which suggests to me that reliable communication equipment can be built (I know very little about radio electronics).
What about the software?
Repairability requires that software be open source, or some kind of Mars-use only source license.
The computer language of choice is obviously C, whose advantages include:
- lots of existing, heavily used, operating systems are written in C (i.e., no need to write, and extensively test, a new one),
- C compilers are much easier to implement than, say, C++ or Java compilers. If the C compiler gets lost, somebody could bootstrap another one (lots of individuals used to write and successfully sell C compilers),
- computer storage will be a premium on Mars based computers, and C supports getting close to the hardware to maximise efficient use of resources.
The operating system of choice may not be Linux. With memory at a premium, operating systems requiring many megabytes are bad news. Computers with 64k of storage (yes, kilobytes) used to be used to do lots of useful work; see the source code of various 1980’s operating systems.
Applications can be written before departure. Maintainability and readability are marketing terms, i.e., we don’t really know how to do this stuff. Extensive testing is a good technique for gaining confidence that software behaves as expected, and the test suite can be shipped with the software.
I feel compelled to write this blog because I keep coming across the wrong type of Product Owner. I feel bad about writing this blog because a) I’ve made these points before in other forums so I’m repeating myself, and b) at the end of the day you, your team, and your organization, is free to define and use any title you like for any role you like, you are free to define any given role as you like.
So let me set out my model of a Product Owner and then at least there is a model to compare any other definition with.
Our old friend the Triangle of Constraints can help here – also know as “The Iron Triangle” and pictured above (I like to call it the FRT triangle). Now notice the version I use is slightly different from the more common model:
- Rather than “cost” I label one side of the triangle “People”. I could label it resources but in software development resources are overwhelmingly people and the knowledge they bring. People deserve respect, calling them “resources” makes them sound like paperclips.
- For software development costs are function of how many people you have and how long you have them for: costs = people x time. OK, there are some other “resources” to add to costs, e.g. buying laptops, renting time in the cloud, and so on but these are often themselves a function of the number of people you have. Such costs are a small increment on top of the wage bill.
Now the number of people you have is fixed in the short term, or to be more accurate: it is upward fixed. People can get ill or resign at anytime but adding people takes time. So in the short run one can consider that dimension fixed.
Time is also fixed. There is usually a business deadline, or rather a business benefit which is time elastic so you have a date to aim for. And on agile teams there are sprint deadlines (two-weeks, two-weeks, two weeks). So a large part time is fixed.
The final side of the triangle is labelled features or functionality, but might be labelled “requirements”, “the what” or “what are we building” – I like to think of it as the demand side.
With me so far? – so far that should be uncontroversial.
Now the traditional Project Manager role, and to a lesser degree the newer Delivery Manager role, tend to regard the third side – the what side – as fixed. There is a thing to be delivered. It is a known thing. It has been decided on and the manager’s job is to get it delivered.
To this end Project Managers are trained to regard the “thing to be built” as a given, preferably fixed, thing. Their training centres on the other sides: cost and time. They are trained both in rationing these commodities and allocating them in an efficient way. When things go wrong these managers ask for more time (which means more money because the same people need paying) or more people (which both costs more and makes things worse because of Brook’s Law).
So to summarise: traditional Project Managers focus on “when” and the input variables: people/resources and money.
Can you guess what I’m going to say next?
Product Owners – plus Product Managers and Business Analysts – focus on the “what”. What do we need to build next? What has the most benefit? What should we be building for the future?
For Product Owners the time and people are fixed. (This is most obvious in an agile environment but is actually true everywhere sooner or later.)
The thing being built is negotiable, the desired outcome may be achieved by different routes, different technologies and different solutions – the different time and cost will be a consideration but outcome is the primary focus.
In other words: Product Owners are all about the what.
In order to operate in the what-space product owners need authority and legitimacy to flex what they are building. When they don’t have that they are reduced to backlog administrators simply ordering the backlog and feeding it to technical teams. That turns the role into a type of Project or Delivery Manager.
So if you need to tell a real Product Owner from all the other misinterpretations of the role ask:
- Does the product owner focus on what?
- Can the product owner discuss different solutions and approaches to achieve an outcome?
- Is the PO flexible about the backlog? (as opposed to slavishly trying to deliver it all)
Real product owners can answer Yes to all three.
(Notice I’m deliberately being careful in what I say about “Delivery Managers.” This role is still emerging and as such its wrong to generalise about it too much. In so much as a Delivery Manager brings management skills, communication and organization to an effort it can be a positive role. When a Delivery Manager is relabelling of the Project Manager role it can be damaging.)
Now that said, the fact that some organizations choose to define the “Product Owner” role as a role closer to “Project Manager” or “Delivery Manager” rather than a role closer to “Product Manager”, “Business Analyst” or (heaven forbid) business owner causes a lot of confusion.
Perhaps I’m wrong here, perhaps the “Product Owner” is a type of “delivery manager” but I think the majority of writers, thinkers and practitioners agree with me.
Even if you disagree with me I hope we can agree on one thing: because there are different interpretations and implementations of the role there is room for confusion; and that confusion makes it harder to fill the role and harder to be seen as a successful Product Owner.
Like this post? – Like to receive these posts by e-mail?
New book: The Art of Agile Product Ownership
Good man! Good man!
I must say that the contrast between the warmth of this fire and the frost outside brings most vividly to my mind an occasion during my tenure as the Empress's ambassador to the land of Oz; specifically the time that I attended King Quadling Rex's winter masked ball during which his southern palace was overrun by an infestation of Snobbles!
The change was reasonably simple: we had to denormalise some postcode data which was currently held in a centralised relational database into some new fields in every client’s database to remove some cross-database joins that would be unsupported on the new SQL platform we were migrating too .
As you might imagine the database schema changes were fairly simple – we just needed to add the new columns as nullable strings into every database. The next step was to update the service code to start populating these new fields as addresses were added or edited by using data from the centralised postcode database .
At this point any new data or data that changed going forward would have the correctly denormalised state. However we still needed to fix up any existing data and that’s the focus of this post.
To fix-up all the existing client data we needed to write a tool which would load each client’s address data that was missing its new postcode data, look it up against the centralised list, and then write back any changes. Given we were still using the cross-database joins in live for the time being to satisfy the existing reports we could roll this out in the background and avoiding putting any unnecessary load on the database cluster.
The tool wasn’t throw-away because the postcode dataset gets updated regularly and so the denormalised client data needs to be refreshed whenever the master list is updated. (This would not be that often but enough to make it worth spending a little extra time writing a reusable tool for the job for ops to run.)
Clearly this isn’t rocket science, it just requires loading the centralised data into a map, fetching the client’s addresses, looking them up, and writing back the relevant fields. The tool only took a few hours to write and test and so it was ready to run for the next release during a quiet period.
When that moment arrived the tool was run across the hundreds of client databases and plenty of data was fixed up in the process, so the task appeared completed.
With all the existing postcode data now correctly populated too we should have been in a position to switch the report generation feature toggle on so that it used the new denormalised data instead of doing a cross-database join to the existing centralised store.
While the team were generally confident in the changes to date I suggested we should just do a sanity check and make sure that everything was working as intended as I felt this was a reasonably simple check to run.
An initial SQL query someone knocked up just checked how many of the new fields had been populated and the numbers seemed about right, i.e. very high (we’d expect some addresses to be missing data due to missing postcodes, typos and stale postcode data). However I still felt that we should be able to get a definitive answer with very little effort by leveraging the existing we SQL we were about to discard, i.e. use the cross-database join one last time to verify the data population more precisely.
Close, but No Cigar
I massaged the existing report query to show where data from the dynamic join was different to that in the new columns that had been added (again, not rocket science). To our surprise there were quite a significant number of discrepancies.
Fortunately it didn’t take long to work out that those addresses which were missing postcode data all had postcodes which were at least partially written in lowercase whereas the ones that had worked were entirely written in uppercase.
Hence the bug was fairly simple to track down. The tool loaded the postcode data into a dictionary (map) keyed on the string postcode and did a straight lookup which is case-sensitive by default. A quick change to use a case-insensitive comparison and the tool was fixed. The data was corrected soon after and the migration verified.
Why didn’t this show up in the initial testing? Well, it turned out the tools used to generate the test data sets and also to anonymize real client databases were somewhat simplistic and this helped to provide a false level of confidence in the new tool.
Testing in Production
Whenever we make a change to our system it’s important that we verify we’ve delivered what we intended. Oftentimes the addition of a feature has some impact on the front-end and the customer and therefore it’s fairly easy to see if it’s working or not. (The customer usually has something to say about it.)
However back-end changes can be harder to verify thoroughly, but it’s still important that we do the best we can to ensure they have the expected effect. In this instance we could easily check every migrated address within a reasonable time frame and know for sure, but on large data sets this might unfeasible so you might have to settle for less. Also the use of feature switches and incremental delivery meant that even though there was a bug it did not affect the customers and we were always making forward progress.
Testing does not end with a successful run of the build pipeline or a sign-off from a QA team – it also needs to work in real life too. Ideally the work we put in up-front will make that more likely but for some classes of change, most notably where actual customer data is involved, we need to follow through and ensure that practice and theory tie up.
 Storage limitations and other factors precluded simply moving the entire postcode database into each customer DB before moving platforms. The cost was worth it to de-risk the overall migration.
 There was no problem with the web service having two connections to two different databases, we just needed to stop writing SQL queries that did cross-database joins.