Growth in number of packages for widely used languages

Derek Jones from The Shape of Code

These days a language’s ecosystem of add-ons, such as packages, is often more important than the features provided by the language (which usually only vary in their syntactic sugar, and built-in support for some subset of commonly occurring features).

Use of a particular language grows and shrinks, sometimes over very many decades. Estimating the number of users of a language is difficult, but a possible proxy is ecosystem activity in the form of package growth/decline. However, it will take many several decades for the data needed to test how effective this proxy might be.

Where are we today?

The Module Counts website is the home for a project that counts the number of libraries/packages/modules contained in 26 language specific repositories. Daily data, in some cases going back to 2010, is available as a csv :-) The following are the most interesting items I discovered during a fishing expedition.

The csv file contains totals, and some values are missing (which means specifying an ‘ignore missing values’ argument to some functions). Some repos have been experiencing large average daily growth (e.g., 65 for PyPI, and 112 for Maven Central-Java), while others are more subdued (e.g., 0.7 for PERL and 3.9 for R’s CRAN). Apart from a few days, the daily change is positive.

Is the difference in the order of magnitude growth due to number of active users, number of packages that currently exist, a wide/narrow application domain (Python is wide, while R’s is narrow), the ease of getting a package accepted, or something else?

The plots below show how PyPI has been experiencing exponential growth of a kind (the regression model fitted to the daily total has the form e^{1.01days+days^2}, where days is the number of days since 2010-01-01; the red line is the daily diff of this equation), while Ruby has been experiencing a linear decline since late 2014 (all code+data):

Daily change in the number of packages in PyPI and Rubygems.

Will the five-year decline in new submissions to Rubygems continue, and does this point to an eventual demise of Ruby (a few decades from now)? Rubygems has years to go before it reaches PERL’s low growth rate (I think PERL is in terminal decline).

Are there any short term patterns, say at the weekly level? Autocorrelation is a technique for estimating the extent to which today’s value is affected by values from the immediate past (usually one or two measurement periods back, i.e., yesterday or the day before that). The two plots below show the autocorrelation for daily changes, with lag in days:

Autocorrelation of daily changes in PyPI and Maven-Java package counts.

The recurring 7-day ‘peaks’ show the impact of weekends (I assume). Is the larger ”weekend-effect’ for Java, compared to PyPI, due to Java usage including a greater percentage of commercial developers (who tend not to work at the weekend)?

I did not manage to find any seasonal effect, e.g., more submissions during the winter than the summer. But I only checked a few of the languages, and only for a single peak (see code for details).

Another way of tracking package evolution is version numbering. For instance, how often do version numbers change, and which component, e.g., major/minor. There have been a couple of studies looking at particular repos over a few years, but nobody is yet recording broad coverage daily, over the long term 😉

New podcasts and video

AllanAdmin from Allan Kelly Associates

Before Christmas I recorded a couple of new podcasts which went live this week. The first was with Luke Szymer for his Align Remotely podcast series and focused on the topic of my new book, Succeeding with OKRs in Agile and is release in two parts. Luke also has a new book, it is timely and well worth reading – as is anything Luke writes – also called Align Remotely.

The second podcast was with Ian Gill of Agility by Nature. This was a casual, wide ranging, conversation.

Finally, the video above: Living Continously with Agile and Digital.

From time to time I also give private presentations to companies. Sometimes there are existing conference or user group presentations, sometimes this is new material. While companies generally pay me for these presentations I always feel the need to share further. So, after removing any client specific references I’m re-recording these and posting them on YouTube in a Private Presentations playlist.


Subscribe to my blog newsletter and download Project Myopia for Free

The post New podcasts and video appeared first on Allan Kelly Associates.

Words I avoid using: should, empower, commitement

AllanAdmin from Allan Kelly Associates

For the record there are a few words I avoid using if I can.

Should: “we should feed the starving millions”, “we should create world peace.”

Should is useless.
It is also a declaration of what should be but also an admission of defeat, we give up immediately, we don’t even try.

Empower and empowerment: “I will empower the team”

It was Henry Mintzberg who alerted me to the problems with this word: empowerment is a loan. Empowerment is not real power, not real authority.

That I empower you means “I have the power, I am going to lend it to you… but I am still responsible and if you screw up I’m taking right back.” Thats why I prefer to talk about devolving, distributing and even sharing authority.

Commitment: “The Scrum team committed to delivering 20 points”.

Actually my dislike of commitment is usually confined to software teams and older implementations of Scrum specifically.

First commitment tends to be one sided: the development team are expected to commit but not their customers. And in an environment were the team is not completely independent (i.e. there are times when it needs non-team members to do something) it is unfair to ask them to commit.

This is very true in large companies where teams are often restricted by a multitude of rules, demarcation lines and restrictions. Such teams don’t have the power to commit on their own, they need others – and superiors – to join in making thing happen.

Second, because of those problems the word “commitment” itself has changed meaning. Originally when a team said “We commit” it meant “We are going to make this happen, come hell or high water, we will do everything in our power to make this happen.” Over time, because the team couldn’t move heaven and earth due to company policy, commitment has become devalued. Today, “commitment” has come to mean “This is the work we plan to do this sprint and we will try out best (but don’t get your hopes up too high).”

I’m sure there are some more words I avoid using but less often, I’ll make a note of them next time I’m temped and report back.


Subscribe to my blog newsletter and download Project Myopia for Free

The post Words I avoid using: should, empower, commitement appeared first on Allan Kelly Associates.

Payback time-frame for research in software engineering

Derek Jones from The Shape of Code

What are the major questions in software engineering that researchers should be trying to answer?

A high level question whose answer is likely to involve life, the universe, and everything is: What is the most cost-effective way to build software systems?

Viewing software engineering research as an attempt to find the answer to a big question mirrors physicists quest for a Grand Unified Theory of how the Universe works.

Physicists have the luxury of studying the Universe at their own convenience, the Universe does not need their input to do a better job.

Software engineering is not like physics. Once a software system has been built, the resources have been invested, and there is no reason to recreate it using a more cost-effective approach (the zero cost of software duplication means that manufacturing cost is the cost of the first version).

Designing and researching new ways of building software systems may be great fun, but the time and money needed to run the realistic experiments needed to evaluate their effectiveness is such that they are unlikely to be run. Searching for more cost-effective software development techniques by paying to run the realistic experiments needed to evaluate them, and waiting for the results to become available, is going to be expensive and time-consuming. A theory is proposed, experiments are run, results are analysed; rinse and repeat until a good-enough cost-effective technique is found. One iteration will take many years, and this iterative process is likely to take many decades.

Very many software systems are being built and maintained, and each of these is an experiment. Data from these ‘experiments’ provides a cost-effective approach to improving existing software engineering practices by studying the existing practices to figure out how they work (or don’t work).

Given the volume of ongoing software development, most of the payback from any research investment is likely to occur in the near future, not decades from now; the evidence shows that source code has a short and lonely existence. Investing for a payback that might occur 30-years from now makes no sense; researchers I talk to often use this time-frame when I ask them about the benefits of their research, i.e., just before they are about to retire. Investing in software engineering research only makes economic sense when it is focused on questions that are expected to start providing payback in, say, 3-5 years.

Who is going to base their research on existing industry practices?

Researching existing practices often involves dealing with people issues, and many researchers in computing departments are not that interested in the people side of software engineering, or rather they are more interested in the computer side.

Algorithm oriented is how I would describe researchers who claim to be studying software engineering. I am frequently told about the potential for huge benefits from the discovery of more efficient algorithms. For many applications, algorithms are now commodities, i.e., they are good enough. Those with a career commitment to studying algorithms have a blinkered view of the likely benefits of their work (most of those I have seen are doing studying incremental improvements, and are very unlikely to make a major break through).

The number of researchers studying what professional developers do, with an aim to improving it, is very small (I am excluding the growing number of fake researchers doing surveys). While I hope there will be a significant growth in numbers, I’m not holding my breadth (at least in the short term; as for the long term, Planck’s experience with quantum mechanics was: “Science advances one funeral at a time”).

Static site migration – starting the optimisation, already

Timo Geusch from The Lone C++ Coder's Blog

Now that I’ve got the static site up and running, it’s obviously time to switch over immediately, right? Not to fast. After QA’ing my deployment process in production, it was time to check how the two compared from a performance perspective. I like to use several different tests, starting with Pingdom, then using PageSpeed Insights […]

The post Static site migration – starting the optimisation, already appeared first on The Lone C++ Coder's Blog.

Static site migration – starting the optimisation, already

The Lone C++ Coder's Blog from The Lone C++ Coder's Blog

Now that I’ve got the static site up and running, it’s obviously time to switch over immediately, right? Not to fast. After QA’ing my deployment process in production, it was time to check how the two compared from a performance perspective. I like to use several different tests, starting with Pingdom, then using PageSpeed Insights for more details. The Pingdom speed test gave it a thumbs up, but they’re not running the currently dominant search engine.

Software effort estimation is mostly fake research

Derek Jones from The Shape of Code

Effort estimation is an important component of any project, software or otherwise. While effort estimation is something that everybody in industry is involved with on a regular basis, it is a niche topic in software engineering research. The problem is researcher attitude (e.g., they are unwilling to venture into the wilds of industry), which has stopped them acquiring the estimation data needed to build realistic models. A few intrepid people have risked an assault on their ego and talked to people in industry, the outcome has been, until very recently, a small collection of tiny estimation datasets.

In a research context the term effort estimation is actually a hang over from the 1970s; effort correction more accurately describes the behavior of most models since the 1990s. In the 1970s models took various quantities (e.g., estimated lines of code) and calculated an effort estimate. Later models have included an estimate as input to the model, producing a corrected estimate as output. For the sake of appearances I will use existing terminology.

Which effort estimation datasets do researchers tend to use?

A 2012 review of datasets used for effort estimation using machine learning between 1991-2010, found that the top three were: Desharnias with 24 papers (29%), COCOMO with 19 papers (23%) and ISBSG with 17. A 2019 review of datasets used for effort estimation using machine learning between 1991 and 2017, found the top three to be NASA with 17 papers (23%), the COCOMO data and ISBSG were joint second with 16 papers (21%), and Desharnais was third with 14. The 2012 review included more sources in its search than the 2019 review, and subjectively your author has noticed a greater use of the NASA dataset over the last five years or so.

How large are these datasets that have attracted so many research papers?

The NASA dataset contains 93 rows (that is not a typo, there is no power-of-ten missing), COCOMO 63 rows, Desharnais 81 rows, and ISBSG is licensed by the International Software Benchmarking Standards Group (academics can apply for a limited time use for research purposes, i.e., not pay the $3,000 annual subscription). The China dataset contains 499 rows, and is sometimes used (there is no mention of a supercomputer being required for this amount of data ;-).

Why are researchers involved in software effort estimation feeding tiny datasets from the 1990s into machine learning algorithms?

Grant money. Research projects are more likely to be funded if they use a trendy technique, and for the last decade machine learning has been the trendiest technique in software engineering research. What data is available to learn from? Those estimation datasets that were flogged to death in the 1990s using non-machine learning techniques, e.g., regression.

Use of machine learning also has the advantage of not needing to know anything about the details of estimating software effort. Everything can be reduced to a discussion of the machine learning algorithms, with performance judged by a chosen error metric. Nobody actually looks at the predicted estimates to discover that the models are essentially producing the same answer, e.g., one learner predicts 43 months, 2 weeks, 4 days, 6 hours, 47 minutes and 11 seconds, while a ‘better’ fitting one predicts 43 months, 2 weeks, 2 days, 6 hours, 27 minutes and 51 seconds.

How many ways are there to do machine learning on datasets containing less than 100 rows?

A paper from 2012 evaluated the possibilities using 9-learners times 10 data-prerocessing options (e.g., log transform or discretization) times 7-error estimation metrics giving 630 possible final models; they picked the top 10 performers.

This 2012 study has not stopped researchers continuing to twiddle away on the option’s nobs available to them; anything to keep the paper mills running.

To quote the authors of one review paper: “Unfortunately, we found that very few papers (including most of our own) paid any attention at all to properties of the data set.”

Agile techniques are widely used these days, and datasets from the 1990s are not applicable. What datasets do researchers use to build Agile effort estimation models?

A 2020 review of Agile development effort estimation found 73 papers. The most popular data set, containing 21 rows, was used by nine papers. Three papers used simulated data! At least some authors were going out and finding data, even if it contains fewer rows than the NASA dataset.

As researchers in business schools have shown, large datasets can be obtained from industry; ISBSG actively solicits data from industry and now has data on 9,500+ projects (as far as I can tell a small amount for each project, but that is still a lot of projects).

Are there any estimates on Github? Some Open source projects use JIRA, which includes support for making estimates. Some story point estimates can be found on Github, but the actuals are missing.

A handful of researchers have obtained and released estimation datasets containing thousands of rows, e.g., the SiP dataset contains 10,100 rows and the CESAW dataset contains over 40,000 rows. These datasets are generally ignored, perhaps because when presented with lots of real data researchers have no idea what to do with it.

On Tug O’ War – student

student from thus spake a.k.

The Baron and Sir R-----'s latest wager comprised of first placing a draught piece upon the fifth lowest of a column of twelve squares and subsequently moving it up or down by one space depending upon the outcome of a coin toss until such time as it should escape, either by moving above the topmost or below the bottommost square. In the former outcome the Baron should have had a prize of three coins and in the latter Sir R----- should have had two.

Static site should be fixed now

Timo Geusch from The Lone C++ Coder's Blog

Ah yes, the guy who used wear the “I don’t often test my code, but if I do, I do it in production” T-shirt in an ironic way followed his own advice, unironically. The deployment script was ultra efficient and mainly removed the static site when updating it. Think about all the bandwidth this conserved! […]

The post Static site should be fixed now appeared first on The Lone C++ Coder's Blog.

Moving this blog to a static site – this time I’m serious (because org-mode)

Timo Geusch from The Lone C++ Coder's Blog

I have been toying with the idea of migrating this blog to a static site to simplify its maintenance for some time. While WordPress is a great tool, this blog is a side project and any time I have to spend maintaining WordPress gets deducted from the time I have to write for the blog. […]

The post Moving this blog to a static site – this time I’m serious (because org-mode) appeared first on The Lone C++ Coder's Blog.