Patterns of regular expression usage: duplicate regexs

Derek Jones from The Shape of Code

Regular expressions are widely used, but until recently they were rarely studied empirically (i.e., just theory research).

This week I discovered two groups studying regular expression usage in source code. The VTLeeLab has various papers analysing 500K distinct regular expressions, from programs written in eight languages and StackOverflow; Carl Chapman and Peipei Wang have been looking at testing of regular expressions, and also ran an interesting experiment (I will write about this when I have decoded the data).

Regular expressions are interesting, in that their use is likely to be purely driven by an application requirement; the use of an integer literals may be driven by internal housekeeping requirements. The number of times the same regular expression appears in source code provides an insight (I claim) into the number of times different programs are having to solve the same application problem.

The data made available by the VTLeeLab group provides lots of information about each distinct regular expression, but not a count of occurrences in source. My email request for count data received a reply from James Davis within the hour :-)

The plot below (code+data; crates.io has not been included because the number of regexs extracted is much smaller than the other repos) shows the number of unique patterns (y-axis) against the number of identical occurrences of each unique pattern (x-axis), e.g., far left shows number of distinct patterns that occurred once, then the number of distinct patterns that each occur twice, etc; colors show the repositories (language) from which the source was obtained (to extract the regexs), and lines are fitted regression models of the form: NumPatterns = a*MultOccur^b, where: a is driven by the total amount of source processed and the frequency of occurrence of regexs in source, and b is the rate at which duplicates occur.

Number of distinct patterns occurring a given number of times in the source stored in various repositories

So most patterns occur once, and a few patterns occur lots of times (there is a long tail off to the unplotted right).

The following table shows values of b for the various repositories (languages):

StackOverflow   cpan    godoc    maven    npm  packagist   pypi   rubygems
    -1.8        -2.5     -2.5    -2.4    -1.9     -2.6     -2.7     -2.4

The lower (i.e., closer to zero) the value of b, the more often the same regex will appear.

The values are in the region of -2.5, with two exceptions; why might StackOverflow and npm be different? I can imagine lots of duplicates on StackOverflow, but npm (I’m not really familiar with this package ecosystem).

I am pleased to see such good regression fits, and close power law exponents (I would have been happy with an exponential fit, or any other equation; I am interested in a consistent pattern across languages, not the pattern itself).

Some of the code is likely to be cloned, i.e., cut-and-pasted from a function in another package/program. Copy rates as high as 70% have been found. In this case, I don’t think cloned code matters. If a particular regex is needed, what difference does it make whether the code was cloned or written from scratch?

If the same regex appears in source because of the same application requirement, the number of reuses should be correlated across languages (unless different languages are being used to solve different kinds of problems). The plot below shows the correlation between number of occurrences of distinct regexs, for each pair of languages (or rather repos for particular languages; top left is StackOverflow).

Correlation of number of identical pattern occurrences, between pairs of repositories.

Why is there a mix of strong and weakly correlated pairs? Is it because similar application problems tend to be solved using different languages? Or perhaps there are different habits for cut-and-pasted source for developers using different repositories (which will cause some patterns to occur more often, but not others, and have an impact on correlation but not the regression fit).

There are lot of other interesting things that can be done with this data, when connected to the results of the analysis of distinct regexs, but these look like hard work, and I have a book to finish.

Building an OpenBSD WireGuard VPN server part 3 – Unbound DNS filtering

Timo Geusch from The Lone C++ Coder's Blog

In part 2, I reconfigured my WireGuard VPN to use an Unbound DNS server on the VPN server rather than rely on a third party server I had used for the original quick and dirty configuration. It was important for me to set up a validating DNS server, which I did in that part. In […]

The post Building an OpenBSD WireGuard VPN server part 3 – Unbound DNS filtering appeared first on The Lone C++ Coder's Blog.

Funky Givens – a.k.

a.k. from thus spake a.k.

We have recently been looking at how we can use a special case of Francis's QR transformation to reduce a real symmetric matrix M to a diagonal matrix Λ by first applying Householder transformations to put it in tridiagonal form and then using shifted Givens rotations to zero out the off diagonal elements.
The columns of the matrix of transformations V and the elements on the leading diagonal of Λ are the unit eigenvectors and eigenvalues of M respectively and they consequently satisfy

    M × V = V × Λ

and, since the product of V and its transpose is the identity matrix

    M = V × Λ × VT

which is known as the spectral decomposition of M.
Last time we saw how we could efficiently apply the Householder transformations in-place, replacing the elements of M with those of the matrix of accumulated transformations Q and creating a pair of arrays to represent the leading and off diagonal elements of the tridiagonal matrix. This time we shall see how we can similarly improve the implementation of the Givens rotations.

Source code has a brief and lonely existence

Derek Jones from The Shape of Code

The majority of source code has a short lifespan (i.e., a few years), and is only ever modified by one person (i.e., 60%).

Literate programming is all well and good for code written to appear in a book that the author hopes will be read for many years, but this is a tiny sliver of the source code ecosystem. The majority of code is never modified, once written, and does not hang around for very long; an investment is source code futures will make a loss unless the returns are spectacular.

What evidence do I have for these claims?

There is lots of evidence for the code having a short lifespan, and not so much for the number of people modifying it (and none for the number of people reading it).

The lifespan evidence is derived from data in my evidence-based software engineering book, and blog posts on software system lifespans, and survival times of Linux distributions. Lifespan in short because Packages are updated, and third-parties change/delete APIs (things may settle down in the future).

People who think source code has a long lifespan are suffering from survivorship bias, i.e., there are a small percentage of programs that are actively used for many years.

Around 60% of functions are only ever modified by one author; based on a study of the change history of functions in Evolution (114,485 changes to functions over 10 years), and Apache (14,072 changes over 12 years); a study investigating the number of people modifying files in Eclipse. Pointers to other studies that have welcome.

One consequence of the short life expectancy of source code is that, any investment made with the expectation of saving on future maintenance costs needs to return many multiples of the original investment. When many programs don’t live long enough to be maintained, those with a long lifespan have to pay the original investments made in all the source that quickly disappeared.

One benefit of short life expectancy is that most coding mistakes don’t live long enough to trigger a fault experience; the code containing the mistake is deleted or replaced before anybody notices the mistake.

Framing the question frames the answers – my crown jewels

Allan Kelly from Allan Kelly Associates

iStock-149794120s-2020-02-5-12-18.jpg

Today I’m giving away my crown jewels. Once you have read this post I can’t run my favourite exercise with you. There is a bit of experiential learning I can’t give you. So if you’ve rather have the experience stop reading and go and book yourself on my May workshop.

I’m describing an exercise that models what happens in “the real world(tm).” Plenty of the people who have done the exercise recognise it was a real life problem. The insights are many, and some a little disturbing.

Dozens of teams and the answers are always the same. I live in dread that someone will guess and ruin the exercise but it never happens. Now I’m telling the world that might change.

On screen I put a story something like:

As a widget maker I want an online store to sell my widgets so that I can make money

I separate the room into teams. Each team represents a technology supplier – an agency, an outsourcer, whatever. I want each team to competitively bid on a piece of work. Each team gets to think about the problem and estimate the work. At the end I want each team to be ready to name their price, how long it will take and how many people they need. They may add any more details they like, e.g. staging, design, technology, etc. (and most do).

The teams on the right get a story which says:

As an international widget maker I want to sell direct to customers so that I can cut out distributors. I anticipate $10million turnover within 3 years and have budgeted $1.2m for this project.

15 minutes later the teams on the right read out their bids. They always want a million give or take. They want months, if not years. They want teams of half a dozen or more engineers, testers, UXD, analysts and project managers. They may propose an ongoing maintenance contract too.

Very disconcerting for these teams is that while they are answering and taking questions the other teams, those on the left, often burst out laughing – literally – when they hear these proposals.

What neither side knows is that they have different stories. The teams on the left get a story saying:

As an artisan widget maker I want to sell my widgets to customers so that I can give up my day job. When I make $100,000 a year in sales I can live my dream. My accountant tells me a WordPress website will cost $5,000.

These teams want a week or two, an engineer or two and perhaps $10,000 tops. In some cases they say “We can do it this afternoon, we’ll set up Etsy.” Even if they don’t want to use Esty or eBay they probably propose an OpenSource solution.

So what do you think?

True, it is a semi-artificial set-up but I would argue that these situations happen all the time in “the real world” work environment. However they are usually well disguised and hard to see. Maybe now you will spot them.

That aside there are many potential lessons this exercise illustrates, almost everyone is worth a discussion in its own right. To keep things brief I’ll just highlight some of them:

  • Anchoring (cognitive bias): individuals are anchored to those numbers, bigger number lead them to frame their answers as bigger numbers.
  • Assumptions: people jump to assumptions, people automatically fill in the blanks when they lack information and the information they fill in flows from the numbers mentioned. Few questions get asked.
  • Different solutions: the key lesson for me, confronted with similar problems, people (especially engineers) are capable of formulating very different solutions. Those solutions have different time and cost implications.
  • Problem bounding: presenting the same problem with different bounding constraints results in massively different solutions.
  • Effort estimates, and therefore cost estimates, flow from value: whether through anchoring assumptions or solution designs the estimates (time and money) flow from the value available NOT the other way around.
  • Prior experience often goes out the window. In one run a low-end teams told me: “We did this last week. A digital consultant showed us how to install WordPress and Magento for online retail in the morning and in the afternoon we did it ourselves.” While this team came up with a low cost proposal their colleagues who were given the $1m story forgot everything they learned last week.
  • People don’t ask questions: I answer questions while teams are creating their answers but people rarely challenge what is asked for. Maybe its because I’m usually in some position of authority as a consultant or workshop trainer and my word should be followed.

Occasionally a team with the million dollar story say “We could do this with eBay/WordPress/Shopify.” I keep a poker face and let them be. Inevitably left alone for long enough they talk themselves into a much more complex and expensive bid.

In fact, the longer I give teams the higher the estimates go. I heard a team in Australia say three times “Those estimates look low, lets double them.” And they did. (Again, planning has diminishing returns.)

So far nobody has offered two solutions: you could offer up a Shopify solution and a custom build solution but nobody has.

While we are going through the exercise the minimal viable product idea often gets mentioned – usually by the teams on the right. So recently I introduced a third story. This read the same as the international widget maker but has an extra paragraph underneath:

MacAllan consulting has advised the company to take an iterative and agile approach to this work using a minimally viable product model.

How do you think teams respond?

Think for a minute… your answer is?

It makes no difference.

Or rather, so far I’ve not had any of the million dollar teams propose anything close to the $5,000 solution. In one case a team with the MVP story actually came in more expensive – and longer – than the million dollar team without the MVP clause.

My learning here: talking MVP makes no difference. If you want an MVP you have to impose constraints. (Hence try an MVT.)

People continue to fill in the blanks after the charade is exposed. I’ve heard software architects argue forcefully they these are different problems because of the money involved and therefore require different architecture. They clearly feel cheated and want to justify the proposal they have made. I suspect they feel I’ve made them look silly and want to undo that impression, I’m sorry if I’ve made anyone feel silly.

I wonder how often that happens in the work place? How many of us would back down in real life? I’d like to think I would but I would probably first try and justify my position.

The architects have a point, in many ways the stories are functionally the same but the differences lie in the non-functional requirements: load, throughput, security and so on. But all of that is inferred by people from the price tag without question.

It makes me said that teams ask so few questions. People easily see themselves as a tailor not as a consultant (my Tailor or Image consultant post.)

Then there are the questions about the bidding process and companies bidding on the work.

Imagine you are the buyer: one supplier bids really low, the others were much higher but inline with your expectations. Would you trust the low bid? Have they blow their credibility?

And as a bidder: if you know the client plans to spend $1,000,000 why bid lower? The engineers cost estimates and designs aren’t relevant. Ideally you bid just below the competition. You are the lowest price with all the credibility and maximum revenue.

For that matter, should you be bidding on this at all?

If you work for a small e-commerce provider in rural Cornwall you may never know about, let alone, bid on a million dollar piece of work from an American multi-national. And if you did, would anyone take you seriously?

Suppose you got your big break: you walk in and offer a quick, low cost solution based on something like Shopify. Would they take you seriously? Would they want to listen?

Do corporations increase their own costs simply by being?

Conversely, if you work for a big consultancy and bid on million dollar deals every week will you be interested in bidding on a $5,000 piece of work? Of course not!

But that also means that if a corporation approaches you for a million dollar online shop, even if you could deliver it in a week’s time running on a third party platform do you have any incentive?

I don’t have answers to these questions. Indeed, there aren’t any right answers. All answers are valid, just some answers are “better” in some places than others.

Ultimately the lesson I take away from this is: we craft solutions within constraints.

More specifically: Engineers engineer within constraints, that is what engineers do.

That reinforces my belief that one needs to really understand benefit (value) and how that changes with time. From there we can work back to a solution.

If you would like to see this exercise for real then book yourself my Requirements, Backlogs and User Stories workshop. If you are in London Learning Connexions are running this again in May.


Like this post? – Like to receive these posts by e-mail? XanpanNewlite-2020-02-5-12-18.jpg

Xanpan

Subscribe to my newsletter & receive a free eBook “Xanpan: Team Centric Agile Software Development”

The post Framing the question frames the answers – my crown jewels appeared first on Allan Kelly Associates.

Dear customer, the truth about IT

Allan Kelly from Allan Kelly Associates

Dear Customer, the truth about IT projects” was originally published in The Agile Journal way back in 2012. Unfortunately the essay has stood the test of time and is still regularly re-discovered online.

A couple of weeks ago I recorded an audio version of Dear Customer which I have now posted online. You can find links to the original publication and of a PDF of the essay at the same site.

O, and the essay forms the prologue to my Xanpan book, but newsletter subscribers probably know that because they downloaded a free copy of Xanpan when they subscribed!

The post Dear customer, the truth about IT appeared first on Allan Kelly Associates.

Blog Post #300

Chris Oldwood from The OldWood Thing

I signed off My 200th Blog Post in November 2014 with the following words:
See you again in a few years.
At the time I didn’t think it would take me over 5 years to write another 100 blog posts, but it has. Does this mean I’ve stopped writing and gone back to coding, reading, and gaming more on my daily commute? No, the clue is also in that blog post:
My main aspiration was that writing this blog would help me sharpen my writing skills and give me the confidence to go and write something more detailed that might then be formally published.
No, I haven’t stopped writing; on the contrary, since my first “proper” [1] article for ACCU in late 2013 I’ve spent far more of my time writing further articles, somewhere around the 60 mark at the last count. These have often been longer and also required more care and attention but I’ve probably still written a similar amount of words in the last five years to the previous five.

Columnist

My “In The Toolbox” column for C Vu was a regular feature from 2013 to 2016 but that has tailed off for now and been replaced by a column on the final page of ACCU’s Overload. After it’s editor Frances Buontempo suggested the title “Afterwood” in the pub one evening how could I not accept?

In my very first Afterwood, where I set out my stall, I described how the final page of a programming journal has often played host to some entertaining writers in the past (when printed journals were still all the rage) and, while perhaps a little late to the party given the demise of the printed page, I’m still glad to have a stab at attempting such a role.

This 300th blog post almost coincided with the blog’s 10th anniversary 9 months ago but I had a remote working contract at the time so my long anticipated “decade of writing” blog post was elevated to an Afterwood instead due to the latter having some semblance of moral obligation unlike the former [2]. That piece, together with this one which focuses more on this blog, probably forms the whole picture.

Statistics

I did wonder if I’d ever get bored of seeing my words appear in print and so far I haven’t; it still feels just that little bit more special to have to get your content past some reviewers, something you don’t have with your own blog. Being author and editor for my blog was something I called out as a big plus in my first anniversary post, “Happy Birthday, Blog”. 

Many of us programmers aren’t as blessed in the confidence department as people in some other disciplines so we often have to find other ways to give ourselves that little boost every now and then. The blog wins out here as you can usually see some metrics and even occasionally the odd link back from other people’s blogs or Stack Overflow, which is a nice surprise. (Metrics only tell you someone downloaded the page, whereas a link back is a good indication they actually read it too :o). They may also have agreed, which would be even more satisfying!)

While we’re on the subject of “vanity” metrics I’ve remained fairly steadfast and ignored them. I did include a monthly “page views”counter on the sidebar just to make sure that it hadn’t got lost in the ether, search-engine wise. It’s never been easy searching for my own content; I usually have to add “site:chrisoldwood.blogspot.com” into the query, but it’s not that big an issue as first-and-foremost it’s notes for myself, other readers are always a bonus. For a long time my posts about PowerShell exit codes (2011) and Subversion mergeinfo records (2010) held the top spots but for some totally unknown reason my slightly ranty post around NTLM HTTP proxies (2016) is now dominating and will likely take over the top spot. Given there are no links to it (that I can find) I can only imagine it turns up in search engine queries and it’s not what people are really looking for. Sorry about that! Maybe there are devs and sysadmins out there looking for NTLM HTTP proxy therapy and this page is it? :o) Anyway, here are the top posts as of today:



Somewhat amusingly the stats graph on my 200th blog post shows a sudden meteoric rise in page views. Was I suddenly propelled to stardom? Of course not. It just so happened that my most recent post at the time got some extra views after the link was retweeted by a few people who’s follower count is measured in the thousands. It happened again a couple of years later, but in between it’s sat around the 4,500 views / month from what I can tell.




The 1 million views mark is still some way off, probably another 2.5 years, unless I manage to write something incredibly profound before then. (I won’t hold my breath though as 10 years of sample data must be statistically valid and it hasn’t happened so far.)

The Future

So, what for the future? Hopefully I’m going to keep plodding along with both my blog and any other outlets that will accept my written word. I have 113 topics in my blog drafts folder so I’m not out of ideas just yet. Naturally many of those should probably be junked as my opinion has undoubtedly changed in the meantime, although that in itself is something to write about which is why I can’t bring myself to bin them just yet – there is still value there, somewhere.

Two things I have realised I’ve missed, due to spending more time writing, is reading books (both technical and fiction) and writing code outside of work, i.e. my free tools. However, while I’ve sorely missed both of these pursuits I have in no way regretted spending more time writing as software development is all about communication and therefore it was a skill that I felt I definitely needed to improve. My time can hardly be considered wasted.

Now that I feel I’ve reached an acceptable level of competency in my technical writing I’m left wondering whether I’m comfortable sticking with that or whether I should try and be more adventurous. Books like The Goal show that technical subjects can presented in more entertaining ways and I’m well aware that my writing is still far too dry. My suspicion is that I need to get back to reading more fiction, and with a more critical eye, before I’ll truly feel confident enough to branch out more regularly into other styles [3].

Where I signed off my 200th post with a genuine expectation that I’d be back again for my 300th I’m less sure about the future. Not that I’ll have given up writing, more that I’m less sure this blog will continue to be the place where I express myself most. Here’s to the next 100 posts.


[1] I wrote a few reviews of branch meetings and book reviews before then, but that didn’t feel quite the same to me as writing about technical aspects of the craft itself. The latter felt like you were exposing more of your own thoughts rather than “simply” recording the opinions of others.

[2] See “Missing the Daily Commute by Train” about why my volume of writing is highly correlated with where I’m working at the time.

[3] To date my efforts to be more adventurous have been limited to my Afterwood left-pad spoof “Knocked for Six” and the short poem “Risk-a-Verse”.