ACCU World of Code – Page 83 – Aggregating coding blogs

As an IT consultant, I travel a lot. I mean, a lot. Part of the pleasure is having to deal with day-to-day online life on open, potentially free-for-all hotel and conference WiFi. In other words, the type of networks you really want to do your online banking, ecommerce and other potentially sensitive operations on. After […]

The post Setting up my own VPN server on Vultr with Centos 7 and WireGuard appeared first on The Lone C++ Coder's Blog.

January 21, 2019

Setting up my own VPN server on Vultr with Centos 7 and WireGuard

The Lone C++ Coder's Blog from The Lone C++ Coder's Blog

As an IT consultant, I travel a lot. I mean, a lot. Part of the pleasure is having to deal with day-to-day online life on open, potentially free-for-all hotel and conference WiFi. In other words, the type of networks you really want to do your online banking, ecommerce and other potentially sensitive operations on. After seeing one too many ads for VPN services on bad late night TV I finally decided I needed to do something about it. Ideally I intended to this on the cheap and learn something in the process. I also didn’t want to spend the whole weekend trying to set it up, which is how WireGuard entered the picture. I only really needed to protect my most sensitive device - my personal travel laptop.

As I’m already a customer at Vultr (affiliate link) I decided to just spin up another of their tiny instances and set it up as my WireGuard VPN server. Note that I’m not setting up a VPN service for the whole family, all my friends and some additional people, all I’m trying to do is secure some of my online communications a little bit more.

I also decided to document this experiment, both for my own reference and in the hope that it will be useful for someone else. Readers will need to have some experience setting up and administering Linux server. Come on in and follow along!

January 21, 2019

Teaching basic data analysis to programmers: summer internship

Derek Jones from The Shape of Code

Software engineering is one of the topics in this year’s summer internships being sponsored by R-Studio. The spec says: “Data Science Training for Software Engineers – Develop course materials to teach basic data analysis to programmers using software engineering problems and data sets.”

It’s good to see interest in data analysis of software engineering data start to gain traction.

What topics might basic data analysis for programmers include? I have written about statistical techniques that I think are useful in software engineering, but I don’t think this list would be regarded as basic. Techniques that are think are basic are:

a picture is worth a thousand words, so obviously visualization is a major topic,
building regression models is good for helping to understand what is going on.

Anything else? Well, I don’t know.

An alternative approach to teaching basic data analysis is to give examples of the kind of useful things it can be used to do. Software developers are fast learners, and given the motivation have the skills needed to find and learn techniques that they think are of use. In a basic course, I would put the emphasis on motivating developers to think that data analysis can help them do a better job.

I would NOT, repeat, not, include any material on machine learning. Software engineering data sets tend to be too small to obtain reliable results from machine learning, and I don’t want to encourage clueless button pushers.

What are the desirable skills in the summer intern? I would say that being able to write readable material is the most important, with statistical knowledge ranked second; the level of software engineering knowledge is unimportant. Data analysis tends to follow the same pattern whatever the subject; so it’s important to get somebody who knows about data analysis.

A social science major is the obvious demographic for this intern (they do lots of data analysis); the last people to consider are students majoring in a computing subject.

January 18, 2019January 18, 2019

On Onwards And Downwards – student

student from thus spake a.k.

When last they met, the Baron challenged Sir R----- to evade capture whilst moving rooks across and down a chessboard. Beginning with a single rook upon the first file and last rank, the Baron should have advanced it to the second file and thence downwards in rank in response to which Sir R----- should have progressed a rook from beneath the board by as many squares and if by doing so had taken the Baron's would have won the game. If not, Sir R----- could then have chosen either rook, barring one that sits upon the first rank, to move to the next file in the same manner with the Baron responding likewise. With the game continuing in this fashion and ending if either of them were to take a rook moved by the other or if every file had been played upon, the Baron should have had a coin from Sir R----- if he took a piece and Sir R----- one of the Baron's otherwise.

January 18, 2019

Choosing between two reasonably fitting probability distributions

Derek Jones from The Shape of Code

I sometimes go fishing for a probability distribution to fit some software engineering data I have. Why would I want to spend time fishing for a probability distribution?

Data comes from events that are driven by one or more processes. Researchers have studied the underlying patterns present in many processes and in some cases have been able to calculate which probability distribution matches the pattern of data that it generates. This approach starts with the characteristics of the processes and derives a probability distribution. Often I don’t really know anything about the characteristics of the processes that generated the data I am looking at (but I can often make what I like to think are intelligent guesses). If I can match the data with a probability distribution, I can use what is known about processes that generate this distribution to get some ideas about the kinds of processes that could have generated my data.

Around nine-months ago, I learned about the Conwayâ€“Maxwellâ€“Poisson distribution (or COM-Poisson). This looked as-if it might find some use in fitting software engineering data, and I added it to my list of distributions to keep in mind. I saw that the R package COMPoissonReg supports the fitting of COM-Poisson distributions.

This week I came across one of the papers, about COM-Poisson, that I was reading nine-months ago, and decided to give it a go with some count-data I had.

The Poisson distribution involves count-data, i.e., non-negative integers. Lots of count-data samples are well described by a Poisson distribution, and it is one of the basic distributions supported by statistical packages. Processes described by a Poisson distribution are memory-less, in that the probability of an event occurring are independent of when previous events occurred. When there is a connection between events, the Poisson distribution is not such a good fit (depending on the strength of the connection between events).

While a process that generates count-data may not meet the requirements needed to be exactly described by a Poisson distribution, the behavior may be close enough to give good-enough results. R supports a quasipoisson distribution to help handle the ‘near-misses’.

Sometimes count-data has a distribution that looks nothing like a Poisson. The Negative-binomial distribution is the obvious next choice to try (this can be viewed as a combination of different Poisson distributions; another combination is the Poisson inverse gaussian distribution).

The plot below (from a paper analyzing usage of record data structures in Racket; Tobias Pape kindly sent me the data) shows the number of Racket structure types that contain a given number of fields (red pluses), along with lines showing fitted Negative binomial and COM-Poisson distributions (code+data):

Number of Racket structure types containing a given number of fields.

I’m interested in understanding the processes that are generating the data, and having two distributions do such a reasonable job of fitting the data has given me more possible distinct explanations for what is going on than I wanted (if I were interested in prediction, then either distribution looks like it would do a good-enough job).

What are the characteristics of the processes that generate data having each of the distributions?

A Negative binomial can be viewed as a combination of Poisson distributions (the combination having a Gamma distribution). We could create a story around multiple processes being responsible for the pattern seen, with each of these processes having the impact of a Poisson distribution. Sounds plausible.
A COM-Poisson distribution can be viewed as a Poisson distribution which is length dependent. We could create a story around the probability of a field being added to a structure type being dependent on the number of existing fields it contains. Sounds plausible (it’s a slightly different idea from preferential attachment).

When fitting a distribution to data, I usually go with the ‘brand-name’ distributions (i.e., the one with most name recognition, provided it matches well enough; brand names are an easier sell then the less well known names).

The Negative binomial distribution is the brand-name here. I had not heard of the COM-Poisson distribution until nine-months ago.

Perhaps the authors of the Racket analysis paper will come up with a theory that prefers one of these distributions, or even suggests another one.

Readers of my evidence-based software engineering book need to be aware of my brand-name preference in some of the data fitting that occurs.

January 15, 2019

The pImpl Idiom

Arne Mertz from Simplify C++!

The pImpl idiom is a useful idiom in C++ to reduce compile-time dependencies. Here is a quick overview of what to keep in mind when we implement and use it. […]

The post The pImpl Idiom appeared first on Simplify C++!.

January 14, 2019

Visual Lint 6.5.6.302 has been released

Products, the Universe and Everything from Products, the Universe and Everything

This is a recommended maintenance update for Visual Lint 6.0 and 6.5. The following changes are included:

Modified generated Vera++ command lines to replace the -showrules option with --show-rule. In consequence the minimum supported version of Vera++ is now 1.2.1.
When a Visual Studio 2017 project using the /Zc:alignedNew or /Zc:alignedNew+ option is loaded the C++ 17 __STDCPP_DEFAULT_NEW_ALIGNMENT__ preprocessor symbol will now be included in the generated analysis configuration.
Corrected the value of _MSC_FULL_VER referenced in the PC-lint Plus compiler indirect files for Visual Studio .NET 2002 and 2003 (co-rb-vs2002.lnt and co-rb-vs2003.lnt respectively).

Download Visual Lint 6.5.6.302

January 14, 2019

Visual Lint 6.5.6.302 has been released

Products, the Universe and Everything from Products, the Universe and Everything

This is a recommended maintenance update for Visual Lint 6.0 and 6.5. The following changes are included:

Modified generated Vera++ command lines to replace the -showrules option with --show-rule. In consequence the minimum supported version of Vera++ is now 1.2.1.
When a Visual Studio 2017 project using the /Zc:alignedNew or /Zc:alignedNew+ option is loaded the C++ 17 __STDCPP_DEFAULT_NEW_ALIGNMENT__ preprocessor symbol will now be included in the generated analysis configuration.
Corrected the value of _MSC_FULL_VER referenced in the PC-lint Plus compiler indirect files for Visual Studio .NET 2002 and 2003 (co-rb-vs2002.lnt and co-rb-vs2003.lnt respectively).

Download Visual Lint 6.5.6.302

January 10, 2019

Wanted: 99 effort estimation datasets

Derek Jones from The Shape of Code

Every now and again, I stumble upon a really interesting dataset. Previously, when this happened I wrote an extensive blog post; but the SiP dataset was just too big and too detailed, it called out for a more expansive treatment.

How big is the SiP effort estimation dataset? It contains 10,100 unique task estimates, from ten years of commercial development using Agile. That’s around two orders of magnitude larger than other, current, public effort datasets.

How detailed is the SiP effort estimation dataset? It contains the (anonymized) identity of the 22 developers making the estimates, for one of 20 project codes, dates, plus various associated items. Other effort estimation datasets usually just contain values for estimated effort and actual effort.

Data analysis is a conversation between the person doing the analysis and the person(s) with knowledge of the application domain from which the data came. The aim is to discover information that is of practical use to the people working in the application domain.

I suggested to Stephen Cullum (the person I got talking to at a workshop, a director of Software in Partnership Ltd, and supplier of data) that we write a paper having the form of a conversation about the data; he bravely agreed.

The result is now available: A conversation around the analysis of the SiP effort estimation dataset.

What next?

I’m looking forward to seeing what other people do with the SiP dataset. There are surely other patterns waiting to be discovered, and what about building a simulation model based on the charcteristics of this data?

Turning software engineering into an evidence-based disciple requires a lot more data; I plan to go looking for more large datasets.

Software engineering researchers are a remarkable unambitious bunch of people. The SiP dataset should be viewed as the first of 100 such datasets. With 100 datasets we can start to draw general, believable conclusions about the processes involved in software effort estimation.

Readers, today is the day you start asking managers to make any software engineering data they have publicly available. Yes, it can be anonymized (I am willing to do that for people who are looking to release data). Yes, ‘old’ data is useful (data from the 1980s could have an interesting story to tell; SiP runs from 2004-2014). Yes, I will analyze any interesting data that is made public for free.

Ask, and you shall receive.

January 9, 2019

Error handling omitted for brevity

Allan Kelly from Allan Kelly Associates

Q: What is the difference between programming in college and programming in the real world?
A: Error handling

Do you remember when you were learning to program? Do you remember those text books you had back in college? And do you remember what they said about error handling?

As I remember it most of what they said about error handling was:

/* error handling omitted for brevity */

Or perhaps:

(* error handling omitted for brevity *)

Back in college error handling hardly got a mention, and if it did it was to abort the program. Yet in the real world 80% of what you program is error handling, or rather exceptions, the corner cases, what happens when things go wrong.

Iâ€™ve been saying this for years but this week I realised how shocking this was.

A couple of years ago a paper entitled â€œSimple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systemsâ€ (2014, you know its an academic paper because it has 8 authors)) was momentarily famous on Twitter. I grabbed it and had a quick read but this week I had reason to go back and look at it again. In the process I found a 20 minutes video presentation by one of the authors.

To cut a long story short, the authors looked at the source code for large open source applications (Cassandra, MapReduce, etc) and software failures. Among various finding they reported:

Finding 1: â€œA majority (77%) of the failures require more than one input event to manifest, but most of the failures (90%) require no more than 3â€ – so even if didnâ€™t happen very often, they were difficult to simulate in system testing
Finding 9: â€œA majority of the production failures (77%) can be reproduced by a unit test.â€ (Yes the reoccurrence of 77% is suspicion but I think it is an improbably but genuine co-incidence, please read the paper or watch the video before you fault the paper on this.)
Finding 10: â€œAlmost all catastrophic failures (92%) are the result of incorrect handling of non-fatal errors explicitly signalled in software.â€
Finding 11 â€œ35% of the catastrophic failures are caused by trivial mistakes in error handling logic â€” ones that simply violate best programming practices; and that can be detected without system specific knowledge.â€

The authors even created a tool to scan code for some of these problems. In many cases they found code like:

catch (â€¦) {
// TODO
}
catch (Exception e) {
/* will never happen */
}

My old jibe about error handling looked very real.

This morning I pulled some old books off my shelves and was shocked by what I found:

First the book I was prescribed at not one but two University programming courses: â€œProblem Solving and Structured Programming in Modula-2â€ by Elliot B. Kaufman (1988).

I canâ€™t find â€œError handling omittedâ€ in this book, my memory was wrong but the book is worse. I canâ€™t find any error handling to speak of! I found one example which returns a boolean success/fail flag but there is no discussion of what to do with it. â€œError handlingâ€ is not even in the index, let alone the table of contents – actually â€œErrorâ€ isnâ€™t even there.

Each chapter ends with a â€œCommon Programming Errorsâ€ section but this section is mostly about compile time errors.

Next I looked at the silver book, Wirthâ€™s â€œPascal User Manual and Reportâ€ (1991). I can only find two references to â€œerrorsâ€ (nothing to exception). Both these references are in the report section and donâ€™t say anything about how to program error handling.

As I looked at more old books I noticed how they just assumed everything worked well.

K&R is slightly better – â€œThe C Programming Languageâ€ by Kernighan and Ritchie (1988) that is. Most of the examples here do check for errors, then printf. Sometimes that is it, sometimes there return 0 or break. On page 164 they say:

â€œWe have generally not worried about exit status in our small illustrative programs, but any serious program should take care to return sensible, useful status values.â€

In other words: Error handling omitted for brevity.

Moving away from the introductory books I turned to what might be the longest single volume technical book I ever read. A book I quoted as a bible, a book whoâ€™s author I still put on a pedestal: â€œLarge Scale C++ Software Designâ€, John Lakos (1996). While John does say a bit more about error handling it does not feature in the index and there is no dedicated section to it. Looking at it now I am in disbelief, how could a book a large scale C++ not have at least one chapter on error handling?

Of the books I look at this morning only Kernighan and Pikeâ€™s â€œPractice of Programmingâ€ (1999) gave any coverage to error handling. And that isnâ€™t saying much.

OK, these are all ancient books. Have things changed? – you tell me.

I hope more recent books, in more modern languages have got better – and my old (1999) copy of â€œLearning Pythonâ€ (Ascher) contains a whole chapter on exceptions as does Stroustrupâ€™s â€œC++ Programming Languageâ€ (2000).

But I am sure error and exception handling hasnâ€™t got any simpler. I canâ€™t believe that JavaScript, PHP, Swift, and simiar. have somehow made the problem go away. â€œThrow exception(blah, blah, blah)â€ might be a great improvement over â€œreturn -1â€ but I canâ€™t imagine handling these cases has got easier.

Based on the â€œSimple Testingâ€ paper improvements in training programmer in error handling need to be redoubled.

Like this post?

Like to receive these posts by e-mail?

Subscribe to my newsletter & receive a free eBook â€œXanpan: Team Centric Agile Software Developmentâ€

The post Error handling omitted for brevity appeared first on Allan Kelly Associates.