GDPR has a huge impact on empirical software engineering research

Derek Jones from The Shape of Code

The EU’s General Data Protection Regulation (GDPR) is going to have a huge impact on empirical software engineering research. After 25 May 2018, analyzing source code will never be the same again.

I am not a lawyer and nothing qualifies me to talk about the GDPR.

People put their name in source code, bug tracking databases and discussion forums; this is personal identifying information.

Researchers use personal names to obtain information about a wide variety of activities, e.g., how much code did individuals write, how many bug reports did they process, contributions in discussions of one sort or another.

Open source licenses give others all kinds of rights (e.g., ability to use and modify source code), but they do not contain any provisions for processing personal data.

Adding a “I hereby give permission for anybody to process information about my name in any way they see fit.” clause to licenses is not going to help.

The GDPR requires (article 5: Principles relating to processing of personal data):

“Personal data shall be: … collected for specified, explicit and legitimate purposes and not further processed in a manner that is incompatible with those purposes;”

That is, personal data can only be processed for the specific reason it was collected, i.e., if you come up with another bright idea for analysis of data that has just been collected, it may be necessary to obtain consent, from those whose personal data it is, before trying out the bright idea.

It is not possible to obtain blanket permission (article 6, Lawfulness of processing):

“…the data subject has given consent to the processing of his or her personal data for one or more specific purposes;”, i.e., consent has to be obtained from the data subject for each specific purpose.

Github’s Global Privacy Practices shows that Github are intent on meeting the GDPR requirements, they include: “GitHub provides clear methods of unambiguous, informed consent at the time of data collection, when we do collect your personal data.”. Processing personal information, about an EU citizen, contained in source code appears to be a violation of Github’s terms of service.

The GDPR has many other requirements, e.g., right to obtain information on what information is held and right to be forgotten. But, the upfront killer is not being able to cheaply collect lots of code and then use personal information to help with the analysis.

There are exceptions for: Processing for archiving, scientific or historical research or statistical purposes. Can somebody who blogs and is writing a book claim to be doing scientific research? People who know more about these exceptions than me, tell me that there could be a fair amount of paperwork involved when making use of the exception, i.e., being able to show that privacy safeguards are in place.

Then, there is the issue of what constitutes personal information. Git’s hashing algorithm makes use of the committer’s name and/or email address. Is a git hash personal identifying information?

A good introduction to the GDPR for developers.

Can you get a deadlock with a single lock and an IO operation?

Timo Geusch from The Lone C++ Coder's Blog

Quite a while ago, I answered a question about the basic deadlock scenario on Stack Overflow. More recently, I got an interesting comment on it. The poster asked if it was possible to get a deadlock with a single lock and an I/O operation. My first gut reaction was “no, not really”, but it got […]

The post Can you get a deadlock with a single lock and an IO operation? appeared first on The Lone C++ Coder's Blog.

Can you get a deadlock with a single lock and an IO operation?

The Lone C++ Coder's Blog from The Lone C++ Coder's Blog

Quite a while ago, I answered a question about the basic deadlock scenario on Stack Overflow. More recently, I got an interesting comment on it. The poster asked if it was possible to get a deadlock with a single lock and an I/O operation. My first gut reaction was “no, not really”, but it got me thinking. So let’s try to unroll the scenario and see if we can reason at least about my gut feeling.

Reliability chapter added to “Empirical software engineering using R”

Derek Jones from The Shape of Code

The Reliability chapter of my Empirical software engineering book has been added to the draft pdf (download here).

I have been working on this draft for four months and it still needs lots of work; time to move on and let it stew for a while. Part of the problem is lack of public data; cost and schedule overruns can be rather public (projects chapter), but reliability problems are easier to keep quiet.

Originally there was a chapter covering reliability and another one covering faults. As time passed, these merged into one. The material kept evaporating in front of my eyes (around a third of the initial draft, collected over the years, was deleted); I have already written about why most fault prediction research is a waste of time. If it had not been for Rome I would not have had much to write about.

Perhaps what will jump out at people most, is that I distinguish between mistakes in code and what I call a fault experience. A fault_experience=mistake_in_code + particular_input. Most fault researchers have been completely ignoring half of what goes into every fault experience, the input profile (if the user does not notice a fault, I do not consider it experienced) . It’s incredibly difficult to figure out anything about the input profile, so it has been quietly ignored (one of the reasons why research papers on reported faults are such a waste of time).

I’m also missing an ‘interesting’ figure on the opening page of the chapter. Suggestions welcome.

I have not said much about source code characteristics. There is a chapter covering source code, perhaps some of this material will migrate to reliability.

All sorts of interesting bits and pieces have been added to earlier chapters. Ecosystems keeps growing and in years to come somebody will write a multi-volume tomb on software ecosystems.

I have been promised all sorts of data. Hopefully some of it will arrive.

As always, if you know of any interesting software engineering data, please tell me.

Source code chapter next.

No more pointers

Simon Brand from Simon Brand

One of the major changes at the most recent C++ standards meeting in Jacksonville was the decision to deprecate raw pointers in C++20, moving to remove them completely in C++23. This came as a surprise to many, with a lot of discussion as to how we’ll get by without this fundamental utility available any more. In this post I’ll look at how we can replace some of the main use-cases of raw pointers in C++20.

Three of the main reasons people use raw pointers are:

  • Dynamic allocation & runtime polymorphism
  • Nullable references
  • Avoiding copies

I’ll deal with these points in turn, but first, an answer to the main question people ask about this change.

The elephant in the room

What about legacy code? Don’t worry, the committee have come up with a way to both move forward the language boldly forward without breaking all the millions of lines of C++ which people have written over the years: opt-in extensions.

If you want to opt-in to C++20’s no-pointers feature, you use #feature.

#feature <no_pointers> //opt-in to no pointers
#feature <cpp20>       //opt-in to all C++20 features

This is a really cool new direction for the language. Hopefully with this we can slowly remove features like std::initializer_list so that new code isn’t bogged down with legacy as much as it is today.

Dynamic allocation & runtime polymorphism

I’m sure most of you already know the answer to this one: smart pointers. If you need to dynamically allocate some resource, that resource’s lifetime should be managed by a smart pointer, such as std::unique_ptr or std::shared_ptr. These types are now special compiler-built-in types rather than normal standard library types. In fact, std::is_fundamental<std::unique_ptr<int>>::value now evaluates to true!

Nullable references

Since references cannot be rebound and cannot be null, pointers are often used to fulfil this purpose. However, with C++20, we have a new type to fulfil this purpose: std::optional<T&>. std::optional was first introduced in C++17, but was plagued with no support for references and no monadic interface. C++20 has fixed both of these, so now we have a much more usable std::optional type which can fill the gap that raw pointers have left behind.

Avoiding copies

Some people like to use raw pointers to avoid copies at interface boundaries, such as returning some resource from a function. Fortunately, we have much better options, such as (Named) Return Value Optimization. C++17 made some forms of copy elision mandatory, which gives us even more guarantees for the performance of our code.

Wrapping up

Of course there are more use-cases for raw pointers, but this covers three of the most common ones. Personally, I think this is a great direction to see the language going in, and I look forward to seeing other ways we can slowly make C++ into a simpler, better language.