Modeling visual studio C++ compile times

Derek Jones from The Shape of Code

Last week I spotted an interesting article on the compile-time performance of C++ compilers running under Microsoft Windows. The author had obviously put a lot of work into gathering the data, and had taken care to have multiple runs to reduce the impact of random effects (128 runs to be exact); but, as if often the case, the analysis of the data was lackluster. I posted a comment asking for the data, and a link was posted the next day :-)

The compilers benchmarked were: Visual Studio 2015, Visual Studio 2017 and clang 7.0.1; the compilers were configured to target: C++20, C++17, C++14, C++11, C++03, or C++98. The source code used was 100 system headers.

If we are interested in understanding the contribution of each component to overall compile-time, the obvious fist regression model to build is:

compile_time = header_x+compiler_y+language_z

where: header_x are the different headers, compiler_y the different compilers and language_z the different target languages. There might be some interaction between variables, so something more complicated was tried first; the final fitted model was (code+data):

compile_time = k+header_x+compiler_y+language_z+compiler_y*language_z

where k is a constant (the Intercept in R’s summary output). The following is a list of normalised numbers to plug into the equation (clang is the default compiler and C++03 the default language, and so do not appear in the list, the : symbol represents the multiplication; only a few of the 100 headers are listed, details are available):

                             Estimate Std. Error  t value Pr(>|t|)    
               (Intercept)                  headerany 
               1.000000000                0.051100398 
               headerarray             headerassert.h 
               0.522336397               -0.654056185 
...
            headerwctype.h            headerwindows.h 
              -0.648095154                1.304270250 
              compilerVS15               compilerVS17 
              -0.185795534               -0.114590143 
             languagec++11              languagec++14 
               0.032930014                0.156363433 
             languagec++17              languagec++20 
               0.192301727                0.184274629 
             languagec++98 compilerVS15:languagec++11 
               0.001149643               -0.058735591 
compilerVS17:languagec++11 compilerVS15:languagec++14 
              -0.038582437               -0.183708714 
compilerVS17:languagec++14 compilerVS15:languagec++17 
              -0.164031495                         NA 
compilerVS17:languagec++17 compilerVS15:languagec++20 
              -0.181591418                         NA 
compilerVS17:languagec++20 compilerVS15:languagec++98 
              -0.193587045                0.062414667 
compilerVS17:languagec++98 
               0.014558295 

As an example, the (normalised) time to compile wchar.h using VS15 with languagec++11 is:
1-0.514807638-0.183862162+0.033951731-0.059720131

Each component adds/substracts to/from the normalised mean.

Building this model didn’t take long. While waiting for the kettle to boil, I suddenly realised that an additive model was probably inappropriate for this problem; oops. Surely the contribution of each component was multiplicative, i.e., components have a percentage impact to performance.

A quick change to the form of the fitted model:

log(compile_time) = k+header_x+compiler_y+language_z+compiler_y*language_z

Taking the exponential of both side, the fitted equation becomes:

compile_time = e^{k}e^{header_x}e^{compiler_y}e^{language_z}e^{compiler_y*language_z}

The numbers, after taking the exponent, are:

               (Intercept)                  headerany 
              9.724619e+08               1.051756e+00 
...
            headerwctype.h            headerwindows.h 
              3.138361e-01               2.288970e+00 
              compilerVS15               compilerVS17 
              7.286951e-01               7.772886e-01 
             languagec++11              languagec++14 
              1.011743e+00               1.049049e+00 
             languagec++17              languagec++20 
              1.067557e+00               1.056677e+00 
             languagec++98 compilerVS15:languagec++11 
              1.003249e+00               9.735327e-01 
compilerVS17:languagec++11 compilerVS15:languagec++14 
              9.880285e-01               9.351416e-01 
compilerVS17:languagec++14 compilerVS15:languagec++17 
              9.501834e-01                         NA 
compilerVS17:languagec++17 compilerVS15:languagec++20 
              9.480678e-01                         NA 
compilerVS17:languagec++20 compilerVS15:languagec++98 
              9.402461e-01               1.058305e+00 
compilerVS17:languagec++98 
              1.001267e+00 

Taking the same example as above: wchar.h using VS15 with c++11. The compile-time (in cpu clock cycles) is:
9.724619e+08*3.138361e-01*7.286951e-01*1.011743e+00*9.735327e-01

Now each component causes a percentage change in the (mean) base value.

Both of these model explain over 90% of the variance in the data, but this is hardly surprising given they include so much detail.

In reality compile-time is driven by some combination of additive and multiplicative factors. Building a combined additive and multiplicative model is going to be like wrestling an octopus, and is left as an exercise for the reader :-)

Given a choice between these two models, I think the multiplicative model is probably closest to reality.

xkcd-style plots in MatPlotLib

Frances Buontempo from BuontempoConsulting

Most programmers I know are familiar with xkcd, the webcomic of romance, sarcasm, math, and language. In order to create diagrams for my machine learning book, I wanted a way to create something I could have fun with.

I discovered that Python's MatPlotLib library has an xkcd style, which you simply wrap round a plot. This allowed me to piece together what I needed using line segments, shapes, and labels.

Given a function, f, which draws what you need on some axes ax, use the style like this, and you're done:

with plt.xkcd():
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    f(ax)

    plt.show()

For example to explain what happens when you fire cannons at different angles:


I gave a talk at Skillsmatter, called "Visualisation FTW", which was recorded, so you can watch it if you want. I also wrote this up for ACCU's CVu magazine. The ACCU runs an annual best article survey, and I was runner up, which was a pleasant surprise. You need to be a member to view the article, but it covers the same ground as the short talk at Skillsmatter.

Buy a copy of my book, or go play with xkcd style pictures. Have fun; I did.



London Python Meetup January 2019 – Async Python and GeoPandas

Andy Balaam from Andy Balaam's Blog

It was a pleasure to go to the London Python Meetup organised by @python_london. There were plenty of friendly people and interesting conversations.

I gave a talk “Making 100 million requests with Python aiohttp” (slides) explaining the basics of writing async code in Python 3 and how I used that to make a very large number of HTTP requests.

Andy giving the presentation

(Photo by CB Bailey.)

Hopefully it was helpful – there were several good questions, so I am optimistic that people were engaged with it.

After that, there was an excellent talk by Gareth Lloyd called “GeoPandas, the geospatial extension for Pandas” in which he explained how to use the very well-developed geo-spatial data tools available in the Python ecosphere to transform, combine, plot and analyse data which includes location information. I was really impressed with how easy the libraries looked to use, and also with the cool Jupyter notebook Gareth used to explain the ideas using live demos.

London Python Meetups seem like a cool place to meet Pythonistas of all levels of experience in a nice, low-pressure environment!

Meetup link: aiohttp / GeoPandas

Visual Lint 7.0.5.314 has been released

Products, the Universe and Everything from Products, the Universe and Everything

This is a recommended maintenance update for Visual Lint 7.0. The following changes are included:

  • If an output report folder is specified on the command line and its path is relative, VisualLintConsole now resolves it before starting analysis. This ensures that the path referenced in any status/progress messages is is unambiguous.

  • Fixed a bug in VisualLintConsole in the copying of Javascript baggage files to the (optional) output report folder.

  • Fixed a minor display bug in the VisualLintGui "Projects" Display.

Visual Lint 7.0.5.314 has been released

Products, the Universe and Everything from Products, the Universe and Everything

This is a recommended maintenance update for Visual Lint 7.0. The following changes are included:

  • If an output report folder is specified on the command line and its path is relative, VisualLintConsole now resolves it before starting analysis. This ensures that the path referenced in any status/progress messages is is unambiguous.

  • Fixed a bug in VisualLintConsole in the copying of Javascript baggage files to the (optional) output report folder.

  • Fixed a minor display bug in the VisualLintGui "Projects" Display.

Setting up my own VPN server on Vultr with Centos 7 and WireGuard

Timo Geusch from The Lone C++ Coder's Blog

As an IT consultant, I travel a lot. I mean, a lot. Part of the pleasure is having to deal with day-to-day online life on open, potentially free-for-all hotel and conference WiFi. In other words, the type of networks you really want to do your online banking, ecommerce and other potentially sensitive operations on. After […]

The post Setting up my own VPN server on Vultr with Centos 7 and WireGuard appeared first on The Lone C++ Coder's Blog.

Setting up my own VPN server on Vultr with Centos 7 and WireGuard

The Lone C++ Coder's Blog from The Lone C++ Coder's Blog

As an IT consultant, I travel a lot. I mean, a lot. Part of the pleasure is having to deal with day-to-day online life on open, potentially free-for-all hotel and conference WiFi. In other words, the type of networks you really want to do your online banking, ecommerce and other potentially sensitive operations on. After seeing one too many ads for VPN services on bad late night TV I finally decided I needed to do something about it. Ideally I intended to this on the cheap and learn something in the process. I also didn’t want to spend the whole weekend trying to set it up, which is how WireGuard entered the picture. I only really needed to protect my most sensitive device - my personal travel laptop.

As I’m already a customer at Vultr (affiliate link) I decided to just spin up another of their tiny instances and set it up as my WireGuard VPN server. Note that I’m not setting up a VPN service for the whole family, all my friends and some additional people, all I’m trying to do is secure some of my online communications a little bit more.

I also decided to document this experiment, both for my own reference and in the hope that it will be useful for someone else. Readers will need to have some experience setting up and administering Linux server. Come on in and follow along!

Teaching basic data analysis to programmers: summer internship

Derek Jones from The Shape of Code

Software engineering is one of the topics in this year’s summer internships being sponsored by R-Studio. The spec says: “Data Science Training for Software Engineers – Develop course materials to teach basic data analysis to programmers using software engineering problems and data sets.”

It’s good to see interest in data analysis of software engineering data start to gain traction.

What topics might basic data analysis for programmers include? I have written about statistical techniques that I think are useful in software engineering, but I don’t think this list would be regarded as basic. Techniques that are think are basic are:

  • a picture is worth a thousand words, so obviously visualization is a major topic,
  • building regression models is good for helping to understand what is going on.

Anything else? Well, I don’t know.

An alternative approach to teaching basic data analysis is to give examples of the kind of useful things it can be used to do. Software developers are fast learners, and given the motivation have the skills needed to find and learn techniques that they think are of use. In a basic course, I would put the emphasis on motivating developers to think that data analysis can help them do a better job.

I would NOT, repeat, not, include any material on machine learning. Software engineering data sets tend to be too small to obtain reliable results from machine learning, and I don’t want to encourage clueless button pushers.

What are the desirable skills in the summer intern? I would say that being able to write readable material is the most important, with statistical knowledge ranked second; the level of software engineering knowledge is unimportant. Data analysis tends to follow the same pattern whatever the subject; so it’s important to get somebody who knows about data analysis.

A social science major is the obvious demographic for this intern (they do lots of data analysis); the last people to consider are students majoring in a computing subject.

On Onwards And Downwards – student

student from thus spake a.k.

When last they met, the Baron challenged Sir R----- to evade capture whilst moving rooks across and down a chessboard. Beginning with a single rook upon the first file and last rank, the Baron should have advanced it to the second file and thence downwards in rank in response to which Sir R----- should have progressed a rook from beneath the board by as many squares and if by doing so had taken the Baron's would have won the game. If not, Sir R----- could then have chosen either rook, barring one that sits upon the first rank, to move to the next file in the same manner with the Baron responding likewise. With the game continuing in this fashion and ending if either of them were to take a rook moved by the other or if every file had been played upon, the Baron should have had a coin from Sir R----- if he took a piece and Sir R----- one of the Baron's otherwise.