Running an experiment is very time-consuming. I am always surprised that people put so much effort into gathering the data and then spend so little effort analyzing it.
The Computer Language Benchmarks Game looks like a fun benchmark; it compares the performance of 27 languages using various toy benchmarks (they could not be said to be representative of real programs). And, yes, lots of boxplots and tables of numbers; great eye-candy, but what do they all mean?
The authors, like good experimentalists, make all their data available. So, what analysis should they have done?
A regression model is the obvious choice and the following three lines of R (four lines if you could the blank line) build one, providing lots of interesting performance information:
cl=read.csv("Computer-Language_u64q.csv.bz2", as.is=TRUE) cl_mod=glm(log(cpu.s.) ~ name+lang, data=cl) summary(cl_mod)
The following is a cut down version of the output from the call to
summary, which summarizes the model built by the call to
Estimate Std. Error t value Pr(>|t|) (Intercept) 1.299246 0.176825 7.348 2.28e-13 *** namechameneosredux 0.499162 0.149960 3.329 0.000878 *** namefannkuchredux 1.407449 0.111391 12.635 < 2e-16 *** namefasta 0.002456 0.106468 0.023 0.981595 namemeteor -2.083929 0.150525 -13.844 < 2e-16 *** langclojure 1.209892 0.208456 5.804 6.79e-09 *** langcsharpcore 0.524843 0.185627 2.827 0.004708 ** langdart 1.039288 0.248837 4.177 3.00e-05 *** langgcc -0.297268 0.187818 -1.583 0.113531 langocaml -0.892398 0.232203 -3.843 0.000123 *** Null deviance: 29610 on 6283 degrees of freedom Residual deviance: 22120 on 6238 degrees of freedom
What do all these numbers mean?
We start with
glm's first argument, which is a specification of the regression model we are trying to fit:
log(cpu.s.) ~ name+lang
cpu.s. is cpu time,
name is the name of the program and
lang is the language. I found these by looking at the column names in the data file. There are other columns in the data, but I am running in quick & simple mode. As a first stab, I though cpu time would depend on the program and language. Why take the
log of the cpu time? Well, the model fitted using cpu time was very poor; the values range over several orders of magnitude and logarithms are a way of compressing this range (and the fitted model was much better).
The model fitted is:
Plugging in some numbers, to predict the cpu time used by say the program
chameneosredux written in the language
clojure, we get: (values taken from the first column of numbers above).
This model assumes there is no interaction between program and language. In practice some languages might perform better/worse on some programs. Changing the first argument of
log(cpu.s.) ~ name*lang, adds an interaction term, which does produce a better fitting model (but it's too complicated for a short blog post; another option is to build a mixed-model by using
lmer from the
We can compare the relative cpu time used by different languages. The multiplication factor for
clojure is , while for
ocaml it is . So
clojure consumes 8.2 times as much cpu time as
How accurate are these values, from the fitted regression model?
The second column of numbers in the
summary output lists the estimated standard deviation of the values in the first column. So the
clojure value is actually , i.e., between 2.2 and 4.9 (the multiplication by 1.96 is used to give a 95% confidence interval); the
ocaml values are , between 0.3 and 0.6.
The fourth column of numbers is the p-value for the fitted parameter. A value of lower than 0.05 is a common criteria, so there are question marks over the fit for the program
fasta and language
gcc. In fact many of the compiled languages have high p-values, perhaps they ran so fast that a large percentage of start-up/close-down time got included in their numbers. Something for the people running the benchmark to investigate.
Isn't it easy to get interesting numbers by building a regression model? It took me 10 minutes, ok I spend a lot of time fitting models. After spending many hours/days gathering data, spending a little more time learning to build simple regression models is well worth the effort.