Performance – ACCU World of Code

Moore’s law was a socially constructed project that depended on the coordinated actions of many independent companies and groups of individuals to last for as long it did.

All products evolve, but what was it about Moore’s law that enabled microelectronics to evolve so much faster and for longer than most other products?

Moore’s observation, made in 1965 based on four data points, was that the number of components contained in a fabricated silicon device doubles every year. The paper didn’t make this claim in words, but a line fitted to four yearly data points (starting in 1962) suggested this behavior continuing into the mid-1970s. The introduction of IBM’s Personal Computer, in 1981 containing Intel’s 8088 processor, led to interested parties coming together to create a hugely profitable ecosystem that depended on the continuance of Moore’s law.

The plot below shows Moore’s four points (red) and fitted regression model (green line). In practice, since 1970, fitting a regression model (purple line) to the number of transistors in various microprocessors (blue/green, data from Wikipedia), finds that the number of transistors doubled every two years (code+data):

Transistors contained in a device over time, plus Moore's original four data-points.

In the early days, designing a device was mostly a manual operation; that is, the circuit design and logic design down to the transistor level were hand-drawn. This meant that creating a device containing twice as many transistors required twice as many engineers. At some point the doubling process either becomes uneconomic or it takes forever to get anything done because of the coordination effort.

The problem of needing an exponentially-growing number of engineers was solved by creating electronic design automation tools (EDA), starting in the 1980s, with successive generations of tools handling ever higher levels of abstraction, and human designers focusing on the upper levels.

The use of EDA provides a benefit to manufacturers (who can design differentiated products) and to customers (e.g., products containing more functionality).

If EDA had not solved the problem of exponential growth in engineers, Moore’s law would have maxed-out in the early 1980s, with around 150K transistors per device. However, this would not have stopped the ongoing shrinking of transistors; two economic factors independently incentivize the creation of ever smaller transistors.

When wafer fabrication technology improvements make it possible to double the number of transistors on a silicon wafer, then around twice as many devices can be produced (assuming unchanged number of transistors per device, and other technical details). The wafer fabrication cost is greater (second row in table below), but a lot less than twice as much, so the manufacturing cost per device is much lower (third row in table).

The doubling of transistors primarily provides a manufacturer benefit.

The following table gives estimates for various chip foundry economic factors, in dollars (taken from the report: AI Chips: What They Are and Why They Matter). Node, expressed in nanometers, used to directly correspond to the length of a particular feature created during the fabrication process; these days it does not correspond to the size of any specific feature and is essentially just a name applied to a particular generation of chips.

Node (nm)                       90      65     40     28      20    16/12     10       7       5
Foundry sale price per wafer  1,650   1,937  2,274  2,891   3,677   3,984   5,992   9,346  16,988
Foundry sale price per chip   2,433   1,428    713    453     399     331     274     233     238
Mass production year          2004    2006   2009   2011    2014    2015    2017    2018   2020
Quarter                        Q4      Q4     Q1     Q4      Q3      Q3      Q2      Q3     Q1
Capital investment per wafer  4,649   5,456  6,404  8,144  10,356  11,220  13,169  14,267  16,746
processed per year
Capital consumed per wafer      411     483    567    721     917     993   1,494   2,330   4,235
processed in 2020
Other costs and markup        1,293   1,454  1,707  2,171   2,760   2,990   4,498   7,016  12,753
per wafer

The second economic factor incentivizing the creation of smaller transistors is Dennard scaling, a rarely heard technical term named after the first author of a 1974 paper showing that transistor power consumption scaled with area (for very small transistors). Halving the area occupied by a transistor, halves the power consumed, at the same frequency.

The maximum clock-frequency of a microprocessor is limited by the amount of heat it can dissipate; the heat produced is proportional to the power consumed, which is approximately proportional to the clock-frequency. Instead of a device having smaller transistors consume less power, they could consume the same power at double the frequency.

Dennard scaling primarily provides a customer benefit.

Figuring out how to further shrink the size of transistors requires an investment in research, followed by designing/(building or purchasing) new equipment. Why would a company, who had invested in researching and building their current manufacturing capability, be willing to invest in making it obsolete?

The fear of losing market share is a commercial imperative experienced by all leading companies. In the microprocessor market, the first company to halve the size of a transistor would be able to produce twice as many microprocessors (at a lower cost) running twice as fast as the existing products. They could (and did) charge more for the latest, faster product, even though it cost them less than the previous version to manufacture.

Building cheaper, faster products is a means to an end; that end is receiving a decent return on the investment made. How large is the market for new microprocessors and how large an investment is required to build the next generation of products?

Rock’s law says that the cost of a chip fabrication plant doubles every four years (the per wafer price in the table above is increasing at a slower rate). Gambling hundreds of millions of dollars, later billions of dollars, on a next generation fabrication plant has always been a high risk/high reward investment.

The sales of microprocessors are dependent on the sale of computers that contain them, and people buy computers to enable them to use software. Microprocessor manufacturers thus have to both convince computer manufacturers to use their chip (without breaking antitrust laws) and convince software companies to create products that run on a particular processor.

The introduction of the IBM PC kick-started the personal computer market, with Wintel (the partnership between Microsoft and Intel) dominating software developer and end-user mindshare of the PC compatible market (in no small part due to the billions these two companies spent on advertising).

An effective technique for increasing the volume of microprocessors sold is to shorten the usable lifetime of the computer potential customers currently own. Customers buy computers to run software, and when new versions of software can only effectively be used in a computer containing more memory or on a new microprocessor which supports functionality not supported by earlier processors, then a new computer is needed. By obsoleting older products soon after newer products become available, companies are able to evolve an existing customer base to one where the new product is looked upon as the norm. Customers are force marched into the future.

The plot below shows sales volume, in gigabytes, of various sized DRAM chips over time. The simple story of exponential growth in sales volume (plus signs) hides the more complicated story of the rise and fall of succeeding generations of memory chips (code+data):

Sales volume, in gigabytes, of various sized DRAM chips over time.

The Red Queens had a simple task, keep buying the latest products. The activities of the companies supplying the specialist equipment needed to build a chip fabrication plant has to be coordinated, a role filled by the International Technology Roadmap for Semiconductors (ITRS). The annual ITRS reports contain detailed specifications of the expected performance of the subsystems involved in the fabrication process.

Moore’s law is now dead, in that transistor doubling now takes longer than two years. Would transistor doubling time have taken longer than two years, or slowed down earlier, if:

the ecosystem had not been dominated by two symbiotic companies, or did network effects make it inevitable that there would be two symbiotic companies,
the Internet had happened at a different time,
if software applications had quickly reached a good enough state,
if cloud computing had gone mainstream much earlier.

May 23, 2021

Where are the industrial strength R compilers?

Derek Jones from The Shape of Code

Why don’t compiler projects for the R language make it into production use? The few that have been written have remained individual experimental products, e.g., RLLVMCompile.

Most popular languages attract many compiler implementations. I’m not saying that any of these implementations have more than a handful of users, that they implement the full language (a full implementation is not common), or that they fulfil any need other than their implementers desire to build something.

A commonly heard reason for the lack of production R compilers is that it is not worth the time and effort, because most of an R program’s time is spent in the library code which is written in a compiled language (e.g., C or Fortran). The fact that it is probably not worth the time and effort has not stopped people writing compilers for other languages, but then I think that the kind of people who use R tend not to be the kind of people who want to spend their time writing compilers. On the whole, they are the kind of people who are into statistics and data analysis.

Is it true that that most R programs spend most of their time executing library code? It’s certainly true for me. But I have noticed that a lot of the library functions executed by my code are written in R. Also, if somebody uses R for all their programming needs (it might be the only language they know), then their code might not be heavily library dependent.

I was surprised to read about Tierney’s byte code compiler, because his implementation is how I thought the R-core’s existing implementation worked (it does now). The internals of R is based on 1980s textbook functional techniques, and like many book implementations of the day, performance is dependent on the escape hatch of compiled code. R’s implementers wisely spent their time addressing user concerns, which revolved around statistics and visual presentation, i.e., not internal implementation technicalities.

Building an R compiler is easy, the much harder and time-consuming part is the runtime system.

Threaded code is a quick and simple approach to compiler implementation. R source gets mapped to a sequence of C function calls, with these functions proving a wrapper to library functions implementing the appropriate basic functionality, e.g., add two vectors. This approach has been the subject of at least one Master’s thesis. Thesis implementations rarely reach production use because those involved significantly underestimate the work that remains to be done, which is usually a lot more than the original implementation.

A simple threaded code approach provides a base for subsequent optimization, with the base having a similar performance to an interpreter. Optimizing requires figuring out details of the operations performed and replacing generic function calls with ones designed to be fast for specific cases, or even better replacing calls with inline code, e.g., adding short vectors of integers. There is a lot of existing work for scripting languages and a few PhD thesis researching R (e.g., Wang). The key technique is static analysis of R source.

Jan Vitek is running what appears to be the most active R compiler research group, at the moment e.g., the Å˜ project. Research can be good for uncovering language usage and trying out different techniques, but it is not intended to produce industry strength code. Lots of the fancy optimizations in early versions of the gcc C compiler started life as a PhD thesis, with the respective individual sometimes going on to spend a few years creating a production quality version for the released compiler.

The essential ingredient for building a production compiler is persistence. There are an awful lot of details that need to be sorted out (this is why research project code does not directly translate to production code, they ignore ‘minor’ details in order to concentrate on the ‘interesting’ research problem). Is there a small group of people currently beavering away on a production quality compiler for R? If there is, I can understand being discrete, on long-term projects it can be very annoying to have people regularly asking when the software is going to be released.

To have a life, once released, a production compiler needs to attract users, who are often loyal to their current compiler (because they know that their code works for this compiler); there needs to be a substantial benefit to entice people to switch. The benefit of compiling R to machine code, rather than interpreting, is performance. What performance improvement is needed to attract a viable community of users (there is always a tiny subset of users who will pay lots for even small performance improvements)?

My R code is rarely cpu bound, so I am not in the target audience, no matter what the speed-up. I don’t have any insight in the performance problems experienced by the R community, and have no idea whether a factor of two, five, ten or more would be enough.

August 2, 2020

Scientific management of software production

Derek Jones from The Shape of Code

When Frederick Taylor investigated the performance of workers in various industries, at the start of the 1900’s, he found that workers organise their work to suit themselves; workers were capable of producing significantly more than they routinely produced. This was hardly news. What made Taylor’s work different was that having discovered the huge difference between actual worker output and what he calculated could be achieved in practice, he was able to change work practices to achieve close to what he had calculated to be possible. Changing work practices took several years, and the workers did everything they could to resist it (Taylor’s The principles of scientific management is an honest and revealing account of his struggles).

Significantly increasing worker output pushed company profits through the roof, and managers everywhere wanted a piece of the action; scientific management took off. Note: scientific management is not a science of work, it is a science of the management of other people’s work.

The scientific management approach has been successfully applied to production where most of the work can be reduced to purely manual activities (i.e., requiring little thinking by those who performed them). The essence of the approach is to break down tasks into the smallest number of component parts, to simplify these components so they can be performed by less skilled workers, and to rearrange tasks in a way that gives management control over the production process. Deskilling tasks increases the size of the pool of potential workers, decreasing labor costs and increasing the interchangeability of workers.

Given the almost universal use of this management technique, it is to be expected that managers will attempt to apply it to the production of software. The software factory was tried, but did not take-off. The use of chief programmer teams had its origins in the scarcity of skilled staff; the idea is that somebody who knows what they were doing divides up the work into chunks that can be implemented by less skilled staff. This approach is essentially the early stages of scientific management, but it did not gain traction (see “Programmers and Managers: The Routinization of Computer Programming in the United States” by Kraft).

The production of software is different in that once the first copy has been created, the cost of reproduction is virtually zero. The human effort invested in creating software systems is primarily cognitive. The division between management and workers is along the lines of what they think about, not between thinking and physical effort.

Software systems can be broken down into simpler components (assuming all the requirements are known), but can the implementation of these components be simplified such that they can be implemented by less skilled developers? The process of simplification is practical when designing a system for repetitive reproduction (e.g., making the same widget again and again), but the first implementation of anything is unlikely to be simple (and only one implementation is needed for software).

If it is not possible to break down the implementation such that most of the work is easy to do, can we at least hire the most productive developers?

How productive are different developers? Programmer productivity has been a hot topic since people started writing software, but almost no effective research has been done.

I have no idea how to measure programmer productivity, but I do have some ideas about how to measure their performance (a high performance programmer can have zero productivity by writing programs, faster than anybody else, that don’t do anything useful, from the client’s perspective).

When the same task is repeatedly performed by different people it is possible to obtain some measure of average/minimum/maximum individual performance.

Task performance improves with practice, and an individual’s initial task performance will depend on their prior experience. Measuring performance based on a single implementation of a task provides some indication of minimum performance. To obtain information on an individual’s maximum performance they need to be measured over multiple performances of the same task (and of course working in a team affects performance).

Should high performance programmers be paid more than low performance programmers (ignoring the issue of productivity)? I am in favour of doing this.

What about productivity payments, e.g., piece work?

This question is a minefield of issues. Manual workers have been repeatedly found to set informal quotas amongst themselves, i.e., setting a maximum on the amount they will produce during a shift (see “Money and Motivation: An Analysis of Incentives in Industry” by William Whyte). Thankfully, I don’t think I will be in a position to have to address this issue anytime soon (i.e., I don’t see a reliable measure of programmer productivity being discovered in the foreseeable future).

January 5, 2020

Performance variation in 2,386 â€˜identicalâ€™ processors

Derek Jones from The Shape of Code

Every microprocessor is different, random variations in the manufacturing process result in transistors, and the connections between them, being fabricated with more/less atoms. An atom here and there makes very little difference when components are built from millions, or even thousands, of atoms. The width of the connections between transistors in modern devices might only be a dozen or so atoms, and an atom here and there can have a noticeable impact.

How does an atom here and there affect performance? Don’t all processors, of the same product, clocked at the same frequency deliver the same performance?

Yes they do, an atom here or there does not cause a processor to execute more/less instructions at a given frequency. But an atom here and there changes the thermal characteristics of processors, i.e., causes them to heat up faster/slower. High performance processors will reduce their operating frequency, or voltage, to prevent self-destruction (by overheating).

Processors operating within the same maximum power budget (say 65 Watts) may execute more/less instructions per second because they have slowed themselves down.

Some years ago I spotted a great example of ‘identical’ processor performance variation, and the author of the example, Barry Rountree, kindly sent me the data. In the weeks before Christmas I finally got around to including the data in my evidence-based software engineering book. Unfortunately I could not figure out what was what in the data (relearning an important lesson: make sure to understand the data as soon as it arrives), thankfully Barry came to the rescue and spent some time doing software archeology to figure out the data.

The original plots showed frequency/time data of 2,386 Intel Sandy Bridge XEON processors (in a high performance computer at the Lawrence Livermore National Laboratory) executing the EP benchmark (the data also includes measurements from the MG benchmark, part of the NAS Parallel benchmark) at various maximum power limits (see plot at end of post, which is normalised based on performance at 115 Watts). The plot below shows frequency/time for a maximum power of 65 Watts, along with violin plots showing the spread of processors running at a given frequency and taking a given number of seconds (my code, code+data on Barry’s github repo):

Frequency vs Time at 65 Watts

The expected frequency/time behavior is for processors to lie along a straight line running from top left to bottom right, which is roughly what happens here. I imagine (waving my software arms about) the variation in behavior comes from interactions with the other hardware devices each processor is connected to (e.g., memory, which presumably have their own temperature characteristics). Memory performance can have a big impact on benchmark performance. Some of the other maximum power limits have very different, and benchmark, measurements have very different characteristics (see below).

More details and analysis in the paper: An empirical survey of performance and energy efficiency variation on Intel processors.

Intel’s Sandy Bridge is now around seven years old, and the number of atoms used to fabricate transistors and their connectors has shrunk and shrunk. An atom here and there is likely to produce even more variation in the performance of today’s processors.

A previous post discussed the impact of a variety of random variations on program performance.

Below is a png version of the original plot I saw:

Frequency vs Time at all power levels

February 28, 2019

Modular vs. monolithic programs: a big performance difference

Derek Jones from The Shape of Code

For a long time now I have been telling people that no experiment has found a situation where the treatment (e.g., use of a technique or tool) produces a performance difference that is larger than the performance difference between the subjects.

The usual results are that differences between people is the source of the largest performance difference, successive runs are the next largest (i.e., people get better with practice), and the smallest performance difference occurs between using/not using the technique or tool.

This is rather disheartening news.

While rummaging through a pile of books I had not looked at in many years, I (re)discovered the paper “An empirical study of the effects of modularity on program modifiability” by Korson and Vaishnavi, in “Empirical Studies of Programmers” (the first one in the series). It’s based on Korson’s 1988 PhD thesis, with the same title.

There were four experiments, involving seven people from industry and nine students, each involving modifying a 900(ish)-line program in some way. There were two versions of each program, they differed in that one was written in a modular form, while the other was monolithic. Subjects were permuted between various combinations of program version/problem, but all problems were solved in the same order.

The performance data was published in the paper, so I fitted various regressions models to it (code+data). There is enough information in the data to separate out the effects of modular/monolithic, kind of problem and subject differences. Because all subjects solved problems in the same order, it is not possible to extract the impact of learning on performance.

The modular/monolithic performance difference was around twice as large as the difference between subjects (removing two very poorly performing subjects reduces the difference to 1.5). I’m going to have to change my slides.

Would the performance difference have been so large if all the subjects had been experienced developers? There is not a lot of well written modular code out there, and so experienced developers get lots of practice with spaghetti code. But, even if the performance difference is of the same order as the difference between developers, that is still a very worthwhile difference.

Now there are lots of ways to write a program in modular form, and we don’t know what kind of job Korson did in creating, or locating, his modular programs.

There are also lots of ways of writing a monolithic program, some of them might be easy to modify, others a tangled mess. Were these programs intentionally written as spaghetti code, or was some effort put into making them easy to modify?

The good news from the Korson study is that there appears to be a technique that delivers larger performance improvements than the difference between people (replication needed). We can quibble over how modular a modular program needs to be, and how spaghetti-like a monolithic program has to be.

February 26, 2019

Python Async basics video (100 million HTTP requests)

Andy Balaam from Andy Balaam's Blog

I found something difficult in Python, which was a bit of a first, so I wrote a whole blog series about it, and now a whole video:

Slides: Interesting Characters slides

Blog posts: asyncio basics, large numbers in parallel, parallel HTTP requests, adding to stdlib

February 11, 2019

Performance of Java 2D drawing operations (part 3: image opacity)

Andy Balaam from Andy Balaam's Blog

Series: operations, images, opacity

Not because I was enjoying it, I seemed compelled to continue my quest to understand the performance of various Java 2D drawing operations. I’m hoping to make my game Rabbit Escape faster, especially on the Raspberry Pi, so you may see another post sometime actually trying this stuff out on a Pi.

But for now, here are the results of my investigation into how different patterns of opacity in images affects rendering performance.

You can find the code here: gitlab.com/andybalaam/java-2d-performance.

Results

Images with partially-opaque pixels are no slower than those with fully-opaque pixels
Large transparent areas in images are drawn quite quickly, but transparent pixels mixed with non-transparent are slow

Advice

Still avoid any transparency whenever possible
It’s relatively OK to use large transparent areas on images (e.g. a fixed-size animation where a character moves through the image)
Don’t bother restricting pixels to be either fully transparent or fully opaque – partially-opaque is fine

Opacity patterns in images

Non-transparent images drew at 76 FPS, and transparent ones dropped to 45 FPS.

I went further into investigating transparency by creating images that were:

All pixels 50% opacity (34 FPS)
Half pixels 0% opacity, half 100%, mixed up (34 FPS)
Double the size of the original image, but the extra area is fully transparent, and the original area is non-transparent (41 FPS)

I concluded that partial-opacity is not important to performance compared with full-opacity, but that large areas of transparency are relatively fast compared with images with complex patterns of transparency and opacity.

Numbers

Transparency and opacity

Test	FPS
large nothing	90
large images20 largeimages	76
large images20 largeimages transparentimages	45
large images20 largeimages transparent50pcimages	34
large images20 largeimages transparent0pc100pcimages	34
large images20 largeimages transparentareaimages	41

Feedback please

Please do get back to me with tips about how to improve the performance of my experimental code.

Feel free to log issues, make merge requests or add comments to the blog post.

February 8, 2019

Performance of Java 2D drawing operations (part 2: image issues)

Andy Balaam from Andy Balaam's Blog

Series: operations, images

In my previous post I examined the performance of various drawing operations in Java 2D rendering. Here I look at some specifics around rendering images, with an eye to finding optimisations I can apply to my game Rabbit Escape.

You can find the code here: gitlab.com/andybalaam/java-2d-performance.

Results

Drawing images with transparent sections is very slow
Drawing one large image is slower than drawing many small images covering the same area(!)
Drawing images outside the screen is slower than not drawing them at all (but faster than drawing them onto a visible area)

Advice

Avoid transparent images where possible
Don’t bother pre-rendering your background tiles onto a single image
Don’t draw images that are off-screen

Images with transparency

All the images I used were PNG files with a transparency layer, but in most of my experiments there were no transparent pixels. When I used images with transparent pixels the frame rate was much slower, dropping from 78 to 46 FPS. So using images with transparent pixels causes a significant performance hit.

I’d be grateful if someone who knows more about it can recommend how to improve my program to reduce this impact – I suspect there may be tricks I can do around setComposite or setRenderingHint or enabling/encouraging hardware acceleration.

Composite images

I assumed that drawing a single image would be much faster than covering the same area of the screen by drawing lots of small images. In fact, the result was the opposite: drawing lots of small images was much faster than drawing a single image covering the same area.

The code for a single image is:

g2d.drawImage(
    singleLargeImage,
    10,
    10,
    null
)

and for the small images it is:

for (y in 0 until 40)
{
    for (x in 0 until 60)
    {
        g2d.drawImage(
            compositeImages[(y*20 + x) % compositeImages.size],
            10 + (20 * x),
            10 + (20 * y),
            null
        )
    }
}

The single large image was rendered at 74 FPS, whereas covering the same area using repeated copies of 100 images was rendered at 80 FPS. I ran this test several times because I found the result surprising, and it was consistent every time.

I have to assume some caching (possibly via accelerated graphics) of the small images is the explanation.

Drawing images off the side of the screen

Drawing images off the side of the screen was faster than drawing them in a visible area, but slower than not drawing them at all. I tested this by adding 10,000 to the x and y positions of the images being drawn (I also tested subtracting 10,000 with similar results). Not drawing any images ran at 93 FPS, drawing images on-screen at 80 FPS, and drawing them off-screen only 83 FPS, meaning drawing images off the side takes significant time.

Advice: check whether images are on-screen, and avoid drawing them if not.

Numbers

Transparency

Test	FPS
large nothing	95
large images20 largeimages	78
large images20 largeimages transparentimages	46

Composite images

(Lots of small images covering an area, or a single larger image.)

Test	FPS
large nothing	87
large largesingleimage	74
large compositeimage	80

Offscreen images

Test	FPS
large nothing	93
large images20 largeimages	80
large images20 largeimages offscreenimages	83

Feedback please

Please do get back to me with tips about how to improve the performance of my experimental code.

Feel free to log issues, make merge requests or add comments to the blog post.

February 4, 2019

Performance of Java 2D drawing operations

Andy Balaam from Andy Balaam's Blog

I want to remodel the desktop UI of my game Rabbit Escape to be more
convenient and nicer looking, so I took a new look at game-loop-style graphics rendering onto a canvas in a Java 2D (Swing) UI.

Specifically, how fast can it be, and what pitfalls should I avoid when I’m doing it?

Results

Larger windows are (much) slower
Resizing images on-the-fly is very slow, even if they are the same size every time
Drawing small images is fast, but drawing large images is slow
Drawing rectangles is fast
Drawing text is fast
Drawing Swing widgets in front of a canvas is fast
Creating fonts on-the-fly is a tiny bit slow

Code

You can find the full code (written in Kotlin) at gitlab.com/andybalaam/java-2d-performance.

Basically, we make a JFrame and a Canvas and tell them not to listen to repaints (i.e. we control their drawing).

val app = JFrame()
app.ignoreRepaint = true
val canvas = Canvas()
canvas.ignoreRepaint = true

Then we add any buttons to the JFrame, and the canvas last (so it displays behind):

app.add(button)
app.add(canvas)

Now we make the canvas double-buffered and get hold of a buffer image for it:

app.isVisible = true
canvas.createBufferStrategy(2)
val bufferStrategy = canvas.bufferStrategy
val bufferedImage = GraphicsEnvironment
    .getLocalGraphicsEnvironment()
    .defaultScreenDevice
    .defaultConfiguration
    .createCompatibleImage(config.width, config.height)

Then inside a tight loop we draw onto the buffer image:

val g2d = bufferedImage.createGraphics()
try
{
    g2d.color = backgroundColor
    g2d.fillRect(0, 0, config.width, config.height)

    ... the different drawing operations go here ...

and then swap the buffers:

    val graphics = bufferStrategy.drawGraphics
    try {
        graphics.drawImage(bufferedImage, 0, 0, null)
        if (!bufferStrategy.contentsLost()) {
            bufferStrategy.show()
        }
    } finally {
        graphics.dispose()
    }
} finally {
    g2d.dispose()
}

Results

Baseline: some rectangles

I decided to compare everything against drawing 20 rectangles at random points on the screen, since that seems like a minimal requirement for a game.

My test machine is an Intel Core 2 Duo E6550 2.33GHz with 6GB RAM and a GeForce GT 740 graphics card (I have no idea whether it is being used here – I assume not). I am running Ubuntu 18.04.1 Linux, OpenJDK Java 1.8.0_191, and Kotlin 1.3.20-release-116. (I expect the results would be identical if I were using Java rather than Kotlin.)

I ran all the tests in two window sizes: 1600×900 and 640×480. 640.×480 was embarrassingly fast for all tests, but 1600×900 struggled with some of the tasks.

Drawing rectangles looks like this:

g2d.color = Color(
    rand.nextInt(256),
    rand.nextInt(256),
    rand.nextInt(256)
)
g2d.fillRect(
    rand.nextInt(config.width / 2),
    rand.nextInt(config.height / 2),
    rand.nextInt(config.width / 2),
    rand.nextInt(config.height / 2)
)

In the small window, the baseline (20 rectangles) ran at 553 FPS. In the large window it ran at 87 FPS.

I didn’t do any statistics on these numbers because I am too lazy. Feel free to do it properly and let me know the results – I will happily update the article.

Fewer rectangles

When I reduced the number of rectangles to do less drawing work, I saw small improvements in performance. In the small window, drawing 2 rectangles instead of 20 increased the frame rate from 553 to 639, but there is a lot of noise in those results, and other runs were much closer. In the large window, the same reduction improved the frame rate from 87 to 92. This is not a huge speed-up, showing that drawing rectangles is pretty fast.

Adding fixed-size images

Drawing pre-scaled images looks like this:

g2d.drawImage(
    image,
    rand.nextInt(config.width),
    rand.nextInt(config.height),
    null
)

When I added 20 small images (40×40 pixels) to be drawn in each frame, the performance was almost unchanged. In the small window, the run showing 20 images per frame (as well as rectangle) actually ran faster than the one without (561 FPS versus 553), suggesting the difference is negligible and I should do some statistics. In the large window, the 20 images version ran at exactly the same speed (87 FPS).

So, it looks like drawing small images costs almost nothing.

When I moved to large images (400×400 pixels), the small window slowed down from 553 to 446 FPS, and the large window slowed from 87 to 73 FPS, so larger images clearly have an impact, and we will need to limit the number and size of images to keep the frame rate acceptable.

Scaling images on the fly

You can scale an image on the fly as you draw onto a Canvas. (Spoiler: don’t do this!)

My code looks like:

val s = config.imageSize
val x1 = rand.nextInt(config.width)
val y1 = rand.nextInt(config.height)
val x2 = x1 + s
val y2 = y1 + s
g2d.drawImage(
    unscaledImage,
    x1, y1, x2, y2,
    0, 0, unscaledImageWidth, unscaledImageHeight,
    null
)

Note the 10-argument form of drawImage is being used. You can be sure you have avoided this situation if you use the 4-argument form from the previous section.

Note: the resulting image is the same size every time, and the Java documentation implies that scaled images may be cached by the system, but I saw a huge slow-down when using the 10-argument form of drawImage above.

On-the-fly scaled images slowed the small window from 446 to 67 FPS(!), and the large window from 73 to 31 FPS, meaning the exact same rendering took over twice as long.

Advice: check you are not using one of the drawImage overloads that scales images! Pre-scale them yourself (e.g. with getScaledInstance as I did here).

Displaying text

Drawing text on the canvas like this:

g2d.font = Font("Courier New", Font.PLAIN, 12)
g2d.color = Color.GREEN
g2d.drawString("FPS: $fpsLastSecond", 20, 20 + i * 14)

had a similar impact to drawing small images – i.e. it only affected the performance very slightly and is generally quite fast. The small window slowed from 553 to 581 FPS, and the large window from 87 to 88.

Creating the font every time (as shown above) slowed the process a little more, so it is worth moving the font creating out of the game loop and only doing it once. The slowdown just for creating the font was 581 to 572 FPS in the small window, and 88 to 86 FPS in the large.

Swing widgets

By adding Button widgets to the JFrame before the Canvas, I was able to display them in front. Their rendering and focus worked as expected, and they had no impact at all on performance.

The same was true when I tried adding these widgets in front of images rendered on the canvas (instead of rectangles).

Turning everything up to 11

When I added everything I had tested all at the same time: rectangles, text with a new font every time, large unscaled images, and large window, the frame rate reduced to 30 FPS. This is a little slow for a game already, and if we had more images to draw it could get even worse. However, when I pre-scaled the images the frame rate went up to 72 FPS, showing that Java is capable of running a game at an acceptable frame rate on my machine, so long as we are careful how we use it.

Numbers

Small window (640×480)

Test	FPS
nothing	661
rectangles2	639
rectangles20	553
rectangles20 images2	538
rectangles20 images20	561
rectangles20 images20 largeimages	446
rectangles20 images20 unscaledimages	343
rectangles20 images20 largeimages unscaledimages	67
rectangles20 text2	582
rectangles20 text20	581
rectangles20 text20 newfont	572
rectangles20 buttons2	598
rectangles20 buttons20	612

Large window (1200×900)

Test	FPS
large nothing	93
large rectangles2	92
large rectangles20	87
large rectangles20 images2	87
large rectangles20 images20	87
large rectangles20 images20 largeimages	73
large rectangles20 images20 unscaledimages	82
large rectangles20 images20 largeimages unscaledimages	31
large rectangles20 text2	89
large rectangles20 text20	88
large rectangles20 text20 newfont	86
large rectangles20 buttons2	88
large rectangles20 buttons20	87
large images20 buttons20 largeimages	74
large rectangles20 images20 text20 buttons20 largeimages newfont	72
large rectangles20 images20 text20 buttons20 largeimages unscaledimages newfont	30

Feedback please

Please do get back to me with tips about how to improve the performance of my experimental code.

Feel free to log issues, make merge requests or add comments to the blog post.

December 10, 2018

Impact of group size and practice on manual performance

Derek Jones from The Shape of Code

How performance varies with group size is an interesting question that is still an unresearched area of software engineering. The impact of learning is also an interesting question and there has been some software engineering research in this area.

I recently read a very interesting study involving both group size and learning, and Jaakko Peltokorpi kindly sent me a copy of the data.

That is the good news; the not so good news is that the experiment was not about software engineering, but the manual assembly of a contraption of the experimenters devising. Still, this experiment is an example of the impact of group size and learning (through repeating the task).

Subjects worked in groups of one to four people and repeated the task four times. Time taken to assemble a bespoke, floor standing rack with some odd-looking connections between components (the image in the paper shows an image of something that might function as a floor standing book-case, if shelves were added, apart from some component connections getting in the way) was measured.

The following equation is a very good fit to the data (code+data). There is theory explaining why log(repetitions) applies, but the division by group-size was found by suck-it-and-see (in another post I found that time spent planning increased with teams size).

There is a strong repetition/group-size interaction. As the group size increases, repetition has less of an impact on improving performance.

time = 0.16+ 0.53/{group size} - log(repetitions)*[0.1 + {0.22}/{group size}]

The following plot shows one way of looking at the data (larger groups take less time, but the difference declines with practice):

Time taken (hours) for various group sizes, by repetition.

and here is another (a group of two is not twice as fast as a group of one; with practice smaller groups are converging on the performance of larger groups):

Time taken (hours) for various repetitions, by group size.

Would the same kind of equation fit the results from solving a software engineering task? Hopefully somebody will run an experiment to find out