Impact of native language on variable naming

Derek Jones from The Shape of Code

When creating a variable name, to what extent are developers influenced by their native human language?

There is lots of evidence that variable names are either English words, abbreviations of English words, or some combination of these two. Source code containing a large percentage of identifiers using words from other languages does exist, but it requires effort to find; there is a widely expressed view that source should be English based (based on my experience of talking to non-native English speakers, and even the odd paper discussing the issue, e.g., Language matters).

Given that variable names can prove information that reduces the effort needed to understand code, and that most code is only ever read by the person who wrote it, developers should make the most of their expertise in using their native language.

To what extent do non-native English-speaking developers make use of their non-English native language?

I have found it very difficult to even have a discussion around this question. When I broach the subject with non-native English speakers, the response is often along the lines of “our develo0pers speak good English.” I am careful to set the scene by telling them of my interest in naming, and that I think there are benefits for developers to make use of their native language. The use of non-English languages in software development is not yet a subject that is open for discussion.

I knew that sooner or later somebody would run an experiment…

How Developers Choose Names is another interesting experiment involving Dror Feitelson (the paper rather confusingly refers to it as a survey, a post on an earlier experiment).

What makes this experiment interesting is that bilingual subjects (English and Hebrew) were used, and the questions were in English or Hebrew. The 230 subjects (some professional, some student) were given a short description and asked to provide an appropriate variable/function/data-structure name; English was used for 26 of the question, and Hebrew for the other 21 questions, and subjects answered a random subset.

What patterns of Hebrew usage are present in the variable names?

Out of 2017 answers, 14 contained Hebrew characters, i.e., not enough for statistical analysis. This does not mean that all the other variable names were only derived from English words, in some cases Hebrew words appeared via transcription using the 26 English letters. For instance, using “pinuk” for the Hebrew word that means “benefit” in English. Some variables were created from a mixture of Hebrew and English words, e.g., deservedPinuks and pinuksUsed.

Analysing this data requires someone who is fluent in Hebrew and English. I am not a fluent, or even non-fluent, Hebrew speaker. My role in this debate is encouraging others, and at last I have some interesting data to show people.

The paper spends time showing how for personal preferences result in a wide selection of names being chosen by different people for the same quantity. I cannot think of any software engineering papers that have addressed this issue for variable names, but there is lots of evidence from other fields; also see figure 7.33.

Those interested in searching source code for the impact of native-language might like to look at the names of variables appearing as operands of the bitwise and logical operators. Some English words occur much more frequently in the names of these variable, compared to variables that are operands of arithmetic operators, e.g., flag, status, and signal. I predict that non-native English-speaking developers will make use of corresponding non-English words.

Naming is hard – or is it?

Phil Nash from level of indirection

Following Peter Hilton's excellent ACCU talk, at last week's conference in Bristol, "How to name things - the hardest problem in programming", a few of us were discussing some of the points raised - and some not raised.

He had discussed identifier length without any mention of Uncle Bob's guideline, whereby the length of a variable name should be proportional to it's scope (i.e. large or global scopes need longer, descriptive, names whereas in smaller, local, scopes shorter, more concise - even single letter - names are appropriate). This seemed all the more of an omission given that he later referenced the book, Clean Code.

It wasn't that Peter disgreed with Uncle Bob (who doesn't, half the time?) that surprised me but that he didn't even mention it in passing. I thought it was fairly well known. Actually I double checked and it is not discussed fully in the book, which only says, "The length of a name should correspond to the size of its scope". This is expanded considerably in Clean Coders (video) episode 2. Also, of course, this is not really "Uncle Bob's rule". Kevlin Henney recalls that he first heard of it in the 90s and it may well have been kicking around before that. Bob calls it "The Scope Rule".

Kevlin was one of those discussing this afterwards. After initially toying with The Scope Rule in the 90s he came to consider it not particularly useful. This, too, surprised me as I had found it worked quite well for me. Or so I thought. Further discussion with Kevlin led to the conclusion that I had read more of my own interpretation into The Scope Rule than I had realised! So I started musing over exactly what my interpretation was.

A transparent reference

As it happened a concept key to clarifying matters came from another great talk at the same conference just the day before. Didier Verna's "Referential transparency is overrated". In this talk Didier discussed various ways that useful idioms in Lisp required violating referential transparency. At one point he explained how "hidden" variables may be introduced by one macro that were then used by another. This worked because the thing being referred to was named very generically - so both macros agreed on the name. He drew on the term "Anaphora" from linguistics, which is where one part of an expression - usually a pronoun - stands in for a more specialised part - such as a person's name - introduced earlier in the context. For example, just now I used the word, "he" to refer to Didier Verna. It was clear who I was talking about because his was the most recently specified name in the current context. In fact I used this anaphoric term a couple of times - and many, many, times in this article. If I had had to fully qualify "Didier Verna" every time writing would quickly become very cumbersome. Anaphora is used very frequently in natural language - usually to good effect.

Scope Creep

I believe this is key to understanding why and when shorter identifiers can be used too. When I had been talking with Kevlin it had become apparent that we had different interpretations of the word, "scope". I realised that I had subconsciously expanded the specific technical meaning to include a more general idea of "context" - including the anaphoric context.

To make this clear I might write some (C++) code like this:

	std::string s = getNextString();
	if( !s.empty() )
		std::cout << "received string: " << s << std::endl;

Many corporate, or personal, coding standards would balk at such practice! Single character identifiers? Way to obfuscate the code!

But how has it obfuscated anything? Look at it as an anaphoric entity. In this case the variable name 's' is anaphoric. We know it is "the next string" because we saw it being introduced by the function call, "getNextString()". We then use it twice on the next couple of lines. There are no other strings being introduced in the same vicinity to confuse it with, and the context in which it is used is kept small. There is no ambiguity and the full identity is revealed in the immediate vicinity.

Sustainability

But what if we add more code, or move parts of this elsewhere? Certainly code evolves over time in ways that can make things less clear if we don't change them. That's true regardless. Naming of the entities at play should always form part of your consideration when refactoring or otherwise modifying existing code. Does it make this code less "sustainable" (to reference another property that Kevlin likes to talk about)? I don't think so. In the worst case, if you don't immediately notice that a short name has become unclear because it's usage has drifted out of anaphoric range, you'll notice the next time you look at it and, momentarily, think "why is there an free variable called 's' here? What on earth is that. You'll take a moment to find it's original declaration, work out what it is, then decide to encode that in the name by renaming the variable at that point. Variable renaming is one of the safest and most ubiquituous refactorings around so I have no qualms about deferring such identity expansion to such a time as it is needed.

Why?

But what about the other side of the argument? Is there any advantage to using a short, even single character, variable name in the first place?

This is often cast as a matter of optimising for typing speed - in world where we typically read these names many many times more than we write them.

While introducing, even small, speed bumps to writing code might discourage spending more time than necessary writing code (which in turn may discourage certain refactorings) it's not really about typing performance at all - It's about readability! Consider again the linguistic definition of anaphora: substituting an, unambiguous, subsequent reference to an entity with a shorter form (e.g. a pronoun) that means the same thing. We do this all the time in natural speech and the written word. Why? Because it would sound unnatural and cumbersome to fully qualify every entity we talk about all the time!

The same applies in programming. Where it is perfectly clear from the immediate context what an identifier refers to then using greater verbosity actually increases the cognitive friction! The more unnecessary and redundant noise and ceremony we can strip away from our code the easier it will be to read, in a shorter period of time. That fact that anaphora is so common in natural language should give us a clue as to our ability to code with it's use in a natural and efficient way.

Now I've only mentally organised my thoughts around this as a result of ruminating on those two talks - and some of the offshoot discussions - but I realise this is essentially how I had interpreted The Scope Rule. Now I've worked it through when I go back and compare it with what Mr Martin actually said his version sounds like a poor proxy for the anaphoric interpretation.

So naming - good naming - is still hard. We've only just discussed one narrow aspect here. But perhaps this has made some of it that little bit easier.