Sunday, 7 November 2010

Myths and realities of for..in

Probably the second most enduring myth about JavaScript is that the for..in statement loops through the indexes of an array. It doesn't, and if you write code that assumes it does, then even if the code doesn't break in its nice cozy nest on your computer, it's likely to as soon as you expose it to the complexities of the outside world.

In this article, I'll explain the myth, the reality of what for..in actually does, and discuss various ways to loop through the contents of an array correctly (and efficiently).

The Myth

According to the myth, you can loop through the contents of an array like this (let's assume we have a display function that writes a line of text somewhere):

var stuff, index;
stuff = ['a', 'b', 'c'];
for (index in stuff) { // <== WRONG
    display("stuff[" + index + "] = " + stuff[index]);
}

And in a limited environment, that is indeed likely to show this output:

stuff[0] = a
stuff[1] = b
stuff[2] = c

But you can't rely on that, doing so will come back and bite you.

The Reality

for..in loops through the enumerable property names of an object, not the indexes of an array. This is much more powerful than something that just loops through array indexes. Here's an example:

var person, name;
person = {
    fname: "Joe",
    lname: "Bloggs",
    age:   47
};
for (name in person) {
    display("person[" + name + "] = " + person[name]);
}

There we have an object, person, which has three properties: fname, lname, and age. The loop displays each of these properties and its value, for instance:

person[lname] = Bloggs
person[fname] = Joe
person[age] = 47

Note that the properties are not necessarily listed in any particular order (they're not alphabetical, they're not in order of when or how they were added to the object, or in any other order you can rely on; it completely varies depending on the JavaScript implementation).

Now you may be thinking "Well, sure, so it does something different with objects than it does with arrays, so what?" But it doesn't. This is an important point and so I'm going to emphasize it:

Arrays in JavaScript are just objects (with one special feature), and array indexes are just property names. In fact, the name "array" is a bit misleading, because JavaScript arrays are not like arrays in most other languages. For one thing, JavaScript arrays are not contiguous blocks of memory. For another, the indexes into them are not offsets. In fact, array "indexes" aren't even numbers, they're strings. The only reason we get away with things like stuff[0] is that the 0 gets automatically converted into "0" for us. (Don't believe me? Read Section 11.2.1 ["Property Accessors"] of the spec.) So we don't really have array "indexes" at all, we just have property names that consist entirely of the digits 0-9 (without extraneous leading zeroes). But it's convenient to call those property names "indexes."

(Update: Here I'm talking about JavaScript's standard Array type. Many environments now also support new typed arrays, which are different.)

The one special feature of arrays is, of course, their length property, which behaves in a way to make arrays seem more like arrays in other languages (I won't go into the details here, not that they're complex; check out Section 15.4 ["Array Objects"] of the spec).

"Hey wait!" you say, "Speaking of length, if for..in really loops through the property names of the object, why don't we see length in the list?!" Good question. It's because for..in loops through the enumerable property names of an object, and length is not an enumerable property. Arrays (and other objects) are defined with lots of non-enumerable properties. (We'll come back to that in a moment.)

For now, let's consider what happens if we add another (enumerable) property to the array:

var stuff, index;
stuff = ['a', 'b', 'c'];
stuff.name = "Foo";
for (index in stuff) {
    display("stuff[" + index + "] = " + stuff[index]);
}

That's a perfectly legitimate and reasonable thing to do, we've added a name property to the array. Now what gets shown? Right! Four things:

stuff[name] = Foo
stuff[0] = a
stuff[1] = b
stuff[2] = c

Because for..in loops through all of the property names of the object, not just the "indexes."

And this is one of the places we get into differences in the order of property names. Some implementations will display the above, with the name property followed by the numeric properties; other implementations will do it the other way around, with name at the end. In fact, in theory, name could be in the middle. In fact, in theory even the numeric indexes might not be done in order (there's nothing in the spec saying that they have to be), but in practice I've never seen an implementation that did the numeric indexes out of order. Regardless, it's not guaranteeed.

Okay, but as long as we don't put other properties on our array, we can use for..in as though it did just the indexes, right? Wrong. That's where we get into the complexities of the real world, because for..in doesn't list only the object's own properties, it lists all of the properties of its prototype, and its prototype's prototype, etc. So that means not only not adding anything to your specific array instance, but not adding anything to Array.prototype either — and it turns out that it's really handy to add to Array.prototype sometimes, handy enough that lots of libraries, plug-ins, and other bits of code you'll find around do it.

Consider the Array#indexOf method. This handy method finds an element in an array and returns its index. But although some implementations have had it for years, others haven't, and it was only added to the specification as of ECMAScript 5th edition (e.g., at the end of 2009). So there are still implementations that don't have it.

But what if you want to write code that assumes it's there? You could write your own function in a procedural programming way and use that instead, passing the array and the target element in as parameters. Your function could then call the indexOf method if it's there, or do a manual search otherwise. But that not only introduces overhead (an extra function call), it also isn't very object-oriented — and JavaScript is object-oriented down to its bones. So the usual solution (and in fact the one suggested by Mozilla) is to just add it if it's missing:

if (!Array.prototype.indexOf) {
    Array.prototype.indexOf = function(searchElement, index) {
        // ...
    };
}

Now you can use indexOf on any array instance, since the instances inherit the function from the prototype. Augmenting types like this is very common in JavaScript circles.

But the thing is, it creates an enumerable property, and so if we do that, then even our first code snippet at the top starts failing:

var stuff, index;
stuff = ['a', 'b', 'c'];
for (index in stuff) { // <== WRONG
    display("stuff[" + index + "] = " + stuff[index]);
}

Outputs:

stuff[0] = a
stuff[1] = b
stuff[2] = c
stuff[indexOf] = function...

Note that the function (which is, after all, just another property like any other property) now shows up in our loop.

You may be wondering why we don't see all the other functions from the Array prototype in our loops (splice, join, etc.). The answer (you guessed it!) is that they're all defined by the spec as non-enumerable properties.

So can we add our function as a non-enumerable property? As of this writing, no, but it probably won't be long before we can. The new ECMAScript 5th edition specification defines a way for us to add properties to objects with the "enumerable" flag set to false, so they don't show up in for..in loops. But as of this writing, the mechanism has only been implemented and deployed in Google's V8 JavaScript engine (which is used in their Chromium and Chrome browsers). While the other implementers aren't far behind (both Mozilla and Microsoft have it in betas of their next releases), it's not there yet. And of course, once the mechanism is widely implemented, then everyone doing it the old way has to change their code to do it the new way.

So in short: Non-enumerable properties are not going to address this issue for you any time soon. And besides, it's not really an issue. Looping over property names (not indexes!) is what for..in is for!

Now someone out there is probably thinking "But hey, this doesn't only apply to arrays! What if someone adds something to Object.prototype, won't that show up in my for..in loops, even on just normal objects?" Yes it will, but fortunately, a de-facto standard has arisen in the JavaScript community: It's okay to augment the prototypes of specific types, but not okay to augment the Object prototype. Those few projects that have done so have been swiftly beaten into submission.

Options for Looping through Array Indexes

So okay, for..in is really cool and helpful for other things, but if we can't use it to loop over the indexes of an array, what should we do instead?

There are several options:

1. Use a boring old-fashioned loop:

Sometimes, the old ways are the best:

// Boring old fashioned loop (a classic!)
var stuff, index;
stuff = ['a', 'b', 'c'];
for (index = 0; index < stuff.length; ++index) {
    display("stuff[" + index + "] = " + stuff[index]);
}

Perfectly clear to anyone who's ever programmed in any language syntactically derived from B (so C, C++, D, C#, Java, JavaScript, PHP, and about 30 others). Straight-forward.

Now of course, there are some variations on that. For when the length of the array won't change, there's the cache-the-length variant:

// Boring old fashioned loop, caching length of unchanging array
var stuff, index, length;
stuff = ['a', 'b', 'c'];
for (index = 0, length = stuff.length; index < length; ++index) {
    display("stuff[" + index + "] = " + stuff[index]);
}

Or for when order doesn't matter, there's the count-backward version to take advantage of any "greater-than-or-equal-to-zero" optimization that the language implementation may have:

// Boring old fashioned loop, counting backward
var stuff, index;
stuff = ['a', 'b', 'c'];
for (index = stuff.length - 1; index >= 0; --index) {
    display("stuff[" + index + "] = " + stuff[index]);
}

Before anyone gets into any performance wars over the above, though, some observations:

  1. Unless you actually see a performance problem, it doesn't matter what you use, use what makes sense in relation to the code you're writing. (Don't optimize prematurely.)
  2. Which one you use won't matter from a performance perspective unless the array is huge, in which case you probably have other things to worry about first.
  3. JavaScript is not C/C++/C#/Java/whatever. The source-level optimizations that make sense in C/C++/C#/Java/whatever do not necessarily make sense in JavaScript. Don't assume they do.
  4. (This is a biggie.) The optimizations that improve performance on one implementation (say, in IE) may make performance worse in other implementations (say, in Firefox).

2. Using for..in — Correctly

But the old ways aren't always best. By their very nature, JavaScript's arrays are sparse — that is, an array can have a length of 10,000 but only have two entries in it:

stuff = [];
stuff[0] = "zero";
stuff[9999] = "nine thousand nine hundred and ninety-nine";
display(stuff.length); // shows 10000

That means that if you use a boring old-fashioned loop, you may work a lot harder than you have to, because you'll execute the body of your loop whether the array has an entry there or not:

var stuff, index;
stuff = [];
stuff[0] = "zero";
stuff[9999] = "nine thousand nine hundred and ninety-nine";
stuff.name = "foo";
for (index = 0; index < stuff.length; ++index) {
    display("stuff[" + index + "] = " + stuff[index]);
}

Although that loop correctly avoids outputting the name property, it will display 10,000 lines of output, with 9,998 of them ending with "= undefined" because the array doesn't have an entry there. Now, maybe that's what you want, maybe not (depends on what the array is for).

When you don't want to do that, you can use for..in — but use it correctly:

var stuff, index;
stuff = [];
stuff[0] = "zero";
stuff[9999] = "nine thousand nine hundred and ninety-nine";
stuff.name = "foo";
for (index in stuff) {
    if (stuff.hasOwnProperty(index)  &&    // These are explained
        /^0$|^[1-9]\d*$/.test(index) &&    // and then hidden
        index <= 4294967294                // away below
       ) {
        display("stuff[" + index + "] = " + stuff[index]);
    }
}

Now we only loop three times (once for each property, the indexes and name), and we only output two lines, because the if statement's condition only evaluates true for the kinds of property names we call "array indexes."

So what's that condition doing? Two things:

  1. It uses the hasOwnProperty function that's built into all objects to tell us whether the property is on the object itself (returns true), rather than being inherited from the object's prototype (returns false). This deals with any enumerable properties that might have been added to Array.prototype.
  2. Second, it determines if the property name looks like an "array index" as defined in the specification: The name is (by definition) a string, and we use a regular expression to prove that it's in the correct standardized base-10 form (either "0" on its own, or a non-zero digit followed by zero or more digits), then we check that it's in range (0 <= index <= 2^32-2 [which is 4,294,967,294]). If the property name passes those tests, it's an array index. You may be wondering about that magic number, 2^32 - 2. Why that number? It's because the spec says that the length property will be in the range 0 to 2^32 - 1 (inclusive), and as length is one higher than the highest array index, the highest array index is 2^32 - 2. (17 July 2013: Many thanks to RobG for pointing out that my test in earlier versions of this article wasn't quite right.)

Now, we're not going to want to type that every time this comes up, so here's a function we can use that applies the correct test:

function arrayHasOwnIndex(array, prop) {
    return array.hasOwnProperty(prop) && /^0$|^[1-9]\d*$/.test(prop) && prop <= 4294967294; // 2^32 - 2
}

Which we'd use in the loop like so:

for (index in stuff) {
    if (arrayHasOwnIndex(stuff, index)) {
        display("stuff[" + index + "] = " + stuff[index]);
    }
}

If you're into extending built-in prototypes, you could even adapt that to put on Array.prototype and then call it like stuff.hasOwnIndex(index), but beware of conflicts, and use Object.defineProperty to make it non-enumerable if you can (e.g., in ES5-enabled environments).

3. Use the New forEach Function

The new ECMAScript 5th edition specification adds a function to arrays called forEach which calls a callback for each element in the array, where an element is an entry in the array whose name is an array "index" according to our earlier definition. It also guarantees that it will call them in order (according to the numeric value of their property name), lowest to highest. The function is called only for entries that actually exist. So:

var stuff;
stuff = [];
stuff[0] = "zero";
stuff[9999] = "nine thousand nine hundred and ninety-nine";
stuff.name = "foo";
stuff.forEach(function(value, index) {
    display("stuff[" + index + "] = " + value);
});

Like our for..in example earlier, this displays two lines. (Unlike our earlier for..in example, it guarantees the order — although again, I've never seen an implementation of JavaScript where the array "indexes" weren't handled in ascending order by for..in, it's just that the spec doesn't guarantee it.)

(The new spec defines several other handy Array functions aside from just forEach, it's worth taking a look.)

So should you use an old-fashioned loop, for..in correctly, or the new forEach (or one of the other functions like it?)? It depends entirely on what you're doing. For me, most of the time, a boring old-fashioned loop does the trick unless I'm using a sparse array for something, in which case I use for..in. But it's totally up to you. The main take-away point is that a for..in used incorrectly will bite you.

Happy Coding!

4 comments:

  1. I just wanna say THANK YOU! for all the knowledge you give away for free. I've read a few of your posts and you've already helped me become a much better developer. You're very smart and learned and you do neophyte coders like me a huge service with your blog. Thanks again.

    ReplyDelete
  2. @David: Gosh, thanks! I'm glad to have helped!

    Best,

    -- T.J.

    ReplyDelete
  3. Good stuff, but the test String(Number(index)) === index seems inappropriate, testing the actual conditions required seems like a better idea: /^0$|^[1-9]\d*$/. Also, the largest index allowed is 2^32 - 1, or 4294967295, so consider:

    stuff.hasOwnProperty(index) && /^0$|^[1-9]\d*$/.test(index) && index < 4294967296

    ReplyDelete
  4. @RobG: Thank you! Good point. `String(Number(index)) === index` is one of those "good enough" tests, but you're quite right it's wrong, and in a couple of different ways. (Consider properties with the names "2.5" or "00", for instance.) I've corrected it. Just to pick nits, your test has an OBOE in it (indexes can't be 2^32-1, so it would have to be < 4294967295, not < 4294967296).

    ReplyDelete