Thursday, 27 September 2012

Quick note on RegExp#lastIndex

Just a quick note on RegExp#lastIndex: It's misnamed. It's not the "last" index of anything, it's the index of the next character in the string that will be looked at by the regex instance's exec function (if the regex has the global flag and exec is used on a string that's long enough). It's 0 on freshly-created instances. Just useful to remember, if you ever need to set it explicitly (which you're allowed to do), think in terms of "next" rather than "last".

Of course, you hardly ever need to set it explicitly. The only reason I've ever done so was to work around a bug an issue in some JavaScript engines which are pretty outdated now (such as the one in Firefox 3.6, which virtually no one uses anymore). For anyone who doesn't already know about it, that issue was that some engines reused RegExp instances defined by literals in unexpected ways, which is a problem because RegExp instances have state (the aforementioned lastIndex). On those engines, you get surprising results from code like this:

function test(str, expect) {
    var re = /./g,
        m = re.exec(str);
    
    if (m) {
        console.log("Matched: " + m[0] +
                    ", expected: " + expect);
    }
    else {
        console.log("No match, expected " +
                    (expect === null ? "no match" : expect));
    }
}

test("one", "o");
test("two", "t");

If you run that code on most engines, you get the expected results. But on some engines, instead of matching "t" on the second call to test, the regex matches "w" because the instance defined by the literal gets reused (the same instance is used by both calls to test), and of course after the first call its lastIndex is 1. The workaround was to explicitly set re.lastIndex = 0; before calling exec when you were starting a new set of matches, even if it looked like the instance was fresh.

Apparently there was some ambiguity in the spec (gosh) which makes this an interpretation rather than an outright bug, although I suspect I'm not the only one thinking it's pretty clear that two separate instances should be used. It was cleared up in the ES5 spec and Firefox's engine has been doing this right for a while now (since Firefox 4, I think). I expect any other older engines that had this odd interpretation have probably been fixed as well.

No comments: