Thursday, 1 March 2012

Match everything...except!

Micro-post:

I was truly shocked to find today that in JavaScript regular expressions, . (the decimal point) doesn't do what I thought it did. I thought . meant "match any character." You too? Yeah. But it doesn't. Specifically, . doesn't match line terminators (so, \r, \n, \u2028, and \u2029). From Section 15.10.2.8:

The production Atom :: . evaluates as follows:

  1. Let A be the set of all characters except LineTerminator.
  2. Call CharacterSetMatcher(A, false) and return its Matcher result.

...which if you spend really quite a long time looking tells you that . matches anything but line terminators.

Maybe I'm just parading my ignorance here, but I would have thought that absent the "multiline" flag or something, . matched everything. Nope. If you want to do that, use [\s\S] (e.g., everything that either is or isn't whitespace).

Happy coding!

2 comments:

Petr 'PePa' Pavel said...

I won't pretend I knew but I with my experience of PHP/PCRE, I would have suspected that new lines could be treated differently.

And there's more:
http://php.net/manual/en/reference.pcre.pattern.modifiers.php

I wonder if for example, the PCRE_MULTILINE switch isn't another potential trap in JavaScript regular expressions. I'm mixing PHP and JS but you get the point, right.

T.J. Crowder said...

That's what I meant about "absent the 'multiline' flag or something" (JavaScript's multiline flag is pretty much like PHP's and Perl's).