Another modest example of how Wordpress is total rubbish

Posted by Jacques Chester on Sunday, November 16, 2008

Nicholas Gruen received an email today from a distraught reader who couldn’t sign up because his name is Irish. Consequently his email contains an apostrophe.

Apostrophes are perfectly legal characters in email addresses. But Wordpress, for reasons known only to the bozos who write it, doesn’t use any well-tested or well-known email address validator. They rolled their own, quite painfully incomplete, “validator”.

Tucked away in wp-includes/formatting.php is a function is_email(). It looks like this:


function is_email($user_email) {
	$chars = "/^([a-z0-9+_]|\\-|\\.)+@(([a-z0-9_]|\\-)+\\.)+[a-z]{2,6}\$/i";
	if (strpos($user_email, '@') !== false && strpos($user_email, '.') !== false) {
		if (preg_match($chars, $user_email)) {
			return true;
		} else {
			return false;
		}
	} else {
		return false;
	}
}

The important line is this:

$chars = "/^([a-z0-9+_]|\\-|\\.)+@(([a-z0-9_]|\\-)+\\.)+[a-z]{2,6}\$/i”;

In it we see a regular expression encoding of what the Wordpress team thinks is a legitimate email address. It’s laughably incomplete.

It turns out that the regular expression to fully check a legitimate email address looks like this:

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|”(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*”(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|”(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[
\t]))*”(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|”(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*”(?:(?:\r\n)
?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\”.\[\] \000-\031]+(?:(?:(?:\
r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
\t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
)*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
|\Z|(?=[\["()<>@,;:\\".\[\]]))|”(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*”(?:(?:\r
\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|”(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
]))*”(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|”(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
[ \t]))*”(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\]
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|”(?:[^\"\r\\]|
\\.|(?:(?:\r\n)?[ \t]))*”(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|”
(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*”(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|”(?:[^\"\r\\]|\\.|(
?:(?:\r\n)?[ \t]))*”(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
:\\”.\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
.\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\]
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|”(?:[^\"\r\\]|\\
.|(?:(?:\r\n)?[ \t]))*”(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|”(?
:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*”(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|”(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*”(?:(?:\r\n)?[ \t])*)(?:\.(?:(
?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
\["()<>@,;:\\".\[\]]))|”(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*”(?:(?:\r\n)?[ \t
])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
]]))|”(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*”(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
?[ \t])*(?:@(?:[^()<>@,;:\\”.\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
\t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|”(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*”(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
"()<>@,;:\\".\[\]]))|”(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*”(?:(?:\r\n)?[ \t])
*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]

That’s from a Perl module called Mail::RFC822::Address, which fully validates to the standard (which is subtle and complex11. How complex?: Very complex, actually. A commenter at Reddit points out that this regular expression was generated programmatically and even it can’t account for cases where an email address has more than six levels of nested parentheses. But I’d back it against Wordpress’s function any day. Another commenter at Reddit points at that email addresses are not technically parseable using regular expressions at all — another Perl module, RFC::RFC822::Address uses a recursive parser to check email addresses. []), rather than just making something up.

And this is part of what separates PHP projects, like Wordpress, from the world of sane programming. PHP as a language deters sensible composition and reuse because it lacks things like a sensible module system or namespaces; and so every project winds up reinventing the same damn wheels over and over again. Wordpress is a particularly bad example as it seems to be highly allergic to sensible inclusions. The life cycle of a Wordpress software nightmare goes like this:

  1. Problem X emerges. It has already been solved by Y. Wordpress implementors are told about Y but decide to write Z instead.
  2. It turns out that Z does not conform to the standard, or has bugs, or overlooks a lot of corner cases. More patches are released every time Wordpress is updated, but soon the Wordpress designers get bored and go back to rewriting the admin interface again.
  3. Z languishes. Either the standard gets updated or it is found to be riddled with exploitable bugs.
  4. Years after first being told to use the already mature, proved, tested, bugfixed and available-the-entire-time Y, Wordpress developers integrate Y. It’s big news.
  5. Problem A emerges. It has already been solved by B…

Of course we are stuck with it. Wordpress is the most recent example of the triumph of Worse-is-Better software and I for one am used to its warts. Like an abused spouse I too afraid to go elsewhere in case I have to go through another cycle of pain.

But seriously. This glitch has been on the bugtracker in two different entries for more than a year. Both are marked to be fixed in 2.9, which is more than 6 months away — when it would take about 30 seconds with google to do a better job. So if you have a celtic surname like O’Reilly, O’Malley or O’Hannesey, you might just be out O’luck. And if you use Wordpress, you’re shit out of luck no matter what your name is.



ShareThis
This entry was posted on Sunday, November 16th, 2008 at 9:47 PM and filed under Geeky Musings, IT and Internet. Follow comments here with the RSS 2.0 feed. Apologies. Comments and trackbacks are both currently closed.

6 Responses to “Another modest example of how Wordpress is total rubbish”

  1. Tel_ said:

    People are (somewhat rightly) nervous about the dreaded apostrophe because it is a common way to insert SQL code from an input box.

    There are of course, well known workarounds for preventing SQL insertion. Perl’s DBI/DBD system knows how to do it properly. I think it’s probably a fair comment to say the overall Perl has much better designed libraries than PHP. Perl is also quite fast for WWW processing but only if you embed it into your webserver (e.g. use Apache’s modperl).

  2. Jacques Chester said:

    Tel;

    That’s another example. The Wordpress project have consistently refused to either use an existing PHP DBI-alike (there are several) and to roll their own. It’s following the lifecycle I outlined.

    Naturally every plugin is quite used to having total access to any and all database tables and can insert its data any old way it likes.

  3. Tel_ said:

    Someone asking for the ire of Éire :

    http://www.regular-expressions.info/email.html

    Another trade-off is that my regex only allows English letters, digits and a few special symbols. The main reason is that I don’t trust all my email software to be able to handle much else. Even though John.O’Hara@theoharas.com is a syntactically valid email address, there’s a risk that some software will misinterpret the apostrophe as a delimiting quote. E.g. blindly inserting this email address into a SQL will cause it to fail if strings are delimited with single quotes. And of course, it’s been many years already that domain names can include non-English characters. Most software and even domain name registrars, however, still stick to the 37 characters they’re used to.

    He has a point though, does Mail::RFC822::Address handle the full UTF-8 DNS options? Should we test Chinese email addresses inside a Chinese domain? Full language support does substantially increase the complexity of the software. Only my opinion but ASCII was one of the main reason English-speaking nations got a huge headstart in computing education over the Asian nations (and personally, I’m glad to keep that edge as long as possible).

  4. conrad said:

    I think some APL programmers would be impressed by some of that code.

  5. Smiley said:

    which is subtle and complex

    What ever happened to the KISS principle? It’s been a little while since I’ve used regular expressions, but aren’t all those \r \n and \t place holders for carriage returns, new lines and horizontal tabs. The complexity is insane.

    The fact that

    this regular expression was generated programmatically

    suggests that the system has been over-engineered.

  6. JM said:

    “suggests that the system has been over-engineered.”

    Not really. There are two things at work here:

    1.) the internet email addressing standard
    2.) the use of regular expressions to implement the standard.

    Second first. regex’s (as fondly known) are complex and arcane even for simple things. Basically they are a tiny programming language, but one that lacks of lot of basic facilities - loops and selection - that otherwise have to be simulated.

    In short, if you have anything more than moderately complex, regex’s are not the most understandable implementation and should be avoided unless you really, really need the performance.

    As to the internet email addressing standard. Sendmail - which is the mail server that pretty much established the standard - was developed at a time when there was a great deal of uncertainty about which way the addressing standard would evolve. At the time X.400 (an OSI standard) was all the rage, but it is just about backwards from what we know and love today (but you can see remnants of it in Active Directory/LDAP) and was in competition with the looser standards that evolved in small Unix environments.

    As a result, Sendmail is extremely flexible and tolerant of how email addresses are formed and can be configured to handle just about anything.

    The result was that a formulation of the email address standard ended up incorporating a lot of unnecessary complexity largely facilitated by the Sendmail implementation that attempted (and unfortunately succeeded) to allow any arbitrary format you wanted.

    So short answer:

    complex problem + impoverished implementation “language” = WTF is that???

    Hence the long regex Jacques has posted.

    But …. meh. What’s done is done.

Close
  • Social Web
  • E-mail