Name Validation Regex for People's Names | NYC PHP Developer

Web Development is a funny thing. Developers can spend a lot of time on crafting and polishing some complicated interactions in JavaScript. Yet, they can fail spectacularly on simpler things like writing decent HTML or allowing people to spell their own names correctly in forms. When your customers cannot correctly write their names in your form, it’s not the customers fault – it’s yours. They know how to spell their names. It’s the name validation regex built into your form, that has failed them. When people encounter such situations, they have to remove the special characters from their names – which is usually any character not in the English alphabet.

If you tell the truth, you don’t have to remember anything.

— Mark Twain

When people have to alter the spelling of their name to pass the form validation, they have to remember how they misspelled their name for that site. It’s likely that they’ll have to misspell their name on a different site, because they had a different set of validation rules.

Getting Started

When starting to explore this topic, I wanted to get some real world data, so I could use it for my unit tests. So I did what any developer would do – I asked my Twitter followers.

Q: Have you ever been unable to type your name correctly into a website form, because their validation didn’t like your names’ “special” characters? I need some data for a project. Just reply to this tweet with your name, like so

givenName=”andrew”
surname=”woods”

#webdev

Source: My Tweet

I didn’t receive as many responses with name data as I hoped. So, in the mean time, I’ll have to create some test data for some permutations that I know I want to test for. My plan is to start with that, and adapt as hopefully more people contribute name examples. A couple of people did refer me to an article about assumptions that developers make about people’s names. It’s worth reading over it. At some point, you have to make decisions to be able to write your code. As you get data, update your code to handle it. We all start somewhere. If you have a customer support team, ask them for data about customer issues dealing with name problems.

Validation Is Necessary

Validation is a necessary process in processing data, especially when collecting data from an unknown source. We want prevent bad data from entering our system. Spam-bots are a huge problem on the web. There’s also the problem of people deliberately trying to hack our system by injecting SQL. So validation needs to be done. I’ve been looking into how to improve upon what people do to handle name data. Using of regular expressions isn’t the problem. It’s developers not making the most of their capabilities.

In this article, we’ll walk through the process of building up a regular expression, and discuss the various decision points along the way. This should help you adapt the name validation regex to meet your specific needs.

Prerequisites

First and last names can be processed separately
No restrictions on length

Separate Processing

Some people think having a single name field is the best strategy. That might be true from an cultural/anthropological perspective. However, I’d argue that many systems have separate fields for first name and last name. We need to work with certain constraints and to prioritize our efforts accordingly. By the way, first name and last name are terrible terms. It’s better to refer to them as given name and surname, respectively. The term family name is also a good substitute for last name. Only in America do we think it’s perfectly normal to say stupid shit like write you last name first. WTF?! but I digress …

So, as I was saying, most systems have separate fields for given name and family name. Since they’re stored that way, it’s best to capture them separately. This is the most flexible solution, programmatically speaking. It’s much easier to keep them separate — and only join the pieces together as needed — than it is to separate them into pieces. If you were to provide a single field in the form, then you have the headache of trying to figure out where the given name stops and the family name begins. Parsing the full name into separate parts is a problem you don’t want to have to worry about. It’s also one that you can easily avoided by capturing them as separate fields. Allow your customer to tell you what their given name and family name are. Our job is to make sure is to make sure they can tell us their name as accurately as possible. Just do your best, within the constraints of your environment.

No Restrictions On Length

Names vary greatly across cultures. Some people have a one word last name e.g. Woods. Other people have a multi word last name e.g de la Hoya. Some people have a hyphenated last name e.g. Thorne-Smith. In some cultures they have multiple first names and multiple last names. That’s OK. We can account for those conditions within our regular expression.

Building the Name Validation Regex

It’s worth remembering that regular expressions are different beast than other functions. A regex is for examining the structure or format of the data. We aren’t interested in evaluating the contents of the customers name. To a regular expression it doesn’t matter if someone’s name is Smith, Moreau, or Castillo. We don’t care if our customers’ surnames are formed by using their maternal ancestors surnames 5 generations deep. That’s an issue for nerds of a different variety, to worry about. When you get down to it, regular expressions care about one thing — what does the data looks like? That what we care about.

What types of characters are there?
How many of them are allowed?
What is their format?
Does it look like we expect it to?
Does it appear to be reasonable for our purposes?

As we walk through the process of building the name regular expression, I’ll discuss the varying assumptions so that you can adapt things to your process. This code is in PHP, but it should largely be compatible with JavaScript – because they’re both based on Perl – so you can adapt this to your client side scripts.

Our Starting Point

We need a simple function to do the validation. All we need is a Boolean response. Our initial name validation regex contains something like what most people are currently using.


function isNameValid($name) { 
    $pattern = '/^\w+/'; 
    if (preg_match($pattern, $name)) {
        return true; 
    } 
    return false; 
}

Most web developers dislike working with forms, and regular expressions are considered, by many, to be a dark magic best avoided. Since this problem uses both, it’s no wonder most developers don’t spend much time on it. Let’s examine the expression. In it’s current form, it’s too simplistic. It says that only word characters are valid. That means no spaces, no apostrophes, and no umlauts, accents, or hyphens. Word characters are limited to the English alphabet, digits, and underscores. There are millions of people with names who’ll successfully pass this regex. However, there are millions more whose names will fail. So we have some work to do!

Remove Digits and Underscores

We probably don’t want to allow numbers or underscores. If the name contains either of those, it’s probably a login and not a person’s name. So we need to modify the regex.

   $pattern = '/^[A-Za-z]+/';

Allow Multiple Words

There’s a whole a lot of people who have more than one word in their last name. For example, the famous rock guitarist Eddie Van Halen comes to mind. Let’s update our regex to allow multiple words.

    $pattern = '/^[A-Za-z]+([\ A-Za-z]+)*/';

The [\ A-Za-z]+ within the parentheses is just like the character class above, with the space added. The surrounding ()* says the contents within can occur 0 or more times. Now the names Woods, Van Halen, and De La Hoya will all match. That’s a good start. However we can make some more improvements.

Add Apostrophe and Hyphen Support

Some name have hyphens, some have apostrophes. Let’s add support for these characters now. There probably aren’t any names that start with these characters. Names like O’Donnell or Jones-Smith have these characters, but the special characters aren’t the first character. If the apostrophe or hyphen is the first character, something is probably wrong. Let’s update the regex.

    $pattern = '/^[A-Za-z][A-Za-z\'\-]+([\ A-Za-z][A-Za-z\'\-]+)*/';

The downside is that these special characters can be duplicated. So, something like O”Donnell with the two apostrophes can slip through and would be considered valid. In the spirit of Postel’s Law, I think it’s worth it. If it’s really a problem for you, create a separate function to remove duplicated apostrophes and hyphens before getting to this function.

Adding Extended ASCII Character Support

There a number of names in European languages that will fail using our current regular expression. Any name that uses an accent, ñ, or an umlaut. Numerous Spanish, French, German, Dutch, and Nordic names fall into this category. To enable support for these names, we need to dig into the Unicode Character chart. The Latin-1 Supplemental range is 0080-00FF. That’s the full range. However, there are bunch of control codes, money symbols, punctuation, and math-y bits we’d like to skip. So we’ll shorten the range to 00C0-00FF. These characters should be allowed to occur anywhere in their name. We can have names like Kevin ßáçøñ now! I know what you’re thinking! Did I really just spell out the name Bacon using these supplemental Latin characters? Yes, yes I did! I even added a strip of bacon on top of the n just for you :)

    $pattern = '/^[A-Za-z\x{00C0}-\x{00FF}][A-Za-z\x{00C0}-\x{00FF}\'\-]+([\ A-Za-z\x{00C0}-\x{00FF}][A-Za-z\x{00C0}-\x{00FF}\'\-]+)*/u';

Note that the regex requires the u after the final slash. That’s a pattern modifier that PHP uses treat the pattern and subject strings as UTF-8.

Additional Considerations

This isn’t perfect, but it’ll allow a whole lot more of your customers to be able to fill in your forms with the correct spelling of their names. However, keep in mind, this only works for people with English-like characters in their names. This will work pretty well for North America, South America and Western Europe.

Write Unit Tests

You do have unit tests in your application, right? If not, that’s what you setup right now, and use this as an opportunity to start creating them in your application. While your regex may seem simple in the beginning, it’ll probably get more complicated over time. You’ll need real world data in your unit tests, to make sure that you don’t break your existing support as your needs grow.

Additional Foreign Language Support

To support names in other languages like Chinese, Japanese, Arabic, Hindi, and Hebrew (just to name a few), you should consider creating a separate function for those languages. Trying to support all your desired languages in a single language will create a massive regex that nobody will want to manage.

Check Length Separately

You’ll notice the regular expression doesn’t restrict the total length of the name. That’s a good thing. That’s should be done separately for maximum flexibility. Since you’re using some Unicode characters, you should use mb_strlen() instead of strlen() to check the name length.

Storing Names In The Database

You’ll need to ensure that your database table can handle the data appropriately.

Is your field long enough?
Is it using the right data type?
Will it support all the characters specified in your regular expression?
Did you escape your data? We wouldn’t want our data to affect the SQL syntax.

Conclusion

I hope that helps. I wanted to share my thoughts with you all, in the hopes it’ll help improve the user experience for more people on the web. While it might not be perfect, it should make things better than what we have now. I’m putting together a library to help simplify things for people. I’d like to create a composer package, to make available on Packagist. Would it be worth it to create a JavaScript version for NPM? I’ll keep you all posted on these things in future blog posts.