In MySQL, searches for things that sound similar can be very important. For example, complicated medical names often have complicated spellings that hardly anyone remembers. If you are searching for a person by last name, the search should be forgiving to the point where if nothing is found, it tries again just to be sure, and if 1000 results are found, it tries to be a bit more strict.

This type of smart search is actually very easy to set up if you are working with MySQL.

Both MySQL and PHP provide built in “sounds like” type functions. In MySQL, there is ‘SOUNDEX(expr1)’, as well as the boolean operator, ‘expr1 SOUNDS LIKE expr2′. In PHP, my favorite has become the ‘metaphone’ function, but there are others such as ’sounds_like’, etc. (See links at the end for more information). Both are good functions and do their jobs, but sometimes the flexibility you need is not there. (Not to mention, if you are using PHP to do the sounds like comparison, then you will have some slow searching.)

Before you use the soundex function, you should look at this short article which explains how soundex works: http://genealogypro.com/articles/soundex.html
Now, realize that MySQL doesn’t follow this exactly. MySQL will not stop after 3 numbers for identifying sounds. It will keep going after the string. So in the example with ‘washington’ they say it should yeild ‘W-252′, MySQL keeps going and ends up with ‘W25235′. (Note that ‘washintonian’ SOUNDS LIKE ‘washington’ will return true because repetitions are not counted (washingtonian -> wshngtnn -> wsngtnn -> wsngtn) but as long as distinct consants go, soundex will return the number for their sound.

When doing a search by last name for ‘cornicova’, you would expect the following to return a positive boolean value:

SELECT * FROM `myTable` WHERE `last_name` SOUNDS LIKE ‘cornicova’;

If you expect someone named ‘kournikova’ to be returned, guess again. You will get ‘cournikova’, but no one whose name starts with a ‘K’. This is the first issue with MySQL’s soundex function. It can do the ’sounds’ fine, but most people dont realize that to ’sound like’ in MySQL means that you must start with the same letter. This is a bit too strict. This is due to keeping the first letter literally as the first character of the soundex string.

In order to remedy this, stripping the first letter is a perfect solution. The following query works as expected:

SELECT * FROM `myTable` WHERE SUBSTR(SOUNDEX(`lastname`), 2) = SUBSTR(SOUNDEX('cornicova'), 2);

Problem solved. But imagine that we dont have someone named ‘kournikova’ but we have ‘kournikovalov’. Our search won’t return this person. This is the second issue of the soundex function. If we want to be a bit more forgiving in the case that the query above yields only a few people, we need to see if any part of ‘kournikovalov’ sounds like ‘cornicova’. Obviously it does, so how do we get MySQL to agree?

Using the INSTR function (see below, similar to the LOCATE function) we can check if the similar sounding string matches up anywhere inside what we’re looking for:

SELECT * FROM `myTable` WHERE INSTR(SOUNDEX(`last_name`), SUBSTR(SOUNDEX('cornicova'), 2)) > 0;

Notice that for the first piece we don’t chop off the first letter. This is simply because we are searching inside the string anywhere, and there is no point of chopping out something that won’t interfere anyway. It just saves some computation time.

This will give you the extra freedom you need to make a good ’sounds like’ search, and is something that MySQL documentation doesn’t mention explicitly. If you have any questions about the functions used above, all of their documentation can be seen here: http://dev.mysql.com/doc/refman/5.0/en/string-functions.html.

The issue still remains with the 000 string. When you have the string ‘Lee’, its soundex representation is L000. ‘BROWNLEE’ has a soundex representation of B654, but obviously part of ‘BROWNLEE’ sounds like ‘LEE’. The trick here is to pad the first character. Soundex of ‘OLEE’ is O-400. Then, if you chop out the zero’s (as they are just there for padding) you are left with as soundex of ‘4′. Obviously, 4 is a part of B654, and this will return a true. The problem with this, is that anything with an L in it is really what you end up with. The other issue is that ‘LEE’ sounds like a part of ‘LEEBOW’ but this wouldn’t show that unless we pad both strings (SOUNDEX(’LEEBOW’) = L100 and there is no 4 in there.) The next solution is to pad BOTH strings. Anything would really work, but O is a good starting character.

As an example, say we want to see if ‘OBER’ sounds like any part of ‘DOBERMAN’. If we did a standard sounds like:

SELECT 'OBER' SOUNDS LIKE 'DOBERMAN'; // returns 0

Trying the next way,

SELECT SUBSTR(SOUNDEX('DOBERMAN'), 2) = SUBSTR(SOUNDEX('OBER'), 2); // returns 0

Try the next way,

SELECT INSTR(SOUNDEX('DOBERMAN'), SUBSTR(SOUNDEX('OBER'), 2)) > 0; // returns 0

Try the final way,

SELECT INSTR(SUBSTR(SOUNDEX('ODOBERMAN'), 2), REPLACE(SUBSTR(SOUNDEX('OOBER'), 2), '0', '')) > 0; // returns 1!!


  1. Matt

    Awesome page - very helpful stuff, cheers!

  2. Tom

    this have saved me a lot of head banging - thank you!

  1. 1 Blog Experiment: Comparing text and SQL - kometbomb

    [...] And yet again, MySQL and other DB’s could have the same feature. And I intentionally left out the phonetic (”sounds like”) comparison features that are very often built-in (AFAIK MySQL and PHP should have those included by default). This article knows MySQL’s “sounds like” operator is quite logically SOUNDS LIKE. [...]



Leave a Comment