mySQL and the grand regexp retardedness with lettercasing

by Dave Dash 19Sep09

I wanted to find a list of Firefox addons that had smushed text in their title. E.g. FireBug or StumbleUpon. The normal porter stemming algorithm that Sphinx uses does not turn "StumbleUpon" into "stumbl upon" as it would with "Stumble Upon". I was hoping for, and unfortunately could not find a method to do a regular expression search/replace using mysql. If I could, I could have Sphinx read "StumbleUpon" as "Stumble Upon" and all would be well (although in theory this would backfire).

So my Plan B was to get a list of common smushed named addons (I'd say camelCase, but camelCase is different from SmushedText). Naturally I used my exceptional skill at regular expressions to concoct this query:

mysql> SELECT name FROM translated_addons WHERE name REGEXP '[a-z][A-Z][a-z]' = 1 LIMIT 10;
+------------------------+
| name                   |
+------------------------+
| Orbit Grey             | 
| Phoenity               | 
| Pinball                | 
| Qute                   | 
| FirefoxModern          | 
| Adblock                | 
| Add Bookmark Here      | 
| All-in-One Gestures    | 
| Bookmarks Synchronizer | 
| Browser Uptime         | 
+------------------------+
10 rows in set (41.28 sec)

Wait... none of these match. I scratched my head for a bit and then thought, oh wait, mysql is case insenstivie maybe it's turning [a-z][A-Z][a-z] into [a-z][a-z][a-z] ― stupid, but consistent with mysql. Then I pulled my other regexp card out of my sleve, character classes:

mysql> SELECT name FROM translated_addons WHERE name REGEXP '[[:lower:]][[:upper:]][[:lower:]]' = 1 LIMIT 10;
+------------------------+
| name                   |
+------------------------+
| Orbit Grey             |
| Phoenity               |
| Pinball                |
| Qute                   |
| FirefoxModern          |
| Adblock                |
| Add Bookmark Here      |
| All-in-One Gestures    |
| Bookmarks Synchronizer |
| Browser Uptime         |
+------------------------+
10 rows in set (12.96 sec)

No difference. Time to pull out the mysql documentation:

REGEXP is not case sensitive, except when used with binary strings.

ORLY?

Case-insenstive regular expressions when looking for [[:upper:]] or [[:lower:]]? Fine... I'll add some syntax to make you work right:

mysql> SELECT DISTINCT name FROM translated_addons WHERE name REGEXP BINARY '[[:lower:]][[:upper:]][[:lower:]]' = 1 LIMIT 10;
+---------------------------+
| name                      |
+---------------------------+
| FirefoxModern             |
| ChatZilla                 |
| ChromEdit                 |
| CuteMenus                 |
| DownloadWith              |
| easyGestures              |
| JavaScript Console Status |
| LinkVisitor               |
| OpenBook                  |
| QuickNote                 |
+---------------------------+
10 rows in set (9.68 sec)

That's more like it!

Unfortunately there's about 2609 addons matching this query and since I can't automatically fix these in mysql, I'll need to do some work:

1.  Create a new table for additional indexable data.
2.  Upon creation of any new addons with names that have SmushedText - store the "un smushed text".
3.  Index this "extras" field in Sphinx.

Bug: 517699


Where am I?

This is a single entry in the weblog.

"mySQL and the grand regexp retardedness with lettercasing" is filed under spindrop. It was published in September 2009.

September 2009
M T W T F S S
« Aug   Nov »
 123456
78910111213
14151617181920
21222324252627
282930  

Tags

&& && &&

need more help

If you found our tutorials and articles to be useful, but are still looking for more hands on help, consider hiring us. Find out more about how Spindrop can help you.

 

3 Responses to “mySQL and the grand regexp retardedness with lettercasing”


  1. 1 Topher Posted September 19th, 2009 - 3:21 pm

    MySQL isn’t inherently case insensitive, certain data types are. VARCHAR for example. CHAR is case sensitive. You could safely convert your existing varchar field to char and then your regex might work better.

  2. 2 J.B. Nicholson-Owens Posted September 21st, 2009 - 3:08 pm

    There is a match in the initial set (with your assumptions from the initial test): “FirefoxModern” should be the only match because it has “xMo” in it. Regular expressions ought to be case-sensitive by default, however, as they are in virtually every other usage.

Who's linking?

  1. 1 AMO Search: Powered by Sphinx at Spindrop Pingback on Sep 30th, 2009
    "[...] Contact Us « mySQL and the grand regexp retardedness ... "

Further Help

If you require more hands on assistance, we do offer affordable hands on support.

Leave a Reply


Comment guidelines: No spamming, no profanity, and no flaming. Inappropriate comments will be deleted outright.