I wanted to find a list of Firefox addons that had smushed text in their title. E.g. FireBug or StumbleUpon. The normal porter stemming algorithm that Sphinx uses does not turn "StumbleUpon" into "stumbl upon" as it would with "Stumble Upon". I was hoping for, and unfortunately could not find a method to do a regular expression search/replace using mysql. If I could, I could have Sphinx read "StumbleUpon" as "Stumble Upon" and all would be well (although in theory this would backfire).
So my Plan B was to get a list of common smushed named addons (I'd say camelCase, but camelCase is different from SmushedText). Naturally I used my exceptional skill at regular expressions to concoct this query:
mysql> SELECT name FROM translated_addons WHERE name REGEXP '[a-z][A-Z][a-z]' = 1 LIMIT 10;
+------------------------+
| name |
+------------------------+
| Orbit Grey |
| Phoenity |
| Pinball |
| Qute |
| FirefoxModern |
| Adblock |
| Add Bookmark Here |
| All-in-One Gestures |
| Bookmarks Synchronizer |
| Browser Uptime |
+------------------------+
10 rows in set (41.28 sec)
Wait... none of these match. I scratched my head for a bit and then thought, oh wait, mysql is case insenstivie maybe it's turning [a-z][A-Z][a-z] into [a-z][a-z][a-z] ― stupid, but consistent with mysql. Then I pulled my other regexp card out of my sleve, character classes:
mysql> SELECT name FROM translated_addons WHERE name REGEXP '[[:lower:]][[:upper:]][[:lower:]]' = 1 LIMIT 10;
+------------------------+
| name |
+------------------------+
| Orbit Grey |
| Phoenity |
| Pinball |
| Qute |
| FirefoxModern |
| Adblock |
| Add Bookmark Here |
| All-in-One Gestures |
| Bookmarks Synchronizer |
| Browser Uptime |
+------------------------+
10 rows in set (12.96 sec)
No difference. Time to pull out the mysql documentation:
REGEXP is not case sensitive, except when used with binary strings.
ORLY?
Case-insenstive regular expressions when looking for [[:upper:]] or [[:lower:]]? Fine... I'll add some syntax to make you work right:
mysql> SELECT DISTINCT name FROM translated_addons WHERE name REGEXP BINARY '[[:lower:]][[:upper:]][[:lower:]]' = 1 LIMIT 10;
+---------------------------+
| name |
+---------------------------+
| FirefoxModern |
| ChatZilla |
| ChromEdit |
| CuteMenus |
| DownloadWith |
| easyGestures |
| JavaScript Console Status |
| LinkVisitor |
| OpenBook |
| QuickNote |
+---------------------------+
10 rows in set (9.68 sec)
That's more like it!
Unfortunately there's about 2609 addons matching this query and since I can't automatically fix these in mysql, I'll need to do some work:
1. Create a new table for additional indexable data.
2. Upon creation of any new addons with names that have SmushedText - store the "un smushed text".
3. Index this "extras" field in Sphinx.
Bug: 517699



MySQL isn’t inherently case insensitive, certain data types are. VARCHAR for example. CHAR is case sensitive. You could safely convert your existing varchar field to char and then your regex might work better.
There is a match in the initial set (with your assumptions from the initial test): “FirefoxModern” should be the only match because it has “xMo” in it. Regular expressions ought to be case-sensitive by default, however, as they are in virtually every other usage.