By Piskvor


2009-06-12 14:08:42 8 Comments

I have a table with ~500k rows; varchar(255) UTF8 column filename contains a file name;

I'm trying to strip out various strange characters out of the filename - thought I'd use a character class: [^a-zA-Z0-9()_ .\-]

Now, is there a function in MySQL that lets you replace through a regular expression? I'm looking for a similar functionality to REPLACE() function - simplified example follows:

SELECT REPLACE('stackowerflow', 'ower', 'over');

Output: "stackoverflow"

/* does something like this exist? */
SELECT X_REG_REPLACE('Stackoverflow','/[A-Zf]/','-'); 

Output: "-tackover-low"

I know about REGEXP/RLIKE, but those only check if there is a match, not what the match is.

(I could do a "SELECT pkey_id,filename FROM foo WHERE filename RLIKE '[^a-zA-Z0-9()_ .\-]'" from a PHP script, do a preg_replace and then "UPDATE foo ... WHERE pkey_id=...", but that looks like a last-resort slow & ugly hack)

11 comments

@Nae 2019-01-24 22:30:36

The one below basically finds the first match from the left and then replaces all occurences of it (tested in ).

Usage:

SELECT REGEX_REPLACE('dis ambiguity', 'dis[[:space:]]*ambiguity', 'disambiguity');

Implementation:

DELIMITER $$
CREATE FUNCTION REGEX_REPLACE(
  var_original VARCHAR(1000),
  var_pattern VARCHAR(1000),
  var_replacement VARCHAR(1000)
  ) RETURNS
    VARCHAR(1000)
  COMMENT 'Based on https://techras.wordpress.com/2011/06/02/regex-replace-for-mysql/'
BEGIN
  DECLARE var_replaced VARCHAR(1000) DEFAULT var_original;
  DECLARE var_leftmost_match VARCHAR(1000) DEFAULT
    REGEX_CAPTURE_LEFTMOST(var_original, var_pattern);
    WHILE var_leftmost_match IS NOT NULL DO
      IF var_replacement <> var_leftmost_match THEN
        SET var_replaced = REPLACE(var_replaced, var_leftmost_match, var_replacement);
        SET var_leftmost_match = REGEX_CAPTURE_LEFTMOST(var_replaced, var_pattern);
        ELSE
          SET var_leftmost_match = NULL;
        END IF;
      END WHILE;
  RETURN var_replaced;
END $$
DELIMITER ;

DELIMITER $$
CREATE FUNCTION REGEX_CAPTURE_LEFTMOST(
  var_original VARCHAR(1000),
  var_pattern VARCHAR(1000)
  ) RETURNS
    VARCHAR(1000)
  COMMENT '
  Captures the leftmost substring that matches the [var_pattern]
  IN [var_original], OR NULL if no match.
  '
BEGIN
  DECLARE var_temp_l VARCHAR(1000);
  DECLARE var_temp_r VARCHAR(1000);
  DECLARE var_left_trim_index INT;
  DECLARE var_right_trim_index INT;
  SET var_left_trim_index = 1;
  SET var_right_trim_index = 1;
  SET var_temp_l = '';
  SET var_temp_r = '';
  WHILE (CHAR_LENGTH(var_original) >= var_left_trim_index) DO
    SET var_temp_l = LEFT(var_original, var_left_trim_index);
    IF var_temp_l REGEXP var_pattern THEN
      WHILE (CHAR_LENGTH(var_temp_l) >= var_right_trim_index) DO
        SET var_temp_r = RIGHT(var_temp_l, var_right_trim_index);
        IF var_temp_r REGEXP var_pattern THEN
          RETURN var_temp_r;
          END IF;
        SET var_right_trim_index = var_right_trim_index + 1;
        END WHILE;
      END IF;
    SET var_left_trim_index = var_left_trim_index + 1;
    END WHILE;
  RETURN NULL;
END $$
DELIMITER ;

@Steve Chambers 2016-07-29 13:49:17

UPDATE 2: A useful set of regex functions including REGEXP_REPLACE have now been provided in MySQL 8.0. This renders reading on unnecessary unless you're constrained to using an earlier version.


UPDATE 1: Have now made this into a blog post: http://stevettt.blogspot.co.uk/2018/02/a-mysql-regular-expression-replace.html


The following expands upon the function provided by Rasika Godawatte but trawls through all necessary substrings rather than just testing single characters:

-- ------------------------------------------------------------------------------------
-- USAGE
-- ------------------------------------------------------------------------------------
-- SELECT reg_replace(<subject>,
--                    <pattern>,
--                    <replacement>,
--                    <greedy>,
--                    <minMatchLen>,
--                    <maxMatchLen>);
-- where:
-- <subject> is the string to look in for doing the replacements
-- <pattern> is the regular expression to match against
-- <replacement> is the replacement string
-- <greedy> is TRUE for greedy matching or FALSE for non-greedy matching
-- <minMatchLen> specifies the minimum match length
-- <maxMatchLen> specifies the maximum match length
-- (minMatchLen and maxMatchLen are used to improve efficiency but are
--  optional and can be set to 0 or NULL if not known/required)
-- Example:
-- SELECT reg_replace(txt, '^[Tt][^ ]* ', 'a', TRUE, 2, 0) FROM tbl;
DROP FUNCTION IF EXISTS reg_replace;
DELIMITER //
CREATE FUNCTION reg_replace(subject VARCHAR(21845), pattern VARCHAR(21845),
  replacement VARCHAR(21845), greedy BOOLEAN, minMatchLen INT, maxMatchLen INT)
RETURNS VARCHAR(21845) DETERMINISTIC BEGIN 
  DECLARE result, subStr, usePattern VARCHAR(21845); 
  DECLARE startPos, prevStartPos, startInc, len, lenInc INT;
  IF subject REGEXP pattern THEN
    SET result = '';
    -- Sanitize input parameter values
    SET minMatchLen = IF(minMatchLen < 1, 1, minMatchLen);
    SET maxMatchLen = IF(maxMatchLen < 1 OR maxMatchLen > CHAR_LENGTH(subject),
                         CHAR_LENGTH(subject), maxMatchLen);
    -- Set the pattern to use to match an entire string rather than part of a string
    SET usePattern = IF (LEFT(pattern, 1) = '^', pattern, CONCAT('^', pattern));
    SET usePattern = IF (RIGHT(pattern, 1) = '$', usePattern, CONCAT(usePattern, '$'));
    -- Set start position to 1 if pattern starts with ^ or doesn't end with $.
    IF LEFT(pattern, 1) = '^' OR RIGHT(pattern, 1) <> '$' THEN
      SET startPos = 1, startInc = 1;
    -- Otherwise (i.e. pattern ends with $ but doesn't start with ^): Set start pos
    -- to the min or max match length from the end (depending on "greedy" flag).
    ELSEIF greedy THEN
      SET startPos = CHAR_LENGTH(subject) - maxMatchLen + 1, startInc = 1;
    ELSE
      SET startPos = CHAR_LENGTH(subject) - minMatchLen + 1, startInc = -1;
    END IF;
    WHILE startPos >= 1 AND startPos <= CHAR_LENGTH(subject)
      AND startPos + minMatchLen - 1 <= CHAR_LENGTH(subject)
      AND !(LEFT(pattern, 1) = '^' AND startPos <> 1)
      AND !(RIGHT(pattern, 1) = '$'
            AND startPos + maxMatchLen - 1 < CHAR_LENGTH(subject)) DO
      -- Set start length to maximum if matching greedily or pattern ends with $.
      -- Otherwise set starting length to the minimum match length.
      IF greedy OR RIGHT(pattern, 1) = '$' THEN
        SET len = LEAST(CHAR_LENGTH(subject) - startPos + 1, maxMatchLen), lenInc = -1;
      ELSE
        SET len = minMatchLen, lenInc = 1;
      END IF;
      SET prevStartPos = startPos;
      lenLoop: WHILE len >= 1 AND len <= maxMatchLen
                 AND startPos + len - 1 <= CHAR_LENGTH(subject)
                 AND !(RIGHT(pattern, 1) = '$' 
                       AND startPos + len - 1 <> CHAR_LENGTH(subject)) DO
        SET subStr = SUBSTRING(subject, startPos, len);
        IF subStr REGEXP usePattern THEN
          SET result = IF(startInc = 1,
                          CONCAT(result, replacement), CONCAT(replacement, result));
          SET startPos = startPos + startInc * len;
          LEAVE lenLoop;
        END IF;
        SET len = len + lenInc;
      END WHILE;
      IF (startPos = prevStartPos) THEN
        SET result = IF(startInc = 1, CONCAT(result, SUBSTRING(subject, startPos, 1)),
                        CONCAT(SUBSTRING(subject, startPos, 1), result));
        SET startPos = startPos + startInc;
      END IF;
    END WHILE;
    IF startInc = 1 AND startPos <= CHAR_LENGTH(subject) THEN
      SET result = CONCAT(result, RIGHT(subject, CHAR_LENGTH(subject) + 1 - startPos));
    ELSEIF startInc = -1 AND startPos >= 1 THEN
      SET result = CONCAT(LEFT(subject, startPos), result);
    END IF;
  ELSE
    SET result = subject;
  END IF;
  RETURN result;
END//
DELIMITER ;

Demo

Rextester Demo

Limitations

  1. This method is of course going to take a while when the subject string is large. Update: Have now added minimum and maximum match length parameters for improved efficiency when these are known (zero = unknown/unlimited).
  2. It won't allow substitution of backreferences (e.g. \1, \2 etc.) to replace capturing groups. If this functionality is needed, please see this answer which attempts to provide a workaround by updating the function to allow a secondary find and replace within each found match (at the expense of increased complexity).
  3. If ^and/or $ is used in the pattern, they must be at the very start and very end respectively - e.g. patterns such as (^start|end$) are not supported.
  4. There is a "greedy" flag to specify whether the overall matching should be greedy or non-greedy. Combining greedy and lazy matching within a single regular expression (e.g. a.*?b.*) is not supported.

Usage Examples

The function has been used to answer the following StackOverflow questions:

@OscarR 2018-08-03 11:47:16

Wow! Is perfect!

@Lukasz Szozda 2018-04-19 16:07:04

MySQL 8.0+ you could use natively REGEXP_REPLACE.

12.5.2 Regular Expressions:

REGEXP_REPLACE(expr, pat, repl[, pos[, occurrence[, match_type]]])

Replaces occurrences in the string expr that match the regular expression specified by the pattern pat with the replacement string repl, and returns the resulting string. If expr, pat, or repl is NULL, the return value is NULL.

and Regular expression support:

Previously, MySQL used the Henry Spencer regular expression library to support regular expression operators (REGEXP, RLIKE).

Regular expression support has been reimplemented using International Components for Unicode (ICU), which provides full Unicode support and is multibyte safe. The REGEXP_LIKE() function performs regular expression matching in the manner of the REGEXP and RLIKE operators, which now are synonyms for that function. In addition, the REGEXP_INSTR(), REGEXP_REPLACE(), and REGEXP_SUBSTR() functions are available to find match positions and perform substring substitution and extraction, respectively.

SELECT REGEXP_REPLACE('Stackoverflow','[A-Zf]','-',1,0,'c'); 
-- Output:
-tackover-low

DBFiddle Demo

@Jeremy Stein 2009-06-12 14:16:44

No.

But if you have access to your server, you could use a user defined function (UDF) like mysql-udf-regexp.

EDIT: MySQL 8.0+ you could use natively REGEXP_REPLACE. More in answer above

@Piskvor 2009-06-12 15:34:28

REGEXP_REPLACE as a User Defined Function? Looks promising, will look into it. Thanks!

@Lathan 2010-07-26 14:08:27

Mysql does not have that feature built-in. I'm told that Oracle has that (no help for you though)

@lkraav 2012-02-20 01:44:19

Unfortunately mysql-udf-regexp doesn't seem to have support for multibyte characters. regexp_replace('äöõü', 'ä', '') returns a long numeric string instead of real text.

@Earth Engine 2013-03-01 00:48:38

@lkraav Really? that thing sounds like a security leak since that "long numeric string" looks like some buffer overrun.

@lkraav 2013-03-01 00:51:03

Well this was a year ago, but I'm pretty sure I identified the issue correctly at the time. No idea if they've made any progress, since I haven't touched this tool since.

@Brad 2013-03-20 20:53:59

MySQL itself does not support multi-byte characters with its RegEx features.

@Jonathan 2013-12-05 23:58:44

Windows users: The UDF Library linked here doesn't seem to have good windows support. The windows installation method outlined did not work well for me.

@gillyspy 2014-02-09 19:46:50

@lkraav you should try out the lib_mysqludf_preg library below as it works great. This the verbose version as it returns a blob by default and I don't know if you have a multibyte charset as your default: select cast( T.R as char) COLLATE utf8_unicode_ci from (select preg_replace('/ä/', '', 'öõüä') R ) T

@Ryan Ward 2012-02-26 19:52:15

My brute force method to get this to work was just:

  1. Dump the table - mysqldump -u user -p database table > dump.sql
  2. Find and replace a couple patterns - find /path/to/dump.sql -type f -exec sed -i 's/old_string/new_string/g' {} \;, There are obviously other perl regeular expressions you could perform on the file as well.
  3. Import the table - mysqlimport -u user -p database table < dump.sql

If you want to make sure the string isn't elsewhere in your dataset, run a few regular expressions to make sure they all occur in a similar environment. It's also not that tough to create a backup before you run a replace, in case you accidentally destroy something that loses depth of information.

@Piskvor 2012-02-27 05:33:13

Okay, that should work, too; I didn't consider an offline replace. Nice out-of-the-box thinking there!

@speshak 2012-03-23 16:17:00

Seems strange to me that you'd use find like that, I would shorten the command to sed -i 's/old_string/new_string/g' /path/to/dump.sql

@Moshe L 2012-05-04 09:28:26

can work if the the replace can't broke the SQL itself.

@Raul Luna 2014-05-15 15:50:02

Very risky, and unpractical with big data sets, or with referential integrity in place: for remove the data and then insert it again you will have to turn referential integrity off, leaving in practice your database off also.

@eggmatters 2015-06-09 16:56:10

Having used this method in the past, I aggre with Raul, this is very risky. You need to be absolutely certain as well, that your string is not elswhere in your dataset.

@Ryan Ward 2018-04-19 22:38:08

Years late to the answer @speshak but the reason I chose to access the file like this was because I originally very nervous for the same reasons as mentioned above. At the time it seemed like separating the "find the file" part from the "replace" part would make the code easier to read before I submitted it

@CharlesM 2018-08-23 09:36:03

One more confirmation: I enjoyed this method, until I started to get errors about the number of columns not maching the expected value ( = broken the whole db / table).

@rasika godawatte 2011-06-02 15:16:20

I recently wrote a MySQL function to replace strings using regular expressions. You could find my post at the following location:

http://techras.wordpress.com/2011/06/02/regex-replace-for-mysql/

Here is the function code:

DELIMITER $$

CREATE FUNCTION  `regex_replace`(pattern VARCHAR(1000),replacement VARCHAR(1000),original VARCHAR(1000))
RETURNS VARCHAR(1000)
DETERMINISTIC
BEGIN 
 DECLARE temp VARCHAR(1000); 
 DECLARE ch VARCHAR(1); 
 DECLARE i INT;
 SET i = 1;
 SET temp = '';
 IF original REGEXP pattern THEN 
  loop_label: LOOP 
   IF i>CHAR_LENGTH(original) THEN
    LEAVE loop_label;  
   END IF;
   SET ch = SUBSTRING(original,i,1);
   IF NOT ch REGEXP pattern THEN
    SET temp = CONCAT(temp,ch);
   ELSE
    SET temp = CONCAT(temp,replacement);
   END IF;
   SET i=i+1;
  END LOOP;
 ELSE
  SET temp = original;
 END IF;
 RETURN temp;
END$$

DELIMITER ;

Example execution:

mysql> select regex_replace('[^a-zA-Z0-9\-]','','2my test3_text-to. check \\ my- sql (regular) ,expressions ._,');

@Jay Taylor 2012-01-05 21:24:12

It also only works on single characters..

@Jason 2012-02-06 23:15:25

I'll just reinforce the above point: this function replaces characters that match a single-character expression. It says above that it is used "to repalce strings using regular expressions", and that can be a little misleading. It does its job, but it's not the job being asked for. (Not a complaint - it is just to save leading people down the wrong path)

@phobie 2015-11-17 09:38:36

It would be more helpful to actually include code in you answer instead of posting a naked link.

@Izzy 2016-04-02 18:33:58

Nice – but unfortunately doesn't deal with references like select regex_replace('.*(abc).*','\1','noabcde') (returns 'noabcde', not 'abc').

@Izzy 2016-04-02 18:35:08

@phobie someone else did that in this answer – just as a reference in case the link dies ;)

@Steve Chambers 2016-08-01 16:02:47

I've modified this method to attempt to address some of the limitations mentioned above and more. Please see this answer.

@Benvorth 2014-10-03 13:11:16

Use MariaDB instead. It has a function

REGEXP_REPLACE(col, regexp, replace)

See MariaDB docs and PCRE Regular expression enhancements

Note that you can use regexp grouping as well (I found that very useful):

SELECT REGEXP_REPLACE("stackoverflow", "(stack)(over)(flow)", '\\2 - \\1 - \\3')

returns

over - stack - flow

@Piskvor 2014-10-03 14:09:03

Niiiice! Even more so because we have already migrated to it for unrelated reasons. Thanks for the tip :)

@Nick 2014-10-07 17:02:43

this is from mariadb 10

@Jeff Widman 2014-12-16 07:29:30

Benni--thanks for pointing this out, but I'm a little confused on the actual implementation... mind chiming in on my question over here? stackoverflow.com/questions/27498929/… cc @Piskvor

@Josiah 2016-08-11 12:01:27

For the next time I need it, here's syntax for changing a whole column: UPDATE table SET Name = REGEXP_REPLACE(Name, "-2$", "\\1") This removes -2 from abcxyz-2 from a whole column at once.

@David Baucum 2017-11-29 22:15:03

Changing an entire platform is hardly a realistic solution.

@Benvorth 2017-11-30 06:15:49

@DavidBaucum MariaDB is a drop-in replacement for MySQL. So it is no "change of platform" but more like choosing a different airline for the same trip

@Lukasz Szozda 2018-04-19 16:10:08

@Jay Patel 2014-12-19 05:07:35

we solve this problem without using regex this query replace only exact match string.

update employee set
employee_firstname = 
trim(REPLACE(concat(" ",employee_firstname," "),' jay ',' abc '))

Example:

emp_id employee_firstname

1 jay

2 jay ajay

3 jay

After executing query result:

emp_id employee_firstname

1 abc

2 abc ajay

3 abc

@James Drummond 2015-12-25 13:53:04

I have no idea why this answer had no votes but this is works perfectly.

@codecowboy 2016-03-04 12:31:54

@yellowmelon what are the two pairs of double quotes for?

@Slam 2016-04-21 20:29:12

He's padding the employeename with spaces before and after. This allows him to search-replace for (space)employeename(space), which avoids catching the employeename "jay" if its part of a larger string "ajay." Then he trims the spaces out when done.

@Sean the Bean 2017-10-05 13:32:41

Looks like a pretty solid workaround for word replacement!

@user3796869 2014-12-01 06:37:45

We can use IF condition in SELECT query as below:

Suppose that for anything with "ABC","ABC1","ABC2","ABC3",..., we want to replace with "ABC" then using REGEXP and IF() condition in the SELECT query, we can achieve this.

Syntax:

SELECT IF(column_name REGEXP 'ABC[0-9]$','ABC',column_name)
FROM table1 
WHERE column_name LIKE 'ABC%';

Example:

SELECT IF('ABC1' REGEXP 'ABC[0-9]$','ABC','ABC1');

@Piskvor 2014-12-01 08:38:48

Hello, thank you for the suggestion. I have been trying something similar, but the performance on my data sets has been unsatisfactory. For smallish sets, this may be viable.

@dotancohen 2013-11-13 14:51:37

I'm happy to report that since this question was asked, now there is a satisfactory answer! Take a look at this terrific package:

https://github.com/mysqludf/lib_mysqludf_preg

Sample SQL:

SELECT PREG_REPLACE('/(.*?)(fox)/' , 'dog' , 'the quick brown fox' ) AS demo;

I found the package from this blog post as linked on this question.

@codecowboy 2016-03-04 12:29:07

how would you update a value in a table?

@kidata 2016-10-18 09:31:45

@dotancohen that would be also my question!

@Eddie B 2012-09-28 03:09:29

You 'can' do it ... but it's not very wise ... this is about as daring as I'll try ... as far as full RegEx support your much better off using perl or the like.

UPDATE db.tbl
SET column = 
CASE 
WHEN column REGEXP '[[:<:]]WORD_TO_REPLACE[[:>:]]' 
THEN REPLACE(column,'WORD_TO_REPLACE','REPLACEMENT')
END 
WHERE column REGEXP '[[:<:]]WORD_TO_REPLACE[[:>:]]'

@Ryan Shillington 2012-10-03 17:14:00

No, that won't work. Imagine your column contains 'asdfWORD_TO_REPLACE WORD_TO_REPLACE". Your method would result in 'asdfREPLACEMENT REPLACEMENT" where the correct answer would be "asdfWORD_TO_REPLACE REPLACEMENT".

@Eddie B 2012-10-03 17:21:59

@Ryan ... that's exactly why I stated that it wasn't very wise ... in the use case you provide this would most definitely fail. In short it's a bad idea to use 'regex-like' structure. Even worse ... if you drop the where clause all your values will be NULL ...

@Eddie B 2012-10-10 23:33:14

Actually Ryan in this case you're incorrect as the markers will only find matches for the zero-length word 'boundaries' so only words with boundaries before and after the word would match ... It's still a bad idea though ...

@Eddie B 2013-03-05 01:35:29

Err ... I'm losing some points here ... Please explain why I'm getting downvoted when I clearly stated this as "NOT VERY WISE"

@Ryan Shillington 2013-04-29 17:46:27

I didn't down-vote you, but any answer that says "This is a terrible idea and you definitely should not do this, but here it is: ..." is probably going to get down voted :-).

@Eddie B 2013-04-29 18:16:10

@RyanShillington "Sigh ... It would have wiser to simply not answer this one :-)"

@jmilloy 2013-09-26 14:32:19

I guess there's a difference between "not wise" and "incorrect".

@Eddie B 2013-10-04 18:55:31

@jmilloy ;-) As such with 'Should and Shall...'

Related Questions

Sponsored Content

46 Answered Questions

[SOLVED] How to replace all occurrences of a string in JavaScript

15 Answered Questions

[SOLVED] How do you access the matched groups in a JavaScript regular expression?

  • 2009-01-11 07:21:20
  • nickf
  • 628194 View
  • 1129 Score
  • 15 Answer
  • Tags:   javascript regex

15 Answered Questions

[SOLVED] How to get a list of MySQL user accounts

  • 2009-07-16 03:23:53
  • burntsugar
  • 1384712 View
  • 1269 Score
  • 15 Answer
  • Tags:   mysql mysql5

77 Answered Questions

[SOLVED] How to validate an email address in JavaScript?

35 Answered Questions

[SOLVED] Should I use the datetime or timestamp data type in MySQL?

70 Answered Questions

28 Answered Questions

[SOLVED] How can I prevent SQL injection in PHP?

27 Answered Questions

18 Answered Questions

[SOLVED] How do you use a variable in a regular expression?

  • 2009-01-30 00:11:05
  • JC Grubbs
  • 577909 View
  • 1079 Score
  • 18 Answer
  • Tags:   javascript regex

7 Answered Questions

[SOLVED] Is there a regular expression to detect a valid regular expression?

  • 2008-10-05 17:07:35
  • psytek
  • 98987 View
  • 647 Score
  • 7 Answer
  • Tags:   regex

Sponsored Content