By Docstero


2010-05-29 09:53:10 8 Comments

Is there a function in PHP that can decode Unicode escape sequences like "\u00ed" to "í" and all other similar occurrences?

I found similar question here but is doesn't seem to work.

7 comments

@Rabin Lama Dong 2017-07-06 11:11:37

PHP 7+

As of PHP 7, you can use the Unicode codepoint escape syntax to do this.

echo "\u{00ed}"; outputs í.

@Gus 2018-11-06 18:56:12

Thanks! Much simpler than the other answers

@orel 2017-01-24 23:04:10

fix json values, it's add \ before u{xxx} to all +" "

  $item = preg_replace_callback('/"(.+?)":"(u.+?)",/', function ($matches) {
        $matches[2] = preg_replace('/(u)/', '\u', $matches[2]);
            $matches[2] = preg_replace('/(")/', '"', $matches[2]); 
            $matches[2] = json_decode('"' . $matches[2] . '"'); 
            return '"' . $matches[1] . '":"' . $matches[2] . '",';
        }, $item);

@Nemo Noman 2015-03-06 13:41:16

This is a sledgehammer approach to replacing raw UNICODE with HTML. I haven't seen any other place to put this solution, but I assume others have had this problem.

Apply this str_replace function to the RAW JSON, before doing anything else.

function unicode2html($str){
    $i=65535;
    while($i>0){
        $hex=dechex($i);
        $str=str_replace("\u$hex","&#$i;",$str);
        $i--;
     }
     return $str;
}

This won't take as long as you think, and this will replace ANY unicode with HTML.

Of course this can be reduced if you know the unicode types that are being returned in the JSON.

For example my code was getting lots of arrows and dingbat unicode. These are between 8448 an 11263. So my production code looks like:

$i=11263;
while($i>08448){
    ...etc...

You can look up the blocks of Unicode by type here: http://unicode-table.com/en/ If you know you're translating Arabic or Telegu or whatever, you can just replace those codes, not all 65,000.

You could apply this same sledgehammer to simple encoding:

 $str=str_replace("\u$hex",chr($i),$str);

@masakielastic 2015-01-15 23:49:08

$str = '\u0063\u0061\u0074'.'\ud83d\ude38';
$str2 = '\u0063\u0061\u0074'.'\ud83d';

// U+1F638
var_dump(
    "cat\xF0\x9F\x98\xB8" === escape_sequence_decode($str),
    "cat\xEF\xBF\xBD" === escape_sequence_decode($str2)
);

function escape_sequence_decode($str) {

    // [U+D800 - U+DBFF][U+DC00 - U+DFFF]|[U+0000 - U+FFFF]
    $regex = '/\\\u([dD][89abAB][\da-fA-F]{2})\\\u([dD][c-fC-F][\da-fA-F]{2})
              |\\\u([\da-fA-F]{4})/sx';

    return preg_replace_callback($regex, function($matches) {

        if (isset($matches[3])) {
            $cp = hexdec($matches[3]);
        } else {
            $lead = hexdec($matches[1]);
            $trail = hexdec($matches[2]);

            // http://unicode.org/faq/utf_bom.html#utf16-4
            $cp = ($lead << 10) + $trail + 0x10000 - (0xD800 << 10) - 0xDC00;
        }

        // https://tools.ietf.org/html/rfc3629#section-3
        // Characters between U+D800 and U+DFFF are not allowed in UTF-8
        if ($cp > 0xD7FF && 0xE000 > $cp) {
            $cp = 0xFFFD;
        }

        // https://github.com/php/php-src/blob/php-5.6.4/ext/standard/html.c#L471
        // php_utf32_utf8(unsigned char *buf, unsigned k)

        if ($cp < 0x80) {
            return chr($cp);
        } else if ($cp < 0xA0) {
            return chr(0xC0 | $cp >> 6).chr(0x80 | $cp & 0x3F);
        }

        return html_entity_decode('&#'.$cp.';');
    }, $str);
}

@Gumbo 2010-05-29 10:06:35

Try this:

$str = preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/', function ($match) {
    return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE');
}, $str);

In case it's UTF-16 based C/C++/Java/Json-style:

$str = preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/', function ($match) {
    return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UTF-16BE');
}, $str);

@Docstero 2010-05-29 10:31:07

Where do I put "\u00ed"?

@Gumbo 2010-05-29 10:42:21

@Docstero: The regular expression will match any sequence of \u followed by four hexadecimal digits.

@Docstero 2010-05-29 10:48:05

Warning: preg_replace_callback() [function.preg-replace-callback]: Compilation failed: PCRE does not support \L, \l, \N, \U, or \u at offset 1

@Artefacto 2011-11-18 10:45:20

This function cannot deal with supplementary characters as they cannot be represented in UCS-2.

@MrFusion 2013-01-09 22:27:29

I wrapped this into a one-parameter function to make it more convenient: <?php function unicode_js_unescape($str) { return preg_replace_callback('/\\\\u([0-9a-f]{4})/i', function($match) { return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE'); }, $str); } ?>

@DougW 2013-02-01 19:16:36

@MrFusion - Just fyi, since a lot of people may be interested in using this to correct json_decode output before the JSON_UNESCAPED_UNICODE option became available in 5.4. Your anonymous function will only work in 5.3+. So there's a pretty small window of versions where it would work and be useful for that specific problem.

@DougW 2013-02-01 19:22:58

You could of course use 'create_function', but that would be using eval, which I'm sure nobody here would ever do.

@Muhammad Babar 2014-08-17 08:03:38

Gumbo you are just great. I have being struggling with this problem for hours.

@Erroid 2015-04-17 01:36:30

This is nice but for older PHP i get T_FUNCTION error because of this function inside function. Is there a way to fix it?

@Demodave 2015-05-18 14:46:57

@gumbo How do you call or use this function?

@ChristoKiwi 2016-10-13 04:11:03

This helps so much! A shame it doesn't capture supplementary characters with the new iOS 10 emoticons, but it's damn close!

@Nico Westerdale 2017-07-03 19:14:54

The json_decode function below works far better, clear concise, and fast.

@Tom Andersen 2017-09-26 00:32:49

I found my way here as I had \u00ed in my output, but I was looking at the output with json_encode() and funnily enough the default json_encode() will trash up the output so use json_encode($theDict,JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE);

@jianyong 2014-05-10 04:11:08

There is also a solution:
http://www.welefen.com/php-unicode-to-utf8.html

function entity2utf8onechar($unicode_c){
    $unicode_c_val = intval($unicode_c);
    $f=0x80; // 10000000
    $str = "";
    // U-00000000 - U-0000007F:   0xxxxxxx
    if($unicode_c_val <= 0x7F){         $str = chr($unicode_c_val);     }     //U-00000080 - U-000007FF:  110xxxxx 10xxxxxx
    else if($unicode_c_val >= 0x80 && $unicode_c_val <= 0x7FF){         $h=0xC0; // 11000000
        $c1 = $unicode_c_val >> 6 | $h;
        $c2 = ($unicode_c_val & 0x3F) | $f;
        $str = chr($c1).chr($c2);
    } else if($unicode_c_val >= 0x800 && $unicode_c_val <= 0xFFFF){         $h=0xE0; // 11100000
        $c1 = $unicode_c_val >> 12 | $h;
        $c2 = (($unicode_c_val & 0xFC0) >> 6) | $f;
        $c3 = ($unicode_c_val & 0x3F) | $f;
        $str=chr($c1).chr($c2).chr($c3);
    }
    //U-00010000 - U-001FFFFF:  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    else if($unicode_c_val >= 0x10000 && $unicode_c_val <= 0x1FFFFF){         $h=0xF0; // 11110000
        $c1 = $unicode_c_val >> 18 | $h;
        $c2 = (($unicode_c_val & 0x3F000) >>12) | $f;
        $c3 = (($unicode_c_val & 0xFC0) >>6) | $f;
        $c4 = ($unicode_c_val & 0x3F) | $f;
        $str = chr($c1).chr($c2).chr($c3).chr($c4);
    }
    //U-00200000 - U-03FFFFFF:  111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    else if($unicode_c_val >= 0x200000 && $unicode_c_val <= 0x3FFFFFF){         $h=0xF8; // 11111000
        $c1 = $unicode_c_val >> 24 | $h;
        $c2 = (($unicode_c_val & 0xFC0000)>>18) | $f;
        $c3 = (($unicode_c_val & 0x3F000) >>12) | $f;
        $c4 = (($unicode_c_val & 0xFC0) >>6) | $f;
        $c5 = ($unicode_c_val & 0x3F) | $f;
        $str = chr($c1).chr($c2).chr($c3).chr($c4).chr($c5);
    }
    //U-04000000 - U-7FFFFFFF:  1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    else if($unicode_c_val >= 0x4000000 && $unicode_c_val <= 0x7FFFFFFF){         $h=0xFC; // 11111100
        $c1 = $unicode_c_val >> 30 | $h;
        $c2 = (($unicode_c_val & 0x3F000000)>>24) | $f;
        $c3 = (($unicode_c_val & 0xFC0000)>>18) | $f;
        $c4 = (($unicode_c_val & 0x3F000) >>12) | $f;
        $c5 = (($unicode_c_val & 0xFC0) >>6) | $f;
        $c6 = ($unicode_c_val & 0x3F) | $f;
        $str = chr($c1).chr($c2).chr($c3).chr($c4).chr($c5).chr($c6);
    }
    return $str;
}
function entities2utf8($unicode_c){
    $unicode_c = preg_replace("/\&\#([\da-f]{5})\;/es", "entity2utf8onechar('\\1')", $unicode_c);
    return $unicode_c;
}

@2BJ 2011-11-02 13:48:55

print_r(json_decode('{"t":"\u00ed"}')); // -> stdClass Object ( [t] => í )

@deceze 2013-05-15 12:15:01

It doesn't even need the object wrapper: json_decode('"' . $text . '"')

@T.Todua 2016-11-25 08:36:45

Thanks. This seems to be STANDARD WAY, rather then accepted answer.

@DynamicDan 2017-10-23 05:38:52

Interestingly, this also works for complex entities like smiley faces... json_decode('{"t":"\uD83D\uDE0A"}') is 😊

@Yvan 2018-10-24 04:12:34

@deceze you should include the fact that $text can include double quotes. So a revised version would be: json_decode('"'.str_replace('"', '\\"', $text).'"'). Thanks for your help :-)

Related Questions

Sponsored Content

18 Answered Questions

[SOLVED] Reference — What does this symbol mean in PHP?

10 Answered Questions

[SOLVED] How do you use bcrypt for hashing passwords in PHP?

11 Answered Questions

[SOLVED] Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence

7 Answered Questions

[SOLVED] Why does modern Perl avoid UTF-8 by default?

  • 2011-05-28 15:12:36
  • w.k
  • 98665 View
  • 555 Score
  • 7 Answer
  • Tags:   perl unicode utf-8

27 Answered Questions

7 Answered Questions

[SOLVED] How does PHP 'foreach' actually work?

8 Answered Questions

[SOLVED] What does the 'b' character do in front of a string literal?

13 Answered Questions

[SOLVED] Dude, where's my php.ini?

  • 2011-12-30 22:20:06
  • necromancer
  • 770608 View
  • 853 Score
  • 13 Answer
  • Tags:   php linux php-ini

2 Answered Questions

[SOLVED] How does Zalgo text work?

  • 2011-07-05 08:30:37
  • Mike
  • 181278 View
  • 681 Score
  • 2 Answer
  • Tags:   html unicode zalgo

Sponsored Content