By hamdiakoguz


2008-11-18 22:49:27 8 Comments

I have a unicode string like "Tanım" which is encoded as "Tan%u0131m" somehow. How can i convert this encoded string back to original unicode. Apparently urllib.unquote does not support unicode.

5 comments

@Martijn Pieters 2019-03-07 14:41:41

You have a URL using a non-standard encoding scheme, rejected by standards bodies but still being produced by some encoders. The Python urllib.parse.unquote() function can't handle these.

Creating your own decoder is not that hard, luckily. %uhhhh entries are meant to be UTF-16 codepoints here, so we need to take surrogate pairs into account. I've also seen %hh codepoints mixed in, for added confusion.

With that in mind, here is a decoder which works in both Python 2 and Python 3, provided you pass in a str object in Python 3 (Python 2 cares less):

try:
    # Python 3
    from urllib.parse import unquote
    unichr = chr
except ImportError:
    # Python 2
    from urllib import unquote

def unquote_unicode(string, _cache={}):
    string = unquote(string)  # handle two-digit %hh components first
    parts = string.split(u'%u')
    if len(parts) == 1:
        return parts
    r = [parts[0]]
    append = r.append
    for part in parts[1:]:
        try:
            digits = part[:4].lower()
            if len(digits) < 4:
                raise ValueError
            ch = _cache.get(digits)
            if ch is None:
                ch = _cache[digits] = unichr(int(digits, 16))
            if (
                not r[-1] and
                u'\uDC00' <= ch <= u'\uDFFF' and
                u'\uD800' <= r[-2] <= u'\uDBFF'
            ):
                # UTF-16 surrogate pair, replace with single non-BMP codepoint
                r[-2] = (r[-2] + ch).encode(
                    'utf-16', 'surrogatepass').decode('utf-16')
            else:
                append(ch)
            append(part[4:])
        except ValueError:
            append(u'%u')
            append(part)
    return u''.join(r)

The function is heavily inspired by the current standard-library implementation.

Demo:

>>> print(unquote_unicode('Tan%u0131m'))
Tanım
>>> print(unquote_unicode('%u05D0%u05D9%u05DA%20%u05DE%u05DE%u05D9%u05E8%u05D9%u05DD%20%u05D0%u05EA%20%u05D4%u05D8%u05E7%u05E1%u05D8%20%u05D4%u05D6%u05D4'))
איך ממירים את הטקסט הזה
>>> print(unquote_unicode('%ud83c%udfd6'))  # surrogate pair
🏖
>>> print(unquote_unicode('%ufoobar%u666'))  # incomplete
%ufoobar%u666

The function works on Python 2 (tested on 2.4 - 2.7) and Python 3 (tested on 3.3 - 3.8).

@Jermaine 2008-12-16 03:13:58

there is a bug in the above version where it freaks out sometimes when there are both ascii encoded and unicode encoded characters in the string. I think its specifically when there are characters from the upper 128 range like '\xab' in addition to unicode.

eg. "%5B%AB%u03E1%BB%5D" causes this error.

I found if you just did the unicode ones first, the problem went away:

def unquote_u(source):
  result = source
  if '%u' in result:
    result = result.replace('%u','\\u').decode('unicode_escape')
  result = unquote(result)
  return result

@wberry 2011-09-20 18:05:43

\xab is not a character but a byte. In effect your example "string" contains both bytes and characters, which is not valid as a single string in any language I know of.

@Martijn Pieters 2019-03-07 15:28:51

What would "%5B%AB%u03E1%BB%5D" decode as? 0x5B 0xAB and 0xBB 0x5D are hardly valid UTF-8 sequences.

@Martijn Pieters 2019-03-07 15:29:49

@wberry: I've seen real-life cases (a Java library somewhere) that encodes some ASCII codepoints like spaces to %hh sequences, and anything over 0x7F to %uhhhh sequences. Terrible, but parsable.

@Ali Afshar 2008-11-18 23:32:49

This will do it if you absolutely have to have this (I really do agree with the cries of "non-standard"):

from urllib import unquote

def unquote_u(source):
    result = unquote(source)
    if '%u' in result:
        result = result.replace('%u','\\u').decode('unicode_escape')
    return result

print unquote_u('Tan%u0131m')

> Tanım

@Aaron Maenpaa 2008-11-18 23:44:07

A slightly pathological case, but: unquote_u('Tan%25u0131m') --> u'Tan\u0131m' rather than 'Tan%u0131' like it should. Just a reminder of why you probably don't want to write a decoder unless you really need it.

@Ali Afshar 2008-11-18 23:48:41

I totally agree. Which is why I really was not keen to offer an actual solution. These things are never so straightforward. The O.P. might have been desperate though, and I think this complements your excellent answer.

@Martijn Pieters 2019-03-07 15:26:21

This only works for Python 2, unfortunately, which is rapidly approaching its end-of-life. The use of unicode_escape makes it a little harder to correct for Python 3 use (you'd need to encode to utf-8 first), but this version does not handle surrogate pairs. The intent of the %hhhh escape format was to encode UTF-16 codepoints, so for non-BMP sequences (such as a large number of emoji) you'd get an invalid string on anything but a UCS-2 Python 2 build.

@Aaron Maenpaa 2008-11-18 23:22:44

%uXXXX is a non-standard encoding scheme that has been rejected by the w3c, despite the fact that an implementation continues to live on in JavaScript land.

The more common technique seems to be to UTF-8 encode the string and then % escape the resulting bytes using %XX. This scheme is supported by urllib.unquote:

>>> urllib2.unquote("%0a")
'\n'

Unfortunately, if you really need to support %uXXXX, you will probably have to roll your own decoder. Otherwise, it is likely to be far more preferable to simply UTF-8 encode your unicode and then % escape the resulting bytes.

A more complete example:

>>> u"Tanım"
u'Tan\u0131m'
>>> url = urllib.quote(u"Tanım".encode('utf8'))
>>> urllib.unquote(url).decode('utf8')
u'Tan\u0131m'

@jamtoday 2009-09-07 00:30:47

'urllib2.unquote' should be 'urllib.unquote'

@wberry 2011-09-20 18:13:48

Interesting that a URI is a percent-encoded byte-string, rather than a character-string.

@Francisco Costa 2014-02-21 18:49:23

@jamtoday not necessarly, in Python 2.7.5+ you can use urllib2.unquote just try print(dir(urllib2))

@Emily 2017-01-25 10:23:24

urllib.unquote(url.encode('utf-8')) worked for me instead

@Akin Hwan 2019-08-01 15:10:52

is it bad practice to do something like unquote(urlencode())?

@Markus Jarderot 2008-11-18 23:22:24

def unquote(text):
    def unicode_unquoter(match):
        return unichr(int(match.group(1),16))
    return re.sub(r'%u([0-9a-fA-F]{4})',unicode_unquoter,text)

@Martijn Pieters 2019-03-07 15:25:02

This only works for Python 2, unfortunately, which is rapidly approaching its end-of-life. It's not hard to correct for to make this Python 2 and 3 compatible (try: unichr, except NameError: unichr = chr), but this version does not handle surrogate pairs. The intent of the %hhhh escape format was to encode UTF-16 codepoints, so for non-BMP sequences (such as a large number of emoji) you'd get an invalid string on anything but a UCS-2 Python 2 build.

Related Questions

Sponsored Content

22 Answered Questions

[SOLVED] What are metaclasses in Python?

62 Answered Questions

[SOLVED] How to call an external command?

22 Answered Questions

[SOLVED] How to print without newline or space?

  • 2009-01-29 20:58:25
  • Andrea Ambu
  • 1916685 View
  • 1917 Score
  • 22 Answer
  • Tags:   python newline

34 Answered Questions

[SOLVED] How do I sort a dictionary by value?

44 Answered Questions

19 Answered Questions

[SOLVED] Convert bytes to a string

10 Answered Questions

[SOLVED] Does Python have a string 'contains' substring method?

26 Answered Questions

[SOLVED] Does Python have a ternary conditional operator?

40 Answered Questions

3 Answered Questions

[SOLVED] Best way to convert string to bytes in Python 3?

Sponsored Content