By MiniQuark


2009-02-05 21:10:40 8 Comments

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).

I found on the Web an elegant way to do this in Java:

  1. convert the Unicode string to its long normalized form (with a separate character for letters and diacritics)
  2. remove all the characters whose Unicode type is "diacritic".

Do I need to install a library such as pyICU or is this possible with just the python standard library? And what about python 3?

Important note: I would like to avoid code with an explicit mapping from accented characters to their non-accented counterpart.

8 comments

@Piotr Migdal 2018-01-30 00:27:58

gensim.utils.deaccent(text) from Gensim - topic modelling for humans:

deaccent("Šéf chomutovských komunistů dostal poštou bílý prášek") 'Sef chomutovskych komunistu dostal postou bily prasek'

Another solution is unidecode.

Not that the suggested solution with unicodedata typically removes accents only in some character (e.g. it turns 'ł' into '', rather than into 'l').

@lcieslak 2019-06-10 08:13:39

deaccent still gives ł instead of l.

@hexaJer 2015-07-24 10:08:14

Actually I work on project compatible python 2.6, 2.7 and 3.4 and I have to create IDs from free user entries.

Thanks to you, I have created this function that works wonders.

import re
import unicodedata

def strip_accents(text):
    """
    Strip accents from input String.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    try:
        text = unicode(text, 'utf-8')
    except (TypeError, NameError): # unicode is a default on python 3 
        pass
    text = unicodedata.normalize('NFD', text)
    text = text.encode('ascii', 'ignore')
    text = text.decode("utf-8")
    return str(text)

def text_to_id(text):
    """
    Convert input text to id.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    text = strip_accents(text.lower())
    text = re.sub('[ ]+', '_', text)
    text = re.sub('[^0-9a-zA-Z_-]', '', text)
    return text

result:

text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
>>> 'montreal_uber_1289_mere_francoise_noel_889'

@hexaJer 2015-07-24 10:13:12

unicode string with python3 : stackoverflow.com/a/6812069/1569144

@Daniel Reis 2016-03-18 15:56:04

With Py2.7, passing an already unicode string errors at text = unicode(text, 'utf-8'). A workaround for that was to addexcept TypeError: pass

@Christian Oudard 2010-04-13 21:21:14

Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.

Example:

accented_string = u'Málaga'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaga'and is of type 'str'

@Paul McMillan 2010-04-13 21:29:24

Yeah, this is a better solution than simply stripping the accents. It provides much more useful transliterations for the languages that have conventions for writing words in ASCII.

@Eric O Lebigot 2011-09-17 14:56:34

Seems to work well with Chinese, but the transformation of the French name "François" unfortunately gives "FranASSois", which is not very good, compared to the more natural "Francois".

@kolinko 2012-03-31 18:15:10

depends what you're trying to achieve. for example I'm doing a search right now, and I don't want to transliterate greek/russian/chinese, I just want to replace "ą/ę/ś/ć" with "a/e/s/c"

@Karl Bartel 2012-04-30 09:38:42

@EOL unidecode works for great for strings like "François", if you pass unicode objects to it. It looks like you tried with a plain byte string.

@Mathieu 2013-03-03 06:13:12

@EOL It looks like the "C cédille" is now handled properly. So, as far as I tested unidecode, which isn't much, I now consider it gives very good results.

@Mikhail Korobov 2014-02-23 22:27:52

Note that unidecode >= 0.04.10 (Dec 2012) is GPL. Use earlier versions or check github.com/kmike/text-unidecode if you need a more permissive license and can stand a slightly worse implementation.

@chhantyal 2015-01-07 13:33:36

Doesn't seem to work with German eg. Ö => O Where it should be Oe

@Liam 2015-11-29 19:12:15

how to use it with variables?

@Antti Haapala 2016-08-20 06:07:00

@chhantyal the Ö => OE is quite German-specific. In Finnish, some words like ääliö would render completely unrecognizable aeaelioe; it is simply more correct to omit diaresis than to add the e, though pronunciation of the accented letter is pretty much on par with the German umlaut.

@Mark Amery 2016-09-15 13:44:47

@EOL You'll be pleased to know that in the latest version of the library, 'François' is mapped to 'Francois' as you'd expect.

@Eric Duminil 2017-04-28 12:02:31

unidecode replaces ° with deg. It does more than just removing accents.

@Drunken Master 2017-05-14 17:00:47

People need to understand that Unicode character decomposition is a language specific mapping, it does not work universally and modules like unidecode are never going to work well with ignoring the locale or language of the input. As to CJK characters, it's a childish assumption that you can take an arbitary CJK character and 'render' it with ASCII: CJK characters can have multiple readings both in Chinese and Japanese, and the Chinese, Japanese, etc. readings are also going to be different. These modules are a waste of time.

@Mohsin 2018-07-30 11:14:56

What if I'm reading a string from a file how do I give it as input to the the library? like u+'str' but that would give me a varible answer name u is not defined

@oefe 2009-02-05 22:17:22

How about this:

import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

This works on greek letters, too:

>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>> 

The character category "Mn" stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).

And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".

@alexis 2012-04-07 11:25:32

These are not composed characters, unfortunately--even though "ł" is named "LATIN SMALL LETTER L WITH STROKE"! You'll either need to play games with parsing unicodedata.name, or break down and use a look-alike table-- which you'd need for Greek letters anyway (Α is just "GREEK CAPITAL LETTER ALPHA").

@alexis 2014-11-23 00:12:18

@andi, I'm afraid I can't guess what point you want to make. The email exchange reflects what I wrote above: Because the letter "ł" is not an accented letter (and is not treated as one in the Unicode standard), it does not have a decomposition.

@lenz 2016-05-16 07:41:48

@alexis (late follow-up): This works perfectly well for Greek as well – eg. "GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA" is normalised into "GREEK CAPITAL LETTER ALPHA" just as expected. Unless you are referring to transliteration (eg. "α" → "a"), which is not the same as "removing accents"...

@alexis 2016-05-16 17:01:56

@lenz, I wasn't talking about removing accents from Greek, but about the "stroke" on the ell. Since it is not a diacritic, changing it to plain ell is the same as changing Greek Alpha to A. If don't want it don't do it, but in both cases you're substituting a Latin (near) look-alike.

@Art 2017-03-01 06:53:40

Mostly works nice :) But it doesn't transform ß into ascii ss in example. I would still use unidecode to avoid accidents.

@o11c 2017-05-05 21:46:49

Should probably use .combining() to check the property directly, rather than only handling .category() == 'Mn, which will mess up

@MiniQuark 2009-02-05 21:19:34

I just found this answer on the Web:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nfkd_form.encode('ASCII', 'ignore')
    return only_ascii

It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.

Edit: this does the trick:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

unicodedata.combining(c) will return true if the character c can be combined with the preceding character, that is mainly if it's a diacritic.

Edit 2: remove_accents expects a unicode string, not a byte string. If you have a byte string, then you must decode it into a unicode string like this:

encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café"  # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)

@Jabba 2012-01-08 23:27:38

I had to add 'utf8' to unicode: nkfd_form = unicodedata.normalize('NFKD', unicode(input_str, 'utf8'))

@MestreLion 2012-04-17 23:15:35

@Jabba: , 'utf8' is a "safety net" needed if you are testing input in terminal (which by default does not use unicode). But usually you don't have to add it, since if you're removing accents then input_str is very likely to be utf8 already. It doesn't hurt to be safe, though.

@rbp 2013-06-09 15:40:55

>>> def remove_accents(input_str): ... nkfd_form = unicodedata.normalize('NFKD', unicode(input_str)) ... return u"".join([c for c in nkfd_form if not unicodedata.combining(c)]) ... >>> remove_accents('é') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 2, in remove_accents UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

@MiniQuark 2013-06-11 10:11:42

@rbp: you should pass a unicode string to remove_accents instead of a regular string (u"é" instead of "é"). You passed a regular string to remove_accents, so when trying to convert your string to a unicode string, the default ascii encoding was used. This encoding does not support any byte whose value is >127. When you typed "é" in your shell, your O.S. encoded that, probably with UTF-8 or some Windows Code Page encoding, and that included bytes >127. I'll change my function in order to remove the conversion to unicode: it will bomb more clearly if a non-unicode string is passed.

@rbp 2013-06-12 20:59:05

@MiniQuark that worked perfectly >>> remove_accents(unicode('é'))

@s29 2018-06-08 02:38:41

This answer gave me the best result on a large data set, the only exception is "ð"- unicodedata wouldn't touch it!

@sirex 2015-07-24 11:34:02

Some languages have combining diacritics as language letters and accent diacritics to specify accent.

I think it is more safe to specify explicitly what diactrics you want to strip:

def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')):
    accents = set(map(unicodedata.lookup, accents))
    chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
    return unicodedata.normalize('NFC', ''.join(chars))

@aseagram 2013-06-12 15:48:48

In response to @MiniQuark's answer:

I was trying to read in a csv file that was half-French (containing accents) and also some strings which would eventually become integers and floats. As a test, I created a test.txt file that looked like this:

Montréal, über, 12.89, Mère, Françoise, noël, 889

I had to include lines 2 and 3 to get it to work (which I found in a python ticket), as well as incorporate @Jabba's comment:

import sys 
reload(sys) 
sys.setdefaultencoding("utf-8")
import csv
import unicodedata

def remove_accents(input_str):
    nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
    return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])

with open('test.txt') as f:
    read = csv.reader(f)
    for row in read:
        for element in row:
            print remove_accents(element)

The result:

Montreal
uber
12.89
Mere
Francoise
noel
889

(Note: I am on Mac OS X 10.8.4 and using Python 2.7.3)

@MiniQuark 2013-06-12 19:52:11

remove_accents was meant to remove accents from a unicode string. In case it's passed a byte-string, it tries to convert it to a unicode string with unicode(input_str). This uses python's default encoding, which is "ascii". Since your file is encoded with UTF-8, this would fail. Lines 2 and 3 change python's default encoding to UTF-8, so then it works, as you found out. Another option is to pass remove_accents a unicode string: remove lines 2 and 3, and on the last line replace element by element.decode("utf-8"). I tested: it works. I'll update my answer to make this clearer.

@aseagram 2013-06-12 20:11:04

Nice edit, good point. (On another note: The real problem I've realised is that my data file is apparently encoded in iso-8859-1, which I can't get to work with this function, unfortunately!)

@MiniQuark 2013-06-13 07:43:28

aseagram: simply replace "utf-8" with "iso-8859-1", and it should work. If you're on windows, then you should probably use "cp1252" instead.

@PM 2Ring 2018-05-16 13:13:19

BTW, reload(sys); sys.setdefaultencoding("utf-8") is a dubious hack sometimes recommended for Windows systems; see stackoverflow.com/questions/28657010/… for details.

@lenz 2013-03-21 12:39:18

This handles not only accents, but also "strokes" (as in ø etc.):

import unicodedata as ud

def rmdiacritics(char):
    '''
    Return the base character of char, by "removing" any
    diacritics like accents or curls and strokes and the like.
    '''
    desc = ud.name(unicode(char))
    cutoff = desc.find(' WITH ')
    if cutoff != -1:
        desc = desc[:cutoff]
    return ud.lookup(desc)

This is the most elegant way I can think of (and it has been mentioned by alexis in a comment on this page), although I don't think it is very elegant indeed.

There are still special letters that are not handled by this, such as turned and inverted letters, since their unicode name does not contain 'WITH'. It depends on what you want to do anyway. I sometimes needed accent stripping for achieving dictionary sort order.

@janek37 2015-07-09 09:45:41

You should catch the exception if the new symbol doesn't exist. For example there's SQUARE WITH VERTICAL FILL ▥, but there's no SQUARE. (not to mention that this code transforms UMBRELLA WITH RAIN DROPS ☔ into UMBRELLA ☂).

@matanster 2018-12-29 14:30:01

This looks elegant in harnessing the semantic descriptions of characters that are available. Do we really need the unicode function call in there with python 3 though? I think a tighter regex in place of the find would avoid all the trouble mentioned in the comment above, and also, memoization would help performance when it's a critical code path.

@lenz 2018-12-29 14:45:19

@matanster no, this is an old answer from the Python-2 era; the unicode typecast is no longer appropriate in Python 3. In any case, in my experience there is no universal, elegant solution to this problem. Depending on the application, any approach has its pros and cons. Quality-thriving tools like unidecode are based on hand-crafted tables. Some resources (tables, algorithms) are provided by Unicode, eg. for collation.

Related Questions

Sponsored Content

23 Answered Questions

[SOLVED] How do I parse a string to a float or int?

18 Answered Questions

[SOLVED] Way to create multiline comments in Python?

16 Answered Questions

[SOLVED] What are metaclasses in Python?

23 Answered Questions

[SOLVED] How to check if the string is empty?

25 Answered Questions

[SOLVED] Is there a way to run Python on Android?

26 Answered Questions

[SOLVED] How can I remove a trailing newline?

  • 2008-11-08 18:25:24
  • RidingThisToTheTop
  • 1560054 View
  • 1532 Score
  • 26 Answer
  • Tags:   python newline trailing

10 Answered Questions

[SOLVED] Does Python have a string 'contains' substring method?

5 Answered Questions

[SOLVED] How do I lowercase a string in Python?

11 Answered Questions

[SOLVED] What is the quickest way to HTTP GET in Python?

11 Answered Questions

[SOLVED] How to substring a string in Python?

  • 2009-03-19 17:29:41
  • Joan Venge
  • 2600413 View
  • 1925 Score
  • 11 Answer
  • Tags:   python string

Sponsored Content