By Chris Dutrow


2012-06-06 22:44:22 8 Comments

What is a good way to remove all characters that are out of the range: ordinal(128) from a string in python?

I'm using hashlib.sha256 in python 2.7. I'm getting the exception:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u200e' in position 13: ordinal not in range(128)

I assume this means that some funky character found its way into the string that I am trying to hash.

Thanks!

3 comments

@Andrew Clark 2012-06-06 23:08:05

Instead of removing those characters, it would be better to use an encoding that hashlib won't choke on, utf-8 for example:

>>> data = u'\u200e'
>>> hashlib.sha256(data.encode('utf-8')).hexdigest()
'e76d0bc0e98b2ad56c38eebda51da277a591043c9bc3f5c5e42cd167abc7393e'

@Nick Craig-Wood 2012-06-06 23:04:15

This is an example of where the changes in python3 will make an improvement, or at least generate a clearer error message

Python2

>>> import hashlib
>>> funky_string=u"You owe me £100"
>>> hashlib.sha256(funky_string)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 11: ordinal not in range(128)
>>> hashlib.sha256(funky_string.encode("utf-8")).hexdigest()
'81ebd729153b49aea50f4f510972441b350a802fea19d67da4792b025ab6e68e'
>>> 

Python3

>>> import hashlib
>>> funky_string="You owe me £100"
>>> hashlib.sha256(funky_string)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Unicode-objects must be encoded before hashing
>>> hashlib.sha256(funky_string.encode("utf-8")).hexdigest()
'81ebd729153b49aea50f4f510972441b350a802fea19d67da4792b025ab6e68e'
>>> 

The real problem is that sha256 takes a sequence of bytes which python2 doesn't have a clear concept of. Use .encode("utf-8") is what I'd suggest.

@Joran Beasley 2012-06-06 22:47:52

new_safe_str = some_string.encode('ascii','ignore') 

I think would work

or you could do a list comprehension

"".join([ch for ch in orig_string if ord(ch)<= 128])

[edit] however as others have said it may be better to figure out how to deal with unicode in general... unless you really need it encoded as ascii for some reason

@Chris Dutrow 2012-06-07 11:05:57

This is the accepted answer because it is the only one that will work for my use case. It would have been nice to know in advance that the hash function needed some more micro-management to work correctly, but now that several million database entries have secondary keys using the current hash method, I am not in a position to change it.

Related Questions

Sponsored Content

27 Answered Questions

10 Answered Questions

[SOLVED] Proper way to declare custom exceptions in modern Python?

10 Answered Questions

[SOLVED] Setting the correct encoding when piping stdout in Python

4 Answered Questions

[SOLVED] string encoding and decoding?

  • 2012-07-05 07:48:06
  • waigani
  • 136633 View
  • 53 Score
  • 4 Answer
  • Tags:   python python-2.7

8 Answered Questions

[SOLVED] Writing Unicode text to a text file?

6 Answered Questions

[SOLVED] How to remove escape characters from string in python?

  • 2019-09-10 03:48:10
  • Houy Narun
  • 98 View
  • 0 Score
  • 6 Answer
  • Tags:   python

2 Answered Questions

2 Answered Questions

[SOLVED] UCS2 coding and decoding using Python

  • 2018-08-17 08:22:24
  • Ayman Tyseer
  • 421 View
  • 0 Score
  • 2 Answer
  • Tags:   python ucs2

2 Answered Questions

[SOLVED] Python - Replace non-ascii character in string (»)

Sponsored Content