By Filipe Correia


2012-04-12 08:50:16 8 Comments

I'm dynamically creating python classes, and I know not all characters are valid in this context.

Is there a method somewhere in the class library that I can use to sanitize a random text string, so that I can use it as a class name? Either that or a list of the allowed characters would be a good help.


Addition regarding clashes with identifier names: Like @Ignacio pointed out in the answer below, any character that is valid as an identifier is a valid character in a class name. And you can even use a reserved word as a class name without any trouble. But there's a catch. If you do use a reserved word, you won't be able to make the class accessible like other (non-dynamically-created) classes (e.g., by doing globals()[my_class.__name__] = my_class). The reserved word will always take precedence in such case.

4 comments

@Ghostkeeper 2017-01-10 03:34:32

This is an old question by now, but I'd like to add an answer on how to do this in Python 3 as I've made an implementation.

The allowed characters are documented here: https://docs.python.org/3/reference/lexical_analysis.html#identifiers . They include quite a lot of special characters, including punctuation, underscore, and a whole slew of foreign characters. Luckily the unicodedata module can help. Here's my implementation implementing directly what the Python documentation says:

import unicodedata

def is_valid_name(name):
    if not _is_id_start(name[0]):
        return False
    for character in name[1:]:
        if not _is_id_continue(character):
            return False
    return True #All characters are allowed.

_allowed_id_continue_categories = {"Ll", "Lm", "Lo", "Lt", "Lu", "Mc", "Mn", "Nd", "Nl", "Pc"}
_allowed_id_continue_characters = {"_", "\u00B7", "\u0387", "\u1369", "\u136A", "\u136B", "\u136C", "\u136D", "\u136E", "\u136F", "\u1370", "\u1371", "\u19DA", "\u2118", "\u212E", "\u309B", "\u309C"}
_allowed_id_start_categories = {"Ll", "Lm", "Lo", "Lt", "Lu", "Nl"}
_allowed_id_start_characters = {"_", "\u2118", "\u212E", "\u309B", "\u309C"}

def _is_id_start(character):
    return unicodedata.category(character) in _allowed_id_start_categories or character in _allowed_id_start_categories or unicodedata.category(unicodedata.normalize("NFKC", character)) in _allowed_id_start_categories or unicodedata.normalize("NFKC", character) in _allowed_id_start_characters

def _is_id_continue(character):
    return unicodedata.category(character) in _allowed_id_continue_categories or character in _allowed_id_continue_characters or unicodedata.category(unicodedata.normalize("NFKC", character)) in _allowed_id_continue_categories or unicodedata.normalize("NFKC", character) in _allowed_id_continue_characters

This code is adapted from here under CC0: https://github.com/Ghostkeeper/Luna/blob/d69624cd0dd5648aec2139054fae4d45b634da7e/plugins/data/enumerated/enumerated_type.py#L91 . It has been well tested.

@steveha 2012-04-12 09:39:38

The thing that makes this interesting is that the first character of an identifier is special. After the first character, numbers '0' through '9' are valid for identifiers, but they must not be the first character.

Here's a function that will return a valid identifier given any random string of characters. Here's how it works:

First, we use itr = iter(seq) to get an explicit iterator on the input. Then there is a first loop, which uses the iterator itr to look at characters until it finds a valid first character for an identifier. Then it breaks out of that loop and runs the second loop, using the same iterator (which we named itr) for the second loop. The iterator itr keeps our place for us; the characters the first loop pulled out of the iterator are still gone when the second loop runs.

def gen_valid_identifier(seq):
    # get an iterator
    itr = iter(seq)
    # pull characters until we get a legal one for first in identifer
    for ch in itr:
        if ch == '_' or ch.isalpha():
            yield ch
            break
    # pull remaining characters and yield legal ones for identifier
    for ch in itr:
        if ch == '_' or ch.isalpha() or ch.isdigit():
            yield ch

def sanitize_identifier(name):
    return ''.join(gen_valid_identifier(name))

This is a clean and Pythonic way to handle a sequence two different ways. For a problem this simple, we could just have a Boolean variable that indicates whether we have seen the first character yet or not:

def gen_valid_identifier(seq):
    saw_first_char = False
    for ch in seq:
        if not saw_first_char and (ch == '_' or ch.isalpha()):
            saw_first_char = True 
            yield ch
        elif saw_first_char and (ch == '_' or ch.isalpha() or ch.isdigit()):
            yield ch

I don't like this version nearly as much as the first version. The special handling for one character is now tangled up in the whole flow of control, and this will be slower than the first version as it has to keep checking the value of saw_first_char constantly. But this is the way you would have to handle the flow of control in most languages! Python's explicit iterator is a nifty feature, and I think it makes this code a lot better.

Looping on an explicit iterator is just as fast as letting Python implicitly get an iterator for you, and the explicit iterator lets us split up the loops that handle the different rules for different parts of the identifier. So the explicit iterator gives us cleaner code that also runs faster. Win/win.

@ArtOfWarfare 2015-02-23 17:30:07

Why do you have the itr = iter(seq) line... wouldn't for ch in seq: have the exact same results, the same if not better performance, and improved readability?

@steveha 2015-02-24 18:45:43

@ArtOfWarfare I have edited the answer to explain.

@ArtOfWarfare 2015-02-24 19:59:45

Huh. I've never seen that done before. I'll keep that design in mind next time I similarly need to handle a before and after portion of an iteration.

@Ignacio Vazquez-Abrams 2012-04-12 08:52:14

Python Language Reference, §2.3, "Identifiers and keywords"

Identifiers (also referred to as names) are described by the following lexical definitions:

identifier ::=  (letter|"_") (letter | digit | "_")*
letter     ::=  lowercase | uppercase
lowercase  ::=  "a"..."z"
uppercase  ::=  "A"..."Z"
digit      ::=  "0"..."9"

Identifiers are unlimited in length. Case is significant.

@void-pointer 2014-01-11 19:20:45

Here is the regular expression used to define valid identifiers: identifier ::= (letter|"_") (letter | digit | "_")*. (Perhaps you would like to add something to this effect to your answer so that users don't have to search the webpage?)

@Qix 2017-05-28 05:38:35

To be pedantic, that's not a regex @void-pointer - it's a grammar.

@sevko 2014-07-02 21:37:05

As per Python Language Reference, §2.3, "Identifiers and keywords", a valid Python identifier is defined as:

(letter|"_") (letter | digit | "_")*

Or, in regex:

[a-zA-Z_][a-zA-Z0-9_]*

Related Questions

Sponsored Content

23 Answered Questions

[SOLVED] Does Python have a ternary conditional operator?

16 Answered Questions

[SOLVED] What are metaclasses in Python?

63 Answered Questions

[SOLVED] Calling an external command from Python

36 Answered Questions

[SOLVED] How to get the current time in Python

  • 2009-01-06 04:54:23
  • user46646
  • 3086903 View
  • 2628 Score
  • 36 Answer
  • Tags:   python datetime time

18 Answered Questions

[SOLVED] Are static class variables possible in Python?

10 Answered Questions

[SOLVED] Does Python have a string 'contains' substring method?

29 Answered Questions

[SOLVED] Finding the index of an item given a list containing it in Python

  • 2008-10-07 01:39:38
  • Eugene M
  • 3533954 View
  • 2895 Score
  • 29 Answer
  • Tags:   python list indexing

8 Answered Questions

[SOLVED] Getting the class name of an instance?

Sponsored Content