By good


2009-10-10 13:10:51 8 Comments

Which characters make a URL invalid?

Are these valid URLs?

  • example.com/file[/].html
  • http://example.com/file[/].html

9 comments

@Mark Amery 2016-04-16 17:17:49

Most of the existing answers here are impractical because they totally ignore the real-world usage of addresses like:

First, a digression into terminology? What are these addresses? Are they valid URLs?

Historically, the answer was "no". According to RFC 3986, from 2005, such addresses are not URIs (and therefore not URLs, since URLs are a type of URIs). Per the terminology of 2005 IETF standards, we should properly call them IRIs (Internationalized Resource Identifiers), as defined in RFC 3987, which are technically not URIs but can be converted to URIs simply by percent-encoding all non-ASCII characters in the IRI.

Per modern spec, the answer is "yes". The WHATWG Living Standard simply classifies everything that would previously be called "URIs" or "IRIs" as "URLs". This aligns the specced terminology with how normal people who haven't read the spec use the word "URL", which was one of the spec's goals.

What characters are allowed under the WHATWG Living Standard?

Per this newer meaning of "URL", what characters are allowed? In many parts of the URL, such as the query string and path, we're allowed to use arbitrary "URL units", which are

URL code points and percent-encoded bytes.

What are "URL code points"?

The URL code points are ASCII alphanumeric, U+0021 (!), U+0024 ($), U+0026 (&), U+0027 ('), U+0028 LEFT PARENTHESIS, U+0029 RIGHT PARENTHESIS, U+002A (*), U+002B (+), U+002C (,), U+002D (-), U+002E (.), U+002F (/), U+003A (:), U+003B (;), U+003D (=), U+003F (?), U+0040 (@), U+005F (_), U+007E (~), and code points in the range U+00A0 to U+10FFFD, inclusive, excluding surrogates and noncharacters.

(Note that the list of "URL code points" doesn't include %, but that %s are allowed in "URL code units" if they're part of a percent-encoding sequence.)

The only place I can spot where the spec permits the use of any character that's not in this set is in the host, where IPv6 addresses are enclosed in [ and ] characters. Everywhere else in the URL, either URL units are allowed or some even more restrictive set of characters.

What characters were allowed under the old RFCs?

For the sake of history, and since it's not explored fully elsewhere in the answers here, let's examine was allowed under the older pair of specs.

First of all, we have two types of RFC 3986 reserved characters:

  • :/?#[]@, which are part of the generic syntax for a URI defined in RFC 3986
  • !$&'()*+,;=, which aren't part of the RFC's generic syntax, but are reserved for use as syntactic components of particular URI schemes. For instance, semicolons and commas are used as part of the syntax of data URIs, and & and = are used as part of the ubiquitous ?foo=bar&qux=baz format in query strings (which isn't specified by RFC 3986).

Any of the reserved characters above can be legally used in a URI without encoding, either to serve their syntactic purpose or just as literal characters in data in some places where such use could not be misinterpreted as the character serving its syntactic purpose. (For example, although / has syntactic meaning in a URL, you can use it unencoded in a query string, because it doesn't have meaning in a query string.)

RFC 3986 also specifies some unreserved characters, which can always be used simply to represent data without any encoding:

  • abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-._~

Finally, the % character itself is allowed for percent-encodings.

That leaves only the following ASCII characters that are forbidden from appearing in a URL:

  • The control characters (chars 0-1F and 7F), including new line, tab, and carriage return.
  • "<>\^`{|}

Every other character from ASCII can legally feature in a URL.

Then RFC 3987 extends that set of unreserved characters with the following unicode character ranges:

  %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
/ %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
/ %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
/ %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
/ %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
/ %xD0000-DFFFD / %xE1000-EFFFD

These block choices from the old spec seem bizarre and arbitrary given the latest Unicode block definitions; this is probably because the blocks have been added to in the decade since RFC 3987 was written.


Finally, it's perhaps worth noting that simply knowing which characters can legally appear in a URL isn't sufficient to recognise whether some given string is a legal URL or not, since some characters are only legal in particular parts of the URL. For example, the reserved characters [ and ] are legal as part of an IPv6 literal host in a URL like http://[1080::8:800:200C:417A]/foo but aren't legal in any other context, so the OP's example of http://example.com/file[/].html is illegal.

@JasonM1 2012-11-21 18:50:11

To add some clarification and directly address the question above, there are several classes of characters that cause problems for URLs and URIs.

There are some characters that are disallowed and should never appear in a URL/URI, reserved characters (described below), and other characters that may cause problems in some cases, but are marked as "unwise" or "unsafe". Explanations for why the characters are restricted are clearly spelled out in RFC-1738 (URLs) and RFC-2396 (URIs). Note the newer RFC-3986 (update to RFC-1738) defines the construction of what characters are allowed in a given context but the older spec offers a simpler and more general description of which characters are not allowed with the following rules.

Excluded US-ASCII Characters disallowed within the URI syntax:

   control     = <US-ASCII coded characters 00-1F and 7F hexadecimal>
   space       = <US-ASCII coded character 20 hexadecimal>
   delims      = "<" | ">" | "#" | "%" | <">

The character "#" is excluded because it is used to delimit a URI from a fragment identifier. The percent character "%" is excluded because it is used for the encoding of escaped characters. In other words, the "#" and "%" are reserved characters that must be used in a specific context.

List of unwise characters are allowed but may cause problems:

   unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

Characters that are reserved within a query component and/or have special meaning within a URI/URL:

  reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | ","

The "reserved" syntax class above refers to those characters that are allowed within a URI, but which may not be allowed within a particular component of the generic URI syntax. Characters in the "reserved" set are not reserved in all contexts. The hostname, for example, can contain an optional username so it could be something like ftp://[email protected]/ where the '@' character has special meaning.

Here is an example of a URL that has invalid and unwise characters (e.g. '$', '[', ']') and should be properly encoded:

http://mw1.google.com/mw-earth-vectordb/kml-samples/gp/seattle/gigapxl/$[level]/r$[y]_c$[x].jpg

Some of the character restrictions for URIs/URLs are programming language dependent. For example, the '|' (0x7C) character although only marked as "unwise" in the URI spec will throw a URISyntaxException in the Java java.net.URI constructor so a URL like http://api.google.com/q?exp=a|b is not allowed and must be encoded instead as http://api.google.com/q?exp=a%7Cb if using Java with a URI object instance.

@Bob Stein 2013-07-08 17:45:52

Excellent, thorough answer, the only one to directly answer the actual question. Reserved section may need work, e.g. literal ? is just fine in the query section, but impossible before it, and I don't think @ belongs in any of these lists. Oh, and instead of %25 in the last string, don't you mean %7C?

@JasonM1 2013-07-08 21:15:45

Thanks. Good catch: the %25 was a typo in the example. Added footnote to the "reserved" syntax description directly from RFC-2396.

@Mark Amery 2016-04-16 17:38:40

This answer isn't bad, but there are some confusions and errors. You initially conflate disallowed and reserved characters (very different things), you make too much of the distinction between "unwise" characters and other disallowed characters (dropped in RFC 3986 and syntactically irrelevant even in RFC 2396), and you confusingly present a list of all reserved characters as the list reserved "within a query component".

@JasonM1 2016-04-17 16:33:19

Thanks, didn't mean to group the disallowed and reserved as the same. Updated the answer. IMHO rules in RFC-2396 though older are simpler to understand than the updated rules in 3986. Answer reflects more on which characters might be troublesome in general rather than exactly which context it is allowed or not allowed.

@Mark Amery 2016-04-18 21:14:32

Hmm. This is still a little misleading since the list of characters reserved in the query component is taken from RFC 2396 (published in 1998) and doesn't match the newer RFC 3986 published in 2005. Confusingly, the terminology has changed a little (RFC 2396 says reserved characters are not reserved in some contexts and so don't need escaping; RFC 3986 says reserved characters do not act as delimiters in some contexts and so don't need escaping), but there's also a real change in meaning: ? and /, for instance, can be used unencoded in an RFC 3986 query string but not an RFC-2396 one.

@Philip 2017-01-25 19:23:11

It's notable that Tomcat in recent releases (7.0.73+, 8.0.39+, 8.5.7+) have started rejecting requests with characters from the "unwise" category with HTTP 400 errors: "Invalid character found in the request target. The valid characters are defined in RFC 7230 and RFC 3986"

@WeGoToMars 2017-07-29 19:46:20

% should be accepted in URLs

@JasonM1 2017-07-29 19:54:45

% is a special character in URLs for hex-encoding such as ftp://[email protected]/%2Fetc/motd where %2F is the hex-decimal notation of '/'.

@PickBoy 2017-11-28 13:53:31

There are some misleading in this answer. RFC 3986 is a replacement for 2396, which also updates 1738. Also, there are some difference between 3986 and 2396 regarding characters (this answer implies they are the same).

@CraigTP 2009-10-10 13:22:26

All valid characters that can be used in a URI (a URL is a type of URI) are defined in RFC 3986.

All other characters can be used in a URL provided that they are "URL Encoded" first. This involves changing the invalid character for specific "codes" (usually in the form of the percent symbol (%) followed by a hexadecimal number).

This link, HTML URL Encoding Reference, contains a list of the encodings for invalid characters.

@DavidRR 2014-09-17 20:09:44

And for Unicode characters, the Wikipedia article Percent-encoding says the following: "The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values."

@Gumbo 2009-10-10 13:26:14

In general URIs as defined by RFC 3986 (see Section 2: Characters) may contain any of the following characters:

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]@!$&'()*+,;=

Note that this list doesn't state where in the URI these characters may occur.

Any other character needs to be encoded with the percent-encoding (%hh). Each part of the URI has further restrictions about what characters need to be represented by an percent-encoded word.

@Eamon Nerbonne 2011-05-31 08:22:37

(of course, the list of characters doesn't state where in the uri they may occur)

@Gumbo 2011-05-31 11:34:56

@Eamon Nerbonne: Yes, this is only the union of the sets of valid characters of all components.

@Leif Wickland 2011-10-07 17:01:54

Here's a regex that will determine if the entire string contains only the characters above: /^[!#$&-;=?-[]_a-z~]+$/

@techiferous 2011-12-11 23:26:12

@Leif that regex is missing some characters and doesn't properly escape others. This regex should work better: /(http[A-Za-z0-9\-\._~:\/\?#[]@!\$&'()*\+,;=‌​%]+)/

@Leif Wickland 2011-12-13 19:28:02

@techiferous, Yeah, I forgot to allow "%" escaped characters. It should've looked more like: /^([!#$&-;=?-[]_a-z~]|%[0-9a-fA-F]{2})+$/ Was there anything else that you found it should've been accepting? (Just to be clear, that regex only checks if the string contains valid URL characters, not if the string contains a well formed URL.)

@Gumbo 2011-12-14 08:31:24

@LeifWickland That’s much better. But note that an empty string is also a valid URL.

@Timwi 2011-12-16 22:29:47

I don’t think there is any requirement in URLs for % to be followed strictly by two hex digits. That is merely a convention that is used on top of URLs.

@Leif Wickland 2012-01-05 00:00:49

@Timwi RFC 3986 says, "A percent-encoded octet is encoded as a character triplet, consisting of the percent character "%" followed by the two hexadecimal digits representing that octet's numeric value." It also says, "Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI." I read that as saying that a "%" may only appear if it is followed by two hex digits. How do you read it?

@Tim 2012-06-20 05:13:27

# marks the start of the fragment, it wouldn't be wise to allow it via a regex

@Weeble 2012-07-02 13:10:59

@LeifWickland It looks like your both your regexes omit period ".", solidus "/", at-sign "@", apostrophe "'", parentheses "(" and ")", asterisk "*", plus "+" and comma ",". I think techiferous's regex includes all of them.

@Weeble 2012-07-02 13:18:56

@techiferous Your regex doesn't escape the closing square bracket "]" in the character class. I think you meant /(http[A-Za-z0-9\-\._~:\/\?#[\]@!\$&'()*\+,;=%]+)/ EDIT - (Aha, StackOverflow deleted it. You need to include two for it to appear. I'm not sure whether it might have eaten other characters too...)

@Leif Wickland 2012-07-02 16:57:22

@Weeble My regex included those characters by using ranges. Between '&' and ';' and between '?' and '[' you'll find all those characters you didn't see.

@Markus von Broady 2013-03-27 07:59:08

http://budyń.pl is an example of an address with character outside of given range of valid characters. And the address works. Funny thing is, it isn't parsed in SO correctly: budyń.pl I think you should be liberal when parsing URLs (http:// prefixed string is an obvious link), while being very strict in page naming, url rewriting etc.

@Gumbo 2013-03-27 11:27:04

@MarkusvonBroady budyń.pl is a so called International Domain Name (IDN) and is actually translated to the punycode xn--budy-e2a.pl. When you enter http://budyń.pl in your browser, it will actually request http://xn--budy-e2a.pl instead.

@Markus von Broady 2013-03-27 14:19:09

@Gumbo I am aware of it, but the translation is done behind the scenes; when creating a forum script, like in situation described by question's author, you want to consider such unusual cases. In fact, I don't understand why a script should check e-mail/URL validity - most regexps I saw were much much more limiting than RFC and therefore causing problems with exotic addresses. I would only check the scheme if it's in a whitelist of http(s), ftp, magnet etc., and put the tag in <> not [], as the latter can be inside a link as well.

@Julian 2013-05-01 16:07:52

This w3c spec from 1994 seems to address the valid characters in more detail: w3.org/Addressing/URL/url-spec.txt See the section titled "BNF for specific URL schemes".

@cnst 2015-02-24 02:46:54

interestingly, this encoding doesn't seem to be mandated for fields like Referer

@Pacerier 2015-03-03 15:49:53

@MarkusvonBroady, To prevent problems due to users' browsers. If you limit it to the "normal limit", there won't be problems.

@Markus von Broady 2015-03-04 18:06:15

@Pacerier The best way to prevent problems is to disable links at all. If person A wants to share http://budyń.pl, and person B's browser can't handle it, you won't help person B by blocking the address - because in both cases person B will not access the Budyń.pl site.

@Pacerier 2015-03-08 14:07:34

@MarkusvonBroady, No no, you should take the input http://budyń.pl and link it to the translated version http://xn--budy-e2a.pl/. That way problem solved.

@Markus von Broady 2015-03-09 14:55:19

@Pacerier My point is, you either: 1. Use regexp to find the link. That way, using the above version, you will never find the http://budyń.pl link and so you will never translate it and it will be parsed as plain text (at least the part after the character not inside regexp formula). --- 2. Use regexp to validate the link. That way, you will not allow a user to type http://budyń.pl, even though it is a valid link that will open in a browser. This is a reason why I tested it here, on SO, to support my point with a real life example:: budyń.pl

@Pacerier 2015-03-11 09:57:50

@MarkusvonBroady, 3. Don't use Regex. Imagine the world doesn't have such a wild creature.

@Markus von Broady 2015-03-12 17:27:51

@Pacerier This is not relevant to the topic.

@244boy 2018-04-10 02:26:28

@Gumbo But where to encode the url?

@brianary 2019-01-14 16:56:46

@Pacerier Regex is extremely powerful and useful. Use the right tool for the job. Excluding any because there are a few vocal programmers are too dumb or lazy to master some very powerful tools is one way to needlessly and poorly reinvent some existing solutions.

@relipse 2016-12-26 18:36:55

I came up with a couple regular expressions for PHP that will convert urls in text to anchor tags. (First it converts all www. urls to http:// then converts all urls with https?:// to a href=... html links

$string = preg_replace('/(https?:\/\/)([!#$&-;=?\-\[\]_a-z~%]+)/sim', '<a href="$1$2">$2</a>', preg_replace('/(\s)((www\.)([!#$&-;=?\-\[\]_a-z~%]+))/sim', '$1http://$2', $string) );

@Mark Amery 2018-09-11 14:30:44

-1; beyond the fact that they both involve URLs in some capacity, this has nothing to do with the question that was asked.

@Ciro Santilli 新疆改造中心996ICU六四事件 2014-08-29 14:19:07

Several of Unicode character ranges are valid HTML5, although it might still not be a good idea to use them.

E.g., href docs say http://www.w3.org/TR/html5/links.html#attr-hyperlink-href:

The href attribute on a and area elements must have a value that is a valid URL potentially surrounded by spaces.

Then the definition of "valid URL" points to http://url.spec.whatwg.org/, which says it aims to:

Align RFC 3986 and RFC 3987 with contemporary implementations and obsolete them in the process.

That document defines URL code points as:

ASCII alphanumeric, "!", "$", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", ":", ";", "=", "?", "@", "_", "~", and code points in the ranges U+00A0 to U+D7FF, U+E000 to U+FDCF, U+FDF0 to U+FFFD, U+10000 to U+1FFFD, U+20000 to U+2FFFD, U+30000 to U+3FFFD, U+40000 to U+4FFFD, U+50000 to U+5FFFD, U+60000 to U+6FFFD, U+70000 to U+7FFFD, U+80000 to U+8FFFD, U+90000 to U+9FFFD, U+A0000 to U+AFFFD, U+B0000 to U+BFFFD, U+C0000 to U+CFFFD, U+D0000 to U+DFFFD, U+E1000 to U+EFFFD, U+F0000 to U+FFFFD, U+100000 to U+10FFFD.

The term "URL code points" is then used in the statement:

If c is not a URL code point and not "%", parse error.

in a several parts of the parsing algorithm, including the schema, authority, relative path, query and fragment states: so basically the entire URL.

Also, the validator http://validator.w3.org/ passes for URLs like "你好", and does not pass for URLs with characters like spaces "a b"

Of course, as mentioned by Stephen C, it is not just about characters but also about context: you have to understand the entire algorithm. But since class "URL code points" is used on key points of the algorithm, it that gives a good idea of what you can use or not.

See also: Unicode characters in URLs

@Bunyk 2014-02-11 17:57:16

I need to select character to split urls in string, so I decided to create list of characters which could not be found in URL by myself:

>>> allowed = "-_.~!*'();:@&=+$,/?%#[][email protected]"
>>> from string import printable
>>> ''.join(set(printable).difference(set(allowed)))
'`" <\x0b\n\r\x0c\\\t{^}|>'

So, the possible choices are the newline, tab, space, backslash and "<>{}^|. I guess I'll go with the space or newline. :)

@244boy 2018-04-10 02:27:37

But where to encode the url in Django backend?

@Dominic Sayers 2009-12-03 15:46:05

In your supplementary question you asked if www.example.com/file[/].html is a valid URL.

That URL isn't valid because a URL is a type of URI and a valid URI must have a scheme like http: (see RFC 3986).

If you meant to ask if http://www.example.com/file[/].html is a valid URL then the answer is still no because the square bracket characters aren't valid there.

The square bracket characters are reserved for URLs in this format: http://[2001:db8:85a3::8a2e:370:7334]/foo/bar (i.e. an IPv6 literal instead of a host name)

It's worth reading RFC 3986 carefully if you want to understand the issue fully.

@skolima 2011-12-14 08:41:36

After reading the RFC, I'm more inclined to agree with @Stephen C more detailed explanation.

@Adam Gent 2013-05-16 00:40:42

A URLs are not a subset of URI. The [ and ] are not URI valid for almost parsers I have seen. This has actually screwed me in the real world: stackoverflow.com/questions/11038967/…

@Mark Amery 2016-04-16 17:43:30

@AdamGent URLs very much are a subset of URIs. The only difference between them is whether they describe the location of the resource - which is a semantic distinction, not a syntactic one. If the parsers you've seen that labelled themselves as "URI" parsers treated square brackets differently to those that labelled themselves as "URL" parsers, then that's pure coincidence, not caused by any difference between URLs and URIs.

@Adam Gent 2016-04-18 01:28:42

@Mark Amery it's analogous to saying C++ is a superset of C. It is for the most part but not entirely true because (URL and C) is much older they have to include behavior that is less strict. The problem is URL parsers will parse things that are not valid URI... And I mean most of them (frankly I'm so tired of pointing this out across so many languages) It is not coincidence it's backwards compatibility. Can we agree that URL spec is older atleast?

@Adam Gent 2016-04-18 01:44:22

@MarkAmery That is from Python, C#, Java and some C libraries the parsers will take Unwise very seriously for URIs and yet be fine with URL libraries. That is there is no flag to ignore Unwise. I'll have to check out what Rust lang (since it is being built for a browser I'm curious what it does) for URLs. Most browsers though will happily pass "[", "]" as well. So in theory just like I said with C/C++ they are sub/super but the reality is not so true. It is highly dependent on interpretation of the spec and semantics of super/subset.

@Erwin Bolwidt 2019-05-10 06:25:02

RFC3986 is crystal clear about this: " A host identified by an Internet Protocol literal address, version 6 [RFC3513] or later, is distinguished by enclosing the IP literal within square brackets ("[" and "]"). This is the only place where square bracket characters are allowed in the URI syntax. ". No other consideration is necessary; http://example.com/file[/].html is invalid as a URL.

@ChrisR 2009-10-10 13:19:42

Not really an answer to your question but validating url's is really a serious p.i.t.a You're probably just better off validating the domainname and leave query part of the url be. That is my experience. You could also resort to pinging the url and seeing if it results in a valid response but that might be too much for such a simple task.

Regular expressions to detect url's are abundant, google it :)

@DavidRR 2014-09-17 20:16:18

This answer advises that URL validation is a job not for a regex, but for a language/platform-specific library.

Related Questions

Sponsored Content

73 Answered Questions

15 Answered Questions

[SOLVED] How to change the URI (URL) for a remote Git repository?

  • 2010-03-12 12:48:47
  • e-satis
  • 1321820 View
  • 3333 Score
  • 15 Answer
  • Tags:   git url git-remote

14 Answered Questions

[SOLVED] Encode URL in JavaScript?

27 Answered Questions

[SOLVED] How do I create a URL shortener?

  • 2009-04-12 16:29:15
  • caw
  • 236930 View
  • 631 Score
  • 27 Answer
  • Tags:   algorithm url

20 Answered Questions

[SOLVED] Get the current URL with JavaScript?

  • 2009-06-23 19:26:45
  • dougoftheabaci
  • 2647468 View
  • 2807 Score
  • 20 Answer
  • Tags:   javascript url

30 Answered Questions

[SOLVED] Get current URL with jQuery?

16 Answered Questions

[SOLVED] What is the maximum length of a URL in different browsers?

  • 2009-01-06 16:14:30
  • Sander Versluys
  • 1153229 View
  • 4571 Score
  • 16 Answer
  • Tags:   http url browser

18 Answered Questions

[SOLVED] How do I modify the URL without reloading the page?

31 Answered Questions

[SOLVED] What is the difference between a URI, a URL and a URN?

  • 2008-10-06 21:26:58
  • Sean McMains
  • 1085936 View
  • 4156 Score
  • 31 Answer
  • Tags:   http url uri urn rfc3986

43 Answered Questions

Sponsored Content