By Salman

2010-11-20 05:33:32 8 Comments

I am trying to match <input> type “hidden” fields using this pattern:

/<input type="hidden" name="([^"]*?)" value="([^"]*?)" />/

This is sample form data:

<input type="hidden" name="SaveRequired" value="False" /><input type="hidden" name="__VIEWSTATE1" value="1H4sIAAtzrkX7QfL5VEGj6nGi+nP" /><input type="hidden" name="__VIEWSTATE2" value="0351118MK" /><input type="hidden" name="__VIEWSTATE3" value="ZVVV91yjY" /><input type="hidden" name="__VIEWSTATE0" value="3" /><input type="hidden" name="__VIEWSTATE" value="" /><input type="hidden" name="__VIEWSTATE" value="" />

But I am not sure that the type, name, and value attributes will always appear in the same order. If the type attribute comes last, the match will fail because in my pattern it’s at the start.

How can I change my pattern so it will match regardless of the positions of the attributes in the <input> tag?

P.S.: By the way I am using the Adobe Air based RegEx Desktop Tool for testing regular expressions.


@tchrist 2010-11-20 19:19:22

Oh Yes You Can Use Regexes to Parse HTML!

For the task you are attempting, regexes are perfectly fine!

It is true that most people underestimate the difficulty of parsing HTML with regular expressions and therefore do so poorly.

But this is not some fundamental flaw related to computational theory. That silliness is parroted a lot around here, but don’t you believe them.

So while it certainly can be done (this posting serves as an existence proof of this incontrovertible fact), that doesn’t mean it should be.

You must decide for yourself whether you’re up to the task of writing what amounts to a dedicated, special-purpose HTML parser out of regexes. Most people are not.

But I am. ☻

General Regex-Based HTML Parsing Solutions

First I’ll show how easy it is to parse arbitrary HTML with regexes. The full program’s at the end of this posting, but the heart of the parser is:

for (;;) {
  given ($html) {
    last                    when (pos || 0) >= length;
    printf "\@%d=",              (pos || 0);
    print  "doctype "   when / \G (?&doctype)  $RX_SUBS  /xgc;
    print  "cdata "     when / \G (?&cdata)    $RX_SUBS  /xgc;
    print  "xml "       when / \G (?&xml)      $RX_SUBS  /xgc;
    print  "xhook "     when / \G (?&xhook)    $RX_SUBS  /xgc;
    print  "script "    when / \G (?&script)   $RX_SUBS  /xgc;
    print  "style "     when / \G (?&style)    $RX_SUBS  /xgc;
    print  "comment "   when / \G (?&comment)  $RX_SUBS  /xgc;
    print  "tag "       when / \G (?&tag)      $RX_SUBS  /xgc;
    print  "untag "     when / \G (?&untag)    $RX_SUBS  /xgc;
    print  "nasty "     when / \G (?&nasty)    $RX_SUBS  /xgc;
    print  "text "      when / \G (?&nontag)   $RX_SUBS  /xgc;
    default {
      die "UNCLASSIFIED: " .
        substr($_, pos || 0, (length > 65) ? 65 : length);

See how easy that is to read?

As written, it identifies each piece of HTML and tells where it found that piece. You could easily modify it to do whatever else you want with any given type of piece, or for more particular types than these.

I have no failing test cases (left :): I’ve successfully run this code on more than 100,000 HTML files — every single one I could quickly and easily get my hands on. Beyond those, I’ve also run it on files specifically constructed to break naïve parsers.

This is not a naïve parser.

Oh, I’m sure it isn’t perfect, but I haven’t managed to break it yet. I figure that even if something did, the fix would be easy to fit in because of the program’s clear structure. Even regex-heavy programs should have stucture.

Now that that’s out of the way, let me address the OP’s question.

Demo of Solving the OP’s Task Using Regexes

The little html_input_rx program I include below produces the following output, so that you can see that parsing HTML with regexes works just fine for what you wish to do:

% html_input_rx,_Apparel,_Computers,_Books,_DVDs_\&_more.htm 
input tag #1 at character 9955:
       class => "searchSelect"
          id => "twotabsearchtextbox"
        name => "field-keywords"
        size => "50"
       style => "width:100%; background-color: #FFF;"
       title => "Search for"
        type => "text"
       value => ""

input tag #2 at character 10335:
         alt => "Go"
         src => ""
        type => "image"

Parse Input Tags, See No Evil Input

Here’s the source for the program that produced the output above.

#!/usr/bin/env perl
# html_input_rx - pull out all <input> tags from (X)HTML src
#                  via simple regex processing
# Tom Christiansen <[email protected]>
# Sat Nov 20 10:17:31 MST 2010

use 5.012;

use strict;
use autodie;
use warnings FATAL => "all";    
use subs qw{
    input descape dequote
use open        ":std",
          IN => ":bytes",
         OUT => ":utf8";    
use Encode qw< encode decode >;




until eof(); sub parse_input_tags {
    my $_ = shift();
    our($Input_Tag_Rx, $Pull_Attr_Rx);
    my $count = 0;
    while (/$Input_Tag_Rx/pig) {
        my $input_tag = $+{TAG};
        my $place     = pos() - length ${^MATCH};
        printf "input tag #%d at character %d:\n", ++$count, $place;
        my %attr = ();
        while ($input_tag =~ /$Pull_Attr_Rx/g) {
            my ($name, $value) = @+{ qw< NAME VALUE > };
            $value = dequote($value);
            if (exists $attr{$name}) {
                printf "Discarding dup attr value '%s' on %s attr\n",
                    $attr{$name} // "<undef>", $name;
            $attr{$name} = $value;
        for my $name (sort keys %attr) {
            printf "  %10s => ", $name;
            my $value = descape $attr{$name};
            my  @Q; given ($value) {
                @Q = qw[  " "  ]  when !/'/ && !/"/;
                @Q = qw[  " "  ]  when  /'/ && !/"/;
                @Q = qw[  ' '  ]  when !/'/ &&  /"/;
                @Q = qw[ q( )  ]  when  /'/ &&  /"/;
                default { die "NOTREACHED" }
            say $Q[0], $value, $Q[1];
        print "\n";


sub dequote {
    my $_ = $_[0];
        (?<quote>   ["']      )
          (?s: (?! \k<quote> ) . ) * 
    return $_;

sub descape {
    my $string = $_[0];
    for my $_ ($string) {
            (?<! % )
            % ( \p{Hex_Digit} {2} )
            chr hex $1;
            & \043 
            ( [0-9]+ )
            (?: ; 
              | (?= [^0-9] )
            chr     $1;
            & \043 x
            ( \p{ASCII_HexDigit} + )
            (?: ; 
              | (?= \P{ASCII_HexDigit} )
            chr hex $1;

    return $string;

sub input { 
    our ($RX_SUBS, $Meta_Tag_Rx);
    my $_ = do { local $/; <> };  
    my $encoding = "iso-8859-1";  # web default; wish we had the HTTP headers :(
    while (/$Meta_Tag_Rx/gi) {
        my $meta = $+{META};
        next unless $meta =~ m{             $RX_SUBS
            (?= http-equiv ) 
            (?= (?&quote)? content-type )
        next unless $meta =~ m{             $RX_SUBS
            (?= content ) (?&name) 
            (?<CONTENT>   (?&value)    )
        next unless $+{CONTENT} =~ m{       $RX_SUBS
            (?= charset ) (?&name) 
            (?<CHARSET>   (?&value)    )
        if (lc $encoding ne lc $+{CHARSET}) {
            say "[RESETTING ENCODING $encoding => $+{CHARSET}]";
            $encoding = $+{CHARSET};
    return decode($encoding, $_);

sub see_no_evil {
    my $_ = shift();

    s{ <!    DOCTYPE  .*?         > }{}sx; 
    s{ <! \[ CDATA \[ .*?    \]\] > }{}gsx; 

    s{ <script> .*?  </script> }{}gsix; 
    s{ <!--     .*?        --> }{}gsx;

    return $_;

sub load_patterns { 

    our $RX_SUBS = qr{ (?(DEFINE)
        (?<nv_pair>         (?&name) (?&equals) (?&value)         ) 
        (?<name>            \b (?=  \pL ) [\w\-] + (?<= \pL ) \b  )
        (?<equals>          (?&might_white)  = (?&might_white)    )
        (?<value>           (?&quoted_value) | (?&unquoted_value) )
        (?<unwhite_chunk>   (?: (?! > ) \S ) +                    )
        (?<unquoted_value>  [\w\-] *                              )
        (?<might_white>     \s *                                  )
            (?<quote>   ["']      )
            (?: (?! \k<quote> ) . ) *
        (?<start_tag>  < (?&might_white) )
            (?: (?&html_end_tag) 
              | (?&xhtml_end_tag) 
        (?<html_end_tag>       >  )
        (?<xhtml_end_tag>    / >  )
    ) }six; 

    our $Meta_Tag_Rx = qr{                          $RX_SUBS 
            (?&start_tag) meta \b
                (?&might_white) (?&nv_pair) 
            ) +

    our $Pull_Attr_Rx = qr{                         $RX_SUBS
        (?<NAME>  (?&name)      )
        (?<VALUE> (?&value)     )

    our $Input_Tag_Rx = qr{                         $RX_SUBS 

        (?<TAG> (?&input_tag) )



                ) *

                (?&might_white) = (?&might_white) 
                  | (?&unquoted_value)

                (?: (?&optional_attribute)
                  | (?&standard_attribute)
                  | (?&event_attribute)
            # for LEGAL parse only, comment out next line 
                  | (?&illegal_attribute)

            (?<illegal_attribute>  (?&name) )

            (?<required_attribute> (?#no required attributes) )

              | (?&deprecated_attribute)

            # NB: The white space in string literals 
            #     below DOES NOT COUNT!   It's just 
            #     there for legibility.

                | alt
                | bottom
                | check box
                | checked
                | disabled
                | file
                | hidden
                | image
                | max length
                | middle
                | name
                | password
                | radio
                | read only
                | reset
                | right
                | size
                | src
                | submit
                | text
                | top
                | type
                | value


                  access key
                | class
                | dir
                | ltr
                | id
                | lang
                | style
                | tab index
                | title
                | xml:lang

                  on blur
                | on change
                | on click
                | on dbl   click
                | on focus
                | on mouse down
                | on mouse move
                | on mouse out
                | on mouse over
                | on mouse up
                | on key   down
                | on key   press
                | on key   up
                | on select



        || die "can't close stdout: $!";

There you go! Nothing to it! :)

Only you can judge whether your skill with regexes is up to any particular parsing task. Everyone’s level of skill is different, and every new task is different. For jobs where you have a well-defined input set, regexes are obviously the right choice, because it is trivial to put some together when you have a restricted subset of HTML to deal with. Even regex beginners should be handle those jobs with regexes. Anything else is overkill.

However, once the HTML starts becoming less nailed down, once it starts to ramify in ways you cannot predict but which are perfectly legal, once you have to match more different sorts of things or with more complex dependencies, you will eventually reach a point where you have to work harder to effect a solution that uses regexes than you would have to using a parsing class. Where that break-even point falls depends again on your own comfort level with regexes.

So What Should I Do?

I’m not going to tell you what you must do or what you cannot do. I think that’s Wrong. I just want to present you with possibilties, open your eyes a bit. You get to choose what you want to do and how you want to do it. There are no absolutes — and nobody else knows your own situation as well as you yourself do. If something seems like it’s too much work, well, maybe it is. Programming should be fun, you know. If it isn’t, you may be doing it wrong.

One can look at my html_input_rx program in any number of valid ways. One such is that you indeed can parse HTML with regular expressions. But another is that it is much, much, much harder than almost anyone ever thinks it is. This can easily lead to the conclusion that my program is a testament to what you should not do, because it really is too hard.

I won’t disagree with that. Certainly if everything I do in my program doesn’t make sense to you after some study, then you should not be attempting to use regexes for this kind of task. For specific HTML, regexes are great, but for generic HTML, they’re tantamount to madness. I use parsing classes all the time, especially if it’s HTML I haven’t generated myself.

Regexes optimal for small HTML parsing problems, pessimal for large ones

Even if my program is taken as illustrative of why you should not use regexes for parsing general HTML — which is OK, because I kinda meant for it to be that ☺ — it still should be an eye-opener so more people break the terribly common and nasty, nasty habit of writing unreadable, unstructured, and unmaintainable patterns.

Patterns do not have to be ugly, and they do not have to be hard. If you create ugly patterns, it is a reflection on you, not them.

Phenomenally Exquisite Regex Language

I’ve been asked to point out that my proferred solution to your problem has been written in Perl. Are you surprised? Did you not notice? Is this revelation a bombshell?

It is true that not all other tools and programming languages are quite as convenient, expressive, and powerful when it comes to regexes as Perl is. There’s a big spectrum out there, with some being more suitable than others. In general, the languages that have expressed regexes as part of the core language instead of as a library are easier to work with. I’ve done nothing with regexes that you couldn’t do in, say, PCRE, although you would structure the program differently if you were using C.

Eventually other languages will be catch up with where Perl is now in terms of regexes. I say this because back when Perl started, nobody else had anything like Perl’s regexes. Say anything you like, but this is where Perl clearly won: everybody copied Perl’s regexes albeit at varying stages of their development. Perl pioneered almost (not quite all, but almost) everything that you have come to rely on in modern patterns today, no matter what tool or language you use. So eventually the others will catch up.

But they’ll only catch up to where Perl was sometime in the past, just as it is now. Everything advances. In regexes if nothing else, where Perl leads, others follow. Where will Perl be once everybody else finally catches up to where Perl is now? I have no idea, but I know we too will have moved. Probably we’ll be closer to Perl₆’s style of crafting patterns.

If you like that kind of thing but would like to use it in Perl₅, you might be interested in Damian Conway’s wonderful Regexp::Grammars module. It’s completely awesome, and makes what I’ve done here in my program seem just as primitive as mine makes the patterns that people cram together without whitespace or alphabetic identifiers. Check it out!

Simple HTML Chunker

Here is the complete source to the parser I showed the centerpiece from at the beginning of this posting.

I am not suggesting that you should use this over a rigorously tested parsing class. But I am tired of people pretending that nobody can parse HTML with regexes just because they can’t. You clearly can, and this program is proof of that assertion.

Sure, it isn’t easy, but it is possible!

And trying to do so is a terrible waste of time, because good parsing classes exist which you should use for this task. The right answer to people trying to parse arbitrary HTML is not that it is impossible. That is a facile and disingenuous answer. The correct and honest answer is that they shouldn’t attempt it because it is too much of a bother to figure out from scratch; they should not break their back striving to reïnvent a wheel that works perfectly well.

On the other hand, HTML that falls within a predicable subset is ultra-easy to parse with regexes. It’s no wonder people try to use them, because for small problems, toy problems perhaps, nothing could be easier. That’s why it’s so important to distinguish the two tasks — specific vs generic — as these do not necessarily demand the same approach.

I hope in the future here to see a more fair and honest treatment of questions about HTML and regexes.

Here’s my HTML lexer. It doesn’t try to do a validating parse; it just identifies the lexical elements. You might think of it more as an HTML chunker than an HTML parser. It isn’t very forgiving of broken HTML, although it makes some very small allowances in that direction.

Even if you never parse full HTML yourself (and why should you? it’s a solved problem!), this program has lots of cool regex bits that I believe a lot of people can learn a lot from. Enjoy!

#!/usr/bin/env perl
# chunk_HTML - a regex-based HTML chunker
# Tom Christiansen <[email protected]
#   Sun Nov 21 19:16:02 MST 2010

use 5.012;

use strict;
use autodie;
use warnings qw< FATAL all >;
use open     qw< IN :bytes OUT :utf8 :std >;

  $| = 1;
  lex_html(my $page = slurpy());

sub lex_html {
    our $RX_SUBS;                                        ###############
    my  $html = shift();                                 # Am I...     #
    for (;;) {                                           # forgiven? :)#
        given ($html) {                                  ###############
            last                when (pos || 0) >= length;
            printf "\@%d=",          (pos || 0);
            print  "doctype "   when / \G (?&doctype)  $RX_SUBS  /xgc;
            print  "cdata "     when / \G (?&cdata)    $RX_SUBS  /xgc;
            print  "xml "       when / \G (?&xml)      $RX_SUBS  /xgc;
            print  "xhook "     when / \G (?&xhook)    $RX_SUBS  /xgc;
            print  "script "    when / \G (?&script)   $RX_SUBS  /xgc;
            print  "style "     when / \G (?&style)    $RX_SUBS  /xgc;
            print  "comment "   when / \G (?&comment)  $RX_SUBS  /xgc;
            print  "tag "       when / \G (?&tag)      $RX_SUBS  /xgc;
            print  "untag "     when / \G (?&untag)    $RX_SUBS  /xgc;
            print  "nasty "     when / \G (?&nasty)    $RX_SUBS  /xgc;
            print  "text "      when / \G (?&nontag)   $RX_SUBS  /xgc;
            default {
                die "UNCLASSIFIED: " .
                  substr($_, pos || 0, (length > 65) ? 65 : length);
    say ".";
# Return correctly decoded contents of next complete
# file slurped in from the <ARGV> stream.
sub slurpy {
    our ($RX_SUBS, $Meta_Tag_Rx);
    my $_ = do { local $/; <ARGV> };   # read all input

    return unless length;

    use Encode   qw< decode >;

    my $bom = "";
    given ($_) {
        $bom = "UTF-32LE" when / ^ \xFf \xFe \0   \0   /x;  # LE
        $bom = "UTF-32BE" when / ^ \0   \0   \xFe \xFf /x;  #   BE
        $bom = "UTF-16LE" when / ^ \xFf \xFe           /x;  # le
        $bom = "UTF-16BE" when / ^ \xFe \xFf           /x;  #   be
        $bom = "UTF-8"    when / ^ \xEF \xBB \xBF      /x;  # st00pid
    if ($bom) {
        say "[BOM $bom]";
        s/^...// if $bom eq "UTF-8";                        # st00pid

        # Must use UTF-(16|32) w/o -[BL]E to strip BOM.
        $bom =~ s/-[LB]E//;

        return decode($bom, $_);

        # if BOM found, don't fall through to look
        #  for embedded encoding spec

    # Latin1 is web default if not otherwise specified.
    # No way to do this correctly if it was overridden
    # in the HTTP header, since we assume stream contains
    # HTML only, not also the HTTP header.
    my $encoding = "iso-8859-1";
    while (/ (?&xml) $RX_SUBS /pgx) {
        my $xml = ${^MATCH};
        next unless $xml =~ m{              $RX_SUBS
            (?= encoding )  (?&name)
                            (?&quote) ?
            (?<ENCODING>    (?&value)       )
        if (lc $encoding ne lc $+{ENCODING}) {
            say "[XML ENCODING $encoding => $+{ENCODING}]";
            $encoding = $+{ENCODING};

    while (/$Meta_Tag_Rx/gi) {
        my $meta = $+{META};

        next unless $meta =~ m{             $RX_SUBS
            (?= http-equiv )    (?&name)
            (?= (?&quote)? content-type )

        next unless $meta =~ m{             $RX_SUBS
            (?= content )       (?&name)
            (?<CONTENT>         (?&value)    )

        next unless $+{CONTENT} =~ m{       $RX_SUBS
            (?= charset )       (?&name)
            (?<CHARSET>         (?&value)    )

        if (lc $encoding ne lc $+{CHARSET}) {
            say "[HTTP-EQUIV ENCODING $encoding => $+{CHARSET}]";
            $encoding = $+{CHARSET};

    return decode($encoding, $_);
# Make sure to this function is called
# as soon as source unit has been compiled.
UNITCHECK { load_rxsubs() }

# useful regex subroutines for HTML parsing
sub load_rxsubs {

    our $RX_SUBS = qr{

        (?<WS> \s *  )

        (?<any_nv_pair>     (?&name) (?&equals) (?&value)         )
        (?<name>            \b (?=  \pL ) [\w:\-] +  \b           )
        (?<equals>          (?&WS)  = (?&WS)    )
        (?<value>           (?&quoted_value) | (?&unquoted_value) )
        (?<unwhite_chunk>   (?: (?! > ) \S ) +                    )

        (?<unquoted_value>  [\w:\-] *                             )

        (?<any_quote>  ["']      )

            (?<quote>   (?&any_quote)  )
            (?: (?! \k<quote> ) . ) *

        (?<start_tag>       < (?&WS)      )
        (?<html_end_tag>      >           )
        (?<xhtml_end_tag>   / >           )
            (?: (?&html_end_tag)
              | (?&xhtml_end_tag) )

            ) *

        (?<untag> </ (?&name) > )

        # starts like a tag, but has screwed up quotes inside it

        (?<nontag>    [^<] +            )

        (?<string> (?&quoted_value)     )
        (?<word>   (?&name)             )

                # please don't feed me nonHTML
                ### (?&WS) HTML
            [^>]* >

        (?<cdata>   <!\[CDATA\[     .*?     \]\]    > )
        (?<script>  (?= <script ) (?&tag)   .*?     </script> )
        (?<style>   (?= <style  ) (?&tag)   .*?     </style> )
        (?<comment> <!--            .*?           --> )

            < \? xml
            ) *
            \? >

        (?<xhook> < \? .*? \? > )



    our $Meta_Tag_Rx = qr{                          $RX_SUBS
            (?&start_tag) meta \b
                (?&WS) (?&any_nv_pair)
            ) +


# nobody *ever* remembers to do this!
END { close STDOUT }

@Salman 2010-11-20 20:18:01

two highlights from your comment "I use parsing classes all the time, especially if it’s HTML I haven’t generated myself." and "Patterns do not have to be ugly, and they do not have to be hard. If you create ugly patterns, it is a reflection on you, not them." i totally agree to what you have said, so i am revaluating the problem. thanks a lot for such detailed answer

@tchrist 2010-11-20 21:58:00

@Salman: glad to help. My second point is the more important one. I really wish people would stop writing their regex using nothing but %@#¡%^¿›±·€!≤#.*& punctuation and all scrunched together w/o any whitespace for breathing room or comments for understanding. Probably most important of all is applying problem-decomposition using top-down programming and meaningful alphabetically-named identifiers. It really changes everything, doesn’t it? ¡ƎƨɐƎ⅂d — ƨuɹəʇʇɐd λlƃnɟ əɹoɯ oИ

@Bill Ruppert 2011-02-17 14:20:19

For those who don't know, I thought I would mention that Tom is the co-author of "Programming Perl" (aka the Camel book) and one of the top Perl authorities. If you doubt that this is the real Tom Christiansen, go back and read the post.

@ridgerunner 2011-03-22 19:39:15

@Tom - I've studied (at length) Friedl's MRE3, but this post clearly demonstrates to me that I have a long, long way to go (to truly "know" regex - in the Neo: "I know kung-fu!" sense). Is there an equally well written and accurate book/resource which you could recommend that might help take me to the next level? And thanks for the excellent post! +1

@brian d foy 2011-07-06 19:23:17

This is the weird region between traditional Perl regexes and Perl 6 rules. Tom's really writing a grammar, although the match operator understands it. :)

@Steve Steiner 2011-07-08 13:45:13

To sum up: RegEx's are misnamed. I think it's a shame, but it won't change. Compatible 'RegEx' engines are not allowed to reject non-regular languages. They therefore cannot be implemented correctly with only Finte State Machines. The powerful concepts around computational classes do not apply. Use of RegEx's does not ensure O(n) execution time. The advantages of RegEx's are terse syntax and the implied domain of character recognition. To me, this is a slow moving train wreck, impossible to look away, but with horrible consequences unfolding.

@NullUserException 2011-09-02 14:32:13

@Justin for(;;) is only 7 characters.

@Jonathan M 2011-11-17 23:47:44

@tchrist, will this correctly parse something like <input value=" output<input " type="hidden">?

@tchrist 2011-11-18 16:25:39

@Jonathan M: Yes, of course it will. It would be broken, stupid, and wrong otherwise — just like most people’s approach. But not mine. :)

@Qtax 2011-12-27 15:02:23

@tchrist, this never answers OPs original question. And is parsing the proper term here? Afaics the regex are doing tokenizing/lexical analysis, but the final parsing done with Perl code, not the regex themselves.

@Mike Clark 2012-03-02 23:56:09

@tchrist Very impressive. You are obviously a highly skilled and talented Perl programmer, and extremely knowledgeable about modern regular expressions. I would point out, though, that what you have written is not really a regular expression (modern, regular, or otherwise), but rather a Perl program that uses regular expressions heavily. Does your post really support the claim that regular expressions can parse HTML correctly? Or is it more like evidence that Perl can parse HTML correctly? Either way, nice work!

@Ben Lee 2012-11-14 21:44:11

I think the modifier "epic" is heavily over-used these days, but if I had to pick and choose, I think this answer deserves it!

@aliteralmind 2014-04-14 13:39:42

This answer has been added to the Stack Overflow Regular Expressions FAQ, under "General information > When not to use Regex".

@Nitin9791 2016-03-13 14:36:18

suppose your html content is stored in string html then in order to get every input that contain type hidden you can use regular expression

var regex = /(<input.*?type\s?=\s?["']hidden["'].*?>)/g;

the above regex find <input followed by any number of characters until it gets type="hidden" or type='hidden' followed by any number of characters till it gets >

/g tell regular expression to find every substring that matches to the given pattern.

@HTML5 developer 2015-03-11 06:30:40

I would like to use **DOMDocument** to extract the html code.

$dom = new DOMDocument();
$dom ->loadHTML($input);
$x = new DOMXpath($dom );
$results = $x->evaluate('//input[@type="hidden"]');

foreach ( $results as $item) {
    print_r( $item->getAttribute('value') );

BTW, you can test it in here - It shows the result at real time. Some rules about Regexp: Reader.

@Suamere 2013-09-18 21:26:36

While I love the contents of the rest of these answers, they didn't really answer the question directly or as correctly. Even Platinum's answer was overly complicated, and also less efficient. So I was forced to put this.

I'm a huge proponent of Regex, when used correctly. But because of stigma (and performance), I always state that well-formed XML or HTML should use an XML Parser. And even better performance would be string-parsing, though there's a line between readability if that gets too out-of-hand. However, that isn't the question. The question is how to match a hidden-type input tag. The answer is:


Depending on your flavor, the only regex option you'd need to include is the ignorecase option.

@Ilmari Karonen 2014-01-18 08:10:38

<input type='hidden' name='Oh, <really>?' value='Try a real HTML parser instead.'>

@Suamere 2014-01-30 16:22:47

Your example is self-closing. Should end with /> . Also, while the chances of having a > in the name field are almost none, it is indeed possible for there to be a > in an action handle. E.G.: An inline javascript call on the OnClick property. That being said, I have an XML parser for those, but also have a Regex for those where the document I'm given is too messed up for XML parsers to handle, but a Regex can. In addition, this isn't what the question was. You'll never run into these situations with a hidden input, and my answer is the best. Ya, <really>!.

@Ilmari Karonen 2014-01-30 17:43:20

/> is an XML-ism; it's not required in any version of HTML, except for XHTML (which never really gained much traction, and has been all but superseded by HTML5). And you're right that there's a lot of messy not-really-valid HTML out there, but a good HTML (not XML) parser should be able to cope with most of it; if they don't, most likely neither will browsers.

@Suamere 2014-01-30 19:18:38

If the only parsing or searching you need is a single hit to return a collection of hidden input fields, this regex would be perfect. Using either the .NET XML Document class(es), or referencing a thrid party XML/HTML Parser just to call one method would be overkill when Regex is built in. And you're right that a website so messed up that a good HTML parser couldn't handle it probably isn't even something a dev would be looking at. But my company is handed-off millions of pages a month that are concatenated and jacked in many ways such that sometimes (not always), Regex is the best option.

@Suamere 2014-01-30 19:20:18

Only point being that we're not sure of the entire company reason this dev wants this answer. But it's what he asked for.

@Shamshirsaz.Navid 2013-03-31 20:37:38

you can try this :

<[A-Za-z ="/_0-9+]*>

and for closer result you can try this :

<[ ]*input[ ]+type="hidden"[ ]*name=[A-Za-z ="_0-9+]*[ ]*[/]*>

you can test your regex pattern here

these pattens are good for this:

<input type="hidden" name="SaveRequired" value="False" /><input type="hidden" name="__VIEWSTATE1" value="1H4sIAAtzrkX7QfL5VEGj6nGi+nP" /><input type="hidden" name="__VIEWSTATE2" value="0351118MK" /><input type="hidden" name="__VIEWSTATE3" value="ZVVV91yjY" />

and for random order of type , name and value u can use this :

<[ ]*input[ ]*[A-Za-z ="_0-9+/]*>


<[ ]*input[ ]*[A-Za-z ="_0-9+/]*[ ]*[/]>

on this :

<input  name="SaveRequired" type="hidden" value="False" /><input type="hidden" name="__VIEWSTATE1" value="1H4sIAAtzrkX7QfL5VEGj6nGi+nP" /><input type="hidden" name="__VIEWSTATE2" value="0351118MK" /><input  name="__VIEWSTATE3" type="hidden" value="ZVVV91yjY" />


by the way i think you want something like this :

<[ ]*input(([ ]*type="hidden"[ ]*name=[A-Za-z0-9_+"]*[ ]*value=[A-Za-z0-9_+"]*[ ]*)+)[ ]*/>|<[ ]*input(([ ]*type="hidden"[ ]*value=[A-Za-z0-9_+"]*[ ]*name=[A-Za-z0-9_+"]*[ ]*)+)[ ]*/>|<[ ]*input(([ ]*name=[A-Za-z0-9_+"]*[ ]*type="hidden"[ ]*value=[A-Za-z0-9_+"]*[ ]*)+)[ ]*/>|<[ ]*input(([ ]*value=[A-Za-z0-9_+"]*[ ]*type="hidden"[ ]*name=[A-Za-z0-9_+"]*[ ]*)+)[ ]*/>|<[ ]*input(([ ]*name=[A-Za-z0-9_+"]*[ ]*value=[A-Za-z0-9_+"]*[ ]*type="hidden"[ ]*)+)[ ]*/>|<[ ]*input(([ ]*value=[A-Za-z0-9_+"]*[ ]*name=[A-Za-z0-9_+"]*[ ]*type="hidden"[ ]*)+)[ ]*/>

its not good but it works in any way.

test it in :

@David 2011-07-10 01:22:35

In the spirit of Tom Christiansen's lexer solution, here's a link to Robert Cameron's seemingly forgotten 1998 article, REX: XML Shallow Parsing with Regular Expressions.


The syntax of XML is simple enough that it is possible to parse an XML document into a list of its markup and text items using a single regular expression. Such a shallow parse of an XML document can be very useful for the construction of a variety of lightweight XML processing tools. However, complex regular expressions can be difficult to construct and even more difficult to read. Using a form of literate programming for regular expressions, this paper documents a set of XML shallow parsing expressions that can be used a basis for simple, correct, efficient, robust and language-independent XML shallow parsing. Complete shallow parser implementations of less than 50 lines each in Perl, JavaScript and Lex/Flex are also given.

If you enjoy reading about regular expressions, Cameron's paper is fascinating. His writing is concise, thorough, and very detailed. He's not simply showing you how to construct the REX regular expression but also an approach for building up any complex regex from smaller parts.

I've been using the REX regular expression on and off for 10 years to solve the sort of problem the initial poster asked about (how do I match this particular tag but not some other very similar tag?). I've found the regex he developed to be completely reliable.

REX is particularly useful when you're focusing on lexical details of a document -- for example, when transforming one kind of text document (e.g., plain text, XML, SGML, HTML) into another, where the document may not be valid, well formed, or even parsable for most of the transformation. It lets you target islands of markup anywhere within a document without disturbing the rest of the document.

@Platinum Azure 2010-11-20 06:17:31

Contrary to all the answers here, for what you're trying to do regex is a perfectly valid solution. This is because you are NOT trying to match balanced tags-- THAT would be impossible with regex! But you are only matching what's in one tag, and that's perfectly regular.

Here's the problem, though. You can't do it with just one regex... you need to do one match to capture an <input> tag, then do further processing on that. Note that this will only work if none of the attribute values have a > character in them, so it's not perfect, but it should suffice for sane inputs.

Here's some Perl (pseudo)code to show you what I mean:

my $html = readLargeInputFile();

my @input_tags = $html =~ m/
        <input                      # Starts with "<input"
        (?=[^>]*?type="hidden")     # Use lookahead to make sure that type="hidden"
        [^>]+                       # Grab the rest of the tag...
        \/>                         # ...except for the />, which is grabbed here

# Now each member of @input_tags is something like <input type="hidden" name="SaveRequired" value="False" />

foreach my $input_tag (@input_tags)
  my $hash_ref = {};
  # Now extract each of the fields one at a time.

  ($hash_ref->{"name"}) = $input_tag =~ /name="([^"]*)"/;
  ($hash_ref->{"value"}) = $input_tag =~ /value="([^"]*)"/;

  # Put $hash_ref in a list or something, or otherwise process it

The basic principle here is, don't try to do too much with one regular expression. As you noticed, regular expressions enforce a certain amount of order. So what you need to do instead is to first match the CONTEXT of what you're trying to extract, then do submatching on the data you want.

EDIT: However, I will agree that in general, using an HTML parser is probably easier and better and you really should consider redesigning your code or re-examining your objectives. :-) But I had to post this answer as a counter to the knee-jerk reaction that parsing any subset of HTML is impossible: HTML and XML are both irregular when you consider the entire specification, but the specification of a tag is decently regular, certainly within the power of PCRE.

@tchrist 2010-11-20 20:02:53

Not contrary to all the answers here. :)

@Salman 2010-11-20 20:23:46

great stuff, i am going to implement this in php, so let me try this

@tchrist 2010-11-20 20:28:33

@Salman: PHP is one of the luckier ones, since it uses PCRE. You shouldn’t have any trouble. That said, I do find meder’s solution extremely attractive.

@Platinum Azure 2010-11-20 21:51:12

@tchrist: Your answer wasn't here when I posted mine. ;-)

@tchrist 2010-11-20 21:53:49

yah well — for some reason it took me longer to type than yours did. I think my keyboard must need greasing. :)

@soulmerge 2011-07-08 11:16:15

<input type="hidden" name="question" value="<Are you really really sure about this?>"/>

@Ross Snyder 2011-07-08 12:45:18

That's invalid HTML - it should be value="&lt;Are you really sure about this?&gt;" If the place he's scraping does a poor job escaping things like this, then he'll need a more sophisticated solution - but if they do it right (and if he has control over it, he should make sure it's right) then he's fine.

@Daniel Ribeiro 2011-07-08 14:29:58

Obligatory link to the best SO answer on the subject (possibly best SO answer period):…

@Yuhong Bao 2011-07-08 20:56:29

That reminds me that as it happens, Mosaic and Netscape 1.x actually did terminate a HTML tag at the first > regardless of whether it was in an attribute value, allowing typos like <missing quotes="allowed>. To make things worse, the main competitor, libwww, did not decode HTML entities inside attribute values back then.

@Platinum Azure 2011-07-08 20:58:44

@Yuhong Bao: I think those problems would probably plague all of the good HTML parsers too, if both of those cases are considered "valid" HTML. :-)

@JDB still remembers Monica 2013-02-04 17:05:39

@DanielRibeiro - Except it isn't an answer. It only exists because enough people have found it funny over the years to prevent it from being deleted.

@Ilmari Karonen 2014-01-18 08:01:33

@RossSnyder: No it's not. Besides, a bigger problem for this attempt at parsing HTML with regexps would be, say, <!-- <input type="hidden" name="this is not an input tag" value="this is just a comment" /> -->. (And yes, that's valid HTML too.)

@Ilmari Karonen 2014-01-18 08:18:27

Or, for that matter, <input value="How can I change my pattern so it will match regardless of the positions of the attributes in the <input> tag?" name=question type=hidden />. (Yes, that's valid too.)

@Beni Cherniavsky-Paskin 2015-11-18 23:01:41

It could also be contained within single-quoted attributes, <script> tags, CSS comments, XHTML CDATA sections and perhaps others... The right question is not whether it's technically possible to write a correct regex for limited use, as much as how complicated it is to do correctly — compared to just using a proven library.

@hanshenrik 2019-07-21 09:45:03

that code will extract the wrong value if value contains any html-encoded characters, for example if the HTML is <input value="foo&lt;bar" /> then the extracted value should be foo<bar, but this code will end up with foo&lt;bar instead.. a proper HTML parser, however, would end up with foo<bar

@Platinum Azure 2019-07-21 14:25:08

@hanshenrik You're not wrong, but I think one could argue that substring extraction and conversion of escape sequences are two different problems. Conversion of escape sequences is pretty trivial and can also be done with regular expressions.

@hanshenrik 2019-07-21 14:55:03

Conversion of escape sequences is pretty trivial and can also be done with regular expressions - oh god, that would be a monster in regex, and if one exist, i bet it's written by a generator, not by human hands. there's at least 1511 translations it needs to know about, probably more. do you have an actual regex-implementation of this you can link to?

@meder omuraliev 2010-11-20 19:36:30

  1. You can write a novel like tchrist did
  2. You can use a DOM library, load the HTML and use xpath and just use //input[@type="hidden"]. Or if you don't want to use xpath, just get all inputs and filter which ones are hidden with getAttribute.

I prefer #2.


$d = new DOMDocument();
    <input type="hidden" name="blah" value="hide yo kids">
    <input type="text" name="blah" value="hide yo kids">
    <input type="hidden" name="blah" value="hide yo wife">
$x = new DOMXpath($d);
$inputs = $x->evaluate('//input[@type="hidden"]');

foreach ( $inputs as $input ) {
    echo $input->getAttribute('value'), '<br>';


hide yo kids<br>hide yo wife<br>

@tchrist 2010-11-20 19:38:04

That was kinda my point, actually. I wanted to show how hard it is.

@tchrist 2010-11-20 19:48:07

Very good stuff there. I had really hoped people would show how much easier it is using a parsing class, so thanks! I just wanted a working example of the extreme trouble you have to go through to do it from scratch using regexes. I sure hope most people conclude to use prefab parsers on generic HTML instead of rolling their own. Regexes are still great for simple HTML they made themselves, though, because that gets rid of 99.98% of the complexity.

@the_yellow_logo 2014-08-31 10:25:00

What would be nice after reading those 2 very interesting approaches would be comparing the speed/memory usage/CPU of one approach against another (i.e. regex-based VS parsing class).

@Dennis98 2016-09-20 13:42:52

@Avt'W Yeah, not that you should go write a 'novel' if Regexes happen to be faster, but in fact it really would be just interesting to know. :) But my guess already is, that a parser takes less resources, too..

@Thorbjørn Ravn Andersen 2017-07-22 07:38:47

This is actually why XPath was invented in the first place!

Related Questions

Sponsored Content

15 Answered Questions

[SOLVED] What is a non-capturing group in regular expressions?

73 Answered Questions

9 Answered Questions

[SOLVED] How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops

  • 2014-03-20 19:09:13
  • Portland Runner
  • 889338 View
  • 590 Score
  • 9 Answer
  • Tags:   regex excel vba

30 Answered Questions

[SOLVED] Regular expression to match a line that doesn't contain a word

21 Answered Questions

[SOLVED] How do you access the matched groups in a JavaScript regular expression?

  • 2009-01-11 07:21:20
  • nickf
  • 769584 View
  • 1368 Score
  • 21 Answer
  • Tags:   javascript regex

8 Answered Questions

[SOLVED] Is there a regular expression to detect a valid regular expression?

  • 2008-10-05 17:07:35
  • psytek
  • 208325 View
  • 1006 Score
  • 8 Answer
  • Tags:   regex

53 Answered Questions

[SOLVED] What is the best regular expression to check if a string is a valid URL?

20 Answered Questions

[SOLVED] How do you use a variable in a regular expression?

  • 2009-01-30 00:11:05
  • JC Grubbs
  • 762050 View
  • 1379 Score
  • 20 Answer
  • Tags:   javascript regex

7 Answered Questions

[SOLVED] Regular expression to stop at first match

  • 2010-03-23 20:36:35
  • publicRavi
  • 599720 View
  • 528 Score
  • 7 Answer
  • Tags:   regex

12 Answered Questions

[SOLVED] Regular Expressions: Is there an AND operator?

  • 2009-01-22 16:49:14
  • Hugoware
  • 767611 View
  • 708 Score
  • 12 Answer
  • Tags:   regex lookahead

Sponsored Content