By Meraqp


2018-08-29 03:53:31 8 Comments

I have a webpage. If I look at the "view-source" of the page, I find multiple instance of following statement:

<td class="my_class" itemprop="main_item">statement 1</td>
<td class="my_class" itemprop="main_item">statement 2</td>
<td class="my_class" itemprop="main_item">statement 3</td>

I want to extract data like this:

statement 1
statement 2
statement 3

To accomplish this, I have made a method "GetContent" which takes "URL" as parameter and copy all the content of the webpage source in a C# string.

private string GetContent(string url)
{
    HttpWebResponse response = null;
    StreamReader respStream = null;

    HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
    request.Timeout = 100000;
    response = (HttpWebResponse)request.GetResponse();
    respStream = new StreamReader(response.GetResponseStream());
    return respStream.ReadToEnd();
}

Now I want to create a method "GetMyList" which will extract the list I want. I am searching for the possible regex which can serve my purpose. Any help is highly appreciated.

2 comments

@dontbyteme 2018-08-29 04:38:04

Hosseins answer is pretty much the solution (and I would recommend you to use a parser if you have the option) but a regular expression with non-capturing paraentheses ?: would bring you the extracted data statement 1 or statement 2 as you need it:

IEnumerable<string> GetMyList(string str)
{
    foreach(Match m in Regex.Matches(str, @"(?:<td.*?>)(.*?)(?:<\/td>)"))
        yield return m.Groups[1].Value;
}

See Explanation at regex101 for a more detailed description.

@JohnyL 2018-08-29 05:12:26

You've got typo: should be ?: instead of :?

@Hossein 2018-08-29 04:03:50

using the HTML AgilityPack, this would be really easy...

  HtmlDocument doc= new HtmlDocument ();
  doc.LoadHtml(html);
  //var nodes = doc.DocumentNode.SelectNodes("//td//text()");
  var nodes = doc.DocumentNode.SelectNodes("//td[@itemprop=\"main_item\"]//text()");
  var list = new List<string>();
            foreach (var m in nodes)
            {
                list.Add(m.InnerText);
            }

But if you want Regex, Try this :

            string regularExpressionPattern1 = @"<td.*?>(.*?)<\/td>";
            Regex regex = new Regex(regularExpressionPattern1, RegexOptions.Singleline);
            MatchCollection collection = regex.Matches(html.ToString());
            var list = new List<string>();
            foreach (Match m in collection)
            {
                list.Add( m.Groups[1].Value);
            }

@Meraqp 2018-08-29 04:07:24

Thanks for the answer. I think this will pick all the <td> from the string. I want only those <td> having attribute itemprop="main_item".

@Hossein 2018-08-29 04:14:27

@Meraqp So use Html AgilityPack like var nodes = doc.DocumentNode.SelectNodes("//td[@itemprop=\"main_item\"]/‌​/text()");

Related Questions

Sponsored Content

22 Answered Questions

[SOLVED] Get property value from string using reflection

25 Answered Questions

[SOLVED] Convert a string to an enum in C#

  • 2008-08-19 12:51:55
  • Ben Mills
  • 723111 View
  • 912 Score
  • 25 Answer
  • Tags:   c# string enums

11 Answered Questions

[SOLVED] Check whether a string matches a regex in JS

64 Answered Questions

[SOLVED] What is the difference between String and string in C#?

13 Answered Questions

[SOLVED] Multiline string literal in C#

  • 2009-07-08 20:03:34
  • Chet
  • 636801 View
  • 1064 Score
  • 13 Answer
  • Tags:   c# string shorthand

39 Answered Questions

27 Answered Questions

[SOLVED] Get int value from enum in C#

  • 2009-06-03 06:46:39
  • jim
  • 1565140 View
  • 1843 Score
  • 27 Answer
  • Tags:   c# enums casting int

1 Answered Questions

how to convert string data to html in richtextbox

1 Answered Questions

Getting bad request (400) on solr

  • 2015-07-29 08:51:50
  • Piyush Kanti Manna
  • 106 View
  • 1 Score
  • 1 Answer
  • Tags:   c# .net solr

2 Answered Questions

[SOLVED] Getting (500) Internal Server Error with webresponse object

Sponsored Content