By Zephyr


2019-12-02 22:04:00 8 Comments

I am trying to write a simple scraper tool that will extract a specific URL from a webpage. The page has many URLs, but I want to get the one that ends with a specific set of characters.

For example, if somewhere in the page source there is a url that looks like this:

source: "https://www.website.com/dog.pdf"

I want to return https://www.website.com/dog.pdf without the quotes. If there is more than one match, I only want to return the first one.

So the Regex should extract everything after source: and up to and including the .pdf"

--

I've looked at other questions, but most answers refuse to provide a RegEx and instead say to use startswith() and endswith(). But since the page source could be massive, I'm worried about performance. I am new to Python, though, and perhaps I'm just not understanding how to use those methods.

1 comments

@Ruslan Saiko 2019-12-02 22:44:39

Here you go

import re
example = '''
    source: "https://www.website.com/dog.pdf"
    source: "https://www.website.com/cat.pdf"
'''
pattern = r'"(?P<url>.+?)"'
m = re.search(pattern, example)
url = m.group('url') # result is https://www.website.com/dog.pdf

UPD.

To return the first link in double-quotes, the regular expression will look like this:

pattern = r'"(?P<url>https?:\/\/.+?)"'

If you need to find the first link in double-quotes that ends with .pdf, then the regular expression will be like this:

pattern = r'"(?P<url>https?:\/\/.+?\.pdf)"'

@Toto 2019-12-03 09:58:58

And what do you get if there are more than 1 URL in a single line?

@Ruslan Saiko 2019-12-03 10:37:26

Updated the answer. Added lazy quantifier

Related Questions

Sponsored Content

16 Answered Questions

[SOLVED] How to extract the substring between two markers?

10 Answered Questions

[SOLVED] How to extract a substring using regex

13 Answered Questions

[SOLVED] Regex Match all characters between two strings

  • 2011-05-24 11:45:58
  • 0xbadf00d
  • 661640 View
  • 378 Score
  • 13 Answer
  • Tags:   regex

16 Answered Questions

[SOLVED] How do I remove a substring from the end of a string in Python?

  • 2009-06-24 14:44:01
  • Ramya
  • 558213 View
  • 343 Score
  • 16 Answer
  • Tags:   python string

1 Answered Questions

[SOLVED] Using RegEx to extract a string in a URL

  • 2018-10-05 21:30:30
  • user10463769
  • 58 View
  • 0 Score
  • 1 Answer
  • Tags:   regex

2 Answered Questions

[SOLVED] Regex to extract required data from a string in C#

  • 2018-08-29 03:53:31
  • Meraqp
  • 58 View
  • 0 Score
  • 2 Answer
  • Tags:   c# regex

1 Answered Questions

RegEx for URL ending with a query string

  • 2015-11-25 14:47:16
  • Nils
  • 1046 View
  • 0 Score
  • 1 Answer
  • Tags:   regex url

2 Answered Questions

[SOLVED] regex to extract all digits from a string (not necessarily consecutive)

  • 2015-08-12 11:05:21
  • Codemonkey
  • 76 View
  • 0 Score
  • 2 Answer
  • Tags:   regex

3 Answered Questions

[SOLVED] Regex to extract string between quotes

  • 2015-08-03 17:00:06
  • Mark
  • 619 View
  • 2 Score
  • 3 Answer
  • Tags:   c# regex

Sponsored Content