By Jonathan Lyon


2009-09-17 14:43:10 8 Comments

Is it possible to find all the pages and links on ANY given website? I'd like to enter a URL and produce a directory tree of all links from that site?

I've looked at HTTrack but that downloads the whole site and I simply need the directory tree.

5 comments

@Hank Gay 2009-09-17 14:51:42

Check out linkchecker—it will crawl the site (while obeying robots.txt) and generate a report. From there, you can script up a solution for creating the directory tree.

@Jonathan Lyon 2009-09-17 15:08:05

thank you so much Hank! Perfect - exactly what I needed. Very much appreciated.

@Mateng 2011-11-14 20:42:56

A nice tool. I was using "XENU link sleuth before". Linkchecker is far more verbose.

@Alan Coromano 2013-07-30 17:15:50

how do I do that myself? and what if there is no robots.txt in a web site?

@Hank Gay 2013-07-31 15:14:02

@MariusKavansky How do you manually crawl a website? Or how do you build a crawler? I'm not sure I understand your question. If there is no robots.txt file, that just means you can crawl to your heart's content.

@Adi Fatol 2013-11-26 22:28:49

And this is available in Ubuntu's repository (actually it works with Windows/Mac/Linux)

@Arash Saidi 2014-11-29 19:24:07

Such a great little program!

@ApexFred 2015-11-05 10:33:53

hi guys, linkchecker has not worked for me when I scan the site it only returns a report of broken links. Very small report. while it does they it checked thousands of links but I can't see where those are reported. Using version 9.3 can you please help?

@Pandya 2018-10-09 09:25:39

how to send output to file with --out or -o?

@ElectroBit 2015-01-05 22:03:52

If you have the developer console (JavaScript) in your browser, you can type this code in:

urls = document.querySelectorAll('a'); for (url in urls) console.log(urls[url].href);

Shortened:

n=$$('a');for(u in n)console.log(n[u].href)

@Pacerier 2015-02-25 00:56:13

What about "Javascript-ed" urls?

@ElectroBit 2015-04-03 20:53:48

Like what? What do you mean?

@Pacerier 2015-04-06 13:45:53

I mean a link done using Javascript. Your solution wouldn't show it.

@zipzit 2016-05-28 17:32:18

@ElectroBit I really like it, but I'm not sure what I'm looking at? What is the $$ operator? Or is that just an arbitrary function name, same as n=ABC(''a'); I'm not understanding how urls gets all the 'a' tagged elements. Can you explain? I'm assuming its not jQuery. What prototype library function are we talking?

@ElectroBit 2016-05-28 17:54:13

@zipzit In a handful of browsers, $$() is basically shorthand for document.querySelectorAll(). More info at this link: developer.mozilla.org/en-US/docs/Web/API/Document/…

@Lothar 2017-12-05 21:03:31

There is no complete computable solution to traversing javascripted urls beyond some very rudimentary attempts. At least this tip is working with the DOM and not the HTML source.

@user4318981 2014-12-03 07:42:45

function getalllinks($url){
$links = array();
if ($fp = fopen($url, 'r')) {
$content = '';
while ($line = fread($fp, 1024)) {
$content .= $line;
}
}
$textLen = strlen($content); 
if ( $textLen > 10){
$startPos = 0;
$valid = true;
while ($valid){
$spos  = strpos($content,'<a ',$startPos);
if ($spos < $startPos) $valid = false;
$spos     = strpos($content,'href',$spos);
$spos     = strpos($content,'"',$spos)+1;
$epos     = strpos($content,'"',$spos);
$startPos = $epos;
$link = substr($content,$spos,$epos-$spos);
if (strpos($link,'http://') !== false) $links[] = $link;
}
}
return $links;
}
try this code....

@Kevin Brown 2015-03-06 00:12:06

While this answer is probably correct and useful, it is preferred if you include some explanation along with it to explain how it helps to solve the problem. This becomes especially useful in the future, if there is a change (possibly unrelated) that causes it to stop working and users need to understand how it once worked.

@ElectroBit 2015-05-03 18:29:40

Eh, it's a little long.

@JamesH 2015-06-26 12:30:11

Completely unnecessary to parse the html in this manner in php. php.net/manual/en/class.domdocument.php PHP does have the ability to understand the DOM!

@Mohamm6d 2016-10-04 13:59:36

it worked for me thanks

@John Magnolia 2012-03-23 08:43:29

Or you could use Google to display all the pages it has indexed for this domain. E.g: site:www.bbc.co.uk

@Mbarry 2013-04-01 22:57:58

but if you use extra search features in google such as site, intitle you'll get a restriction of 700 entries. evenif on the top of the results page says a way far from 700 ex: About 87,300 results (0.73 seconds)

@Pacerier 2015-04-06 13:46:24

@Mbarry, And how do you know that?

@Zon 2016-04-07 15:23:04

It is easy to get to know. Try to get 30 - 50 pages of search-results ahead and you will soon find the end, instead of thousands of results on "site:www.bbc.co.uk".

@Lothar 2017-12-05 21:01:21

Even on normal searches google does now not return more then 400 results.

@mizubasho 2009-09-17 15:17:47

If this is a programming question, then I would suggest you write your own regular expression to parse all the retrieved contents. Target tags are IMG and A for standard HTML. For JAVA,

final String openingTags = "(<a [^>]*href=['\"]?|<img[^> ]* src=['\"]?)";

this along with Pattern and Matcher classes should detect the beginning of the tags. Add LINK tag if you also want CSS.

However, it is not as easy as you may have intially thought. Many web pages are not well-formed. Extracting all the links programmatically that human being can "recognize" is really difficult if you need to take into account all the irregular expressions.

Good luck!

@dimo414 2013-05-29 05:47:10

No no no no, don't parse HTML with regex, it makes Baby Jesus cry!

Related Questions

Sponsored Content

25 Answered Questions

[SOLVED] How do I list all files of a directory?

  • 2010-07-08 19:31:22
  • duhhunjonn
  • 2955034 View
  • 3251 Score
  • 25 Answer
  • Tags:   python directory

15 Answered Questions

[SOLVED] Find current directory and file's directory

  • 2011-02-28 01:51:21
  • John Howard
  • 2164996 View
  • 1719 Score
  • 15 Answer
  • Tags:   python directory

25 Answered Questions

[SOLVED] How can I safely create a nested directory in Python?

28 Answered Questions

[SOLVED] How can I add an empty directory to a Git repository?

  • 2008-09-22 16:41:03
  • Laurie Young
  • 772726 View
  • 3686 Score
  • 28 Answer
  • Tags:   git directory git-add

41 Answered Questions

[SOLVED] How do I find all files containing specific text on Linux?

8 Answered Questions

[SOLVED] Get a list of URLs from a site

  • 2009-05-13 12:22:58
  • Oli
  • 279442 View
  • 79 Score
  • 8 Answer
  • Tags:   web-crawler

10 Answered Questions

[SOLVED] How to find if directory exists in Python

  • 2012-01-19 21:03:20
  • David542
  • 851225 View
  • 872 Score
  • 10 Answer
  • Tags:   python directory

1 Answered Questions

[SOLVED] Remove Domain URL from downloaded wbsite by HTTrack

1 Answered Questions

How to find all image links on a website?

3 Answered Questions

[SOLVED] Website Downloader using Python

  • 2011-09-26 12:12:30
  • Arunanand T A
  • 2797 View
  • 0 Score
  • 3 Answer
  • Tags:   python web-crawler

Sponsored Content