By Jonathan Lyon

2009-09-17 14:43:10 8 Comments

Is it possible to find all the pages and links on ANY given website? I'd like to enter a URL and produce a directory tree of all links from that site?

I've looked at HTTrack but that downloads the whole site and I simply need the directory tree.


@Hank Gay 2009-09-17 14:51:42

Check out linkchecker—it will crawl the site (while obeying robots.txt) and generate a report. From there, you can script up a solution for creating the directory tree.

@Jonathan Lyon 2009-09-17 15:08:05

thank you so much Hank! Perfect - exactly what I needed. Very much appreciated.

@Mateng 2011-11-14 20:42:56

A nice tool. I was using "XENU link sleuth before". Linkchecker is far more verbose.

@Alan Coromano 2013-07-30 17:15:50

how do I do that myself? and what if there is no robots.txt in a web site?

@Hank Gay 2013-07-31 15:14:02

@MariusKavansky How do you manually crawl a website? Or how do you build a crawler? I'm not sure I understand your question. If there is no robots.txt file, that just means you can crawl to your heart's content.

@Adi Fatol 2013-11-26 22:28:49

And this is available in Ubuntu's repository (actually it works with Windows/Mac/Linux)

@Arash Saidi 2014-11-29 19:24:07

Such a great little program!

@ApexFred 2015-11-05 10:33:53

hi guys, linkchecker has not worked for me when I scan the site it only returns a report of broken links. Very small report. while it does they it checked thousands of links but I can't see where those are reported. Using version 9.3 can you please help?

@Pandya 2018-10-09 09:25:39

how to send output to file with --out or -o?

@ElectroBit 2015-01-05 22:03:52

If you have the developer console (JavaScript) in your browser, you can type this code in:

urls = document.querySelectorAll('a'); for (url in urls) console.log(urls[url].href);


n=$$('a');for(u in n)console.log(n[u].href)

@Pacerier 2015-02-25 00:56:13

What about "Javascript-ed" urls?

@ElectroBit 2015-04-03 20:53:48

Like what? What do you mean?

@Pacerier 2015-04-06 13:45:53

I mean a link done using Javascript. Your solution wouldn't show it.

@zipzit 2016-05-28 17:32:18

@ElectroBit I really like it, but I'm not sure what I'm looking at? What is the $$ operator? Or is that just an arbitrary function name, same as n=ABC(''a'); I'm not understanding how urls gets all the 'a' tagged elements. Can you explain? I'm assuming its not jQuery. What prototype library function are we talking?

@ElectroBit 2016-05-28 17:54:13

@zipzit In a handful of browsers, $$() is basically shorthand for document.querySelectorAll(). More info at this link:…

@Lothar 2017-12-05 21:03:31

There is no complete computable solution to traversing javascripted urls beyond some very rudimentary attempts. At least this tip is working with the DOM and not the HTML source.

@user4318981 2014-12-03 07:42:45

function getalllinks($url){
$links = array();
if ($fp = fopen($url, 'r')) {
$content = '';
while ($line = fread($fp, 1024)) {
$content .= $line;
$textLen = strlen($content); 
if ( $textLen > 10){
$startPos = 0;
$valid = true;
while ($valid){
$spos  = strpos($content,'<a ',$startPos);
if ($spos < $startPos) $valid = false;
$spos     = strpos($content,'href',$spos);
$spos     = strpos($content,'"',$spos)+1;
$epos     = strpos($content,'"',$spos);
$startPos = $epos;
$link = substr($content,$spos,$epos-$spos);
if (strpos($link,'http://') !== false) $links[] = $link;
return $links;
try this code....

@Kevin Brown 2015-03-06 00:12:06

While this answer is probably correct and useful, it is preferred if you include some explanation along with it to explain how it helps to solve the problem. This becomes especially useful in the future, if there is a change (possibly unrelated) that causes it to stop working and users need to understand how it once worked.

@ElectroBit 2015-05-03 18:29:40

Eh, it's a little long.

@JamesH 2015-06-26 12:30:11

Completely unnecessary to parse the html in this manner in php. PHP does have the ability to understand the DOM!

@Mohamm6d 2016-10-04 13:59:36

it worked for me thanks

@mizubasho 2009-09-17 15:17:47

If this is a programming question, then I would suggest you write your own regular expression to parse all the retrieved contents. Target tags are IMG and A for standard HTML. For JAVA,

final String openingTags = "(<a [^>]*href=['\"]?|<img[^> ]* src=['\"]?)";

this along with Pattern and Matcher classes should detect the beginning of the tags. Add LINK tag if you also want CSS.

However, it is not as easy as you may have intially thought. Many web pages are not well-formed. Extracting all the links programmatically that human being can "recognize" is really difficult if you need to take into account all the irregular expressions.

Good luck!

@dimo414 2013-05-29 05:47:10

No no no no, don't parse HTML with regex, it makes Baby Jesus cry!

Related Questions

Sponsored Content

30 Answered Questions

[SOLVED] How can I add an empty directory to a Git repository?

  • 2008-09-22 16:41:03
  • Laurie Young
  • 842874 View
  • 3909 Score
  • 30 Answer
  • Tags:   git directory git-add

22 Answered Questions

[SOLVED] How do I list all files of a directory?

  • 2010-07-08 19:31:22
  • duhhunjonn
  • 3453761 View
  • 3474 Score
  • 22 Answer
  • Tags:   python directory

25 Answered Questions

[SOLVED] How can I safely create a nested directory?

11 Answered Questions

[SOLVED] How to find if directory exists in Python

  • 2012-01-19 21:03:20
  • David542
  • 982899 View
  • 987 Score
  • 11 Answer
  • Tags:   python directory

42 Answered Questions

[SOLVED] How do I find all files containing specific text on Linux?

16 Answered Questions

13 Answered Questions

[SOLVED] Find current directory and file's directory

  • 2011-02-28 01:51:21
  • John Howard
  • 2534712 View
  • 1895 Score
  • 13 Answer
  • Tags:   python directory

8 Answered Questions

[SOLVED] Get a list of URLs from a site

  • 2009-05-13 12:22:58
  • Oli
  • 307895 View
  • 84 Score
  • 8 Answer
  • Tags:   web-crawler

1 Answered Questions

How to find all image links on a website?

3 Answered Questions

[SOLVED] Website Downloader using Python

  • 2011-09-26 12:12:30
  • Arunanand T A
  • 2886 View
  • 0 Score
  • 3 Answer
  • Tags:   python web-crawler

Sponsored Content