Perl note: Checking for Bad Links

A slight modification of this last script allows you to check an entire document hierarchy (your own or someone else's) for bad links. The script shown in I.3.3 traverses a document, and checks each of the http:, ftp: and gopher: links to see if there's a response at the other end. Links that point to sub-documents are fetched and traversed as before, so you can check your whole site in this way.

  % find_bad_links http://prego/apache-1.2/
checking http://prego/apache-1.2/...
checking http://prego/apache-1.2/manual/...
checking http://prego/apache-1.2/manual/misc/footer.html...
checking http://prego/apache-1.2/manual/misc/header.html...
checking http://prego/apache-1.2/manual/misc/nopgp.html...
checking http://www.yahoo.com/Science/Mathematics/Security_and_Encryption/...
checking http://www.eff.org/pub/EFF/Policy/Crypto/...
checking http://www.quadralay.com/www/Crypt/Crypt.html...
checking http://www.law.indiana.edu/law/iclu.html...
checking http://bong.com/~brian...
checking http://prego/apache-1.2/manual/cgi_path.html...
checking http://www.ics.uci.edu/pub/ietf/http/...
  .
  .
  .
BAD LINKS:
manual/misc/known_bugs.html : http://www.apache.org/dist/patches/apply_to_1.2b6/
manual/misc/fin_wait_2.html : http://www.freebsd.org/
manual/misc/fin_wait_2.html : http://www.ncr.com/
manual/misc/compat_notes.html : http://www.eit.com/
manual/misc/howto.html : http://www.zyzzyva.com/robots/alert/
manual/misc/perf.html : http://www.software.hp.com/internet/perf/tuning.html
manual/misc/perf.html : http://www.qosina.com/~awm/apache/linux-tcp.html
manual/misc/perf.html : http://www.sun.com/sun-on-net/Sun.Internet.Solutions/performance/
manual/misc/perf.html : http://www.sun.com/solaris/products/siss/
manual/misc/nopgp.html : http://www.yahoo.com/Science/Mathematics/Security_and_Encryption/

152 documents checked
11 bad links

----------------------Script I.3.2 mirrorTree.pl--------------------

 #!/usr/local/bin/perl

 # File: find_bad_links.pl

use LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;

 %CAN_HANDLE = ('http'=>1,
              'gopher'=>1,
              # 'ftp'=>1,   # timeout problems?
              );
%OUTCOME = ();
$CHECKED = $BAD = 0;
@BAD = ();

 my $URL = shift;

 $UA     = new LWP::UserAgent;
$PARSER = HTML::LinkExtor->new();
$TOP    = $UA->request(HTTP::Request->new(HEAD => $URL));
$BASE   = $TOP->base;

 check_links(URI::URL->new($TOP->request->url));
if (@BAD) {
   print "\nBAD LINKS:\n";
   print join("\n",@BAD),"\n\n";
}
print "$CHECKED documents checked\n",scalar(@BAD)," bad links\n";

 sub check_links {
   my $url = shift;
   my $fixed_url = $url;
   $fixed_url =~ s/\#.+$//;

    return 1 unless $CAN_HANDLE{$url->scheme};

    # check cached outcomes
   return $OUTCOME{$fixed_url} if exists $OUTCOME{$fixed_url};

    print STDERR "checking $fixed_url...\n";
   $CHECKED++;

    my $rel = $url->rel($BASE) || 'index.html';
   my $child = is_child($BASE,$url);
   $UA->timeout(5);
   my $doc = $d = $UA->request(HTTP::Request->new(($child ? 'GET' : 'HEAD' )=>$url));
   $OUTCOME{$fixed_url} = $doc->is_success;

    return $OUTCOME{$fixed_url}
      unless $child && $doc->header('Content-type') eq 'text/html';

    # Follow HTML documents
   my $base = $doc->base;
  
   # pull out the links and call us recursively
   my @links = $PARSER->parse($doc->content)->links;
   my @hrefs = map { url($_->[2],$base)->abs } @links;

    foreach (@hrefs) {
       next if check_links($_);
       push (@BAD,"$rel : $_");
   }
   1;
}

 sub is_child {
   my ($base,$url) = @_;
   my $rel = $url->rel($base);
   return ($rel ne $url) && ($rel !~ m!^[/.]!);
}

Perl note

Friday, December 7, 2007

Checking for Bad Links

No comments:

Blog Archive

About Me