lvrfy: A HTML Link Verifier
Version 1.6
28 November 1995
lvrfy is a script that verifies all the internal links in HTML pages
on your server. Its operation is rather simple: it starts with one
page, parses all the links (including inline images), and then
recursively checks all the links.
Its greatest shortcomming is that it is slow. While it averaged 7.5
seconds per file on our server, with 1584 files, that adds up to over
3 hours. I was told that it can process 10000 links to 4000 pages in
about 1.5 hours on a Sparc 1000 with dual 75MHz CPU's.
This is a regular shell script. Just make it executable, and you're
ready to go. It assumes that the following programs are in your path:
sed, awk, csh, touch, and rm.
Note the copyright. If you make major changes, or wish to distribute a
modified version, you need to contact me.
Command-Line Execution
lvrfy startURL fromURL OKfile BADfile OFFSITEfile
The parameters are:
startURL
- The page to verify. This is a partial URL, in that the server
name and protocol are not listed. If you wanted to start with
"http://www.cs.dartmouth.edu/~crow/" you would use "/~crow/" as the
first parameter.
fromURL
- This is used in the recursive call. If you specify a valid page
for startURL, then any text as a placeholder will do the
job.
OKfile
- This is the file in which pages which have been successfully found
are listed. Any pages listed in this file are assumed to have already
been processed. Experienced users may use this page to prune portions
of the server's pages from verification. Unexperienced users should
insure that this file does not initially exist, to insure a complete
scan of the document tree. You may not specify /dev/null for this parameter.
- This file may be used to find files that are not reachable from
the starting page, by comparing the file with the results of "find
ROOT -name '*.*' -print." You should use 'diff' on the two files,
after sorting them and grepping out user directories. Symbolic links
may confuse the results.
BADfile
- This is the file in which broken links are recorded. If it
already exists, the results will be appended, so in general it should
not initially exist. Note that upon completion, some entries may be
duplicated if there are multiple copies of the bad link. You may
specify /dev/null for this parameter, though that would be silly.
OFFSITEfile
- This is the file in which HTTP links to other servers are
recorded. No verification of these links is attempted, though it
would be trivial to 'awk' this file into a script to run lynx to
attempt loading each page. Note that entries are not necessarily
unique; the number of entries for a URL indicates the number of links
to it. You may specify /dev/null for this parameter.
Customization
You will probably want to customize several variables within the
script.
- 'SERVER' specifies the name of the server that you are using.
This is helpful in detecting links to the same server, even if they
specify the full URL.
- 'SLASH' is the server's root directory.
- 'PUBLIC' is the name of the subdirectory with a user's account for
/~uid/ links. You can intentionally set this wrong to stop the
program from checking personal pages, though you'll have to grep out
some garbage from the bad link file.
- 'TMP' is the directory where temporary files are saved. It
shouldn't need too much space.
- 'MAXNEST' is the maximum nesting level for the recursion. If you
have trouble with the script filling up the process table, use a
smaller number.
Results
If everything works correctly, then the script will not produce any
output to stdout or stderr, but you should watch for potential errors.
The output files may contain the following types of entries:
OKfile
/usr/moosilauke/www/docs/index.html /~crow/
/usr/moosilauke/www/docs/images/dartgreeting3a.gif /
/usr/toe/grad/crow/public_html/lego/empire.html /~crow/
These entries reflect the files /, /images/dartgreeting3a.gif, and
/~crow/lego/empire.html, given the correct values for the 'SLASH' and
'PUBLIC' variables. The second entry is the URL for the page from
which the page was first found.
BADfile
Link to non-existant page /~samr/CS68.95W/Homework/ from /~samr/CS68.95W/
Link to non-existant page /TR/Search/ from /tr/home.html
Link to unreadable page /~cowen/schedule.html from /~cowen/homepage.html
Link to unreadable page /~cowen/schedule.html from /~cowen/
Link to server-generated index page /~rajendra/News/oldnews/ from /~rajendra/News/News.html
Here, we have a five bad links. The first one really is bad. The
second one is due to the failure of lvrfy to recognize that /TR/ is an
aliased directory, so the link is valid. The third entry indicates
that the file wasn't readable by the user who ran lvrfy (you should
run it as the same user that your server executes as, if possible)--in
this case it was intentional. The fourth entry indicates that
'homepage.html' is the same as 'index.html,' thanks to the magic of
symbolic links. The fifth entry is more of a warning than an error.
OFFSITEfile
http://www.dartmouth.edu/ /usr/windsor2/www/docs/phd_program.html
http://www.house.gov/ /u/crow/public_html/~crow/index/html
http://legowww.homepages.com/text/FAQ.html /u/crow/public_html/~crow/index.html
These are just standard URL's and the file in which they were found.
How it Works
The code is rather hard to read--it really should have been a single
perl program, but I haven't learned perl yet. Instead, it mostly uses
sed to parse each HTML document and extract out all the links. This
is eventually converted into a shell script. This results in a
depth-first search, leaving behind a sleeping process for each level
in the search.
The recursion is halted if the nesting level gets too large. The
pending checks are then processed by the top level. I believe that
setting the maximum nesting level to 1 would result in a breadth-first
search. I haven't studied the performance impace of doing this.
Known Bugs
Despite these problems, I've found the script quite useful.
- Doesn't handle tags in comments correctly. It may fault on:
<!-- <LI>foo -->, or otherwise get confused.
- Doesn't handle unclosed tags, and may seg. fault if it
finds them.
- May leave files in /tmp when it doesn't complete successfully.
- Doesn't recognize ScriptAlias directories, so links to aliased
files will be reported as bad.
- Symbolic links to directories may cause an infinite recursion
when combined with restrictive directory permissions which
prevent 'pwd' from working.
- Certain pathalogical file or directory names may confuse
it, but these should be quite rare.
Warning: This script isn't secure, and shouldn't be run as
root.
I'm not sure if it is possible for a carefully constructed
pathalogical case to misdirect the script, causing unexpected or
dangerous side effects.
Created by Preston Crow
Visit my homepage!