I made some modifications to goFetch to get it to write to my validate.db file. It works on my site.
1. Everywhere in the script it says
$db_spider_id_file_name, I replaced that with
$db_links_id_file_name.
2. Everywhere in the script it says
$db_spider_name, I replaced it with
$db_valid_name.
Ok, so now when it spiders a page it reads how many links there are and updates the validate.db file.
But it isn't that simple. My database is pipe delimited so I had to change:
print SPIDER "$ID%%$mytitle%%$myurl%%$mydescrip%%$mykeywords%%$mysize%%$lastupd\n"; to:
print SPIDER "$ID|$mytitle|$myurl|$mydescrip|$mykeywords|$mysize|$lastupd\n"; You have to match how many fields you have in your database. I have 17 fields so I had to change it to:
print SPIDER "$ID|$mytitle|$myurl|$date||$mydescrip|Name Here|your\@email.com|||||||||$date\n"; Also note that $lastupd is changed to $date to produce the right date format. You also have to change:
use HTTP::Date;
$lastupd = time2str($res->last_modified); to:
$date = &get_date; (Note, I still get one problem with this, though. The first 2 links always have the infamous 1969 date!)
C. Now you have to eliminate some characters from the title and description fields.
Beneath:
# Update the counter.
open (ID, ">$db_links_id_file_name") or &cgierr("error in get_defaults. unable to open id file: $db_links_id_file_name. Reason: $!");
flock(ID, 2) unless (!$db_use_flock);
print ID $ID; # update counter.
close ID; # automatically removes file lock
open (SPIDER, ">>$db_valid_name") or &cgierr ("Can't open for output counter file. Reason: $!");
if ($db_use_flock) { flock (SPIDER, 2) or &cgierr ("Can't get file lock. Reason: $!"); } I added this:
$mydescrip =~ tr/\n//d;
$mydescrip =~ tr/|\n//d;
$mytitle =~ tr/|\n//d; to remove the | character from the title and description. and to remove line breaks from the description.
Now it should enter everything into the valildate.db in the right slots with no pipes or line breaks to screw up the fields.
Now I need a page to spider. The way I do it is use a shareware program called UrlSearch. I find a page with links I would like to add,
then I save it to my hard drive. Then open it with UrlSearch and eliminate the irrelevant links. Then save it, upload it to my
server then spider it on my server. I usually only do 10 or 15 at a time because I still have to validate them to check the title
and description. But it is not often that both the title and description turn out OK. Too many people do not have a meta tag
for description. And when it builds a description from the content it puts in javascript and other things it doesn't understand.
Sometimes it is easier to use the bookmarklet tool to add links because you can highlight your description on the page.
Too much rambling now. Anyway I have it working on my site and am now adding 20 or 30 relevant links to my site a week this way.
Mike
http://www.sweepstalk.com