Gossamer Forum
Home : Products : DBMan : Installation :

importing existing file into DBMAN format

Quote Reply
importing existing file into DBMAN format
This isn't as much a DBMAN question as it is a perl question, but I need to take my existing file, which is a multi-line HTML record based file, and convert it to the flat file format of DBMAN.

Does anyone have a quick and dirty (or clean!) solution to this?

One day I hope to be as good at perl as many of you obviously are.
TIA

Jamie


------------------
James A. (Jamie) Dennis
*Certified AIX Administrator
*Certified Solaris Administrator
*Member - SAGE, USENIX
Quote Reply
Re: importing existing file into DBMAN format In reply to
You might be able to do a "search-and-replace" in a text editor, depending on how the html is set up.

I started with something similar and ended up editing more than 500 records by hand. Not fun, but you only have to do it once!

Wish I could have been of more help. Maybe someone else can give you more hope. Smile

------------------
JPD
Quote Reply
Re: importing existing file into DBMAN format In reply to
If you post what a typical record likes, I'm sure 4 or 5 lines of perl could do the conversion for you!

Cheers,

Alex
Quote Reply
Re: importing existing file into DBMAN format In reply to
Here is a copy of one of the records:

<H3><FONT COLOR=BLUE><A NAME=Dennis>James A Dennis</font></H3>
E-mail address: <<a HRef="mailto:jdennis@netset.com"><b>jdennis@netset.com</
a>></b><br>
Home Page Address:<a HRef="http://www.netset.com/~jdennis">http://www.netset.co
m/~jdennis</a><br>
Address:<b> 6208 Wrothston Drive, Columbus, OH USA 43228</b><BR>
Employer:<b> Dennis Consulting Services, Inc.</b><br>
Occupation:<b> Senior Consultant/President</b><br>
Comments:<b> Married 17 years to my gorgeous wife Darla, 2 sons, Ryan age 14, J
ared age 11.
</b><br>
Visited here via the CHS Alumni Page on: Thu Aug 13 10:52:55 EDT 1998
<HR>

<H3><FONT COLOR=BLUE><A NAME=Dennis>James A Dennis</font></H3>
E-mail address: <<a HRef="mailto:jdennis@netset.com"><b>jdennis@netset.com</
a>></b><br>
Home Page Address:<a HRef="http://www.netset.com/~jdennis">http://www.netset.co
m/~jdennis</a><br>
No Street address given:<b>City: Columbus, OH USA 43228</b><BR>
Employer:<b> Dennis Consulting Services</b><br>
Occupation:<b> President</b><br>
Comments:<b> Recently went into consulting full time after a few years of doing
it on the side. Affiliated with Great Northern Consulting Services
of Westerville, OH, the leading reseller for Sun Microsystems in the
Midwest.
</b><br>
Visited here via the CHS Alumni Page on: Tue Jul 21 23:04:34 EDT 1998
<HR>

As you can see, each record begins with the <H3> line, and ends with the <HR>

Thanks

Any help appreciated.

Jamie


------------------
James A. (Jamie) Dennis
*Certified AIX Administrator
*Certified Solaris Administrator
*Member - SAGE, USENIX
Quote Reply
Re: importing existing file into DBMAN format In reply to
Get arachnophilia.

http://www.arachnoid.com/...p/ftp/arach_full.zip .
It's 1.6mb but worth the download. It's a great little text/html editor that is so powerful, when put together with Dreamweaver, they make a killer HTML developer setup.
Some features that I know:
syntax coloring
auto correct indentation for programs (perl, c++)
MOST IMPORTANTLY: SEARCH AND REPLACE across multiple files.
And a few more goodies I'm not aware of.

Anyway, to get back to your problem:

In excel, or arachnophilia or any other text editor with replace capabilities: search and replace the following -


<H3><FONT COLOR=BLUE><A NAME= #remove (replace with nothing)

</font></H3>E-mail address: <<a HRef="mailto: #replace with | (a pipe sign)

"><b> #remove

</a>></b><br>Home Page Address:<a HRef=" #replace with |

"> #remove

</a><br>Address:<b> #replace with |

</b><BR>Employer:<b> #replace with |

</b><br>Occupation:<b> #replace with |

</b><br>Comments:<b> #replace with |

</b><br>Visited here via the #replace with |

on: #replace with |

<HR> #replace with "|\n" (in arach, you can specify \n and it will insert a linebreak.)

at last, take care of the only > sign that was left in there near <a name=# stuff.
> # replace with |

So now, you have the following data:
Dennis|James A Dennis|jdennis@netset.com|jdennis@netset.com|http://www.netset.com/~jdennis|http://www.netset.com/~jdennis|6208 Wrothston Drive, Columbus, OH USA 43228|Dennis Consulting Services, Inc.|Married 17 years to my gorgeous wife Darla, 2 sons, Ryan age 14, J ared age 11.|CHS Alumni Page|Thu Aug 13 10:52:55 EDT 1998|

Then, it's a matter of opening it up in a spreadsheet program and removing the extra EMAIL and URL entries as a whole column. Remember, there are two entries for each email and url. Remove one. Do this after all your data has been recompiled. Then just issue command to delete whole column.

And the final heading would be:
AName|Name|Email|Homepage|Address|Employer|Occupation|Comments|Referer|Date_and_Time|

Now if you want the address parsed up furhter more into city, state, zip, etc. That may be asking a little too much. But after analyzing if all the addresses follow the same format (same amount of commas, same country, etc), you can pretty search and replace your way to get there, too.

Good luck.
Quote Reply
Re: importing existing file into DBMAN format In reply to
Get arachnophilia.

http://www.arachnoid.com/...p/ftp/arach_full.zip .
It's 1.6mb but worth the download. It's a great little text/html editor that is so powerful, when put together with Dreamweaver, they make a killer HTML developer setup.
Some features that I know:
syntax coloring
auto correct indentation for programs (perl, c++)
MOST IMPORTANTLY: SEARCH AND REPLACE across multiple files.
And a few more goodies I'm not aware of.

Anyway, to get back to your problem:

In excel, or arachnophilia or any other text editor with replace capabilities: search and replace the following -


<H3><FONT COLOR=BLUE><A NAME= #remove (replace with nothing)

</font></H3>E-mail address: <<a HRef="mailto: #replace with | (a pipe sign)

"><b> #remove

</a>></b><br>Home Page Address:<a HRef=" #replace with |

"> #remove

</a><br>Address:<b> #replace with |

</b><BR>Employer:<b> #replace with |

</b><br>Occupation:<b> #replace with |

</b><br>Comments:<b> #replace with |

</b><br>Visited here via the #replace with |

on: #replace with |

<HR> #replace with "|\n" (in arach, you can specify \n and it will insert a linebreak.)

at last, take care of the only > sign that was left in there near <a name=# stuff.
> # replace with |

So now, you have the following data:
Dennis|James A Dennis|jdennis@netset.com|jdennis@netset.com|http://www.netset.com/~jdennis|http://www.netset.com/~jdennis|6208 Wrothston Drive, Columbus, OH USA 43228|Dennis Consulting Services, Inc.|Married 17 years to my gorgeous wife Darla, 2 sons, Ryan age 14, J ared age 11.|CHS Alumni Page|Thu Aug 13 10:52:55 EDT 1998|

Then, it's a matter of opening it up in a spreadsheet program and removing the extra EMAIL and URL entries as a whole column. Remember, there are two entries for each email and url. Remove one. Do this after all your data has been recompiled. Then just issue command to delete whole column.

And the final heading would be:
AName|Name|Email|Homepage|Address|Employer|Occupation|Comments|Referer|Date_and_Time|

Now if you want the address parsed up furhter more into city, state, zip, etc. That may be asking a little too much. But after analyzing if all the addresses follow the same format (same amount of commas, same country, etc), you can pretty search and replace your way to get there, too.

Good luck.
Quote Reply
Re: importing existing file into DBMAN format In reply to
This is a good thought; however, I would like to do it with an awk, perl, or sed script, something which can be done non-interactively, as there are about 50 of these files.

Maybe I can create an excel macro to do it?



------------------
James A. (Jamie) Dennis
*Certified AIX Administrator
*Certified Solaris Administrator
*Member - SAGE, USENIX
Quote Reply
Re: importing existing file into DBMAN format In reply to
This is a good thought; however, I would like to do it with an awk, perl, or sed script, something which can be done non-interactively, as there are about 50 of these files.

Maybe I can create an excel macro to do it?



------------------
James A. (Jamie) Dennis
*Certified AIX Administrator
*Certified Solaris Administrator
*Member - SAGE, USENIX
Quote Reply
Re: importing existing file into DBMAN format In reply to
How about:

open (IN, "<data.txt") or die $!;
open (OUT, ">out.db") or die $!;
$/ = "\n\n"; # Get data by paragraphs.
while (<IN> ) {
$data = $_;
$count++;
$data =~ s/\n//g;
($name) = $data =~ m,<H3><FONT COLOR=BLUE><A NAME=.+?>(.+?)</font></H3>,m;
($email) = $data =~ m,E-mail address: <<a HRef="mailto Frown.+?)">,m;
($url) = $data =~ m,Home Page Address:<a HRef="(.+?)">,m;
($addr) = $data =~ m,Address:<b>(.+?)</b><BR>,m;
$addr &#0124; &#0124; (($addr) = $data =~ m,No Street address given:<b>(.+?)</b><BR>,m);
($employ) = $data =~ m,Employer:<b>(.+?)</b><br>,m;
($occup) = $data =~ m,Occupation:<b>(.+?)</b><br>,m;
($comment) = $data =~ m,Comments:<b>(.+?)</b><br>,m;
($visit, $time) = $data =~ m,Visited here via the (.+?) on: (.+?)\s*<HR>,m;
$line = "$count|$name|$email|$url|$addr|$employ|$occup|$comment|$visit|$time";
$line =~ s/\n/``/g;
print OUT $line, "\n";
}
close IN;
close OUT;

Works for the two you gave..

Hope that helps,

Alex
Quote Reply
Re: importing existing file into DBMAN format In reply to
Replace the smiley with a : followed by a (.

Cheers,

Alex
Quote Reply
Re: importing existing file into DBMAN format In reply to
Alex

When at first I tried, perl bombed out with this error:

root@nadia->perl extractor
syntax error at extractor line 13, near "| |"
Execution of extractor aborted due to compilation errors.

So, in checking, it appeared that the pipe signs were separated by a space, so I removed the space and tried again. This time it bombed with this error:

Use of uninitialized value at extractor line 18, <IN> chunk 1.
Use of uninitialized value at extractor line 18, <IN> chunk 1.
Use of uninitialized value at extractor line 18, <IN> chunk 2.
Use of uninitialized value at extractor line 18, <IN> chunk 2.
Use of uninitialized value at extractor line 18, <IN> chunk 2.
Use of uninitialized value at extractor line 18, <IN> chunk 2.
Use of uninitialized value at extractor line 18, <IN> chunk 2.
Use of uninitialized value at extractor line 18, <IN> chunk 2.
Use of uninitialized value at extractor line 18, <IN> chunk 2.
Use of uninitialized value at extractor line 18, <IN> chunk 3.
Use of uninitialized value at extractor line 18, <IN> chunk 4.
Use of uninitialized value at extractor line 18, <IN> chunk 4.
Use of uninitialized value at extractor line 18, <IN> chunk 4.
Use of uninitialized value at extractor line 18, <IN> chunk 4.
Use of uninitialized value at extractor line 18, <IN> chunk 5.
Use of uninitialized value at extractor line 18, <IN> chunk 5.
Use of uninitialized value at extractor line 18, <IN> chunk 5.
Use of uninitialized value at extractor line 18, <IN> chunk 5.
Use of uninitialized value at extractor line 18, <IN> chunk 5.
Use of uninitialized value at extractor line 18, <IN> chunk 5.
Use of uninitialized value at extractor line 18, <IN> chunk 5.
Use of uninitialized value at extractor line 18, <IN> chunk 5.
Use of uninitialized value at extractor line 18, <IN> chunk 5.


Line 18 is this one:

$line = "$count|$name|$email|$url|$addr|$employ|$occup|$comment|$visit|$time";

It looks to me like these are all defined.

Here is exactly how extractor looks now:

#!/usr/local/bin/perl -w
open (IN, "<data.txt") or die $!;
open (OUT, ">out.db") or die $!;
$/ = "\n\n"; # Get data by paragraphs.
while (<IN> ) {
$data = $_;
$count++;
$data =~ s/\n//g;
($name) = $data =~ m,<H3><FONT COLOR=BLUE><A NAME=.+?>(.+?)</font></H3>,m;
($email) = $data =~ m,E-mail address: <<a HRef="mailto Frown.+?)">,m;
($url) = $data =~ m,Home Page Address:<a HRef="(.+?)">,m;
($addr) = $data =~ m,Address:<b>(.+?)</b><BR>,m;
$addr &#0124; &#0124; (($addr) = $data =~ m,No Street address given:<b>(.+?)</b><BR>,m);
($employ) = $data =~ m,Employer:<b>(.+?)</b><br>,m;
($occup) = $data =~ m,Occupation:<b>(.+?)</b><br>,m;
($comment) = $data =~ m,Comments:<b>(.+?)</b><br>,m;
($visit, $time) = $data =~ m,Visited here via the (.+?) on: (.+?)\s*<HR>,m;
$line = "$count|$name|$email|$url|$addr|$employ|$occup|$comment|$visit|$time";
$line =~ s/\n/``/g;
print OUT $line, "\n";
}
close IN;
close OUT;


Thanks for all the help!!!

Jamie
Quote Reply
Re: importing existing file into DBMAN format In reply to
Hi Jamie,

The unitilized error is ok. It means you are running perl with -w (show warnings). When the warning is displayed, it just means that the script couldn't parse (perhaps it was empty) one of the fields, and when I print it, it's still undefined.

Did the conversion work otherwise? As for the space in the pipes, UBB has a nasty habit of doing that. Frown

Cheers,

Alex
Quote Reply
Re: importing existing file into DBMAN format In reply to
Alex

When I first ran it, it did not work at all, just made a file made of pipe signs. I re-copied the code this morning and tried again, and this time it "mostly" worked, although it didn't get some of the records.

I think the biggest problem I am going to have is inconsistency in the data from when I originally started the page to when I finished the production version. It seems that some of the files are formatted incorrectly, although I thought I had gone back and fixed them all.

Oh, well. Maybe I'll run this script to create the new files, then go back and validate the data. If it loads at least 75% correctly (not the fault of your code for sure!), then I can process the remaining.

Thanks for all your help!

Jamie