Gossamer Forum: General: Perl Programming: Need help with parsing

Jun 13, 2002, 4:24 PM

AlexBGD

Novice (7 posts)

Jun 13, 2002, 4:24 PM

Post #1 of 16

Shortcut

Need help with parsing

I need an example on how to parse visible text of the link.
<a href="www.something.com">I NEED THIS PART</a>
I'm just starting to learn Perl, and I really need help on this one.
Thanks!

Jun 13, 2002, 5:19 PM

Wil

Veteran / Moderator (4108 posts)

Jun 13, 2002, 5:19 PM

Post #2 of 16

Shortcut

Re: [AlexBGD] Need help with parsing In reply to

use HTML::Parser;

One of the examples which come with this module does exactly what you are asking for.

Some might suggest a regex for you. If you are unfamiliar with the input, that is, if the HTML code has been written by someone else, then be sure to use this module. It will catch all those nasty little things you won't expect.

Cheers

- wil

Jun 14, 2002, 1:53 AM

Paul

Veteran (19537 posts)

Jun 14, 2002, 1:53 AM

Post #3 of 16

Shortcut

Re: [AlexBGD] Need help with parsing In reply to

Wil is obviously trying to discourage people from using suggestions other than his own Wink

....however this regex will do the trick without the overhead of the module, or the inconvenience of having to install it if you don't have it.

Code:

#!/usr/bin/perl 

my $your_html = '<a href="www.something.com">I NEED THIS PART</a>'; 

$your_html =~ s|<a.*?href=[^>]+>(.*?)</a>|$1|i; 

print "Content-type: text/html\n\n"; 
print $your_html;

Quote:

If you are unfamiliar with the input, that is, if the HTML code has been written by someone else, then be sure to use this module. It will catch all those nasty little things you won't expect.

Show me a properly formatted hyperlink that my regex above won't catch and then I'll start recommending HTML::Parser.

Last edited by:

Paul: Jun 14, 2002, 1:58 AM

Jun 14, 2002, 3:19 AM

Wil

Veteran / Moderator (4108 posts)

Jun 14, 2002, 3:19 AM

Post #4 of 16

Shortcut

Re: [Paul] Need help with parsing In reply to

How about if your HTML wraps over one line? You haven't put any flags in your regex to indicate a match over a line break.

What happens when I want to put a TARGET=_blank in my link?

What about if I want to put a NAME=foo in my link?

Case proven and closed. ;-)

Shouldn't that be a\s+href anyway?

A few comments on your choice of regex, though...

Tsk. Tsk... Death to .* !! (!!!) Use a use a negated character class like you do in the first part of your regex. .* is evil, evil and greedy. You should try and avoid that at all costs.

- wil

Jun 14, 2002, 3:34 AM

Paul

Veteran (19537 posts)

Jun 14, 2002, 3:34 AM

Post #5 of 16

Shortcut

Re: [Wil] Need help with parsing In reply to

Ah you make me laugh..

>>
How about if your HTML wraps over one line?
<<

Um $your_html is a string so it makes no difference.

>>
What happens when I want to put a TARGET=_blank in my link?
<<

Nothing, it still works Tongue

>>
What about if I want to put a NAME=foo in my link?
<<

Nothing, it still works.

>>
Case proven and closed. ;-)
<<

Umm hehe, don't think so ;)

>>
Shouldn't that be a\s+href anyway?
<<

No.

>>
Tsk. Tsk... Death to .* !! (!!!) Use a use a negated character class like you do in the first part of your regex. .* is evil, evil and greedy. You should try and avoid that at all costs.
<<

I used the _non_ greedy code... .*? Tongue

Last edited by:

Paul: Jun 14, 2002, 4:20 AM

Jun 14, 2002, 4:22 AM

Wil

Veteran / Moderator (4108 posts)

Jun 14, 2002, 4:22 AM

Post #6 of 16

Shortcut

Re: [Paul] Need help with parsing In reply to

Unless you are using the /s modifier, the .* does not match a newline (\n). A negated character class will happily match a newline if you don't include it

Death to death to .*

Death to .*?

:-)

Use strict; should jump up and slap you for using any combination of .*.
Just being 'lazy' and throwing a question mark after it certainly works better than the greedy dot-star combination, though.

With your not-so-greedy .*? version, the regex engine is forced to stop after every match to see if the rest of the regex will match (kind of like a lookahead, but with subtle differences. See Mastering Regular Expressions, Second Edition, page 226 for the mechanics of this - and if ain't got this book - buy iy!!). If you are forced to iterate over the regex, these "tracking" issues can have a substantial performance hit on your program.

- wil

Jun 14, 2002, 4:42 AM

Paul

Veteran (19537 posts)

Jun 14, 2002, 4:42 AM

Post #7 of 16

Shortcut

Re: [Wil] Need help with parsing In reply to

You need to apply what you are reading from your perl book into real situations rather than taking things on face value. The perl book will be giving you solutions to examples in the book which can be very different from real situations.

For example, your perl book has told you that you need /s to match over newlines and now you seem to think you need it in all situations but you must look at what has happened to the code before this point and adapt your thinking. If the code is already in a string then /s becomes redundant, as in my code above. The /s isn't needed in that example as the code firstly has no new lines, but secondly is already in a string.

Quote:

Death to death to .*

Death to .*?

:-)[/code]
That just seems like a naive comment or one from someone who just wants to diss my code for the sake of it without fully understanding what it does. Using .*? can be very useful and will probably end up executing quicker than trying to match every "a href" tag possiblility. like TARGET= CLASS= etc.

I'm still waiting for you to show me a hyperlink that my regex won't match Tongue

Anyway I was only hanging around whilst something more interesting came up, and it has so I'm off out...byeee :)

Last edited by:

Paul: Jun 14, 2002, 4:44 AM

Jun 14, 2002, 9:48 AM

Wil

Veteran / Moderator (4108 posts)

Jun 14, 2002, 9:48 AM

Post #8 of 16

Shortcut

Re: [Paul] Need help with parsing In reply to

Any Perl programmer worth his or her's salt will curse you for using .*(?) in your code. They will. I will bet money that they will.

You want an example? OK. Consider the following. This is adapted from a much longer article from an ex-co-worker who tried to once explain the problem to me. Hope you find it useful.

Code:
$myvar =~ /"(.*)"/;

The intent of this is to capture whatever is inside of parentheses to $1 (backreferencing). However, this fails if $myvar is something like (yes, I know it's a ridiculous example):

Code:
$myvar = qq(Wil said "hi.  It's me," and Paul replied "get lost,  
I know I'm doing this right. Well, maybe?"); 
$myvar =~ /"(.*)"/; 
print $1;

We might expect the final line to print hi. It's me, but it won't. This is because the star quantifier is greedy. It will attempt the larget match that can possibly satisfy the regular expression. What would be printed is :

Code:
hi.  It's me," and Paul replied "get lost, I know I'm  
doing this right. Well, maybe?

To solve this, you may add a question mark after the quantifier (.*?) which makes the quantifier match as little as possible. This is called lazy or non-greedy matching. Merely adding that little question mark will cause the code to print the hi. It's me, that we were looking for:

$myvar =~ /"(.*?)"/;

So we're all set, right? Wrong. This certainly works better than the greedy dot-star combination. It keeps trying to match until it finds the first quote mark and then tries to get the smallest match that satisfies the regex. Sounds fine, right? Well, no.
There are two problems with both the greedy and lazy version of the dot star: imprecision and tracking.

The imprecision is obvious. The example above shows the imprecision with the greedy version of .*, but the lazy version is more subtle. What happens if you were trying to extract questions in quotes without the trailing question mark? You might think that something like /"(.*?)\?"/ would do the trick. Unfortunately, we get the same result as above because the lazy matching doesn't guarantee the smallest match. It guarantees the smallest match from the first place that a match could possibly begin.

Tracking is another problem with both of them. With the greedy version of dot star, the dot star gobbles up the entire string. The regex engine is then forced to backtrack from the end of the string to find the longest possible match that will satisfy the regex. With the lazy version, the the regex engine is forced to stop after every match to see if the rest of the regex will match (kind of like a lookahead, but with subtle differences. See Mastering Regular Expressions, Second Edition, page 226 for the mechanics of this). With a one-shot regex, these issues are not usually much of a performance hit (though it can be nasty on a complicated regex with no possible match), but if you are forced to iterate over the regex, these "tracking" issues can have quite a performance hit on your program.

The solution is to use a negated character class:

Code:
$myvar =~ /"([^"]*)"/;

What's going on there? When a caret "^" is the first character in a character class, it's telling the regex to match anything except what is in the character class. In this case, it is telling it to keep matching anything (including newline) that is not a quote. If there is a possible match, there is no tracking, there's no ambiguity, there's just a straight match. The above regex will match the first quote, capture everything that's not a quote to $1, and then match the end quote. It's fast and precise.
Unfortunately, it's a little more complicated with the "questions in quotes" example. Here's one solution:

Code:
$myvar =~ /"((?:[^?"]|\?[^"])*)\?"/;

Yes, it's more work. Yes, it's harder to read. But it works and doesn't have the problems of the dot star.

- wil

Jun 14, 2002, 9:50 AM

Wil

Veteran / Moderator (4108 posts)

Jun 14, 2002, 9:50 AM

Post #9 of 16

Shortcut

Re: [Paul] Need help with parsing In reply to

An HTML your regex will stumble? No problem.

Code:
<HTML> 
<HEAD> 
<TITLE></TITLE> 
</HEAD> 

<BODY> 

<A HREF="gopher://i_can_peek_on_IE_users_hard_drive/">Whoa! Mr IE user 
why don't you click on this <magical> link!</A> 

<BODY> 
</HTML>

- wil

Jun 14, 2002, 10:21 AM

Paul

Veteran (19537 posts)

Jun 14, 2002, 10:21 AM

Post #10 of 16

Shortcut

Re: [Wil] Need help with parsing In reply to

....all I need to do is read that code from a html file....

http://213.106.15.150/cgi-bin/test.cgi
http://213.106.15.150/cgi-bin/test.cgi?code=1

The vital factor you are forgetting is that the code was never written to work on html files or over multiple lines in the first place. Lets look at the original question:

Quote:

I need an example on how to parse visible text of the link.
<a href="www.something.com">I NEED THIS PART</a>

The code I provided will do as asked and will work as expected.

You are trying to make my regex fail by manipulating the original question. Thats like you asking for some code to subtractions and then complaining when it doesn't do multiplication.

If the original question had been, how do I parse URL's from a html file then I wouldn't have given that same code..duh.

If you want to argue about using .*? then try studing the regex to see what it actually does and then you'll see why I use .*?

Perhaps you're thinking "um you could use ([^<]+) instead of (.*?) ...well the answer to that is no you couldn't.

>>
Any Perl programmer worth his or her's salt will curse you for using .*(?) in your code. They will. I will bet money that they will.
<<

Yeah using .* is not always good but there are times when .*? is appropriate and this is one of then. Sigh.

Last edited by:

Paul: Jun 14, 2002, 10:25 AM

Jun 14, 2002, 10:29 AM

Wil

Veteran / Moderator (4108 posts)

Jun 14, 2002, 10:29 AM

Post #11 of 16

Shortcut

Re: [Paul] Need help with parsing In reply to

Appropriate = lazy. My point was, with a little more effort, there are better more efficent ways.

The question asked to extract the text out of a link, where the poster gave an example link.

The module originally suggested by myself would of accounted for any nasty surprises without making any modifications to code. As long as it was valid syntax, it would of picked it up.. and that was my point. It's more flexible.

- wil

Jun 14, 2002, 11:27 AM

Paul

Veteran (19537 posts)

Jun 14, 2002, 11:27 AM

Post #12 of 16

Shortcut

Re: [Wil] Need help with parsing In reply to

Well I was trying to force myself not to reply because I'm hitting my head against a brick wall but I just found an example of code (written by Jagerman) in GForum:

if ($key =~ /^cookie-(.*)/) {

$$text =~ s{\[\s*url\s*\](\s*)(.*?)(\s*)\[\s*/url\s*\]}

$$text =~ s{\[email\](\s*)(.*?)(\s*)\[/email\]}

$$text =~ s{\[\s*ima?ge?\s*\]\s*(.*?)\s*\[\s*/ima?ge?\s*\]}

So now you are probably going to tell me that it's ok for him to do it or is he "lazy" too?

.*? _IS_ appropriate if you know when to use it.

I rest my case. Mess with the best, die like the rest. Good night.

Last edited by:

Paul: Jun 14, 2002, 11:28 AM

Jun 15, 2002, 3:53 AM

Wil

Veteran / Moderator (4108 posts)

Jun 15, 2002, 3:53 AM

Post #13 of 16

Shortcut

Re: [Paul] Need help with parsing In reply to

Shouldn't that be?

$$text =~ s{\[\s*im(?:a)g(?:e)\s*\]\s*(.*?)\s*\[\s*/im(?:g(?:e)\s*\]}

I'm just replying to your post in the Test forum now...

- wil

Jun 25, 2002, 8:19 AM

AlexBGD

Novice (7 posts)

Jun 25, 2002, 8:19 AM

Post #14 of 16

Shortcut

Re: [Wil] Need help with parsing In reply to

I just needed some tips on parsing, I had no intention to start a war between you guys :) You are both great and you helped me a lot, really! Sorry I couldn't have replied sooner, I had some problems with my PC.
Once again, thanks a lot!

Jun 25, 2002, 8:36 AM

Wil

Veteran / Moderator (4108 posts)

Jun 25, 2002, 8:36 AM

Post #15 of 16

Shortcut

Re: [AlexBGD] Need help with parsing In reply to

Hi Alex

Glad you got some useful information.

There's no war going on here! Just me and Paul arguing about how to do things. We tend to bicker about these things a lot. I think it's all part of our learning processes. We give each criticizm and at the end of the day we make each other think in different ways, which is a good thing, IMO.

Cheers. :-)

- wil

Jun 25, 2002, 9:12 AM

Paul

Veteran (19537 posts)

Jun 25, 2002, 9:12 AM

Post #16 of 16

Shortcut

Re: [AlexBGD] Need help with parsing In reply to

Yes Wil is right...they aren't "real" arguments...just a potent mixture of sarcasm, egos and unparallelled perl knowledge (on my part).