Home : Products : Gossamer Links : Discussions :

Products: Gossamer Links: Discussions: Internal indexing / _tokenize: Edit Log

Here is the list of edits for this post
Internal indexing / _tokenize
Hi,

while looking on how to index our site I stepped over the tonkenize function.
As far as I can see words like white-wine are not split. That means wine cannot be found because it won´t be indexed.
I cannot see a reason for this at the moment but would be interested in feedback. Furthermore it would be interesting if the internal indexing process could be changed / overwritten with a plugin.

Code:
sub _tokenize {
#--------------------------------------------------------------------------------
# takes a strings and chops it up into little bits
my $self = shift;
my $text = shift;
my ( @words, $i, %rejected, $word, $code );

# split on any non-word (includes accents) characters
@words = split /[^\w\x80-\xFF\-]+/, lc $text;
$self->debug_dumper( "Words: ", \@words ) if ($self->{_debug});

# drop all words that are too small, etc.
$i = 0;
while ( $i <= $#words ) {
$word = $words[ $i ];
if ((exists $self->{stopwords}{$word} and ($code = 'STOPWORD')) or
(length($word) < $self->{min_word_size} and $code = 'TOOSMALL' ) or
(length($word) > $self->{max_word_size} and $code = 'TOOBIG')) {
splice( @words, $i, 1 );
$rejected{$word} = $self->{'rejections'}->{$code};
}
else {
$i++; # Words ok.
}
}
$self->debug_dumper( "Accepted Words: ", \@words ) if ($self->{_debug});
$self->debug_dumper( "Rejected Words: ", \%rejected ) if ($self->{_debug});

return ( \@words, \%rejected );
}
from /cgi-bin/admin/GT/SQL/Search/Base/Common.pm

Thanks

Niko

Last edited by:

el noe: Jul 30, 2009, 2:28 AM

Edit Log: