Gossamer Forum: Products: Gossamer Links: Discussions: Internal indexing /

Hi,

while looking on how to index our site I stepped over the tonkenize function.
As far as I can see words like white-wine are not split. That means wine cannot be found because it won´t be indexed.
I cannot see a reason for this at the moment but would be interested in feedback. Furthermore it would be interesting if the internal indexing process could be changed / overwritten with a plugin.

Code:
sub _tokenize { 
#-------------------------------------------------------------------------------- 
# takes a strings and chops it up into little bits 
    my $self    = shift; 
    my $text    = shift; 
    my ( @words, $i, %rejected, $word, $code ); 

# split on any non-word (includes accents) characters 
    @words = split /[^\w\x80-\xFF\-]+/, lc $text; 
    $self->debug_dumper( "Words: ", \@words ) if ($self->{_debug}); 

# drop all words that are too small, etc. 
    $i = 0; 
    while ( $i <= $#words ) { 
        $word = $words[ $i ]; 
        if ((exists $self->{stopwords}{$word}   and ($code = 'STOPWORD')) or 
            (length($word) < $self->{min_word_size} and $code = 'TOOSMALL' )  or 
            (length($word) > $self->{max_word_size} and $code = 'TOOBIG')) { 
                splice( @words, $i, 1 ); 
                $rejected{$word}    = $self->{'rejections'}->{$code}; 
        } 
        else { 
            $i++;   # Words ok. 
        } 
    } 
    $self->debug_dumper( "Accepted Words: ", \@words  )   if ($self->{_debug}); 
    $self->debug_dumper( "Rejected Words: ", \%rejected ) if ($self->{_debug}); 

    return ( \@words, \%rejected ); 
}

from /cgi-bin/admin/GT/SQL/Search/Base/Common.pm

Thanks

Niko

Hi,

although I am not really happy with this one it might be interesting for someone. I wanted white-wine and white and wine as searchable words in INTERNAL indexing, so I modified /cgi-bin/admin/GT/SQL/Search/Base/Common.pm

after:

Code:
# drop all words that are too small, etc.

I added:

Code:
#add words for hyphened word 
    $i = 0; 
    while ( $i <= $#words ) { 
$word = $words[ $i ];$i++; 
next unless $word =~ /[\-]/; 
my @subwords = split /\-/, $word; 
foreach my $subword (@subwords) {push (@words, $subword);} 
    } 
#/add words for hyphened word

I don´t know if there are unwanted side effects with other routines than search so use at your own risk.
Maybe someday somebody from GT can make something out of it.

Regards

Niko

Internal indexing / _tokenize

Last edited by: