Forums » Ferret » testing tokenizers

testing tokenizers
Posted by Onur Turgay (Guest)
on 12.04.2006 14:19
Hi,
is there a way to test tokenizers? I mean, I want to give input stream
and see the output tokens.

AND is there a way to see an indexed document's index tokens? Which
words in the document are used to index this document?

Thanks in advance
Onur
Re: testing tokenizers
Posted by David Balmain (Guest)
on 18.04.2006 03:59
Hey Onur, just got back from a trip around Japan. You've probably
already worked out the answer to this question but here is how I test
tokenizers;

    require 'ferret'
    $stdin.each do |line|
      stk = Ferret::Analysis::StandardTokenizer.new(line)
      while tk = stk.next()
        puts "    <#{tk.text}> from #{tk.start_offset} to 
#{tk.end_offset}"
      end
    end

And I run it like this;

    ruby -r rubygems tz_tester.rb < file_to_tokenize.txt

You can just change the tokenizer to whaterver tokenizer you want to 
test.

Hope that helps,
Dave