Hi, is there a way to test tokenizers? I mean, I want to give input stream and see the output tokens. AND is there a way to see an indexed document's index tokens? Which words in the document are used to index this document? Thanks in advance Onur
testing tokenizers
on 12.04.2006 14:19
Re: testing tokenizers
on 18.04.2006 03:59
Hey Onur, just got back from a trip around Japan. You've probably
already worked out the answer to this question but here is how I test
tokenizers;
require 'ferret'
$stdin.each do |line|
stk = Ferret::Analysis::StandardTokenizer.new(line)
while tk = stk.next()
puts " <#{tk.text}> from #{tk.start_offset} to
#{tk.end_offset}"
end
end
And I run it like this;
ruby -r rubygems tz_tester.rb < file_to_tokenize.txt
You can just change the tokenizer to whaterver tokenizer you want to
test.
Hope that helps,
Dave