I have some text that has been pre-analyzed (tokenized, stemmed, stopword
filtered, and the term frequency has been counted). So, for the text
"Example text has twice the word twice", I have "(example, 1), (text, 1),
(twice, 2), (word, 1)".
How can I add this to ElasticSearch?
I could just repeat the tokens and have them re-analyzed, or I could create
my own analyzer for this format, or I could try to directly access the term
vectors.
I assume that The Right Way would be the last option, but I don't know how
to do it. Any ideas?
If you're adding text the regular way, then you get extra meta data such as
position. So if you care only about term vector like you outlined, then you
might want basic term queries not the regular text search. If you intend to
do text searches then you'll need a matching query analyzer.
Without fully understanding your usecase, I sense you're better off with
writing your own tokenizer and just not applying any fancy filters to it.
On Wednesday, May 1, 2013 6:08:53 AM UTC-7, Cristiano Lima wrote:
Hi All,
I have some text that has been pre-analyzed (tokenized, stemmed, stopword
filtered, and the term frequency has been counted). So, for the text
"Example text has twice the word twice", I have "(example, 1), (text, 1),
(twice, 2), (word, 1)".
How can I add this to Elasticsearch?
I could just repeat the tokens and have them re-analyzed, or I could
create my own analyzer for this format, or I could try to directly access
the term vectors.
I assume that The Right Way would be the last option, but I don't know how
to do it. Any ideas?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.