Hello,
I'm looking for tips on how to recreate something like Google's Ngram viewer
https://books.google.com/ngrams with elasticsearch. I have a text corpus
of < 500 MB for which this kind of tool would be very valuable.
I've had some success with the shingle token filter
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-shingle-tokenfilter.html and
the date histogram aggregation
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html,
but the results are not ideal: I'd like to get a histogram of word/phrase
frequencies, not a histogram of how many documents the word/phrase occurs
in.
It looks like what I need is some kind of combination of shingles, term
vectors
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-termvectors.html and the
date histogram aggregation, but I'm not sure how to proceed. I can improve
my current approach by breaking the corpus into smaller pieces, i.e. make
my documents be paragraphs instead of chapters. But what I really want is a
"shingle frequency date histogram".
Is this something that can be accomplished with elasticsearch?
Jari
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4b37f0a1-4611-4260-85fb-36b4d67c6076%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.