Hi all,
We've got an internal Java library that allows us to do keyword extraction
that seems like a great thing to turn into an integrated elasticsearch
function.
Ultimately, I want to be able to access the result of this library from
search results/etc, but I wanted to do a sanity check to make sure my
approach was right - or if I should be looking at doing a custom analyzer
or something instead.
Given a string field, the type would become a multi-field, {name} and
keywords/phrases as subfields. A plugin would be written to handle this
keywords field, run the strings through the library and return a list of
strings like:
"my_data":"Jack and Jill went up the hill, Jack fell down and bumped his
crown, and Jill came tumbling after."
"my_data.keywords":["Jack", "Jack fell"]
That's a trivial example, of course, and the algorithm is more complex than
the standard stopword filtering.
Ultimately, I want to be able to expose the my_data.keywords field as an
actual list like above, so that we can use it in other things like facets
down the line.
So is a custom type plugin the right way to go here, or should I be looking
at developing a more complex analyzer/tokenizer/stopword combo?
An analyzer plugin is the right thing. Adding the recognized/extracted
terms needs access to ES mapping service. There are a few plugins out there
which work in this manner, for example, the attachment mapper plugin.
Or the lang-detect plugin, it adds the recognized language(s) as a keyword
code into a neighbor field for filtering or faceting:
Also, I developed a similar plugin that works with recognition techniques,
it can recognize ISBN or other standard number in a text, and injects extra
tokens into the token stream to identify these numbers:
An analyzer plugin is the right thing. Adding the recognized/extracted
terms needs access to ES mapping service. There are a few plugins out there
which work in this manner, for example, the attachment mapper plugin.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.