Plugin development guidance

Hi all,
We've got an internal Java library that allows us to do keyword extraction
that seems like a great thing to turn into an integrated elasticsearch
function.
Ultimately, I want to be able to access the result of this library from
search results/etc, but I wanted to do a sanity check to make sure my
approach was right - or if I should be looking at doing a custom analyzer
or something instead.

Given a string field, the type would become a multi-field, {name} and
keywords/phrases as subfields. A plugin would be written to handle this
keywords field, run the strings through the library and return a list of
strings like:
"my_data":"Jack and Jill went up the hill, Jack fell down and bumped his
crown, and Jill came tumbling after."
"my_data.keywords":["Jack", "Jack fell"]
That's a trivial example, of course, and the algorithm is more complex than
the standard stopword filtering.

Ultimately, I want to be able to expose the my_data.keywords field as an
actual list like above, so that we can use it in other things like facets
down the line.

So is a custom type plugin the right way to go here, or should I be looking
at developing a more complex analyzer/tokenizer/stopword combo?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/44d3ba64-f0b3-4727-9a49-745a2167d34d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

An analyzer plugin is the right thing. Adding the recognized/extracted
terms needs access to ES mapping service. There are a few plugins out there
which work in this manner, for example, the attachment mapper plugin.

Or the lang-detect plugin, it adds the recognized language(s) as a keyword
code into a neighbor field for filtering or faceting:

Also, I developed a similar plugin that works with recognition techniques,
it can recognize ISBN or other standard number in a text, and injects extra
tokens into the token stream to identify these numbers:

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEzwFzy4zGbcg1w66LQgcEq8L5O9tjj2ke_6krw9nc%2B7A%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Great, thanks Jörg!
I'll start fiddling around with the langdetect plugin to see if I can get
it going with our library.

On Tue, Feb 11, 2014 at 1:18 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

An analyzer plugin is the right thing. Adding the recognized/extracted
terms needs access to ES mapping service. There are a few plugins out there
which work in this manner, for example, the attachment mapper plugin.

Or the lang-detect plugin, it adds the recognized language(s) as a keyword
code into a neighbor field for filtering or faceting:
GitHub - jprante/elasticsearch-langdetect: A plugin for language detection in Elasticsearch using Nakatani Shuyo's language detector

Also, I developed a similar plugin that works with recognition techniques,
it can recognize ISBN or other standard number in a text, and injects extra
tokens into the token stream to identify these numbers:
GitHub - jprante/elasticsearch-analysis-standardnumber: Analyze standard numbers like ARK, DOI, EAN, GTIN, IBAN, ISAN, ISBN, ISMN, ISNI, ISSN, ISTC, ISWC, ORCID, PPN, SICI, UPC, ZDB with Elasticsearch

Jörg

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/1lexzKdBbP8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEzwFzy4zGbcg1w66LQgcEq8L5O9tjj2ke_6krw9nc%2B7A%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALX3-AmgVoyoTy1nso0_SW%3DaVNkZ3aSKXN8tbPTnmrOfqjnVDQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.