TokenStream implementation in ElasticSearch


(thinusp) #1

In Lucene I am able to inject my own stream of tokens, should I choose to
do so:

GenericTokenStream customToken = new GenericTokenStream(...);
doc.add(new Field("tags", customToken));

where GenericTokenStream is a class extending TokenStream, with the
IncrementToken method implemented to provide Lucene with a way of
traversing the list of tokens. What this allowed me to do is to generate
tokens using a non-complicated method such as adding tags to a source, and
then save them as a separately searchable stream. Of course, these type of
fields are not analyzed, as I provide them with the tokens myself, with
properties such as the OffsetAttribute and CharTermAttribute already
defined within.

What I'd like to know then is how you would do the same kind of thing with
ElasticSearch. What I'd like to be able to do is add a source to the
index, where a certain field, say 'content', is indexed using whichever
analyzer (snowball, for example), but where another field is specified
through the token stream. I realise full well that this is probably not
possible at the moment, so advice on how to make this work would also be
greatly appreciated.

I have considered using an array to store the terms, in other words have a
field be populated by an array of strings, but this would not do as I
cannot set the attributes for each array. I literally need to save the
tokens directly.

Thank you kindly.

  • Thinus

(Shay Banon) #2

Heya,

Its certainly possible. The simplest way is to build your own plugin that
provides your own Analyzer implementation. You can give it your own name,
and then, configure in hte mapping the relevant field to use that analyzer.
Here is an example of an analysis plugin:
https://github.com/elasticsearch/elasticsearch-analysis-icu.

On Mon, Mar 19, 2012 at 4:22 PM, Thinus Prinsloo
thinus.prinsloo@gmail.comwrote:

In Lucene I am able to inject my own stream of tokens, should I choose to
do so:

GenericTokenStream customToken = new GenericTokenStream(...);
doc.add(new Field("tags", customToken));

where GenericTokenStream is a class extending TokenStream, with the
IncrementToken method implemented to provide Lucene with a way of
traversing the list of tokens. What this allowed me to do is to generate
tokens using a non-complicated method such as adding tags to a source, and
then save them as a separately searchable stream. Of course, these type of
fields are not analyzed, as I provide them with the tokens myself, with
properties such as the OffsetAttribute and CharTermAttribute already
defined within.

What I'd like to know then is how you would do the same kind of thing with
ElasticSearch. What I'd like to be able to do is add a source to the
index, where a certain field, say 'content', is indexed using whichever
analyzer (snowball, for example), but where another field is specified
through the token stream. I realise full well that this is probably not
possible at the moment, so advice on how to make this work would also be
greatly appreciated.

I have considered using an array to store the terms, in other words have a
field be populated by an array of strings, but this would not do as I
cannot set the attributes for each array. I literally need to save the
tokens directly.

Thank you kindly.

  • Thinus

(thinusp) #3

Thanks, I did exactly that. I created an Analyzer that can parse the
relevant information from an XML-formatted string that I pass in the normal
way. It seems to work quite well so far. Of course you have to then make
sure the search_analyzer is not the same as the index_analyzer...

Perhaps (in the very looooong future) it would be nice to have some native
way of doing that internally, where ES can worry about the formatting (into
JSon or XML or whatever) and provide you with a client-side interface for
setting the attributes directly. But for now, this will suffice.

Thanks again.

On Tuesday, 20 March 2012 12:09:43 UTC+2, kimchy wrote:

Heya,

Its certainly possible. The simplest way is to build your own plugin
that provides your own Analyzer implementation. You can give it your own
name, and then, configure in hte mapping the relevant field to use that
analyzer. Here is an example of an analysis plugin:
https://github.com/elasticsearch/elasticsearch-analysis-icu.

On Mon, Mar 19, 2012 at 4:22 PM, Thinus Prinsloo wrote:

In Lucene I am able to inject my own stream of tokens, should I choose to
do so:

GenericTokenStream customToken = new GenericTokenStream(...);
doc.add(new Field("tags", customToken));

where GenericTokenStream is a class extending TokenStream, with the
IncrementToken method implemented to provide Lucene with a way of
traversing the list of tokens. What this allowed me to do is to generate
tokens using a non-complicated method such as adding tags to a source, and
then save them as a separately searchable stream. Of course, these type of
fields are not analyzed, as I provide them with the tokens myself, with
properties such as the OffsetAttribute and CharTermAttribute already
defined within.

What I'd like to know then is how you would do the same kind of thing
with ElasticSearch. What I'd like to be able to do is add a source to the
index, where a certain field, say 'content', is indexed using whichever
analyzer (snowball, for example), but where another field is specified
through the token stream. I realise full well that this is probably not
possible at the moment, so advice on how to make this work would also be
greatly appreciated.

I have considered using an array to store the terms, in other words have
a field be populated by an array of strings, but this would not do as I
cannot set the attributes for each array. I literally need to save the
tokens directly.

Thank you kindly.

  • Thinus

(Shay Banon) #4

What do you mean by ES doing the formatting, now sure I get it (and
obviously very open to future improvements to ES). I thought you were doing
custom tokenization?

On Tue, Mar 20, 2012 at 2:02 PM, Thinus Prinsloo
thinus.prinsloo@gmail.comwrote:

Thanks, I did exactly that. I created an Analyzer that can parse the
relevant information from an XML-formatted string that I pass in the normal
way. It seems to work quite well so far. Of course you have to then make
sure the search_analyzer is not the same as the index_analyzer...

Perhaps (in the very looooong future) it would be nice to have some native
way of doing that internally, where ES can worry about the formatting (into
JSon or XML or whatever) and provide you with a client-side interface for
setting the attributes directly. But for now, this will suffice.

Thanks again.

On Tuesday, 20 March 2012 12:09:43 UTC+2, kimchy wrote:

Heya,

Its certainly possible. The simplest way is to build your own plugin
that provides your own Analyzer implementation. You can give it your own
name, and then, configure in hte mapping the relevant field to use that
analyzer. Here is an example of an analysis plugin: https://github.com/**
elasticsearch/elasticsearch-**analysis-icuhttps://github.com/elasticsearch/elasticsearch-analysis-icu
.

On Mon, Mar 19, 2012 at 4:22 PM, Thinus Prinsloo wrote:

In Lucene I am able to inject my own stream of tokens, should I choose to

do so:

GenericTokenStream customToken = new GenericTokenStream(...);
doc.add(new Field("tags", customToken));

where GenericTokenStream is a class extending TokenStream, with the
IncrementToken method implemented to provide Lucene with a way of
traversing the list of tokens. What this allowed me to do is to generate
tokens using a non-complicated method such as adding tags to a source, and
then save them as a separately searchable stream. Of course, these type of
fields are not analyzed, as I provide them with the tokens myself, with
properties such as the OffsetAttribute and CharTermAttribute already
defined within.

What I'd like to know then is how you would do the same kind of thing
with ElasticSearch. What I'd like to be able to do is add a source to the
index, where a certain field, say 'content', is indexed using whichever
analyzer (snowball, for example), but where another field is specified
through the token stream. I realise full well that this is probably not
possible at the moment, so advice on how to make this work would also be
greatly appreciated.

I have considered using an array to store the terms, in other words have
a field be populated by an array of strings, but this would not do as I
cannot set the attributes for each array. I literally need to save the
tokens directly.

Thank you kindly.

  • Thinus

(thinusp) #5

Shay,

Yeah - so let me quickly explain what I did and why. I had to somehow
inject tokens of custom design into the database remotely. The requirement
was to bypass the whole analysis stage and simply specify which tokens with
which offsets etc. had to be saved. With Lucene that is simply saving a
TokenStream, with ES it was not so simple. So what I ended up doing is
creating an XML format with which I specify the details, such as the token
list and the offsets to be saved, and created a custom analyser as a plugin
to parse that XML on the server and create the relevant tokens to be
presented to the ES core.

My question is then whether it would not make a nice feature to allow such
a capability at the client side as part of ES directly. My thinking is
that some API should make it simple to specify a list of tokens that has to
be injected as is, with all the relevant parameters pre-defined, and then
ES will take care of the background detail on making that happen, whether
by the system described above (or similar), or some completely new
abstraction that's more efficient.

By the way, all of this was done through the TransportClient, I suspect it
would be much simpler on a local client?

The implementation was simple enough, so thanks I suppose for that! :smiley:

Thinus

What do you mean by ES doing the formatting, now sure I get it (and

obviously very open to future improvements to ES). I thought you were doing
custom tokenization?

On Tue, Mar 20, 2012 at 2:02 PM, Thinus Prinsloo wrote:

Thanks, I did exactly that. I created an Analyzer that can parse the
relevant information from an XML-formatted string that I pass in the normal
way. It seems to work quite well so far. Of course you have to then make
sure the search_analyzer is not the same as the index_analyzer...

Perhaps (in the very looooong future) it would be nice to have some
native way of doing that internally, where ES can worry about the
formatting (into JSon or XML or whatever) and provide you with a
client-side interface for setting the attributes directly. But for now,
this will suffice.

Thanks again.

On Tuesday, 20 March 2012 12:09:43 UTC+2, kimchy wrote:

Heya,

Its certainly possible. The simplest way is to build your own plugin
that provides your own Analyzer implementation. You can give it your own
name, and then, configure in hte mapping the relevant field to use that
analyzer. Here is an example of an analysis plugin: https://github.com/*
*elasticsearch/elasticsearch-**analysis-icuhttps://github.com/elasticsearch/elasticsearch-analysis-icu
.

On Mon, Mar 19, 2012 at 4:22 PM, Thinus Prinsloo wrote:

In Lucene I am able to inject my own stream of tokens, should I choose

to do so:

GenericTokenStream customToken = new GenericTokenStream(...);
doc.add(new Field("tags", customToken));

where GenericTokenStream is a class extending TokenStream, with the
IncrementToken method implemented to provide Lucene with a way of
traversing the list of tokens. What this allowed me to do is to generate
tokens using a non-complicated method such as adding tags to a source, and
then save them as a separately searchable stream. Of course, these type of
fields are not analyzed, as I provide them with the tokens myself, with
properties such as the OffsetAttribute and CharTermAttribute already
defined within.

What I'd like to know then is how you would do the same kind of thing
with ElasticSearch. What I'd like to be able to do is add a source to the
index, where a certain field, say 'content', is indexed using whichever
analyzer (snowball, for example), but where another field is specified
through the token stream. I realise full well that this is probably not
possible at the moment, so advice on how to make this work would also be
greatly appreciated.

I have considered using an array to store the terms, in other words
have a field be populated by an array of strings, but this would not do as
I cannot set the attributes for each array. I literally need to save the
tokens directly.

Thank you kindly.

  • Thinus

--
Thinus Prinsloo
E-mail: thinus.prinsloo@gmail.com
Cell: +27 82 339 2226


(Shay Banon) #6

Heya,

I see, its certainly possible, though I wonder how common it is :). We
can create a custom type, for example, that accepts a pre defined format of
tokens and data associated with each one.

On Wed, Mar 21, 2012 at 9:13 PM, Thinus Prinsloo
thinus.prinsloo@gmail.comwrote:

Shay,

Yeah - so let me quickly explain what I did and why. I had to somehow
inject tokens of custom design into the database remotely. The requirement
was to bypass the whole analysis stage and simply specify which tokens with
which offsets etc. had to be saved. With Lucene that is simply saving a
TokenStream, with ES it was not so simple. So what I ended up doing is
creating an XML format with which I specify the details, such as the token
list and the offsets to be saved, and created a custom analyser as a plugin
to parse that XML on the server and create the relevant tokens to be
presented to the ES core.

My question is then whether it would not make a nice feature to allow such
a capability at the client side as part of ES directly. My thinking is
that some API should make it simple to specify a list of tokens that has to
be injected as is, with all the relevant parameters pre-defined, and then
ES will take care of the background detail on making that happen, whether
by the system described above (or similar), or some completely new
abstraction that's more efficient.

By the way, all of this was done through the TransportClient, I suspect it
would be much simpler on a local client?

The implementation was simple enough, so thanks I suppose for that! :smiley:

Thinus

What do you mean by ES doing the formatting, now sure I get it (and

obviously very open to future improvements to ES). I thought you were doing
custom tokenization?

On Tue, Mar 20, 2012 at 2:02 PM, Thinus Prinsloo wrote:

Thanks, I did exactly that. I created an Analyzer that can parse the

relevant information from an XML-formatted string that I pass in the normal
way. It seems to work quite well so far. Of course you have to then make
sure the search_analyzer is not the same as the index_analyzer...

Perhaps (in the very looooong future) it would be nice to have some
native way of doing that internally, where ES can worry about the
formatting (into JSon or XML or whatever) and provide you with a
client-side interface for setting the attributes directly. But for now,
this will suffice.

Thanks again.

On Tuesday, 20 March 2012 12:09:43 UTC+2, kimchy wrote:

Heya,

Its certainly possible. The simplest way is to build your own plugin
that provides your own Analyzer implementation. You can give it your own
name, and then, configure in hte mapping the relevant field to use that
analyzer. Here is an example of an analysis plugin: https://github.com/
**elasticsearch/elasticsearch-**analysis-icuhttps://github.com/elasticsearch/elasticsearch-analysis-icu
.

On Mon, Mar 19, 2012 at 4:22 PM, Thinus Prinsloo wrote:

In Lucene I am able to inject my own stream of tokens, should I choose

to do so:

GenericTokenStream customToken = new GenericTokenStream(...);
doc.add(new Field("tags", customToken));

where GenericTokenStream is a class extending TokenStream, with the
IncrementToken method implemented to provide Lucene with a way of
traversing the list of tokens. What this allowed me to do is to generate
tokens using a non-complicated method such as adding tags to a source, and
then save them as a separately searchable stream. Of course, these type of
fields are not analyzed, as I provide them with the tokens myself, with
properties such as the OffsetAttribute and CharTermAttribute already
defined within.

What I'd like to know then is how you would do the same kind of thing
with ElasticSearch. What I'd like to be able to do is add a source to the
index, where a certain field, say 'content', is indexed using whichever
analyzer (snowball, for example), but where another field is specified
through the token stream. I realise full well that this is probably not
possible at the moment, so advice on how to make this work would also be
greatly appreciated.

I have considered using an array to store the terms, in other words
have a field be populated by an array of strings, but this would not do as
I cannot set the attributes for each array. I literally need to save the
tokens directly.

Thank you kindly.

  • Thinus

--
Thinus Prinsloo
E-mail: thinus.prinsloo@gmail.com
Cell: +27 82 339 2226


(system) #7