Indexing custom Lucene documents

Hi all!

I am in the planning phase for for a search application and still have
to decide which search engine to use. First I have been going towards
Solr but than got pointed to ElasticSearch. And I must admit, it caught
my attention.

From what I have seen so far, this is a great and easy to use search
engine. But I have a use case I'm not quite sure if this goes along well
with ES.

I am doing analysis of text documents using the UIMA Framework. That is,
I have a text and quite a lot of annotations. For example, my analysis
would mark Names in a text as 'person', cities, lakes etc as 'location'
etc. It does also some quite more complicated things (detecting
relations and whatsoever). I already use a UIMA component named Lucas
(Lucene CAS indexer) to create a Lucene index from my annotated texts.
Now, of course, I don't want to bother with using Lucene directly but
rather through an elaborate search engine.

In Solr I managed to write some additional classes which take a CAS and
use Lucas to build a Lucene document. This document is then given to the
normal Solr indexing-process and everything's fine.

Finally, my question: Can I do something similar with ElasticSearch? In
the simpliest way I already have a Lucene document object I'd just like
to hand to the search engine which should do the rest for me. Especially
with ElasticSearch I don't know where to start. Will this work with the
schemaless approach? Will routing still work?
I'd really appreciate if you could point me to a mechanism so I could
add such capabilities. I think I saw a plugin mechanism for ES, could
this be the way to go?

Another possibility would be to convert the format UIMA gives me into
something ElasticSearch understands out of the box. But I don't think
you can express something like Lucene position_increment in JSON, right?

Thanks for your help!

Erik

I have further investigated this issue and read the guide for Elasticsearch.

What I intend to do is basically to pass all information about analysis directly to the search engine. I don't really need a tokenizer within ES for instance as I do (a very specialized) tokenization before. So I already know my tokens 'token1 token2 token3' before index time. Additionally, for each tokens I know some meta data. For example the Lemma, Part of Speech or Entity class. You could represent this information in such a manner:

'token1_lemma1_PoS1_entity1 token2_lemma2_PoS2_entity2 token2|_emma2_PoS1_entity2'

I'd like to index all the meta information as Lucene tokens with position_increment zero.
The issue arises that the original text is not recoverable once written down in this manner. I am not able to determine the correct position_offsets and highlighting won't work, even if I would store the original text into another field.

It is perfectly possible that I miss something. Does anyone of you have an idea how to represent my data in a format which ES could understand without losing information?

Best regards,

Erik

Am 14.02.2011 um 17:11 schrieb Erik Fäßler:

Hi all!

I am in the planning phase for for a search application and still have to decide which search engine to use. First I have been going towards Solr but than got pointed to Elasticsearch. And I must admit, it caught my attention.

From what I have seen so far, this is a great and easy to use search engine. But I have a use case I'm not quite sure if this goes along well with ES.

I am doing analysis of text documents using the UIMA Framework. That is, I have a text and quite a lot of annotations. For example, my analysis would mark Names in a text as 'person', cities, lakes etc as 'location' etc. It does also some quite more complicated things (detecting relations and whatsoever). I already use a UIMA component named Lucas (Lucene CAS indexer) to create a Lucene index from my annotated texts. Now, of course, I don't want to bother with using Lucene directly but rather through an elaborate search engine.

In Solr I managed to write some additional classes which take a CAS and use Lucas to build a Lucene document. This document is then given to the normal Solr indexing-process and everything's fine.

Finally, my question: Can I do something similar with Elasticsearch? In the simpliest way I already have a Lucene document object I'd just like to hand to the search engine which should do the rest for me. Especially with Elasticsearch I don't know where to start. Will this work with the schemaless approach? Will routing still work?
I'd really appreciate if you could point me to a mechanism so I could add such capabilities. I think I saw a plugin mechanism for ES, could this be the way to go?

Another possibility would be to convert the format UIMA gives me into something Elasticsearch understands out of the box. But I don't think you can express something like Lucene position_increment in JSON, right?

Thanks for your help!

Erik

Not sure I completely followed your question, but, you can create the json you want to index however you like, and then, write your own analyzer that will be applied to (some) of the fields based on a format you define and process them as tokens to be indexed.
On Wednesday, February 16, 2011 at 12:17 AM, Erik Fäßler wrote:

I have further investigated this issue and read the guide for Elasticsearch.

What I intend to do is basically to pass all information about analysis directly to the search engine. I don't really need a tokenizer within ES for instance as I do (a very specialized) tokenization before. So I already know my tokens 'token1 token2 token3' before index time. Additionally, for each tokens I know some meta data. For example the Lemma, Part of Speech or Entity class. You could represent this information in such a manner:

'token1_lemma1_PoS1_entity1 token2_lemma2_PoS2_entity2 token2|_emma2_PoS1_entity2'

I'd like to index all the meta information as Lucene tokens with position_increment zero.
The issue arises that the original text is not recoverable once written down in this manner. I am not able to determine the correct position_offsets and highlighting won't work, even if I would store the original text into another field.

It is perfectly possible that I miss something. Does anyone of you have an idea how to represent my data in a format which ES could understand without losing information?

Best regards,

Erik

Am 14.02.2011 um 17:11 schrieb Erik Fäßler:

Hi all!

I am in the planning phase for for a search application and still have to decide which search engine to use. First I have been going towards Solr but than got pointed to Elasticsearch. And I must admit, it caught my attention.

From what I have seen so far, this is a great and easy to use search engine. But I have a use case I'm not quite sure if this goes along well with ES.

I am doing analysis of text documents using the UIMA Framework. That is, I have a text and quite a lot of annotations. For example, my analysis would mark Names in a text as 'person', cities, lakes etc as 'location' etc. It does also some quite more complicated things (detecting relations and whatsoever). I already use a UIMA component named Lucas (Lucene CAS indexer) to create a Lucene index from my annotated texts. Now, of course, I don't want to bother with using Lucene directly but rather through an elaborate search engine.

In Solr I managed to write some additional classes which take a CAS and use Lucas to build a Lucene document. This document is then given to the normal Solr indexing-process and everything's fine.

Finally, my question: Can I do something similar with Elasticsearch? In the simpliest way I already have a Lucene document object I'd just like to hand to the search engine which should do the rest for me. Especially with Elasticsearch I don't know where to start. Will this work with the schemaless approach? Will routing still work?
I'd really appreciate if you could point me to a mechanism so I could add such capabilities. I think I saw a plugin mechanism for ES, could this be the way to go?

Another possibility would be to convert the format UIMA gives me into something Elasticsearch understands out of the box. But I don't think you can express something like Lucene position_increment in JSON, right?

Thanks for your help!

Erik

I guess I didn't describe my problem clear enough, apologies!

I think I have one core question: Can I specify several fields in ES
with the same name? Reasons why I would need that follow:

I am aware of the fact I can write everything into JSON form I want and
that I can use my custom analyzers.

My point is that the actual analysis of my documents is done before
indexing. I use quite sophisticated components for all kind of natural
language processing (NLP). So I don't want to use the lucene analysers
for tokenization, lemmatazation, PoS-Tagging etc.

My question was if someone of you sees a possibility to formulate all
this information in JSON. First, if I do write the additional
information (e.g. lemmas) right into the document text (for intance as
token_lemma token_lemma token_lemma....) I would be able to cut off the
lemmas by a custom analyzer in ES, but it wouldn't make any sense to
store the field as the field's text isn't my original document text but
an annotated form.

So, ES-specific question: Is it possible in ES to create several fields
of the same name? Lucene allows this; you can have one field called
"text" which is not stored but indexed. I would use this field for
indexing the above information. Additionally, I could have another
field, also called "text" which isn't indexed but stored. Here my
original text would go to. In the end, searching would do fine and
highlighting would as well (although one has to take great care because
of the position offsets when delivering a custom tokenization).

Thanks for your help!

Best regards,

 Erik

Am 16.02.2011 03:10, schrieb Shay Banon:

Not sure I completely followed your question, but, you can create the
json you want to index however you like, and then, write your own
analyzer that will be applied to (some) of the fields based on a
format you define and process them as tokens to be indexed.

On Wednesday, February 16, 2011 at 12:17 AM, Erik Fäßler wrote:

I have further investigated this issue and read the guide for
Elasticsearch.

What I intend to do is basically to pass all information about
analysis directly to the search engine. I don't really need a
tokenizer within ES for instance as I do (a very specialized)
tokenization before. So I already know my tokens 'token1 token2
token3' before index time. Additionally, for each tokens I know some
meta data. For example the Lemma, Part of Speech or Entity class. You
could represent this information in such a manner:

'token1_lemma1_PoS1_entity1 token2_lemma2_PoS2_entity2
token2|_emma2_PoS1_entity2'

I'd like to index all the meta information as Lucene tokens with
position_increment zero.
The issue arises that the original text is not recoverable once
written down in this manner. I am not able to determine the correct
position_offsets and highlighting won't work, even if I would store
the original text into another field.

It is perfectly possible that I miss something. Does anyone of you
have an idea how to represent my data in a format which ES could
understand without losing information?

Best regards,

Erik

Am 14.02.2011 um 17:11 schrieb Erik Fäßler:

Hi all!

I am in the planning phase for for a search application and still
have to decide which search engine to use. First I have been going
towards Solr but than got pointed to Elasticsearch. And I must
admit, it caught my attention.

From what I have seen so far, this is a great and easy to use search
engine. But I have a use case I'm not quite sure if this goes along
well with ES.

I am doing analysis of text documents using the UIMA Framework. That
is, I have a text and quite a lot of annotations. For example, my
analysis would mark Names in a text as 'person', cities, lakes etc
as 'location' etc. It does also some quite more complicated things
(detecting relations and whatsoever). I already use a UIMA component
named Lucas (Lucene CAS indexer) to create a Lucene index from my
annotated texts. Now, of course, I don't want to bother with using
Lucene directly but rather through an elaborate search engine.

In Solr I managed to write some additional classes which take a CAS
and use Lucas to build a Lucene document. This document is then
given to the normal Solr indexing-process and everything's fine.

Finally, my question: Can I do something similar with Elasticsearch?
In the simpliest way I already have a Lucene document object I'd
just like to hand to the search engine which should do the rest for
me. Especially with Elasticsearch I don't know where to start. Will
this work with the schemaless approach? Will routing still work?
I'd really appreciate if you could point me to a mechanism so I
could add such capabilities. I think I saw a plugin mechanism for
ES, could this be the way to go?

Another possibility would be to convert the format UIMA gives me
into something Elasticsearch understands out of the box. But I don't
think you can express something like Lucene position_increment in
JSON, right?

Thanks for your help!

Erik

No, you can't have same name fields with different mapping options, but you can do it with different names, no?
On Wednesday, February 16, 2011 at 10:37 AM, Erik Fäßler wrote:

I guess I didn't describe my problem clear enough, apologies!

I think I have one core question: Can I specify several fields in ES
with the same name? Reasons why I would need that follow:

I am aware of the fact I can write everything into JSON form I want and
that I can use my custom analyzers.

My point is that the actual analysis of my documents is done before
indexing. I use quite sophisticated components for all kind of natural
language processing (NLP). So I don't want to use the lucene analysers
for tokenization, lemmatazation, PoS-Tagging etc.

My question was if someone of you sees a possibility to formulate all
this information in JSON. First, if I do write the additional
information (e.g. lemmas) right into the document text (for intance as
token_lemma token_lemma token_lemma....) I would be able to cut off the
lemmas by a custom analyzer in ES, but it wouldn't make any sense to
store the field as the field's text isn't my original document text but
an annotated form.

So, ES-specific question: Is it possible in ES to create several fields
of the same name? Lucene allows this; you can have one field called
"text" which is not stored but indexed. I would use this field for
indexing the above information. Additionally, I could have another
field, also called "text" which isn't indexed but stored. Here my
original text would go to. In the end, searching would do fine and
highlighting would as well (although one has to take great care because
of the position offsets when delivering a custom tokenization).

Thanks for your help!

Best regards,

Erik

Am 16.02.2011 03:10, schrieb Shay Banon:

Not sure I completely followed your question, but, you can create the
json you want to index however you like, and then, write your own
analyzer that will be applied to (some) of the fields based on a
format you define and process them as tokens to be indexed.

On Wednesday, February 16, 2011 at 12:17 AM, Erik Fäßler wrote:

I have further investigated this issue and read the guide for
Elasticsearch.

What I intend to do is basically to pass all information about
analysis directly to the search engine. I don't really need a
tokenizer within ES for instance as I do (a very specialized)
tokenization before. So I already know my tokens 'token1 token2
token3' before index time. Additionally, for each tokens I know some
meta data. For example the Lemma, Part of Speech or Entity class. You
could represent this information in such a manner:

'token1_lemma1_PoS1_entity1 token2_lemma2_PoS2_entity2
token2|_emma2_PoS1_entity2'

I'd like to index all the meta information as Lucene tokens with
position_increment zero.
The issue arises that the original text is not recoverable once
written down in this manner. I am not able to determine the correct
position_offsets and highlighting won't work, even if I would store
the original text into another field.

It is perfectly possible that I miss something. Does anyone of you
have an idea how to represent my data in a format which ES could
understand without losing information?

Best regards,

Erik

Am 14.02.2011 um 17:11 schrieb Erik Fäßler:

Hi all!

I am in the planning phase for for a search application and still
have to decide which search engine to use. First I have been going
towards Solr but than got pointed to Elasticsearch. And I must
admit, it caught my attention.

From what I have seen so far, this is a great and easy to use search
engine. But I have a use case I'm not quite sure if this goes along
well with ES.

I am doing analysis of text documents using the UIMA Framework. That
is, I have a text and quite a lot of annotations. For example, my
analysis would mark Names in a text as 'person', cities, lakes etc
as 'location' etc. It does also some quite more complicated things
(detecting relations and whatsoever). I already use a UIMA component
named Lucas (Lucene CAS indexer) to create a Lucene index from my
annotated texts. Now, of course, I don't want to bother with using
Lucene directly but rather through an elaborate search engine.

In Solr I managed to write some additional classes which take a CAS
and use Lucas to build a Lucene document. This document is then
given to the normal Solr indexing-process and everything's fine.

Finally, my question: Can I do something similar with Elasticsearch?
In the simpliest way I already have a Lucene document object I'd
just like to hand to the search engine which should do the rest for
me. Especially with Elasticsearch I don't know where to start. Will
this work with the schemaless approach? Will routing still work?
I'd really appreciate if you could point me to a mechanism so I
could add such capabilities. I think I saw a plugin mechanism for
ES, could this be the way to go?

Another possibility would be to convert the format UIMA gives me
into something Elasticsearch understands out of the box. But I don't
think you can express something like Lucene position_increment in
JSON, right?

Thanks for your help!

Erik

Would highlighting still work when I have an indexed field
"text_terms" with all my annotation information and a field "text" which
only holds the original text? I would search on "text_terms" and would
like to retrieve a highlighted portion of the "text" value.
Does this work?

Am 17.02.2011 02:05, schrieb Shay Banon:

No, you can't have same name fields with different mapping options,
but you can do it with different names, no?

On Wednesday, February 16, 2011 at 10:37 AM, Erik Fäßler wrote:

I guess I didn't describe my problem clear enough, apologies!

I think I have one core question: Can I specify several fields in ES
with the same name? Reasons why I would need that follow:

I am aware of the fact I can write everything into JSON form I want and
that I can use my custom analyzers.

My point is that the actual analysis of my documents is done before
indexing. I use quite sophisticated components for all kind of natural
language processing (NLP). So I don't want to use the lucene analysers
for tokenization, lemmatazation, PoS-Tagging etc.

My question was if someone of you sees a possibility to formulate all
this information in JSON. First, if I do write the additional
information (e.g. lemmas) right into the document text (for intance as
token_lemma token_lemma token_lemma....) I would be able to cut off the
lemmas by a custom analyzer in ES, but it wouldn't make any sense to
store the field as the field's text isn't my original document text but
an annotated form.

So, ES-specific question: Is it possible in ES to create several fields
of the same name? Lucene allows this; you can have one field called
"text" which is not stored but indexed. I would use this field for
indexing the above information. Additionally, I could have another
field, also called "text" which isn't indexed but stored. Here my
original text would go to. In the end, searching would do fine and
highlighting would as well (although one has to take great care because
of the position offsets when delivering a custom tokenization).

Thanks for your help!

Best regards,

Erik

Am 16.02.2011 03:10, schrieb Shay Banon:

Not sure I completely followed your question, but, you can create the
json you want to index however you like, and then, write your own
analyzer that will be applied to (some) of the fields based on a
ZEIT ONLINE | Nachrichten, News, Hintergründe und Debatten
format you define and process them as tokens to be indexed.

On Wednesday, February 16, 2011 at 12:17 AM, Erik Fäßler wrote:

I have further investigated this issue and read the guide for
Elasticsearch.

What I intend to do is basically to pass all information about
analysis directly to the search engine. I don't really need a
tokenizer within ES for instance as I do (a very specialized)
tokenization before. So I already know my tokens 'token1 token2
token3' before index time. Additionally, for each tokens I know some
meta data. For example the Lemma, Part of Speech or Entity class. You
could represent this information in such a manner:

'token1_lemma1_PoS1_entity1 token2_lemma2_PoS2_entity2
token2|_emma2_PoS1_entity2'

I'd like to index all the meta information as Lucene tokens with
position_increment zero.
The issue arises that the original text is not recoverable once
written down in this manner. I am not able to determine the correct
position_offsets and highlighting won't work, even if I would store
the original text into another field.

It is perfectly possible that I miss something. Does anyone of you
have an idea how to represent my data in a format which ES could
understand without losing information?

Best regards,

Erik

Am 14.02.2011 um 17:11 schrieb Erik Fäßler:

Hi all!

I am in the planning phase for for a search application and still
have to decide which search engine to use. First I have been going
towards Solr but than got pointed to Elasticsearch. And I must
admit, it caught my attention.

From what I have seen so far, this is a great and easy to use search
engine. But I have a use case I'm not quite sure if this goes along
well with ES.

I am doing analysis of text documents using the UIMA Framework. That
is, I have a text and quite a lot of annotations. For example, my
analysis would mark Names in a text as 'person', cities, lakes etc
as 'location' etc. It does also some quite more complicated things
(detecting relations and whatsoever). I already use a UIMA component
named Lucas (Lucene CAS indexer) to create a Lucene index from my
annotated texts. Now, of course, I don't want to bother with using
Lucene directly but rather through an elaborate search engine.

In Solr I managed to write some additional classes which take a CAS
and use Lucas to build a Lucene document. This document is then
given to the normal Solr indexing-process and everything's fine.

Finally, my question: Can I do something similar with Elasticsearch?
In the simpliest way I already have a Lucene document object I'd
just like to hand to the search engine which should do the rest for
me. Especially with Elasticsearch I don't know where to start. Will
this work with the schemaless approach? Will routing still work?
I'd really appreciate if you could point me to a mechanism so I
could add such capabilities. I think I saw a plugin mechanism for
ES, could this be the way to go?

Another possibility would be to convert the format UIMA gives me
into something Elasticsearch understands out of the box. But I don't
think you can express something like Lucene position_increment in
JSON, right?

Thanks for your help!

Erik