JSON parsing and Payload injection


(Aurélien-3) #1

Hi ESearchers,

I'm investigating into elastic search in order to inject payload
information associated to each word. One interesting idea would be to use
the JSON to inject objects holding the data and payload data together.

For example, my document would be :
{
"content": { "data" : "keyword1", "payload" : "2.0"},{"data":"keyword2",
"payload":"3.0"}
}

Thought to inject payload, I need to have access to several fields of the
object.

As I understand from the documentation and code, objects are dynamically
parsed and considered as nested fields. So the request parser will launch
my analyzer for data and payload independantly.

I there a way to peronalize this behaviour to have access at the analyzer
(and tokenizer level) to a whole chunk of JSON ?

This question was already asked before and "kimchy" was suggesting
(https://groups.google.com/d/topic/elasticsearch/cO8J5i39cUE/discussion)
to use scripts. I don't really undertand how scripts could solve this.

Any comment or ideas to dig are welcome.

Thanks by the way for the awesome product. It has much future ahead, I'm
sure.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Luca Cavanna) #2

Hi,
what Shay meant in that thread is that there is no easy way to read
payloads, thus you would need to write a native script (Java), where you
have access to the doc id and to the lucene reader, so that you can read
the payload information by yourself from within the script.

It's now possible to plug in a custom similarityhttp://www.elasticsearch.org/guide/reference/index-modules/similarity/per field, which is usually how you read payloads and you use them to score
documents depending on your domain. Have a look at
https://github.com/tlrx/elasticsearch-custom-similarity-provider for an
example of custom similarity provider.

Not sure what your usecase is, but your questions refer more to the
indexing process, the part that stored the payloads in the index. You would
need to use a custom token filter that does something similar to what
lucene DelimitedPayloadTokenFilterhttp://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/payloads/DelimitedPayloadTokenFilter.htmldoes, but you need to specify the payload in the same field and use a
proper delimiter (e.g. "keyword1|2.0").

I'm not aware of any other way to index payloads that could allow to pass
it in as a separate field.

Hope this helps
Luca

On Tuesday, September 3, 2013 3:06:51 PM UTC+2, Aurélien wrote:

Hi ESearchers,

I'm investigating into elastic search in order to inject payload
information associated to each word. One interesting idea would be to use
the JSON to inject objects holding the data and payload data together.

For example, my document would be :
{
"content": { "data" : "keyword1", "payload" : "2.0"},{"data":"keyword2",
"payload":"3.0"}
}

Thought to inject payload, I need to have access to several fields of the
object.

As I understand from the documentation and code, objects are dynamically
parsed and considered as nested fields. So the request parser will launch
my analyzer for data and payload independantly.

I there a way to peronalize this behaviour to have access at the analyzer
(and tokenizer level) to a whole chunk of JSON ?

This question was already asked before and "kimchy" was suggesting (
https://groups.google.com/d/topic/elasticsearch/cO8J5i39cUE/discussion)
to use scripts. I don't really undertand how scripts could solve this.

Any comment or ideas to dig are welcome.

Thanks by the way for the awesome product. It has much future ahead, I'm
sure.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Aurélien-3) #3

Thanks a lot.

Well I went to register DelimiterPayloadTokenFilter. For the moment seems
simpler. I'll anyway try to find some time to experiment with Native Script.

Le mardi 3 septembre 2013 18:24:50 UTC+3, Luca Cavanna a écrit :

Hi,
what Shay meant in that thread is that there is no easy way to read
payloads, thus you would need to write a native script (Java), where you
have access to the doc id and to the lucene reader, so that you can read
the payload information by yourself from within the script.

It's now possible to plug in a custom similarityhttp://www.elasticsearch.org/guide/reference/index-modules/similarity/per field, which is usually how you read payloads and you use them to score
documents depending on your domain. Have a look at
https://github.com/tlrx/elasticsearch-custom-similarity-provider for an
example of custom similarity provider.

Not sure what your usecase is, but your questions refer more to the
indexing process, the part that stored the payloads in the index. You would
need to use a custom token filter that does something similar to what
lucene DelimitedPayloadTokenFilterhttp://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/payloads/DelimitedPayloadTokenFilter.htmldoes, but you need to specify the payload in the same field and use a
proper delimiter (e.g. "keyword1|2.0").

I'm not aware of any other way to index payloads that could allow to pass
it in as a separate field.

Hope this helps
Luca

On Tuesday, September 3, 2013 3:06:51 PM UTC+2, Aurélien wrote:

Hi ESearchers,

I'm investigating into elastic search in order to inject payload
information associated to each word. One interesting idea would be to use
the JSON to inject objects holding the data and payload data together.

For example, my document would be :
{
"content": { "data" : "keyword1", "payload" :
"2.0"},{"data":"keyword2", "payload":"3.0"}
}

Thought to inject payload, I need to have access to several fields of the
object.

As I understand from the documentation and code, objects are dynamically
parsed and considered as nested fields. So the request parser will launch
my analyzer for data and payload independantly.

I there a way to peronalize this behaviour to have access at the analyzer
(and tokenizer level) to a whole chunk of JSON ?

This question was already asked before and "kimchy" was suggesting (
https://groups.google.com/d/topic/elasticsearch/cO8J5i39cUE/discussion)
to use scripts. I don't really undertand how scripts could solve this.

Any comment or ideas to dig are welcome.

Thanks by the way for the awesome product. It has much future ahead, I'm
sure.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #4