How to use _mget in elasticserach-hadoop


(Jonathan Spooner) #1

Is it possible to use the _mget function from the EsSpark class?

I need to update a large amount of documents in ElasticSearch and I'm thinking the best method is collect every document with _mget. Example

POST /devices-v1/device/_mget
{
    "ids" : ["abc, 
    "zzz",
    "ffff3"
    ]
}

Elasticsearch-hadoop and updating records
(James Baiera) #2

Hello!

ES-Hadoop (and EsSpark) focus primarily on the scroll and bulk api's (amidst node api's for service discovery). It does not support the _mget endpoint at this time. If you have a large number of updates you need to perform, I propose setting the es.write.operation to update in combination to specifying which field is your id field with the es.mapping.id property.

For more info please see the documentation pages about connector configuration.

Hope this helps!


(Jonathan Spooner) #3

The problem is that my document has an array with objects. If I have a job that adds a new object it will clobber the previous objects in that carry. I was thinking I could get all of the original objects add my new object and resubmit that via the update api.

My other idea is to have a groovy script manage this array

My other option is to use the parent child relationship in ES but I may have too many documents for a fast search.


(James Baiera) #4

It seems like you're attempting to emulate a map-style object within your document. While a parent child relationship between documents is the easiest way to rationalize this sort of write and update pattern, you are indeed correct that search query performance is impacted negatively with that modeling approach. Using a Parent-Child Relationship is more for situations where child documents outnumber their parent documents by a very large margin.

Groovy may be the best route for conserving your search performance here. My recommendations for this is to parameterize the script so that it can be compiled and cached between indexing calls (which it seems is the way you have outlined in your gist, well done).

Another recommendation, if you have not yet done it already, is to map your demographics field as a nested field. This is going to depend on what types of queries you expect to execute. If you're looking for all "worstgolfers" reported to be age 26 by some source "ESPN 8, the Ocho", you may find that documents erroneously match this query depending on their mixture of data entries in the demographics field. More on that here.

Hope this helps!


(system) #5