How to use _mget in elasticserach-hadoop

jspooner · June 30, 2016, 11:44pm

Is it possible to use the _mget function from the EsSpark class?

I need to update a large amount of documents in ElasticSearch and I'm thinking the best method is collect every document with _mget. Example

POST /devices-v1/device/_mget
{
    "ids" : ["abc, 
    "zzz",
    "ffff3"
    ]
}

james.baiera · July 1, 2016, 2:38pm

Hello!

ES-Hadoop (and EsSpark) focus primarily on the scroll and bulk api's (amidst node api's for service discovery). It does not support the _mget endpoint at this time. If you have a large number of updates you need to perform, I propose setting the es.write.operation to update in combination to specifying which field is your id field with the es.mapping.id property.

For more info please see the documentation pages about connector configuration.

Hope this helps!

jspooner · July 1, 2016, 8:04pm

The problem is that my document has an array with objects. If I have a job that adds a new object it will clobber the previous objects in that carry. I was thinking I could get all of the original objects add my new object and resubmit that via the update api.

My other idea is to have a groovy script manage this array

gist.github.com

https://gist.github.com/jspooner/aca369e357ac824c5832b347f98c005a

gistfile1.txt

demographics.groovy

if (ctx._source.demographics) {
  match_found = false
  x = 0
  for (d in ctx._source.demographics) {
    if (d.source == demographic.source) {
      match_found = true
      ctx._source.demographics[x] = demographic // will destroy the original object.  Any keys could be lost
    }

This file has been truncated. show original

My other option is to use the parent child relationship in ES but I may have too many documents for a fast search.

james.baiera · July 1, 2016, 8:43pm

It seems like you're attempting to emulate a map-style object within your document. While a parent child relationship between documents is the easiest way to rationalize this sort of write and update pattern, you are indeed correct that search query performance is impacted negatively with that modeling approach. Using a Parent-Child Relationship is more for situations where child documents outnumber their parent documents by a very large margin.

Groovy may be the best route for conserving your search performance here. My recommendations for this is to parameterize the script so that it can be compiled and cached between indexing calls (which it seems is the way you have outlined in your gist, well done).

Another recommendation, if you have not yet done it already, is to map your demographics field as a nested field. This is going to depend on what types of queries you expect to execute. If you're looking for all "worstgolfers" reported to be age 26 by some source "ESPN 8, the Ocho", you may find that documents erroneously match this query depending on their mixture of data entries in the demographics field. More on that here.

Hope this helps!

Topic		Replies	Views
Bulk get (_mget) performance when using ES as key value store Elasticsearch	3	1693	February 27, 2019
Elasticsearch-hadoop and updating records Elasticsearch es-hadoop	3	1376	July 6, 2017
Mget too slow for large amount of documents Elasticsearch	9	1699	February 16, 2022
How to call mget from Logstash? Logstash	4	290	May 7, 2021
ES Document Update Issue MR Elasticsearch es-hadoop	4	2323	July 6, 2017

How to use _mget in elasticserach-hadoop

Related topics