Searching inner hits in ES datastore

Hi ES community,

Searching and retrieving inner hits (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-inner-hits.html) is extremely slow!

Any tips to expedite the search?

Thanks!
Parul

Is this happening for parent-child or nested searches?

@NerdSec this is nested search

Can you add some more details, because this is a very generic question:

  • Elasticsearch version
  • Specific query you are running, maybe even a (simplified) mapping
  • What slow means specifically (in general, in comparison to other queries,...)

@xeraa

Elasticsearch version

curl -XGET 'localhost:9200'
{
  "status" : 200,
  "name" : "Gaea",
  "cluster_name" : "v1-cluster",
  "version" : {
    "number" : "1.7.6",
    "build_hash" : "c730b59357f8ebc555286794dcd90b3411f517c9",
    "build_timestamp" : "2016-11-18T15:21:16Z",
    "build_snapshot" : false,
    "lucene_version" : "4.10.4"
  },
  "tagline" : "You Know, for Search"
}

Specific query you are running, maybe even a (simplified) mapping

  • the data is querying biological dataset

ES query
peak_results = es.search(body=query,
index=chromosome,
doc_type=assembly,
size=99999)

About peak_query function

query = get_query(start, end, within_region=region_inside_status)

def get_query(start, end, with_inner_hits=True, within_region=True):
    """                                                                                                                                                                                                        
    return peak query                                                                                                                                                                                          
    """
    query = {
        'query': {
            'filtered': {
                'filter': {
	            'nested': {
                        'path': 'location',
                        'filter': {
                            'bool': {
                                'should': []
                            }
                        }
                    }
		},
                '_cache': True,
            }
        },
        '_source': False,
    }
    search_ranges = {
        'inside_range': {
            'start': start,
            'end': end
        },
        'range_inside': {
            'start': end,
            'end': start
        },
        'overlap_start_range': {
            'start': start,
            'end': start
        },
        'overlap_end_range': {
            'start': end,
            'end': end
        }
    }
    for key, value in search_ranges.items():
        query['query']['filtered']['filter']['nested']['filter']['bool']['should'].append(get_bool_query(value['start'], value['end']))
    if with_inner_hits:
        query['query']['filtered']['filter']['nested']['inner_hits'] = {'size': 99999}
    return query

What slow means specifically (in general, in comparison to other queries,...)

To explain slowness in our use case, let me give you an overview of the mapping function, the query is executed against nested object location

           'location': {
                'type': 'nested',
                'properties': {
                    'start': {
                        'type': 'long'
                    },
                    'end': {
                        'type': 'long'
                    },
                    'state': {
                        'type': 'string'
                        },
                    'val': {
                        'type': 'string'
                        }

The longer the search coordinates/values for start, end parameter longer is the query search

i.e. for start = 1 and end = 100 takes 1second return the location information, however for start = 1 and end = 10000 it takes 60 seconds to return location information.

Let me know, what you think.

Best Regards,
Parul

The bad news is that your Elasticsearch version is ancient, which hasn't been supported for quite a while and also misses some very helpful tools for finding performance issues like the profile API.

The good news is that inner hits should have improved quite a bit in more recent versions. We have a lot of performance benchmarks and the one you'll be most interested in is probably major versions of nested — especially about inner hits at the end:


You'll have to update sooner or later, but this will probably require quite a lot of work — rewriting queries and (remote) reindexing the data.

Not sure about quick wins. The for loop for the should criteria looks dangerous to me. Also {'size': 99999} for inner hits could be an issue — do you really need that much data? What's the total size of the response document? But even tweaks here won't save you from the update in the long run.

1 Like

@xeraa Thanks a lot! We will keep the community posted on our implementation and speed improvement

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.