Why is IDs query slow?

Nikita_Tovstoles · September 10, 2015, 9:57pm

Have an index of ~ 9M docs, of which ~ 20K are "programs". This IDs query (with 20 IDs) takes 600ms on a local server, running on MBP with SSD. If I halve number of IDs, query latency cuts in half too. Same query takes 2000ms on an EC2 instance. Seems like this should be executed in << 10ms locall, given that we're checking _uid and the entire type has only 20K entries...

Should I care about unassigned_shards (probably from deleted indices)?

GET /admin/programs/_search
    {
      "query": {
        "filtered": {
          "filter": {
            "ids": {
              "values": [
                 18536,
                  18537,
                  18538,
                  18539,
                  18552,
                  18553,
                  18554,
                  18555,
                  18556,
                  18557,
                  18558,
                  18559,
                  18560,
                  18561,
                  18562,
                  18563,
                  18564,
                  18565,
                  18652,
                  18653
              ]
            }
          }
        }
      }
    }

GET /_cluster/health
{
"cluster_name": "ecr-elasticsearch-local",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 1,
"number_of_data_nodes": 1,
"active_primary_shards": 75,
"active_shards": 75,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 74
}

magnusbaeck · September 11, 2015, 5:46am

Should I care about unassigned_shards (probably from deleted indices)?

This is because you have configured the indexes to have one replica but you only have a single-node cluster so no replicas can actually be allocated. This should be fixed but it's not resulting in reduced performance or anything.

warkolm · September 11, 2015, 6:12am

Is that the initial run of the query or everyone?

Nikita_Tovstoles · September 11, 2015, 6:36am

Both cases

warkolm · September 11, 2015, 7:15am

This should be pretty quick thanks to filter caching, are you getting/monitoring any cache evictions?

Mark_Harwood · September 11, 2015, 8:33am

More queried IDs=more matching hits to retrieve=more disk access

Can you provide the number of hits you retrieve in each case?

Mark_Harwood · September 11, 2015, 11:06am

Another idea to help separate the query vs retrieve costs:

try a match_all query and get size 10 then size 20 hits and contrast response times with your ID based query for the same volumes. Should help isolate the costs of the ID query.

roytmana · September 11, 2015, 4:50pm

I wonder what would be query performance if you try terms filter on these IDs

Nikita_Tovstoles · September 11, 2015, 5:47pm

Thanks for all the suggestions thus far, folks. I think the issue lies in the cost of doing source filtering on matching docs - either via _source= or fields= The challenge now is figuring out the best way to minimize that cost.

Let me step back:

/programs has many properties, amongst them id, name, and areas. The latter is an array and is large - on the order of 40K entries for some docs
I am interested in retrieving only id, name (and a few other small properties) of docs matching the IDs query. I achieve that via source filtering, ie _source: ["id", "name"]
source filtering appears to be quite expensive and (AFAIK) does not benefit from filter caching - even if that were employed - since caches, as I understand, store references to whole docs as values (not post-source-filter contents, right?)
if I omit source (via _source: false) the query latency drops from 600ms to 4ms

If I am right, I am surprised that source filtering is so expensive in this case, given that I am filtering by exact property names (as opposed to wildcards), and we're only filtering 20 docs. is the bulk of cost in constructing a key-value map from _source on each query before applying source filtering?

Any way to lower cost of source filtering? Or do I need to create a separate type mapping - that contains only a subset of programs' properties relevant to this query?

Thanks,
-nikita

roytmana · September 11, 2015, 5:50pm

what if you declare few key fields you need as stored in mappings and then omit source but include those fields in your query. this should get it from the index rather than source

Nikita_Tovstoles · September 11, 2015, 5:55pm

Sorry, Alex, not sure I follow. In _mapping I have:

      "id": {
          "type": "long",
          "fields": {
             "str": {
                "type": "string",
                "index": "not_analyzed"
             }
          }
       }

and query contains "_source": "id". What should I change and to what?

roytmana · September 11, 2015, 6:00pm

take a look here https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-fields.html
for stored fields it should be much faster than filtering of large source I think
I am not clear if it is deprecated/removed for 2.x though

Nikita_Tovstoles · September 11, 2015, 6:13pm

yes, I think you're right. two things I didn't realize:

fields is not the same as source filtering via _source (rather falls back on latter when field in question is not stored separately)
individual fields in a doc are not stored (just indexed) by default - need to amend property mapping to say "store": true

I will update my type mapping and retry the query. Thanks!

roytmana · September 11, 2015, 6:17pm

no problem. unfortunately it will require reindexing if they are not stored

Nikita_Tovstoles · September 11, 2015, 6:44pm

with fields id, and name mapped as stored (store: true) query latency is now 90ms vs 600ms before. So definitely better (and I think CPU load is predictably lower) but not as good as returning _id only:

fields: [] - 4ms
fields: [id,name] - 90ms

...when retrieving 20 hits. #2 still seems a bit high. Anything else I can do to bring its' latency closer to no-fields scenario?

I did notice that latency of 2nd query:

scales linearly with number of hits
does not differ between id-only and id + name.

roytmana · September 11, 2015, 6:53pm

have you reindexed your huge dataset that quick to test and positive that id and name are now stored fields an delastick is not falling back onto extracting it from _source?

Not sure what else could be done but I am not an expert in this and rarely use fields:[]
Empty fields:[] should return _id and _type and it runs 20 faster than [id, name] I assume just _id would not be sufficient for your purpose?

also is it 90 ms every time or repeat query is faster?

Mark_Harwood · September 12, 2015, 8:22am

The linear-with-hits behaviour comes down to disk access at the end of the day.

If you have the RAM available and can load the name and id field values into field data you can use Aggs to make this lookup an in-memory access. You'd need a root terms agg for the id and nest a terms agg for the associated name. Set size=0 on the query request to avoid a network round trip to return matching docs as all the info you need is in the Aggs tree

Nikita_Tovstoles · September 16, 2015, 2:23pm

Actually, I ended up getting far better performance via excluding the huge areas[] from _source at index time (see excluding fields from source) and using source filtering in query. Latency of that query was 4ms (equal to that of no-source-query). Recall that [storing individual fields][2] (and using fields in query) yielded 90ms latency.

I still would like to understand why I am paying a latency penalty when getting a stored field if that doc contains another field that is huge.

ie given doc type with (stored) name and huge areas[], a query with fields: "name" will take multiple times longer if areas[] is part of _source vs if it is not. Why is that? I was assuming that by querying for stored fields only, I am bypassing _source deser. and thus size/presence of areas[] is irrelevant... Would appreciate any feedback.

Thanks again, all, especially Alex and Mark.

-nikita

Topic		Replies	Views
Ids query takes up to 1 min on cloud elastic Elasticsearch	3	844	July 5, 2017
Ids Query is slow in 2b docs Elasticsearch	2	814	September 16, 2018
Performance issues when returning many ids Elasticsearch	1	268	June 29, 2021
Ids query is faster very much than ids filter ,why? Elasticsearch	5	1183	July 5, 2017
Improving Query Performance Elasticsearch	21	1072	August 19, 2019

Why is IDs query slow?

Related topics