Why is IDs query slow?

Have an index of ~ 9M docs, of which ~ 20K are "programs". This IDs query (with 20 IDs) takes 600ms on a local server, running on MBP with SSD. If I halve number of IDs, query latency cuts in half too. Same query takes 2000ms on an EC2 instance. Seems like this should be executed in << 10ms locall, given that we're checking _uid and the entire type has only 20K entries...

Should I care about unassigned_shards (probably from deleted indices)?

GET /admin/programs/_search
    {
      "query": {
        "filtered": {
          "filter": {
            "ids": {
              "values": [
                 18536,
                  18537,
                  18538,
                  18539,
                  18552,
                  18553,
                  18554,
                  18555,
                  18556,
                  18557,
                  18558,
                  18559,
                  18560,
                  18561,
                  18562,
                  18563,
                  18564,
                  18565,
                  18652,
                  18653
              ]
            }
          }
        }
      }
    }

GET /_cluster/health
{
"cluster_name": "ecr-elasticsearch-local",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 1,
"number_of_data_nodes": 1,
"active_primary_shards": 75,
"active_shards": 75,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 74
}

Should I care about unassigned_shards (probably from deleted indices)?

This is because you have configured the indexes to have one replica but you only have a single-node cluster so no replicas can actually be allocated. This should be fixed but it's not resulting in reduced performance or anything.

Is that the initial run of the query or everyone?

Both cases

This should be pretty quick thanks to filter caching, are you getting/monitoring any cache evictions?

More queried IDs=more matching hits to retrieve=more disk access

Can you provide the number of hits you retrieve in each case?

Another idea to help separate the query vs retrieve costs:

try a match_all query and get size 10 then size 20 hits and contrast response times with your ID based query for the same volumes. Should help isolate the costs of the ID query.

I wonder what would be query performance if you try terms filter on these IDs

Thanks for all the suggestions thus far, folks. I think the issue lies in the cost of doing source filtering on matching docs - either via _source= or fields= The challenge now is figuring out the best way to minimize that cost.

Let me step back:

  • /programs has many properties, amongst them id, name, and areas. The latter is an array and is large - on the order of 40K entries for some docs
  • I am interested in retrieving only id, name (and a few other small properties) of docs matching the IDs query. I achieve that via source filtering, ie _source: ["id", "name"]
  • source filtering appears to be quite expensive and (AFAIK) does not benefit from filter caching - even if that were employed - since caches, as I understand, store references to whole docs as values (not post-source-filter contents, right?)
  • if I omit source (via _source: false) the query latency drops from 600ms to 4ms

If I am right, I am surprised that source filtering is so expensive in this case, given that I am filtering by exact property names (as opposed to wildcards), and we're only filtering 20 docs. is the bulk of cost in constructing a key-value map from _source on each query before applying source filtering?

Any way to lower cost of source filtering? Or do I need to create a separate type mapping - that contains only a subset of programs' properties relevant to this query?

Thanks,
-nikita

what if you declare few key fields you need as stored in mappings and then omit source but include those fields in your query. this should get it from the index rather than source

Sorry, Alex, not sure I follow. In _mapping I have:

      "id": {
          "type": "long",
          "fields": {
             "str": {
                "type": "string",
                "index": "not_analyzed"
             }
          }
       }

and query contains "_source": "id". What should I change and to what?

take a look here https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-fields.html
for stored fields it should be much faster than filtering of large source I think
I am not clear if it is deprecated/removed for 2.x though

yes, I think you're right. two things I didn't realize:

  • fields is not the same as source filtering via _source (rather falls back on latter when field in question is not stored separately)
  • individual fields in a doc are not stored (just indexed) by default - need to amend property mapping to say "store": true

I will update my type mapping and retry the query. Thanks!

no problem. unfortunately it will require reindexing if they are not stored

with fields id, and name mapped as stored (store: true) query latency is now 90ms vs 600ms before. So definitely better (and I think CPU load is predictably lower) but not as good as returning _id only:

  1. fields: [] - 4ms
  2. fields: [id,name] - 90ms

...when retrieving 20 hits. #2 still seems a bit high. Anything else I can do to bring its' latency closer to no-fields scenario?

I did notice that latency of 2nd query:

  • scales linearly with number of hits
  • does not differ between id-only and id + name.

have you reindexed your huge dataset that quick to test and positive that id and name are now stored fields an delastick is not falling back onto extracting it from _source?

Not sure what else could be done but I am not an expert in this and rarely use fields:[]
Empty fields:[] should return _id and _type and it runs 20 faster than [id, name] I assume just _id would not be sufficient for your purpose?

also is it 90 ms every time or repeat query is faster?

The linear-with-hits behaviour comes down to disk access at the end of the day.

If you have the RAM available and can load the name and id field values into field data you can use Aggs to make this lookup an in-memory access. You'd need a root terms agg for the id and nest a terms agg for the associated name. Set size=0 on the query request to avoid a network round trip to return matching docs as all the info you need is in the Aggs tree

Actually, I ended up getting far better performance via excluding the huge areas[] from _source at index time (see excluding fields from source) and using source filtering in query. Latency of that query was 4ms (equal to that of no-source-query). Recall that [storing individual fields][2] (and using fields in query) yielded 90ms latency.

I still would like to understand why I am paying a latency penalty when getting a stored field if that doc contains another field that is huge.

ie given doc type with (stored) name and huge areas[], a query with fields: "name" will take multiple times longer if areas[] is part of _source vs if it is not. Why is that? I was assuming that by querying for stored fields only, I am bypassing _source deser. and thus size/presence of areas[] is irrelevant... Would appreciate any feedback.

Thanks again, all, especially Alex and Mark.

-nikita