Performance problems when searching in-memory index with 15M documents

I have an index(es 2.2) with 1 shard and 15.6M documents (will go up to 100M in the future) and I'm doing searches on it but they take ~500ms (too much time in my opinion).

I don't store the _source, I don't store fields, just want the ids from es.

I have 1 node with 64GB ram intel-xeon-E5-1630v3 that is only running es+nginx for proxy. I changed the heap to 30GB and optimized to 1 segment (~3.7GB).

Things that I use in the query are: term filters, function_score (weight functions), multi_match on 4 fields with different weights,

Query body: https://paste.ee/p/0WU9B
Query profile: https://paste.ee/p/admIL
Node stats: http://pastebin.com/Cu6FPdEQ

I currently put everything in 1 shard since it's a little hard to make a right _routing value, I could make it from a "program_id" field witch I'm usually filtering on, but one program has 5M docs, while other programs have from 1K to 2.5M, so maybe the distribution of documents will make some shard much bigger compared to others and that may be a problem.

My wish, is to make this as fast as possible, since in my best-case-scenario I will have thousands requests/second. Of course with alot of replicas, but I need the single-query to be fast first.

Thank You!

What's the mapping look like?

Also how are you doing an in memory search?

@warkolm
Mapping: http://pastebin.com/2pb6zgN5

By in-memory I mean that I'm repeating the same query, and the index is smaller than the free-ram after the heap, so it should be cached by the filesystem since nothing else is running on the box (except nginx for proxy)

A few notes/questions/thoughts:

  • Can you re-paste the profile output as raw JSON? The debug format that it's in right now is very hard to read

  • Since you are on 2.2, I would actually decrease the size of heap. You don't need the full 30gb anymore (unlike old recommendations). The reason we used to recommend such large heaps was for field data, which was heap-resident. Now that 2.0+ uses doc values, you should be able to use smaller heaps (which will give the OS fs cache more memory for caching segments). I'd probably do 4-8gb heap now.

  • Do you know how many docs match this query? The more documents that match, the slower queries tend to get.

  • Do you need the unnormalized weighting that a function score provides? If not, you could move those functions to a regular query and probably speed things up. The profile output shows that the majority of the time is in the FiltersFunctionScoreQuery. For example, you could get a similar (but not identical) query like so:

{
  "query":{
    "bool":{
      "should":[
        {
          "terms":{
            "program_id":[
              181,
              182
            ],
            "boost":0.5
          }
        },
        {
          "terms":{
            "program_id":[
              138,
              700
            ],
            "boost":2.0
          }
        },
        {
          "terms":{
            "program_id":[
              222,
              235
            ],
            "boost":3.0
          }
        },
        {
          "multi_match":{
            "fields":[
              "manual_keywords^1000",
              "name^2.0",
              "description^1.0",
              "keywords^1.0"
            ],
            "query":"black shoes"
          }
        }
      ],
      "filter":{
        "terms":{
          "program_id":[
            ... bunch of numbers ...
          ]
        }
      }
    }
  }
}
  • Remember that latency is unrelated to throughput. A query may have a min latency of 500ms, but the server can still handle 10,000 requests per second (just making up numbers, but you get the idea). Latency represents the time required for an action to move through the "pipe". Throughput is how many actions can be in the pipe at the same time (and queing delay is the average time an action has to sit in the pipe). See http://www.futurechips.org/thoughts-for-researchers/clarifying-throughput-vs-latency.html and http://blog.flux7.com/blogs/benchmarks/littles-law for more details

    Throughput and latency are also affected by parallelism at the cluster level. For example, adding more machines and replicas tends to increase the min latency slightly, since you usually need an extra network hop. But in exchange for a bit of latency, you often double your throughput because you now have an entire second machine servicing read requests.

So basically, I wouldn't worry too much about trying to optimize a single query in a such an artificial situation as this. Do your benchmarking on datasets/clusters/indices that will look like your final product, not an artificial setup with a single shard. Performance can change significantly depending on the number of shards, segments, parallel queries, queue occupancy, etc. Trying to benchmark now is probably just time wasted imo.

@polyfractal

1.Query profile in plain json (before it was python prettyprint(dict)): http://pastebin.com/7MRkUFDM
2. Yeah I know, though I don't think it will help me in this case, since I still have 30GB for filesystem-cache on a 4GB index-size. Will make the change
3.This query matches 2.2M docs. I know that the less documents it has to "search" it's faster, trying to remove as many as possible with filtering and stuff.
4. I don't need the unnormalized weighting, just need to add more boosting, and looks like this works. Can this query also be made to work in es 1.7 ?
5. Latency vs throughput. Yes I understand, but I NEED the query to complete in hopefully <50ms, because I lose the opportunity. I then will do app-caching so I never call es again until much later when the data changes.
6. This is a much nicer scenario than in production. I actually was in production, and had problem with es on doing ~30 searches/second (was getting timeouts).

So what do I need to do to get even lower latency ? I Since that's my main problem.
I think:

  1. should I add more filtering, like filter in "shoes" category or something this way it searches even less docs, right ?
  2. split into shards, this way 1 search is done in parallel?
  3. anything else I may be missing ?

Thank You!

Ah, for 1.7 you'll have to re-arrange it a bit. You'll have to use a filtered query and split the components between query and filter clauses.

Note that the performance characteristics on 1.7 will be much different. 2.0+ uses a new version of lucene which has new optimizations, and a new method of caching filters.

Yes, more exclusivity will make searches run faster. Since your search matches 2.2m docs, it needs to score each of those in turn. For example, if it takes just 20 nanoseconds to score a single document, it will take 44ms to score all 2.2m (and that's ignoring all the other overhead like networking latency, disk seeks, cache misses, pointer indirection, etc).

Making the query match fewer documents will drastically improve latency. Obviously there is only so much you can do, but any little bit helps.

Potentially, although in practice this only tends to help if you have multiple machines. If you have a single machine than multiple primary shards may decrease latency (since you can use several threads at once), but it will come at a direct expense of throughput on that single machine. It also adds some overhead because you need to merge results from multiple shards now, so it isn't a clear-cut case.