TermsFIlter performance as index grows

Hi,
we're dealing with a performance issue with elastic 1.5.2 (but we also had it on 1.5.1 and preceding versions).

Our ES is formed by a single node instance on a Xeon with 12 HT cores and 64Gb of RAM; we plan to setup an ES cluster later this year. We configured both $ES_MIN_MEM and $ES_MAX_MEM with 32g. We also made these changes to the default configuration file:

threadpool.search.type: fixed
threadpool.search.size: 24
threadpool.search.queue_size: 200000
bootstrap.mlockall: true
http.cors.enabled: true
script.disable_dynamic: false

We started our tests on an index with one shard, no replicas and only one mapping with 400k docs (each with 10 subdocs) like this one:

{"user_id":201512,"chain_id":7735,"task_id":8856,"time_slices":[{"slice_end_time":4603,"slice_start_time":4505,"slice_time_effort":10},{"slice_end_time":6942,"slice_start_time":8835,"slice_time_effort":10},{"slice_end_time":8679,"slice_start_time":6006,"slice_time_effort":10},{"slice_end_time":7701,"slice_start_time":2305,"slice_time_effort":10},{"slice_end_time":5978,"slice_start_time":1504,"slice_time_effort":10},{"slice_end_time":8576,"slice_start_time":2713,"slice_time_effort":10},{"slice_end_time":6037,"slice_start_time":1160,"slice_time_effort":10},{"slice_end_time":8141,"slice_start_time":8894,"slice_time_effort":10},{"slice_end_time":6278,"slice_start_time":7509,"slice_time_effort":10},{"slice_end_time":3711,"slice_start_time":350,"slice_time_effort":10}]}

indexed with the _id as a concatenation of task_id and chain_id; each of the 400k docs has a different "user_id" value.

Our target is to retrieve documents related to a set of users.

We cannot use a MultiGet to retrieve the docs related to a set of users (since the _id of the docs is not the user_id) and so we use a TermsFilter (we of course can change the structure of the docs, but having the chain_id and task_id as the _id makes our life a lot simpler); we are testing this call since we'd like to have stable performances as the number of docs in the index grows.

We tested a TermsFilter on "user_id" field with 20 random different values, wrapped into a ConstantScoreQuery: we sent to ES 200k requests in chunks of 4k/sec. The result was about 1400 responses/sec.

Then we grew the index doubling the number of docs, reaching 800k (maintaining unique user_ids); the same TermsFilter on "user_id" field with 20 random values, made of 200k reqs with chunks of 4k/sec, and ES kept managing about 1400 resp/sec.

Then we doubled again the number of docs, reaching 1.6M and the same test led to about 1200 resp/sec.

We then decided to grow the index a lot more, and reached 16M documents; we lowered down the test making a total of 6k requests split in chunks of 300/sec (but keeping 20 terms on the filter) and it resulted in ES managing about 200 resp/sec. We made another index with the same number of documents but on 5 shards, and the results were better, about 400 resp/sec.

We noticed that if, for example, the total running time of the test is 20 seconds, during the first 17 secs we receive a few responses (at most some hundreds) and in the remaining 3 secs we receive the missing thousands all together, like a flush of accumulated responses. The higher the number of requests, the more the number of seconds "waiting" before the flush; the flush duration seems proportional to the amount of data requested, so I guess the flush duration is function of the bandwidth between the client and the server (we hit the 12MB/sec limit of a 100Mb network).
How is this behaviour explained? Shouldn't ES be giving back responses as soon as the requests arrive?

We'd like to have stable performances: should we add nodes to the cluster as we reach a certain number of docs in the index and keep doing it as the number of docs grows? Is using more than one shard per node a good idea?

Thanks,
Andrea

I'm curious what your query looks like. Are you able to share that?

Also, it sounds like you have tried to increase the number of shards and you have seen some improvement. Have you tried adding more nodes and increasing the number of replicas? Seeing as you are doing lots of reads, I really think having more nodes and replicas would improve the performance. However, it's tricky to say when you should add more nodes when you reach a certain number of docs because that is may differ based on the type of data you are indexing and the volume of that as well. Just as you have experimented on one node, try adding more nodes (maybe try some virtual machines to start off with if you cannot provision another physical one) and replicas and scale out that way.

sure:

ConstantScoreQueryBuilder query = QueryBuilders.constantScoreQuery(FilterBuilders.termsFilter("user_id", user_ids));

where user_ids is a String.

I'll give a try with other nodes to see how it behaves.

Thanks!

Also, just a thought but maybe because you are looking up document ids via the term filter, it is not a good candidate for caching. One idea worth trying might be a bool filter with two must-clauses looking up the chain_id and task_id respectively. Are there more chain_ids than task_ids? Whichever is more specific as the first must-clause helps but also doing it like this may make use of the filter caching whereas the way you are doing it you aren't availing that.

I hope that made sense.

Sarwar

we need to retrieve docs from that index both by the id (chain+task) and by filtering on user_id. We decided to put chain + task as the _id of the index because doing that we simplify the business logic of the application; but we also need to retrieve docs by user_id, and that's why we're trying to understand the limits of the TermsFilter.
We have two choices:

  1. index the document using user_id
  2. index the document using chain + task

With choice 1 we can retrieve documents by user_id easily and with good performance using a MultiGet; since we have few accesses to docs via the chain+task, retrieving docs filtering on chain+task is not an issue; the drawback is a more complex business logic.
With choice 2 we simplify business logic but for retrieving docs via user_id we have to use a TermsFilter, which is not giving us good performance.

We'd like to use choice 2 for simplifying business logic, but if we can't guarantee good performance we'll have to use choice 1.

Thanks,
Andrea

Right, so I mistook user_ids for document ids. But even then, it's almost the same since you said each of the documents has a unique user_id. My suggestion still stands, use a bool filter with two must clauses where the first must clause narrows down the data somewhat (either by chain_id or task_id) and then the second must clause is your lookup of the user_ids

Sorry, I don't get your point. Why should I use a bool filter on chain and task when my user_id are unique? The filter for user_id will return at most one doc, so I don't see the reason to bool filter other values for narrowing the search space. Or am I missing something?

Thanks,
Andrea

Narrowing the search space can make use of caching that way the lookup of the term can be from something that was already looked up instead of the entire cluster. It was just a suggestion to try.

Some info on filter caching:
https://www.elastic.co/guide/en/elasticsearch/guide/current/filter-caching.html

You might want to try a slightly smaller heap size, too. The 32GB mark is where compressed pointers switch over to ordinary pointers.

Also, you could experiment with using doc_values on that user_id field and perhaps shrink your heap size considerably.