Use case and infrastructure questions and doubts

Hi,

I would like to share my use of case and issues and see if I can get some good advices and recommendations

####Use case:

My product is a search engine based on ES. It index everyday constantly in the same index. Sadly we cannot avoid this because we don’t know when the documents expire or change until we get the new ones. So we made index, update and delete operations. This creates a lot of deleted documents in the index.

Our search performance is degraded by all the index operations. If we stop indexing, the searches boosts.

####Current infrastructure and state:

Elasticsearch: 1.5.2
Cluster with 4 Nodes
Each Node:
Total memory: 60 GB
Heap size: 24.9 GB
1 Index
1 shard
3 replicas
6~ million documents
25~ GB store size
Disk space used: 15%
Documents deleted: 37-40%
Refresh interval: 120
HTTP Connection Rate: 7 /second

####Questions and doubts:

  • We currently use a custom _id. We know it is not the best according to this post.
    We chose that _id to make our infrastructure and code complexity more more simpler. (We have to know the _id to make update and delete operations later). If we change this to one of the recommended types, the better index time would improve the search performance? How much?

  • If I have 5 nodes and I set up a client to index with a connection string with 3 nodes and set up another client to search with a connection string of the other 2 nodes, are we going to see any improvements in the search performance? Or it is the same and the load is balanced across all the nodes?

  • Is it possible and useful to have 5 nodes, 3 to write and 2 to read? Only the read nodes will be faster? Or this is nonsense?

  • Is there any way to know which node answer a request?

Thank you in advance!

If we change this to one of the recommended types, the better index time would improve the search performance? How much?

Hard to say, but I doubt it would make a big difference to be honest.

If I have 5 nodes and I set up a client to index with a connection string with 3 nodes and set up another client to search with a connection string of the other 2 nodes, are we going to see any improvements in the search performance?

I'm not sure I understand this? "Connection string"? Do you mean routing certain queries to certain nodes? Something else?

Is it possible and useful to have 5 nodes, 3 to write and 2 to read? Only the read nodes will be faster? Or this is nonsense?

Nope, not really. ES doesn't really partition data into "read" and "write" nodes, the responsibilities are shared by all nodes. Think of a cluster as a homogeneous pool of resources for the most part.

Is there any way to know which node answer a request?

You can use the Search Shards API to see what shards/indices a search would hit, but it'll likely change each time you execute it due to picking different replicas.

In general I think your problem is the full replication (1 shard, 3 replicas). You're essentially mirroring the data to all nodes. That means for every document you index, you are performing the cost of indexing four times (one on each node). At search time, only one node can be used to execute the query since there is only a single shard.

I think you'd be better off with more primary shards. Perhaps 2 primaries, 2 replicas. Or 4 primaries, 1 replica.

I think there may be other problems, as 7 requests/second is not large at all, neither is a 25gb index. What's your indexing rate?

Hi. Thank you for your quick response!

Hard to say, but I doubt it would make a big difference to be honest.

Great, good to know.

I'm not sure I understand this? "Connection string"? Do you mean routing certain queries to certain nodes? Something else?

I mean that when I create the ES client, I config the hosts array of the client differently for both. One with 2 hosts and the other with 3.

I think there may be other problems, as 7 requests/second is not large at all, neither is a 25gb index. What's your indexing rate?

Sorry, I made a mistake. The HTTP Connection Rate is 70 /second.

What's your indexing rate?

I have got this information from ES:
If there is another way to get the exact index rate with ES you can tell me I would appreciate it.

In general I think your problem is the full replication (1 shard, 3 replicas). You're essentially mirroring the data to all nodes. That means for every document you index, you are performing the cost of indexing four times (one on each node). At search time, only one node can be used to execute the query since there is only a single shard.

I think you'd be better off with more primary shards. Perhaps 2 primaries, 2 replicas. Or 4 primaries, 1 replica.

I see your point. I am going to be more specific so you can understand better the use case.

We made a lot of terms aggregations for searches.

We can't show results of the same company, so we aggregate on that field.

In addition, we define which document to get from the term bucket with a groovy script (to get best ranked document, it just returns the document score and then the aggregation sort the bucket).

This way, we get the best scored document from each company.

The query is something like this:

    size: 0,
    sort: ["_score"],
    query: {
        filtered: {
            query: {...},
            filter: {...}
        }
    },
    aggs: {
        agg_distinct: {
            terms: {
                field: "company",
                size: 20,
                order : [ { top_score: "desc" }, { _term : "asc" } ]
            },
            aggs: {
                top_score: {
                    max: {
                        script_file: "script-score"
                    }
                },

                agg_top_hits: {
                    top_hits: {
                        size: 1
                    }
                }
            }
        }
    }

And here comes the questions about the sharding:

  • What advantage I gain sharding the index when all the data can be stored in one shard?
  • How the aggregations perform while the index shards number increase? Do they get worse or it doesn't matter?

I am aware about the groovy script poor performance. I have implemented before a native script plugin in Java, but I made a few tests and the results were disappointing. I have to admit anyways that the tests were pretty basic

  • Do you think that the native plugin is worth it?

If you have any suggestion, it would be great. Thanks a lot for answering.

First off, if you're seeing a big hit from re-indexing then addressing indexing load would help more in the short term.

Try to bulk your updates if you can. That makes really huge difference.

What advantage I gain sharding the index when all the data can be stored in one shard?

Taking isolated query, only a single thread/core can query a shard. Having more shards lets you utilize more cores. (That doesn't actually make it faster with many concurrent queries)
Another benefit of extra shards is that you can spread your indexing load. A 3 shards - 1 replica setup over 4 nodes would double your indexing throughput. (or 4-1 if you have 5 nodes)
Also if you're seeing 70 queries/sec that should decrease your overall search latency.

How the aggregations perform while the index shards number increase? Do they get worse or it doesn't matter?

Aggregations would perform better, because now you can have multiple machines work on the query.

Yep.

In a single-primary situation, a query can only be executed by a single node (on a single thread). So that latency of that query is entirely dependent on the power of the machine. And as you grow your data, the latency becomes linearly worse. That's exactly why clustered services exist, because you can only grow "vertically" so much before the machine isn't powerful enough.

Now, if you split the index into multiple primary shards, that means a single query will be executing in parallel across multiple nodes. Each node has a smaller chunk of work to do, so the latency of that chunk is lower. And since all the chunks are running in parallel, the overall latency of the query goes down.

The tradeoff is theoretically lower overall throughput. E.g. you can probably get more queries-per-second in the monolithic, fully-replicated case because there is less overhead and coordination. But your average- and worst-case latency will be far worse (sometimes drastically worse, e.g. your 95th percentile could be seconds instead of ms, because a query gets stuck in a long queue waiting to process).

Now, you can definitely take it too far. At some point, too many shards becomes a bad thing. E.g. you'll see clusters with hundreds of indices and thousands of shards...on just a handful of nodes. That's not idea either :slight_smile:

Aggregations would follow similar performance characteristics, as they operate very similarly to search.

I mean that when I create the ES client, I config the hosts array of the client differently for both. One with 2 hosts and the other with 3.

Ah, I see. The answer then is "no". Internally, ES will round-robin search requests to evenly distribute the load. So even if all your traffic goes through a single node, it will be spread across the cluster (although you'll put extra load on that "coordinator", which isn't great). It's best to evenly round-robin your requests across the nodes so the coordination is distributed too.

1 Like

I wonder, however, if there's a way you can shift your data to make search and indexing faster. I agree with your assessment that all the updates are causing a lot of merge pressure on your index. All those deleted docs need to be merged out (which eats IO/CPU) and while they are sitting soft-deleted, they impact search performance.

Looking at your query, it seems you're using it like a SQL query more than ES-style search. E.g. using a terms agg to find the top scoring doc for each company, then physically retrieving it via top-hits. That's alright, but certainly not super-fast. Top-hits has to execute extra fetches to retrieve those docs (in addition to the fetches already done by the query), so it'll slow things down.

Could you perhaps restructure your data so that it's in a parent/child or nested scheme, that way company results are implicitly grouped together? Or alternatively, just retrieve the top results and iterate over the results in your application to select the top scoring doc for each company? Or run a pre-query to find the top companies, and a secondary msearch to get the top doc for each company as a single, individual search?