Elasticsearch getByID and put going latent

Hi,

We have a use case where we need to create around 4000 different indices in a cluster. Every index has a different mapping. We were benchmarking elasticsearch 6.6.1 and observed the following.

When first created a single index and indexed around 32k documents and ran getByID for stored documents. The 99.9% latency was under 100ms.

We then went ahead to just create another 1500 indices (with one replica and without any data) and ran the benchmark again for the earlier index. The 99.9% latency shot upto 700-800ms.

When we go ahead and delete these empty 1500 indices, the latency goes back to under 100ms.
Note: Our cluster contains 12 data nodes having 1TB disk and 16gb memory allocated to ES.

Does number of indices in a cluster affect the getByID call in anyway?

I don't think it should.
Could you share the exact request you are sending?

What other activity do you have at the same time if any?

No other activity is going on. Using the below code.

GetRequest getRequest = new GetRequest("indexName","indexName","id");
GetResponse getResponse = client.get(getRequest).actionGet();

What is the size of the document you are getting? Do you query the same document everytime or different documents? Could they differ in size?

Document size is less than 10kb. I am fetching different documents (random function) from the 30k indexed there.

Could you try to get only the same exact document on every run?

I'd like to make sure that the number of indices in the cluster does not affect the response time of getting the same document.

It'd be great to share a bit more about your test scenario. With code or pseudo code that explains exactly what you are running.

Are you testing that with Rally by any chance? Copying @danielmitterdorfer who might also have some ideas.

Sure I'll fetch the same document and get back on the results.

I have just written a dropwizard service and am using transportClient to fetch the document by id. I am using our own load generation tool to hit this getByID api with a QPS of 5000 and then plotting the latency.

Our use-case is as follows:
We need to store information about products. There are around 4000 different categories of products, so we intended to create 4000 indices, where we define the required mappings for each category of products.

In the current scenario, I first created a single index and indexed around 32,000 documents and ran getByID benchmarks on it. The 99.9% latency for the same was around 100ms. Then I created another 1500 indices and put the mappings for the indices as well. I didn't index any other document in the 1500 indices and re-ran the benchmark for the earlier index. Now the latency for the same getByID increased to around 600-700ms.

I then deleted the 1500 indices and latency came back to around 100ms. I am now adding 100 indices every half an hour and checking what is the point after which the latency increases. Have attached the screenshot for the latency graph.

We just observed another data point.

  1. We deleted and re-created the 1500 indices. However this time we didn't put the mapping in these 1500 indices. We then put the mapping for one of the test index and indexed around 30k documents again and ran benchmark on the test index. The results were perfect 99.9% was in the range of 100ms only.
  2. We then put the mapping for remaining 1500 indices (without indexing any documents in those indices) and the latency shot up for the test index to 600-700ms again.

Sharing the size of cluster state

  1. With one index and it's mapping set => 248KB
  2. With 1500 index (without mapping) => 3.7MB
  3. With 1500 index with mapping set =>114MB

Does the size of cluster_state has a role to play in the normal getByID and index(put) latencies? Please note that the data nodes have total ram of 39GB, out of which 16GB heap space has been assigned to elasticsearch

elastic+ 9033 14.9 43.1 23075216 17777420 ? SLsl Mar04 284:36 /usr/bin/java -Xms16g -Xmx16g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+DisableExplicitGC -XX:+AlwaysPreTouch -server -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -Djdk.io.permissionsUseCanonicalPath=true -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -Des.allow_insecure_settings=true -XX:+HeapDumpOnOutOfMemoryError -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch -Des.distribution.flavor=default -Des.distribution.type=deb -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid --quiet

@dadoonet: We ran the benchmark with just one document as well. Results are the same. As you increase the number of indices along with their mappings, the latencies start to take a hit.

It seems like it is because of cluster state only. Any further input here?

The title says that you are running getbyId and put, while your description states that you are only fetching data and not performing and updates. Which one is correct? Can you share the exact queries you are running? Are you specifying a single index name in your requests or an index pattern?

When we ran benchmarks for put, they were going latent too (hence included that in the title). In the current runs we are just running getById requests and nothing else.

I am specifying the exact indexName in the request.

Like I mentioned previously. I am running a dropwizard service and am using transport client. Pasting the code below

@Timed
@ExceptionMetered
public byte[] getById(String indexName, String id){
    GetRequest getRequest = new GetRequest(indexName, indexName, id);
    return client.get(getRequest).actionGet().getSourceAsBytes();
}

I am hitting the above API with the testIndexName and choose an ID randomly from the list of 32k documents which we have indexed.

Could you switch to the Rest Client instead?

How does query latency vary by concurrency level? If you start with fewer concurrent queries and gradually increase it, is the pattern the same?

Switching to rest client also didn't make any difference.

We slowly ramped up the concurrency and the results are the same. As soon as we delete the other indices or not put their mapping the latencies come down.

Please advise what more can be done here

Can you share your full code with the Rest Client please?
Could you also try with the low level one?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.