Timeout, GC overhead and plenty other beginner errors

Hi, I'm super new to ES and I have some problems. Actually, quite a lot.

My setup is at the end, the only thing that matters to understand my problems is that I have two different servers which share some problems but not all. I'll call them "local" and "distant".

Problems

First

First, in both servers I spend a lot of times in the GC. Here are two logs (first local, second distant) :

[INFO ][o.e.m.j.JvmGcMonitorService] [SBii_Wb] [gc][182] overhead, spent [330ms] collecting in the last [1s]
[INFO ][o.e.m.j.JvmGcMonitorService] [SBii_Wb] [gc][205] overhead, spent [393ms] collecting in the last [1.3s]

I didn't understand why it made this, though I google quite a lot about it.

Second

My second problem is that I get timedout from ES. Both Kibana and my program get timedout and I don't understand why. And it happens on both machines. Probably for different reasons.

Third

On my local setting, it happens that my RAM goes up to 100% usage. When it happens, ES just crashes. Can I do something to prevent this ?

Fourth (and last)

On both the local and distant, I have the following warning :

[WARN ][o.e.b.BootstrapChecks    ] [SBii_Wb] max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

On the distant one, I also have :

[WARN ][o.e.b.BootstrapChecks    ] [SBii_Wb] max file descriptors [4096] for elasticsearch process is too low, increase to at least [65536]

As you'll read in my setup, I'm not root. Could I change those settings without being it ?

Setup

Here is my setup. I can't be root so I'm using the tar.gz version of ES. I have about 540k documents in a single index and a single primary shard (from what I understood at least). I'd like to request ES up to 200 q/s and if possible up to may be 1k/s. I have the resources (RAM and CPU) to do so but I have troubles with ES. All documents only weights 1.4GB. Another precision, I actually have more indices as I'm using Kibana and I'm testing some parameters on the search engine (changing k1 and b on BM25 model). My jvm heap size is set to 16GB. If you need any more info just ask.

Thanks for reading and answering if you can.
Cheers

Hi, I'm upping this :slight_smile:

Hi @Ricocotam,

A few initial comments and questions:

  1. You have all docs in one primary shard. Does it have any replica? Since if not, you would only be using one node for the searches.
  2. The scenario that causes the problems is that doing all the queries? If so, it can be beneficial to search for the limit by finding out how many queries/s you can handle on a single node and then scaling the environment accordingly.
  3. The actual query matters and maybe it can be tuned or a different structure can help.
  4. When you get timed out, what response do you get? A 429 would be expected if overloaded.
  5. How much RAM does the machines have?
  6. When elasticsearch crashes, it could be the case that it was killed by the linux oom killer (if running linux). Examining /var/log/messages or kern.log could give insights into that. Otherwise, the elasticsearch log file is likely to contain information on what happened.
  7. Whether you can change file descriptors without root depends on setup. There are typically hard and soft limits set for a user and the user cannot exceed the hard limit, but can increase the soft limit up till the hard limit.
  8. Increasing vm.max_map_count requires root.

Hi @HenningAndersen, thanks for your answer

  1. I have no replica. From what I read, I have too few data (only 1.4 GB) to have more than one shard. But I might not understand correctly how to use these
  2. If I'm looking at the logs right now, I see the overhead info popping out every 2 hours or so (when used) but the value increments a lot more than just +1. I monitored the CPU usage and I'm far from what I've available (only using ~ half), and the ram is only used about 10% of what's available for the JVM which is using only a third of what's available on the server
  3. All the queries are very similar and I'm requesting the exact same thing (just the id of the documents). It would be very strange that a query changes something. I'm using Robust2004 dataset from TREC conference. It's pretty standard in the IR field
  4. I figured out the timeout issue, my network was overloaded so the server didn't answer, I fixed it. Thanks for the suggestion though
  5. I gave 96GB to the JVM (we never know) and I have 256 GB available in total
  6. If it ever happens again I'll look into this. It's probably the OOM killer that's to blame.
  7. Are these values critical for ES to work well in searching ? I'm only interested in searching as fast as possible. I'm currently handling about 200 queries/second and it works pretty well but I'd like to scale up to 500/s. I have the power but something settings might be better. Are those values a problem ?

Again, thanks for your answer, I really appreciate

Hi @Ricocotam,

  1. I may have misunderstood your setup initially to be a two server setup. I now think it is a single-server setup (but you have two different environments). If that is true, adding replicas or splitting into multiple shards will not help. But if you scale out, adding additional replicas (or splitting the index into multiple shards) can help search performance.
  2. With 96GB of heap, the overhead may be artificial. I would need a GC log to figure out if this is problematic or not.
  3. Not having seen the query it is hard to judge if it runs optimally or could be tuned. I would recommend to spend some efforts here, since sometimes the speed up that can be achieved is massive.
  4. Great.
  5. Normally going beyond 30GB is not recommended (though I cannot rule out that special workloads could benefit from that large a heap). Elasticsearch will use the rest of the memory for file caching anyway.
  6. Depends on whether you hit that limit or not. With one or only few shards, my best guess is that you do not. I am not sure about whether this affects performance or if it would result in errors.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.