Timeout, GC overhead and plenty other beginner errors

Ricocotam · April 10, 2019, 7:33am

Hi, I'm super new to ES and I have some problems. Actually, quite a lot.

My setup is at the end, the only thing that matters to understand my problems is that I have two different servers which share some problems but not all. I'll call them "local" and "distant".

Problems

First

First, in both servers I spend a lot of times in the GC. Here are two logs (first local, second distant) :

[INFO ][o.e.m.j.JvmGcMonitorService] [SBii_Wb] [gc][182] overhead, spent [330ms] collecting in the last [1s]
[INFO ][o.e.m.j.JvmGcMonitorService] [SBii_Wb] [gc][205] overhead, spent [393ms] collecting in the last [1.3s]

I didn't understand why it made this, though I google quite a lot about it.

Second

My second problem is that I get timedout from ES. Both Kibana and my program get timedout and I don't understand why. And it happens on both machines. Probably for different reasons.

Third

On my local setting, it happens that my RAM goes up to 100% usage. When it happens, ES just crashes. Can I do something to prevent this ?

Fourth (and last)

On both the local and distant, I have the following warning :

[WARN ][o.e.b.BootstrapChecks    ] [SBii_Wb] max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

On the distant one, I also have :

[WARN ][o.e.b.BootstrapChecks    ] [SBii_Wb] max file descriptors [4096] for elasticsearch process is too low, increase to at least [65536]

As you'll read in my setup, I'm not root. Could I change those settings without being it ?

Setup

Here is my setup. I can't be root so I'm using the tar.gz version of ES. I have about 540k documents in a single index and a single primary shard (from what I understood at least). I'd like to request ES up to 200 q/s and if possible up to may be 1k/s. I have the resources (RAM and CPU) to do so but I have troubles with ES. All documents only weights 1.4GB. Another precision, I actually have more indices as I'm using Kibana and I'm testing some parameters on the search engine (changing k1 and b on BM25 model). My jvm heap size is set to 16GB. If you need any more info just ask.

Thanks for reading and answering if you can.
Cheers

Ricocotam · April 17, 2019, 8:04am

Hi, I'm upping this

HenningAndersen · April 17, 2019, 3:27pm

Hi @Ricocotam,

A few initial comments and questions:

You have all docs in one primary shard. Does it have any replica? Since if not, you would only be using one node for the searches.
The scenario that causes the problems is that doing all the queries? If so, it can be beneficial to search for the limit by finding out how many queries/s you can handle on a single node and then scaling the environment accordingly.
The actual query matters and maybe it can be tuned or a different structure can help.
When you get timed out, what response do you get? A 429 would be expected if overloaded.
How much RAM does the machines have?
When elasticsearch crashes, it could be the case that it was killed by the linux oom killer (if running linux). Examining /var/log/messages or kern.log could give insights into that. Otherwise, the elasticsearch log file is likely to contain information on what happened.
Whether you can change file descriptors without root depends on setup. There are typically hard and soft limits set for a user and the user cannot exceed the hard limit, but can increase the soft limit up till the hard limit.
Increasing vm.max_map_count requires root.

Ricocotam · April 18, 2019, 6:35am

Hi @HenningAndersen, thanks for your answer

I have no replica. From what I read, I have too few data (only 1.4 GB) to have more than one shard. But I might not understand correctly how to use these
If I'm looking at the logs right now, I see the overhead info popping out every 2 hours or so (when used) but the value increments a lot more than just +1. I monitored the CPU usage and I'm far from what I've available (only using ~ half), and the ram is only used about 10% of what's available for the JVM which is using only a third of what's available on the server
All the queries are very similar and I'm requesting the exact same thing (just the id of the documents). It would be very strange that a query changes something. I'm using Robust2004 dataset from TREC conference. It's pretty standard in the IR field
I figured out the timeout issue, my network was overloaded so the server didn't answer, I fixed it. Thanks for the suggestion though
I gave 96GB to the JVM (we never know) and I have 256 GB available in total
If it ever happens again I'll look into this. It's probably the OOM killer that's to blame.
Are these values critical for ES to work well in searching ? I'm only interested in searching as fast as possible. I'm currently handling about 200 queries/second and it works pretty well but I'd like to scale up to 500/s. I have the power but something settings might be better. Are those values a problem ?

Again, thanks for your answer, I really appreciate

HenningAndersen · May 3, 2019, 3:56pm