Enterprise search takes forever in startup and rejects all the incoming requests

Hello, I'm having a problem regarding my on-premise docker ELK stack of both elastic and enterprise search version 7.14.0.

When I start up my containers, elasticsearch goes up quickly, but enterprise search takes forever to start up. In the meantime, I get error 503 for every user request. If I look at the logs, those queries take more than 100 seconds, when normaly they take 500-800ms.

I can only wait after midnight when my users logout, or shutdown my whole website, so enterprise search can get back on his feet. After that, everything works fine. If i recreate my containers having my usual user load, it won't start.

If you see my CPU stats below, you can see that enterprisesearch is taking all the juice.
In my configuration, elasticsearch has 6gb RAM and enterprisesearch has 4gb. I tried giving both 6gb with same results.

Can you help me please troubleshoot this situation? Each time I have to upgrade ELK or do mainteinance to the server, I have the same problem

Thank you in advance
Gabriel

Hi @gpribi! I'm sorry to hear you're experiencing issues. We're continually working on ways to improve the product and more specifically the start-up time. Look out for improvements in future releases.

With that said, Upgrading self-managed deployments | Elastic Enterprise Search Documentation [7.14] | Elastic provides some instructions around doing in-place upgrades that may be useful during scheduled maintenance as well.

It's worth noting that Elastic Cloud will handle this type of issue for you automatically.

So we can attempt to reproduce the issue, would you mind sharing what your usual user load looks like?

Thanks!
Brian

Hello @Brian_McGue ! Thanks for your reply.
I forgot to mention, this delay happens every time I restart the ELK containers I have in my server, not only when upgrading the ELK itself.

In this case, the problem started 20:30hs and was resolved 21:50hs, only after I shutdown the webserver a cuple of minutes to bring the requests to 0.

The CPU shown in "docker stats" is very variable even in normal conditions. I'm sharing the kibana metrics from 20 to 23hs, so you can see the situation before, during and after the startup problem. I hope this information is what you need.






If you need any more information just tell me
Thank you!

Hello, Gabriel,

Sorry you're having issues with your deployment.

First, I wanted to mention something: Enterprise Search does not do anything special after starting, definitely not anything that would take 1.5h of CPU crunching we're seeing on the graphs. So, I suspect something external pushing it over the edge.

One very suspicious item I'm seeing in the data you have presented is a message from Enterprise Search crawler about the cluster being in read-only mode. Did you put the cluster in r/o mode? If so, why?

To limit the confusion here, I'd recommend you do the following:

  1. Describe your current setup (number CPU cores, amount of RAM given to both Enterprise Search and Enterprise Search, data size in Elasticsearch)
  2. Describe what you're trying to do (clearly it is not just a restart given the r/o mode message).
  3. Outline the steps you're trying to take and what you're seeing as a result.
  4. Finally, I'd recommend looking at Enterprise Search logs for information on what it is doing - if it takes forever to start, let's make sure we know what it is saying while doing that (if it is an upgrade from an old version, there may be data migrations involved, which would take time and a bunch of CPU time, etc).

Let me know if I can clarify any of the requests above.

Hello Oleksiy, I'll try to answer all your questions as clear as possible.

Regarding the R/O mode, that's because this time I was upgrading the ELK stack, and setting R/O on is one of the recommended steps. However this may be discarded as the root cause since I've had the same startup issues when upgrading docker or restarting the server, without any changes in the ELK images.

My setup is the following:

  • AWS EC2 m5a.xlarge instance (4 CPU cores, 16GB RAM, 50 GB SSD)
  • Elastic search has 6GB RAM. Enterprise search had 4GB RAM when I took the original screenshots. During the hang I changed it to 6GB to see if it managed to start but didn't word. I kept it in 6 for now.
  • Ubuntu 18.04.5 LTS
  • Single node cluster
  • In the same EC2 instance I only have Elasticsearch, Enterprise search and metricbeat, v7.14.0 all of them

What I'm trying to do? Restart the ELK stack for mainteinance purposes

Steps: if I just stop the containers and restart them, the issue occurs. Didn't try to restart only enterprise search yet.

During the issue, the enterprise search logs are like the one I attached at the begining: It receives some requests but takes far too long to respond. And all the other requests receive error 50X. Even when I try to access the enterprise search's backend I get the same error.
I've checked for migrations in that particular case and they've previously run ok. After that, the server goes "online" and starts with the delays I've descripted.

I guess that the evidence points to Enterprise search since it's CPU consumption is huge, much higher than elasticsearch. Maybe it's not able to receive a normal flow of requests during the startup. Remember that the only way I've managed to get this thing back again is to shutdown my webserver, or wait until 1AM when the users log out. After that, without intervention, it works fine.

Thank you for your help!