Using CirrusSerach in Mediawiki, recently getting error "We could not complete your search due to a temporary problem"

For the past month of so, CirrusSearch has suddenly and randomly stopped working and given the message "An error has occurred while searching: We could not complete your search due to a temporary problem. Please try again later."

Our wiki has used elasticsearch via CirrusSearch for a good while now with no issues, but recently our traffic has slowly been improving, and that is when trouble with search began. Restarting our VPS would solve the issue, but over time search would eventually go down again. As wiki traffic gradually increased, so did the frequency of the error, up to the point where search would go down daily.

Thinking it might be a memory use issue, I created a custom.options file in elasticsearch/jvm.options.d with the settings

-Xms3g

-Xmx3g

Nothing changed at first, as I didn't restart Elasticsearch, but the next morning search was down per usual, so I rebooted the VPS to get it working again. This time, that didn't solve the problem. The message "An error has occurred while searching" was still appearing. I deleted the custom.options file I had created, and rebooted the VPS again. Still this didn't solve the problem.

To avoid not having any search function at all, we're now using the default mediawiki search. I checked the elasticsearch.log, but it didn't seem to have anything useful. It did have the message "Native controller process has stopped - no new native processes can be started," but there were no other error messages or an explanation as to why search stopped. I'm not even sure if that's an error or that's just when I disabled CirrusSearch because it had already stopped working anyway and was showing the "error has occurred while searching" message.

Our wiki uses a shared VPS server with 4 cores and our container having 8 GB guaranteed ram, and this is where elasticsearch is also stored (as in there's not a separate server just for it). That has been enough for elasticsearch to function with the default settings up until recently.

I have noticed that with CirrusSearch disabled, our server's total memory use is very low. Just enabling CirrusSearch makes it jump to over 65%, and as time passes that number will slowly creep higher to around 80-83% before search then goes down.

I would much rather have CirrusSearch back again that use the default search, so does anyone know what I should do to solve this issue and stop search giving nothing but error messages?

I don't think there's many CirrusSearch experts here unfortunately so you might need to ask for help on a more Mediawiki-focussed forum. We're going to need to need some more Elasticsearch-specific details for sure. For starters what version of Elasticsearch are you using? Also "An error has occurred while searching" is not particularly useful, nor is it an Elasticsearch error message; you'd need to share the exact error that Elasticsearch emits before we can offer much advice. The error may well not be in the Elasticsearch logs, it's more likely sent to the client (CirrusSearch) in a HTTP response, so you'd need to get the detailed error from its logs somehow.

The message "Native controller process has stopped - no new native processes can be started" is not an unusual message when Elasticsearch is shutting down gracefully.

I was on the mediawiki extension page for CirrusSearch asking about this issue first, and it was suggested I try here.

After completely uninstalling and reinstalling elasticsearch I was able to get it working again, but again this morning I ran into the same search error.

I again checked elasticsearch's error logs and found only "Native controller process has stopped - no new native processes can be started." I think elasticsearch is turning itself off on purpose when memory use gets too high. It always follows the same pattern of the server's total memory use slowly climbing higher and higher before suddenly search stops working.

What settings in elasticsearch can be changed to make sure it uses less memory and doesn't hit whatever limits it seems to be reaching?

Unclear, it depends what limit it's reaching. Until we know that, it's hard to offer any concrete advice. Can you work with the CirrusSearch folks to help you get the exact error response that Elasticsearch is sending back? It's not the Native controller process has stopped - no new native processes can be started one, at least that doesn't in itself explain searches failing.

Increasing memory usage isn't in itself a useful symptom either, that's normal on Linux at least (see e.g. https://www.linuxatemyram.com/ for more details)

Sorry meant to add about this bit...

That's not something that Elasticsearch does. It'll return errors indicating high memory usage, and typically you'd also see log messages about it too, but it will keep on running until something external kills it. Is something external killing it? Maybe the OOM killer? Check your kernel logs (e.g. dmesg). If it is that, you'll either need to remove any other memory consumers on the same machine, or else reduce the ES heap size with -Xmx and -Xms.

Well no matter where I look, I can't find any helpful error logs for either elasticsearch or cirrussearch. But I did adjust the -Xmx and -Xms levels, as that was what I was trying to do to begin with. This time I figured out how to check the starting numbers, and after a few days of dropping them lower and lower, eventually the search function no longer stopped working overnight.

The only problem is that the lowering of heap size seems to have only bought more time, as now over a few days the memory use keeps slowly increasing while never going back down again. I've found that stopping elasticsearch and then starting it again "resets" those memory levels, but I wish I knew what was going on to make the memory use just keep on rising like it does.

Even in the kernel logs? Strange if so.

What exactly do you mean by "memory use" here? Some measures of memory use are supposed to behave like this.