Blocked threads on read index calls

Hi,
A portal built using Liferay is using Elasticsearch as a search engine with below config:
Elasticsearch 7.16.2 Single Node, Around 250MB sized indices
Windows Server 2019 64Bit 8 vcpu, 32GB RAM, SSD Disk 100GB
OpenJDK 8 , 16GB Min and Max heap

Need inputs for below:

Index search requests are taking around 120 secs, with threads blocked on below code for more than 60 secs. Thread details attached and logs at gist:0e134a7cc9a38dd4a21b7ee4ea1e8fd2 · GitHub



:
org.apache.lucene.util.compress.LZ4.decompress(LZ4.java:112)
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:56)
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:315)
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:159)
sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:717)

POST /liferay-20097/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512&ccs_minimize_roundtrips=true

We tried disabling the anti virus on data folder, didn't help

Regards,
Madhu

What JDK exactly are you using? None of the stock ones seem to have any way to be BLOCKED at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:717):

$ for tag in $(git tag | grep jdk8 | sort); do git show $tag:./jdk/src/share/classes/sun/nio/ch/FileChannelImpl.java | sed -ne '717p'; done | sort | uniq

        private static final NativeDispatcher nd = new FileDispatcherImpl();
        }

Could you share a complete thread dump from the time of the blockage, captured using jstack? The screenshots and summary you've shared don't have enough detail to be useful.

It's Azul Zulu OpenJDK 8, will take thread dumps and share. Thanks for your response

That doesn't narrow it down much - what version exactly is it?

Hi David,
Its JDK build 1.8.0_302-b08
The corresponding Azul version is 8.56.0.21-CA-win64.

Also these the thread dumps when the issue occurred today . /03012022/Before Restart/ are the ones when the issue occurred, post which we restarted ES and took thread dumps again for reference /03012022/After Restart/

Regards,
Madhu

Your threads are all busy in the fetch phase, and the JDK apparently serialises reads to the same file on Windows:

This means if one thread is slow reading then all the other threads will have to wait, and indeed your dumps all capture a thread at sun.nio.ch.FileDispatcherImpl.pread0(Native Method) which means it's waiting for the OS to respond to a read request. You will need to investigate the behaviour of that thread further to determine if it's completely stuck on a single request or if it's actually making progress just very slowly. In any case if the OS responds slowly to reads then Elasticsearch won't perform very well.

I haven't found the source that corresponds with your chosen JDK but it's worth noting that the recommended JDK is the bundled one since that's the one that gets all the testing and it's much easier to answer this kind of question when using the recommended setup. I don't think it will make much difference in this case but it would still be worth ruling out that the problem is specific to your unusual JDK choice.

Hi @DavidTurner

Thanks for your response. This is Anurag (Madhu's colleague)

I just had a quick question. Do you think changing the OS from Windows to Linux can help? If so, which version of Linux would you recommend?

Thanks again for your quick responses and sharing your input.

Regards,
Anurag

Mmmaybe. I mean it shouldn't be necessary, Windows is fully supported, but Linux is definitely much more common and I personally know a lot more about debugging performance issues there than on Windows. The support matrix shows all the different supported flavours, it's up to you to choose one that you're comfortable administering.

I have opened Improve concurrency of reads on Windows · Issue #82184 · elastic/elasticsearch · GitHub to get thoughts from the rest of the team on this Windows-specific behaviour.

Changing from external JDK to bundled one didn't help, getting the same blocked threads again

Repeating my earlier message with my suggestion for your next steps:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.