Blocked threads on read index calls

Madhu_Yadav · December 31, 2021, 4:58am

Hi,
A portal built using Liferay is using Elasticsearch as a search engine with below config:
Elasticsearch 7.16.2 Single Node, Around 250MB sized indices
Windows Server 2019 64Bit 8 vcpu, 32GB RAM, SSD Disk 100GB
OpenJDK 8 , 16GB Min and Max heap

Need inputs for below:

Index search requests are taking around 120 secs, with threads blocked on below code for more than 60 secs. Thread details attached and logs at gist:0e134a7cc9a38dd4a21b7ee4ea1e8fd2 · GitHub

:
org.apache.lucene.util.compress.LZ4.decompress(LZ4.java:112)
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:56)
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:315)
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:159)
sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:717)

POST /liferay-20097/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512&ccs_minimize_roundtrips=true

We tried disabling the anti virus on data folder, didn't help

Regards,
Madhu

DavidTurner · December 31, 2021, 10:58am

What JDK exactly are you using? None of the stock ones seem to have any way to be BLOCKED at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:717):

$ for tag in $(git tag | grep jdk8 | sort); do git show $tag:./jdk/src/share/classes/sun/nio/ch/FileChannelImpl.java | sed -ne '717p'; done | sort | uniq

        private static final NativeDispatcher nd = new FileDispatcherImpl();
        }

Could you share a complete thread dump from the time of the blockage, captured using jstack? The screenshots and summary you've shared don't have enough detail to be useful.

Madhu_Yadav · December 31, 2021, 12:40pm

It's Azul Zulu OpenJDK 8, will take thread dumps and share. Thanks for your response

DavidTurner · December 31, 2021, 1:19pm

That doesn't narrow it down much - what version exactly is it?

Madhu_Yadav · January 3, 2022, 10:47am

Hi David,
Its JDK build 1.8.0_302-b08
The corresponding Azul version is 8.56.0.21-CA-win64.

Also these the thread dumps when the issue occurred today . /03012022/Before Restart/ are the ones when the issue occurred, post which we restarted ES and took thread dumps again for reference /03012022/After Restart/

Regards,
Madhu

DavidTurner · January 3, 2022, 3:03pm

Your threads are all busy in the fetch phase, and the JDK apparently serialises reads to the same file on Windows:

github.com

openjdk/jdk/blob/3a1fca3adf3111a966cb62d926b95acc89b7fe97/src/java.base/share/classes/sun/nio/ch/FileChannelImpl.java#L819-L825

    
      
          if (nd.needsPositionLock()) {
              synchronized (positionLock) {
                  return readInternal(dst, position);
              }
          } else {
              return readInternal(dst, position);
          }

github.com

openjdk/jdk/blob/3a1fca3adf3111a966cb62d926b95acc89b7fe97/src/java.base/windows/classes/sun/nio/ch/FileDispatcherImpl.java#L47-L49

    
      
          boolean needsPositionLock() {
              return true;
          }

This means if one thread is slow reading then all the other threads will have to wait, and indeed your dumps all capture a thread at sun.nio.ch.FileDispatcherImpl.pread0(Native Method) which means it's waiting for the OS to respond to a read request. You will need to investigate the behaviour of that thread further to determine if it's completely stuck on a single request or if it's actually making progress just very slowly. In any case if the OS responds slowly to reads then Elasticsearch won't perform very well.

I haven't found the source that corresponds with your chosen JDK but it's worth noting that the recommended JDK is the bundled one since that's the one that gets all the testing and it's much easier to answer this kind of question when using the recommended setup. I don't think it will make much difference in this case but it would still be worth ruling out that the problem is specific to your unusual JDK choice.

Anurag_Mittal · January 3, 2022, 4:00pm

Hi @DavidTurner

Thanks for your response. This is Anurag (Madhu's colleague)

I just had a quick question. Do you think changing the OS from Windows to Linux can help? If so, which version of Linux would you recommend?

Thanks again for your quick responses and sharing your input.

Regards,
Anurag

DavidTurner · January 3, 2022, 4:08pm

Mmmaybe. I mean it shouldn't be necessary, Windows is fully supported, but Linux is definitely much more common and I personally know a lot more about debugging performance issues there than on Windows. The support matrix shows all the different supported flavours, it's up to you to choose one that you're comfortable administering.

DavidTurner · January 4, 2022, 9:22am

I have opened Improve concurrency of reads on Windows · Issue #82184 · elastic/elasticsearch · GitHub to get thoughts from the rest of the team on this Windows-specific behaviour.

Madhu_Yadav · January 6, 2022, 7:33am

Changing from external JDK to bundled one didn't help, getting the same blocked threads again

DavidTurner · January 6, 2022, 8:56am

Repeating my earlier message with my suggestion for your next steps:

system · February 3, 2022, 8:57am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
High disk read on one node out of 3 Elasticsearch	2	892	October 16, 2018
ElasticSearch freezes suddenly and unable to respond Elasticsearch	12	2209	November 5, 2021
LinkedTransferQueue is blocking threads i dont know why this issue occurs Elasticsearch	3	49	January 31, 2025
Too many bloked threads in elasticsearch java client : how to impose thread limit? Elasticsearch	3	2685	July 6, 2017
Lock up on actionGet Elasticsearch	7	836	July 6, 2017

Blocked threads on read index calls

Related topics