I was indexing a list of a few hundred product builds that have about a gig of data each. I'm using one machine for this. I don't mind searches being a bit slow (e.g up to 30 seconds) and I don't mind indexing to take about 10 minutes for each build. But I had a few builds that took around 3 hours near the end of the indexing process. So I googled and found this github issue
And it's stated there that indexing does in fact become slower the more indices you have which I found peculiar. Do you guys have more information on why this is?
As was mentioned in the GH issue, can you provide more information about your cluster infrastructure?
How many indices have you created, how large are they?
Most of the documents are very similar. If I knew this amount of indices would be a challenge I could have made one index with all the documents. I have a sliding window of builds I'm tracking so I wanted to make it easy to garbage collect a build by just removing its index.
I'm also worried about how slow pagination is. Sometimes I don't want to "scan" or "scroll" but I just want result numbers 15-30. A search that takes 150ms for results 0-15 can take me 30 seconds for results 15-30 (out of 46 total results). But perhaps that's a separate topic.
Let me know if more information is needed. Thank you for your help!
It would be also great if you can get the output of GET /_nodes/hot_threads?threads=1000 on the master while you see slow index creation. It will help see what it waits on. It will also be good if you can set your logs on DEBUG level and share them.
Also can you describe in more detail what is slow exactly? Is it the time for an index creation API call to come back? Is it the time for the index to become yellow? Is it indexing slowness?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.