Hi,
I'm trying to extract a list of matching documents for simple single-word queries, and I only need the document _id's. I'm running Elasticsearch 6.4.3 on a machine with 64GB RAM (I've given Elasticsearch 31GB). My data set is currently 166 million documents and Kibana is showing 51GB of disk space and 17 primary shards, 0 replica shards. I have only a single node at present. It's showing the JVM heap using 2.6 / 31GB.
I have a lot of servers writing data to my Elasticsearch instance (maybe around 1,000 writes per second curently), and only one trying to read from it.
If I run single-word queries in Kibana I usually get results back in much less than a second, although occasionally it takes 10 seconds or Kibana times out after 30 seconds. It's hard to diagnose because there seems to be caching and repeated attempts at a query are usually very fast.
I'm new to Elasticsearch & Lucene but my understanding is that the inverted index is essentially a list of matching documents for each indexed keyword. If that's true, I really just need the contents of the inverted index fairly directly. Even though searches (usually) run quickly, the scroll API is unusably slow at present. For a search with 100,000 hits it might take an average of 10 seconds to produce every 1,000 results.
I've tried queries like this:
http://server/index/_search?scroll=3m, content {query: {query_string: {query: "fantastic"}}, size: 1000, _source: false}
and
http://server/index/_search?scroll=3m, content {query: {bool: {filter: {query_string: {query: "fantastic"}}}}, sort: ['_doc'], size: 1000, _source: false}
and performance varies but it's generally around 10 seconds for every 1,000 results. I was hoping to retrieve 100,000 document _id's in just a few seconds. I'm wondering if this is possible with Elasticsearch. It seems like i must have a grossly misconfigured setup or perhaps my queries are making the system do a lot more work than just pull _id's out of the inverted index.
I know that I can scale up by adding more nodes, but that will become prohibitively expensive pretty quickly. Elasticsearch seems to be handling the write load well, so I'm hoping there's a way to get all matching document _id's without creating a large number of nodes (although I understand if a single node isn't designed to handle concurrent read and write loads perhaps).
Thanks in advance for any advice!
Jeremy