Extract _id's of matching documents


I'm trying to extract a list of matching documents for simple single-word queries, and I only need the document _id's. I'm running Elasticsearch 6.4.3 on a machine with 64GB RAM (I've given Elasticsearch 31GB). My data set is currently 166 million documents and Kibana is showing 51GB of disk space and 17 primary shards, 0 replica shards. I have only a single node at present. It's showing the JVM heap using 2.6 / 31GB.

I have a lot of servers writing data to my Elasticsearch instance (maybe around 1,000 writes per second curently), and only one trying to read from it.

If I run single-word queries in Kibana I usually get results back in much less than a second, although occasionally it takes 10 seconds or Kibana times out after 30 seconds. It's hard to diagnose because there seems to be caching and repeated attempts at a query are usually very fast.

I'm new to Elasticsearch & Lucene but my understanding is that the inverted index is essentially a list of matching documents for each indexed keyword. If that's true, I really just need the contents of the inverted index fairly directly. Even though searches (usually) run quickly, the scroll API is unusably slow at present. For a search with 100,000 hits it might take an average of 10 seconds to produce every 1,000 results.

I've tried queries like this:
http://server/index/_search?scroll=3m, content {query: {query_string: {query: "fantastic"}}, size: 1000, _source: false}
http://server/index/_search?scroll=3m, content {query: {bool: {filter: {query_string: {query: "fantastic"}}}}, sort: ['_doc'], size: 1000, _source: false}

and performance varies but it's generally around 10 seconds for every 1,000 results. I was hoping to retrieve 100,000 document _id's in just a few seconds. I'm wondering if this is possible with Elasticsearch. It seems like i must have a grossly misconfigured setup or perhaps my queries are making the system do a lot more work than just pull _id's out of the inverted index.

I know that I can scale up by adding more nodes, but that will become prohibitively expensive pretty quickly. Elasticsearch seems to be handling the write load well, so I'm hoping there's a way to get all matching document _id's without creating a large number of nodes (although I understand if a single node isn't designed to handle concurrent read and write loads perhaps).

Thanks in advance for any advice!

Hi Jeremy,

poor performance usually has a lot of different factors. But the first thing that looks not ideal here is the number of primary shards:

For 51 GB of data you will probably not need more than 2 primary shards. You have 17.
As a first step I'd recommend to reduce the number of primary shards.

The easiest way would be to create a new index with 2 primary shards and use the _reindex API to move all data from your old indices to the new one.

If you're using time-based indices you should change the setting of number of shards to 1 for the future.

Let me know if that helps.

Kind regards,

Hi Saskia,

Thanks for the reply. I checked and realized I have 17 indices (15 from Kibana), and each of them has only a single primary shard. Sorry for the misunderstanding on my part.

I'm really just wondering what kind of general performance expectation there is for a scroll through all hits. So if I have 150 million documents and a single-word query matching 1 million of them, is the general expectation that it'll take seconds, minutes or hours to retrieve all matching document IDs?

I know performance depends on a huge number of factors, I was just hoping that retrieving matching doc IDs could be fast since that's essentially what's in the inverted index (I think anyway!).


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.