Extract _id's of matching documents


#1

Hi,

I'm trying to extract a list of matching documents for simple single-word queries, and I only need the document _id's. I'm running Elasticsearch 6.4.3 on a machine with 64GB RAM (I've given Elasticsearch 31GB). My data set is currently 166 million documents and Kibana is showing 51GB of disk space and 17 primary shards, 0 replica shards. I have only a single node at present. It's showing the JVM heap using 2.6 / 31GB.

I have a lot of servers writing data to my Elasticsearch instance (maybe around 1,000 writes per second curently), and only one trying to read from it.

If I run single-word queries in Kibana I usually get results back in much less than a second, although occasionally it takes 10 seconds or Kibana times out after 30 seconds. It's hard to diagnose because there seems to be caching and repeated attempts at a query are usually very fast.

I'm new to Elasticsearch & Lucene but my understanding is that the inverted index is essentially a list of matching documents for each indexed keyword. If that's true, I really just need the contents of the inverted index fairly directly. Even though searches (usually) run quickly, the scroll API is unusably slow at present. For a search with 100,000 hits it might take an average of 10 seconds to produce every 1,000 results.

I've tried queries like this:
http://server/index/_search?scroll=3m, content {query: {query_string: {query: "fantastic"}}, size: 1000, _source: false}
and
http://server/index/_search?scroll=3m, content {query: {bool: {filter: {query_string: {query: "fantastic"}}}}, sort: ['_doc'], size: 1000, _source: false}

and performance varies but it's generally around 10 seconds for every 1,000 results. I was hoping to retrieve 100,000 document _id's in just a few seconds. I'm wondering if this is possible with Elasticsearch. It seems like i must have a grossly misconfigured setup or perhaps my queries are making the system do a lot more work than just pull _id's out of the inverted index.

I know that I can scale up by adding more nodes, but that will become prohibitively expensive pretty quickly. Elasticsearch seems to be handling the write load well, so I'm hoping there's a way to get all matching document _id's without creating a large number of nodes (although I understand if a single node isn't designed to handle concurrent read and write loads perhaps).

Thanks in advance for any advice!
Jeremy


#2

Hi Jeremy,

poor performance usually has a lot of different factors. But the first thing that looks not ideal here is the number of primary shards:

For 51 GB of data you will probably not need more than 2 primary shards. You have 17.
As a first step I'd recommend to reduce the number of primary shards.

The easiest way would be to create a new index with 2 primary shards and use the _reindex API to move all data from your old indices to the new one.

If you're using time-based indices you should change the setting of number of shards to 1 for the future.

Let me know if that helps.

Kind regards,
Saskia


#3

Hi Saskia,

Thanks for the reply. I checked and realized I have 17 indices (15 from Kibana), and each of them has only a single primary shard. Sorry for the misunderstanding on my part.

I'm really just wondering what kind of general performance expectation there is for a scroll through all hits. So if I have 150 million documents and a single-word query matching 1 million of them, is the general expectation that it'll take seconds, minutes or hours to retrieve all matching document IDs?

I know performance depends on a huge number of factors, I was just hoping that retrieving matching doc IDs could be fast since that's essentially what's in the inverted index (I think anyway!).

Regards,
Jeremy