Searching 1M document from 10M documents

no_jihun · December 8, 2015, 5:59am

Hello.

I should like to have the benefit of your advice.

I wanna make a query to an index which have 10 million documents. each document's size is 1k~5k.
I expect the result document count is 10k in minimun and 1Million in maximum.

The query is quite simple. something like..

/myindex/mytype/_search
{
  "query": {
    "bool": {
      "should": {
        "query_string": {
          "query": "baby car house star giant computer"
        }
      }
    }
  }
}

if some common keyword used It's hit count will increase(~1M)
(e.g query:"man morning car house go result")

if some specific keywrd used It's hit count will be small(1k~).
(e.g "offensive knife terror")

My final goal is get all of the each document's _id which has hit.
(but I am not sure I will use min_score)

In that case, if I set large fetch(about 1M) size to get the result at once ES will have OOM trouble.
On the other hands, if I set small fetch size, and use pagination to get the result ES will have deep pagination problem. https://www.elastic.co/guide/en/elasticsearch/guide/current/pagination.html

What will be the good way to get the large result set by query?

When I use SCAN it doesn't provide score value.

QueryBuilder queryBuilder = QueryBuilders.queryStringQuery(queryString);
SearchRequestBuilder builder = client.prepareSearch("usertext");
builder.setTypes("usertext");
builder.setQuery(queryBuilder);
builder.setSearchType(SearchType.SCAN);
builder.setSize(3000);
builder.setScroll(new TimeValue(1000));
SearchResponse response = builder.execute().actionGet();
int addCount = 0;
while (true) {
	for (SearchHit hit : response.getHits()) {
		addCount++;
		uuidSet.add(hit.getId());
		System.out.println(addCount + "," + uuidSet.size() + ", " + hit.getScore());
	}
	response = client.prepareSearchScroll(response.getScrollId()).setScroll(new meValue(10000)).execute().actionGet();
	if (response.getHits().getHits().length == 0) {
		break;
	}
}

nik9000 · December 8, 2015, 7:04pm

scroll doesn't require scan. It should work if you just remove:

builder.setSearchType(SearchType.SCAN);

It won't be as efficient, but it should work.

You may also want to think about Elasticsearch's aggregations. They were created to answer questions about a large result set without having to pull back the whole thing. They get to take advantage of the columnar storage format for doc_values, making them quite efficient. They can be plugged together in lots of ways and you can write an Elasticsearch plugin to define more of them.

I believe this time value is pretty small. You'll probably want to make it a bit higher. Usually I think of these timeouts in minutes. Once the scroll context has timed out its gone and you've have to restart the process.

Topic		Replies	Views
Performance impact of returning large result sets Elasticsearch	3	4301	July 5, 2017
Search query performance Elasticsearch	4	554	March 15, 2018
Returning the number of results which match a large query? Elasticsearch	4	304	April 29, 2022
How to set size limit in GET query Elasticsearch	3	13949	March 21, 2019
Retrieving over a million records in Elasticsearch Elasticsearch	10	28102	July 5, 2017

Searching 1M document from 10M documents

Related topics