Searching 1M document from 10M documents

Hello.

I should like to have the benefit of your advice.

I wanna make a query to an index which have 10 million documents. each document's size is 1k~5k.
I expect the result document count is 10k in minimun and 1Million in maximum.

The query is quite simple. something like..

/myindex/mytype/_search
{
  "query": {
    "bool": {
      "should": {
        "query_string": {
          "query": "baby car house star giant computer"
        }
      }
    }
  }
}

if some common keyword used It's hit count will increase(~1M)
(e.g query:"man morning car house go result")

if some specific keywrd used It's hit count will be small(1k~).
(e.g "offensive knife terror")

My final goal is get all of the each document's _id which has hit.
(but I am not sure I will use min_score)

In that case, if I set large fetch(about 1M) size to get the result at once ES will have OOM trouble.
On the other hands, if I set small fetch size, and use pagination to get the result ES will have deep pagination problem. https://www.elastic.co/guide/en/elasticsearch/guide/current/pagination.html

What will be the good way to get the large result set by query?

When I use SCAN it doesn't provide score value.

QueryBuilder queryBuilder = QueryBuilders.queryStringQuery(queryString);
SearchRequestBuilder builder = client.prepareSearch("usertext");
builder.setTypes("usertext");
builder.setQuery(queryBuilder);
builder.setSearchType(SearchType.SCAN);
builder.setSize(3000);
builder.setScroll(new TimeValue(1000));
SearchResponse response = builder.execute().actionGet();
int addCount = 0;
while (true) {
	for (SearchHit hit : response.getHits()) {
		addCount++;
		uuidSet.add(hit.getId());
		System.out.println(addCount + "," + uuidSet.size() + ", " + hit.getScore());
	}
	response = client.prepareSearchScroll(response.getScrollId()).setScroll(new meValue(10000)).execute().actionGet();
	if (response.getHits().getHits().length == 0) {
		break;
	}
}

scroll doesn't require scan. It should work if you just remove:

builder.setSearchType(SearchType.SCAN);

It won't be as efficient, but it should work.

You may also want to think about Elasticsearch's aggregations. They were created to answer questions about a large result set without having to pull back the whole thing. They get to take advantage of the columnar storage format for doc_values, making them quite efficient. They can be plugged together in lots of ways and you can write an Elasticsearch plugin to define more of them.

I believe this time value is pretty small. You'll probably want to make it a bit higher. Usually I think of these timeouts in minutes. Once the scroll context has timed out its gone and you've have to restart the process.

1 Like