Searching 1M document from 10M documents


I should like to have the benefit of your advice.

I wanna make a query to an index which have 10 million documents. each document's size is 1k~5k.
I expect the result document count is 10k in minimun and 1Million in maximum.

The query is quite simple. something like..

  "query": {
    "bool": {
      "should": {
        "query_string": {
          "query": "baby car house star giant computer"

if some common keyword used It's hit count will increase(~1M)
(e.g query:"man morning car house go result")

if some specific keywrd used It's hit count will be small(1k~).
(e.g "offensive knife terror")

My final goal is get all of the each document's _id which has hit.
(but I am not sure I will use min_score)

In that case, if I set large fetch(about 1M) size to get the result at once ES will have OOM trouble.
On the other hands, if I set small fetch size, and use pagination to get the result ES will have deep pagination problem.

What will be the good way to get the large result set by query?

When I use SCAN it doesn't provide score value.

QueryBuilder queryBuilder = QueryBuilders.queryStringQuery(queryString);
SearchRequestBuilder builder = client.prepareSearch("usertext");
builder.setScroll(new TimeValue(1000));
SearchResponse response = builder.execute().actionGet();
int addCount = 0;
while (true) {
	for (SearchHit hit : response.getHits()) {
		System.out.println(addCount + "," + uuidSet.size() + ", " + hit.getScore());
	response = client.prepareSearchScroll(response.getScrollId()).setScroll(new meValue(10000)).execute().actionGet();
	if (response.getHits().getHits().length == 0) {

scroll doesn't require scan. It should work if you just remove:


It won't be as efficient, but it should work.

You may also want to think about Elasticsearch's aggregations. They were created to answer questions about a large result set without having to pull back the whole thing. They get to take advantage of the columnar storage format for doc_values, making them quite efficient. They can be plugged together in lots of ways and you can write an Elasticsearch plugin to define more of them.

I believe this time value is pretty small. You'll probably want to make it a bit higher. Usually I think of these timeouts in minutes. Once the scroll context has timed out its gone and you've have to restart the process.

1 Like