Fastest way to extract _id + fielddata

Hello,

I have a cluster running on 3 nodes, each with 64GB RAM and 3TB of SSD storage. There's one index with the following properties:

  • 20 shards
  • 1 replica
  • 8.3 billion documents
  • 2.96TB only in primary shards (5.92TB with replicas)

All servers are running ES 1.4.4 and JVM 1.8.0_31. ES_HEAP_SIZE is set to 32g on all of them.
Under normal conditions the cluster is performing very well both when indexing and when searching. Heap usage is constantly around 70% and there rarely is a query over 3 seconds, most are returning under 1 second.

Here is the mapping of the index (simplified for brevity, there are more fields, but they are not relevant):

{
    "articles":{
       "mappings":{
          "article":{
             "_all":{
                "enabled":false
             },
             "_source":{
                "enabled":false
             },
             "properties":{
                "content":{
                   "type":"string",
                   "norms":{
                      "enabled":false
                   }
                },
                "url":{
                   "type":"string",
                   "index":"not_analyzed"
                }
             }
          }
       },
       "settings":{
          "index":{
             "refresh_interval":"30s",
             "number_of_shards":"20",
             "analysis":{
                "analyzer":{
                   "default":{
                      "filter":[
                         "icu_folding",
                         "icu_normalizer"
                      ],
                      "type":"custom",
                      "tokenizer":"icu_tokenizer"
                   }
                }
             },
             "number_of_replicas":"1"
          }
       }
    }
 }

Due to a problem with one of my source databases, I need a way to extract around 3 billion documents in order to match the document id (_id) with url. I have _source disabled, but I know that I can use fielddata_fields to get the fielddata and since url is not_analyzed this is perfectly fine for me. The only problem is that on this particular index fielddata_fields seems to be a huge memory killer.

This is what I'm trying to do:

curl -XGET "http://es:9200/articles/article/_search/?pretty" -d '{"fielddata_fields": ["url"], "query" : {"terms" : {"_id": ["8433552111"]}}}'

or (it doesn't seem to make any difference):

curl -XGET "http://es:9200/articles/article/_search/?pretty" -d '{"fielddata_fields": ["url"], "query" : {"ids" : {"values" : ["8433552111"]}}}'

The result:

{
  "took" : 3616,
  "timed_out" : false,
  "_shards" : {
    "total" : 20,
    "successful" : 20,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "articles",
      "_type" : "article",
      "_id" : "8433552111",
      "_score" : 1.0,
      "fields" : {
        "url" : [ "http://www.reddit.com/r/offbeat/comments/4lfp4x/man_stands_outside_nfl_stadium_hoping_for_a/" ]
      }
    } ]
  }
}

The result is great, but the problem is that this query is killing my cluster pretty quickly. Even when requesting a couple of documents, heap sizes grow to 99% and at some point the nodes become unresponsive.

The logs are full with this:

[2016-05-28 03:15:15,813][WARN ][indices.breaker          ] [node3] [FIELDDATA] New used memory 21556768399 [20gb] from field [url] would be larger than configured breaker: 20521628467 [19.1gb], breaking

I've tried increasing indices.breaker.fielddata.limit, but this only makes the cluster even more unstable. I'm now looking for any solution to get this data as quickly as possible. It is a one-time effort, so ANY solution will be welcome.

Thank you!

Hey there.

i'll just share my thoughts and maybe it will help you.

About field_data_fields.

It’s important to understand that using the fielddata_fields parameter will cause the terms for that field to be loaded to memory (cached), which will result in more memory consumption.

I think that is exactly what's happening in your case. When executing the query all urls are being loaded into heap and I'm assuming the url's are mostly unique so that's a lot of them.

Putting it differently I think you are loading all urls into memory when you only need those matching the ids.

An approach would be to copy the index data files to a new cluster, although 3TB is a lot of data. Once there you could delete by query all documents that don't match the desired ids just to reduce the size of field data.

You could also increase the heap size. However you would need take into account that when going above 32GB, 8GB are lost. So for example 48GB actually means 40GB. Additionally GC performance decreases with a bigger heap size. But since it's one time thing on a detached cluster it doesn't really matter.

Good luck and if you manage to solve it, I would love to hear how you did it.

Thanks for the reply! It seems that it's my fault that I didn't enable doc_values for the url field :sweat: and now I'm being punished, because reindexing is not an option. I don't have 1/3 of the source documents anywhere. This is what I'm trying to (at least partially) restore. Too bad doc_values weren't default back in 1.4. Your solution sounds reasonable and I think I will go for it.

So there's absolutely no way to generate doc_values at this point?

So there's absolutely no way to generate doc_values at this point?

I'm not aware of a way that would allow you to generate field data with doc_values without reindexing. That requires the original data or being able to read data from your current index. So you seem stuck here.

1 Like

I was afraid of that answer. I'm now extracting other data (two integer fields with doc_values enabled) and it's going well, but for the url field we will most probably go with your solution.

Is it possible instead of using snapshots to add a 4th server to the cluster with cluster.routing.allocation.enable set to NONE, set replicas to 2 from 1 and manually allocate all new shards to it, wait for relocation, then shut it down, change the cluster name and start it. Will it then work as a standalone cluster with the whole index intact?
The reason I want to do this is because 1) we haven't used snapshots before, 2) I think it will be faster and 3) I won't need another temporary 3gb storage to store the snapshot.

I was also considering your solution with adding a node and it seems legit to me. However I would suggest you try all the steps involved in a dev environment or disable allocation and try it with a new small index. You should think where the new replicas will get spawned and initialized, they require disk and memory, spawning them on current nodes might cause problems. I do believe that changing the name of the cluster should detach it, but I would still check that in a dev environment to avoid surprises or instabilities.

At this point I don't think I have more experience than you do. I would come up with a list of steps and try them out first in a safe way, to make sure they don't produce surprises.