Hello,
I have a cluster running on 3 nodes, each with 64GB RAM and 3TB of SSD storage. There's one index with the following properties:
- 20 shards
- 1 replica
- 8.3 billion documents
- 2.96TB only in primary shards (5.92TB with replicas)
All servers are running ES 1.4.4
and JVM 1.8.0_31
. ES_HEAP_SIZE
is set to 32g
on all of them.
Under normal conditions the cluster is performing very well both when indexing and when searching. Heap usage is constantly around 70% and there rarely is a query over 3 seconds, most are returning under 1 second.
Here is the mapping of the index (simplified for brevity, there are more fields, but they are not relevant):
{
"articles":{
"mappings":{
"article":{
"_all":{
"enabled":false
},
"_source":{
"enabled":false
},
"properties":{
"content":{
"type":"string",
"norms":{
"enabled":false
}
},
"url":{
"type":"string",
"index":"not_analyzed"
}
}
}
},
"settings":{
"index":{
"refresh_interval":"30s",
"number_of_shards":"20",
"analysis":{
"analyzer":{
"default":{
"filter":[
"icu_folding",
"icu_normalizer"
],
"type":"custom",
"tokenizer":"icu_tokenizer"
}
}
},
"number_of_replicas":"1"
}
}
}
}
Due to a problem with one of my source databases, I need a way to extract around 3 billion documents in order to match the document id (_id
) with url
. I have _source
disabled, but I know that I can use fielddata_fields to get the fielddata and since url
is not_analyzed
this is perfectly fine for me. The only problem is that on this particular index fielddata_fields
seems to be a huge memory killer.
This is what I'm trying to do:
curl -XGET "http://es:9200/articles/article/_search/?pretty" -d '{"fielddata_fields": ["url"], "query" : {"terms" : {"_id": ["8433552111"]}}}'
or (it doesn't seem to make any difference):
curl -XGET "http://es:9200/articles/article/_search/?pretty" -d '{"fielddata_fields": ["url"], "query" : {"ids" : {"values" : ["8433552111"]}}}'
The result:
{
"took" : 3616,
"timed_out" : false,
"_shards" : {
"total" : 20,
"successful" : 20,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "articles",
"_type" : "article",
"_id" : "8433552111",
"_score" : 1.0,
"fields" : {
"url" : [ "http://www.reddit.com/r/offbeat/comments/4lfp4x/man_stands_outside_nfl_stadium_hoping_for_a/" ]
}
} ]
}
}
The result is great, but the problem is that this query is killing my cluster pretty quickly. Even when requesting a couple of documents, heap sizes grow to 99% and at some point the nodes become unresponsive.
The logs are full with this:
[2016-05-28 03:15:15,813][WARN ][indices.breaker ] [node3] [FIELDDATA] New used memory 21556768399 [20gb] from field [url] would be larger than configured breaker: 20521628467 [19.1gb], breaking
I've tried increasing indices.breaker.fielddata.limit
, but this only makes the cluster even more unstable. I'm now looking for any solution to get this data as quickly as possible. It is a one-time effort, so ANY solution will be welcome.
Thank you!