I'm using the Terms Facet with Elasticsearch V0.20.2. The server has 8 x
Intel Xeon E5-2680 v2 processors and 15GB of memory.
My Terms Facet queries work great as long as the number of documents in the
index is small ( eg. less than 20,000 ). When the system hits more,
pushing into the hundreds of thousands or millions of documents, my Terms
Facets never return results. Watching the server, I initially see a few
Java processes using a lot of CPU, but within a few seconds, this is
reduced to a half dozen processes each using ~2% cpu. I never see memory
usage increase on the server as a result of these queries. When these
queries fail to return results, they also sometimes seem to "freeze"
Elasticsearch and I often have to restart the ES server or even reboot the
physical server to get ES back online for other simple queries.
The fields I'm trying to facet exist for nearly every document and can have
anywhere from 0 to hundreds of different values across the dataset. All
values are text strings and I'm using a custom analyzer that reduces them
to lowercase. I realize that increasing the number of potential values in
a field will dramatically increase the resources needed for the Terms Facet
Query. In testing, I would expect some of the smaller fields should work
fine even at scale with millions of documents.
Questions:
1.) My test field ( industries ), can have no more than 32 unique values.
Each document could have none or all 32 values. Each value can be from 10
to 100 characters of text. This Terms Facet never returns a result at
scale. Any thoughts on what is happening? Is my setup flawed?
- Will I ever be able to run a facet on a field that can have millions of
unique text values? I have some data analysis cases like this where I'd
like to use Elasticsearch Facetting.
3.) Would reducing the fields I'm faceting on to integers ( and then
translating back to text outside ES ) make a big difference in performance
and required resources?
Test Query:
curl -X POST "http://remote_host:9200/companies/company/_search?pretty=true"
-d '
{
"query" : {
"match_all" : { }
},
"facets" : {
"industries" : {
"terms" : {
"field" : "industries.term.keyword_lowercase",
"size" : 100
}
}
},
"size" : 0
}
'
Index Configuration:
{
"index" : {
"number_of_shards" : 5,
"number_of_replicas" : 1,
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "standard",
"filter" : ["standard", "word_delimiter", "lowercase", "stop"]
},
"html_strip" : {
"tokenizer" : "standard",
"filter" : ["standard", "word_delimiter", "lowercase", "stop"],
"char_filter" : "html_strip"
},
"keyword_lowercase" : {
"tokenizer" : "keyword",
"filter" : "lowercase"
}
}
}
}
}
Company Document Mapping:
** i've removed irrelevant fields
{
"company" : {
"type" : "object",
"include_in_all" : false,
"path" : "full",
"dynamic" : "strict",
"properties" : {
"name" : {
"type" : "multi_field",
"fields" : {
"name" : { "type" : "string", "index" : "analyzed", "include_in_all" :
"true", "boost" : 10.0 },
"keyword_lowercase" : { "type" : "string", "index" : "analyzed", "analyzer"
: "keyword_lowercase", "include_in_all" : "false" }
}
},
"description" : { "type" : "string", "index" : "analyzed", "include_in_all"
: "true", "boost" : 6.0 },
"industries" : {
"type" : "nested",
"include_in_root" : true,
"properties" : {
"term" : {
"type" : "multi_field",
"fields" : {
"term" : { "type" : "string", "index" : "analyzed", "include_in_all" :
true, "boost" : 3.0 },
"keyword_lowercase" : { "type" : "string", "index" : "analyzed", "analyzer"
: "keyword_lowercase" }
}
},
"description" : { "type" : "string", "index" : "analyzed", "include_in_all"
: true },
"score" : { "type" : "integer" },
"verified" : { "type" : "boolean" }
}
}
}
}
}
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3e608b31-8569-49d3-b9fa-20d3a1e4a597%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.