Terms Facet never returns a response when the number of documents increases past tens of thousands


(tbrianjones) #1

I'm using the Terms Facet with Elasticsearch V0.20.2. The server has 8 x
Intel Xeon E5-2680 v2 processors and 15GB of memory.

My Terms Facet queries work great as long as the number of documents in the
index is small ( eg. less than 20,000 ). When the system hits more,
pushing into the hundreds of thousands or millions of documents, my Terms
Facets never return results. Watching the server, I initially see a few
Java processes using a lot of CPU, but within a few seconds, this is
reduced to a half dozen processes each using ~2% cpu. I never see memory
usage increase on the server as a result of these queries. When these
queries fail to return results, they also sometimes seem to "freeze"
Elasticsearch and I often have to restart the ES server or even reboot the
physical server to get ES back online for other simple queries.

The fields I'm trying to facet exist for nearly every document and can have
anywhere from 0 to hundreds of different values across the dataset. All
values are text strings and I'm using a custom analyzer that reduces them
to lowercase. I realize that increasing the number of potential values in
a field will dramatically increase the resources needed for the Terms Facet
Query. In testing, I would expect some of the smaller fields should work
fine even at scale with millions of documents.

Questions:

1.) My test field ( industries ), can have no more than 32 unique values.
Each document could have none or all 32 values. Each value can be from 10
to 100 characters of text. This Terms Facet never returns a result at
scale. Any thoughts on what is happening? Is my setup flawed?

  1. Will I ever be able to run a facet on a field that can have millions of
    unique text values? I have some data analysis cases like this where I'd
    like to use Elasticsearch Facetting.

3.) Would reducing the fields I'm faceting on to integers ( and then
translating back to text outside ES ) make a big difference in performance
and required resources?

Test Query:

curl -X POST "http://remote_host:9200/companies/company/_search?pretty=true"
-d '
{
"query" : {
"match_all" : { }
},
"facets" : {
"industries" : {
"terms" : {
"field" : "industries.term.keyword_lowercase",
"size" : 100
}
}
},
"size" : 0
}
'

Index Configuration:

{
"index" : {
"number_of_shards" : 5,
"number_of_replicas" : 1,
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "standard",
"filter" : ["standard", "word_delimiter", "lowercase", "stop"]
},
"html_strip" : {
"tokenizer" : "standard",
"filter" : ["standard", "word_delimiter", "lowercase", "stop"],
"char_filter" : "html_strip"
},
"keyword_lowercase" : {
"tokenizer" : "keyword",
"filter" : "lowercase"
}
}
}
}
}

Company Document Mapping:

** i've removed irrelevant fields

{
"company" : {
"type" : "object",
"include_in_all" : false,
"path" : "full",
"dynamic" : "strict",
"properties" : {
"name" : {
"type" : "multi_field",
"fields" : {
"name" : { "type" : "string", "index" : "analyzed", "include_in_all" :
"true", "boost" : 10.0 },
"keyword_lowercase" : { "type" : "string", "index" : "analyzed", "analyzer"
: "keyword_lowercase", "include_in_all" : "false" }
}
},
"description" : { "type" : "string", "index" : "analyzed", "include_in_all"
: "true", "boost" : 6.0 },
"industries" : {
"type" : "nested",
"include_in_root" : true,
"properties" : {
"term" : {
"type" : "multi_field",
"fields" : {
"term" : { "type" : "string", "index" : "analyzed", "include_in_all" :
true, "boost" : 3.0 },
"keyword_lowercase" : { "type" : "string", "index" : "analyzed", "analyzer"
: "keyword_lowercase" }
}
},
"description" : { "type" : "string", "index" : "analyzed", "include_in_all"
: true },
"score" : { "type" : "integer" },
"verified" : { "type" : "boolean" }
}
}
}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3e608b31-8569-49d3-b9fa-20d3a1e4a597%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Alexander Reelsen) #2

Hey,

can you test with a more recent version of elasticsearch first? There were
some dramatic improvements regarding facetting.
Also, you should explain your setup a bit more. Facetting can need a lot of
memory with lots of documents as it uses so-called fielddata, so you should
configure and monitor elasticsearch appropriately.

See
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html#field-data

--Alex

On Wed, Dec 18, 2013 at 10:51 PM, Brian Jones tbrianjones@gmail.com wrote:

I'm using the Terms Facet with Elasticsearch V0.20.2. The server has 8 x
Intel Xeon E5-2680 v2 processors and 15GB of memory.

My Terms Facet queries work great as long as the number of documents in
the index is small ( eg. less than 20,000 ). When the system hits more,
pushing into the hundreds of thousands or millions of documents, my Terms
Facets never return results. Watching the server, I initially see a few
Java processes using a lot of CPU, but within a few seconds, this is
reduced to a half dozen processes each using ~2% cpu. I never see memory
usage increase on the server as a result of these queries. When these
queries fail to return results, they also sometimes seem to "freeze"
Elasticsearch and I often have to restart the ES server or even reboot the
physical server to get ES back online for other simple queries.

The fields I'm trying to facet exist for nearly every document and can
have anywhere from 0 to hundreds of different values across the dataset.
All values are text strings and I'm using a custom analyzer that reduces
them to lowercase. I realize that increasing the number of potential
values in a field will dramatically increase the resources needed for the
Terms Facet Query. In testing, I would expect some of the smaller fields
should work fine even at scale with millions of documents.

Questions:

1.) My test field ( industries ), can have no more than 32 unique values.
Each document could have none or all 32 values. Each value can be from 10
to 100 characters of text. This Terms Facet never returns a result at
scale. Any thoughts on what is happening? Is my setup flawed?

  1. Will I ever be able to run a facet on a field that can have millions of
    unique text values? I have some data analysis cases like this where I'd
    like to use Elasticsearch Facetting.

3.) Would reducing the fields I'm faceting on to integers ( and then
translating back to text outside ES ) make a big difference in performance
and required resources?

Test Query:

curl -X POST "
http://remote_host:9200/companies/company/_search?pretty=true" -d '
{
"query" : {
"match_all" : { }
},
"facets" : {
"industries" : {
"terms" : {
"field" : "industries.term.keyword_lowercase",
"size" : 100
}
}
},
"size" : 0
}
'

Index Configuration:

{
"index" : {
"number_of_shards" : 5,
"number_of_replicas" : 1,
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "standard",
"filter" : ["standard", "word_delimiter", "lowercase", "stop"]
},
"html_strip" : {
"tokenizer" : "standard",
"filter" : ["standard", "word_delimiter", "lowercase", "stop"],
"char_filter" : "html_strip"
},
"keyword_lowercase" : {
"tokenizer" : "keyword",
"filter" : "lowercase"
}
}
}
}
}

Company Document Mapping:

** i've removed irrelevant fields

{
"company" : {
"type" : "object",
"include_in_all" : false,
"path" : "full",
"dynamic" : "strict",
"properties" : {
"name" : {
"type" : "multi_field",
"fields" : {
"name" : { "type" : "string", "index" : "analyzed", "include_in_all" :
"true", "boost" : 10.0 },
"keyword_lowercase" : { "type" : "string", "index" : "analyzed",
"analyzer" : "keyword_lowercase", "include_in_all" : "false" }
}
},
"description" : { "type" : "string", "index" : "analyzed",
"include_in_all" : "true", "boost" : 6.0 },
"industries" : {
"type" : "nested",
"include_in_root" : true,
"properties" : {
"term" : {
"type" : "multi_field",
"fields" : {
"term" : { "type" : "string", "index" : "analyzed", "include_in_all" :
true, "boost" : 3.0 },
"keyword_lowercase" : { "type" : "string", "index" : "analyzed",
"analyzer" : "keyword_lowercase" }
}
},
"description" : { "type" : "string", "index" : "analyzed",
"include_in_all" : true },
"score" : { "type" : "integer" },
"verified" : { "type" : "boolean" }
}
}
}
}
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/3e608b31-8569-49d3-b9fa-20d3a1e4a597%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM80zPKeTo%3DrXEBinoatkZmX%2BbWqhx2itE4tuCBg87NEwQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ivan Brusic) #3

I completely agree about upgrading. Elasticsearch 0.90 introduced numerous
memory improvements, including one issue that directly affects you. With
the previous versions (0.20 and prior), high cardinality fields, such as
your industry field, would use inefficient data structures to load the
faceted values. Your situation would be greatly improved with 0.90+. Easily.

In terms of your last point, I would use numerical values whenever
possible. My taxonomy is known in advance, so I do client side lookups for
numerical values. I do not have statistics for how much of an improvement
it is, but you can do the simple math of how much can be saved . In Java,
ints are 4 bytes, while each character in UTF-8 (assuming you are using
unicode) can be 1 to 6 bytes. It all adds up.

Cheers,

Ivan

On Thu, Dec 19, 2013 at 7:00 AM, Alexander Reelsen alr@spinscale.de wrote:

Hey,

can you test with a more recent version of elasticsearch first? There were
some dramatic improvements regarding facetting.
Also, you should explain your setup a bit more. Facetting can need a lot
of memory with lots of documents as it uses so-called fielddata, so you
should configure and monitor elasticsearch appropriately.

See

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html#field-data

--Alex

On Wed, Dec 18, 2013 at 10:51 PM, Brian Jones tbrianjones@gmail.comwrote:

I'm using the Terms Facet with Elasticsearch V0.20.2. The server has 8 x
Intel Xeon E5-2680 v2 processors and 15GB of memory.

My Terms Facet queries work great as long as the number of documents in
the index is small ( eg. less than 20,000 ). When the system hits more,
pushing into the hundreds of thousands or millions of documents, my Terms
Facets never return results. Watching the server, I initially see a few
Java processes using a lot of CPU, but within a few seconds, this is
reduced to a half dozen processes each using ~2% cpu. I never see memory
usage increase on the server as a result of these queries. When these
queries fail to return results, they also sometimes seem to "freeze"
Elasticsearch and I often have to restart the ES server or even reboot the
physical server to get ES back online for other simple queries.

The fields I'm trying to facet exist for nearly every document and can
have anywhere from 0 to hundreds of different values across the dataset.
All values are text strings and I'm using a custom analyzer that reduces
them to lowercase. I realize that increasing the number of potential
values in a field will dramatically increase the resources needed for the
Terms Facet Query. In testing, I would expect some of the smaller fields
should work fine even at scale with millions of documents.

Questions:

1.) My test field ( industries ), can have no more than 32 unique values.
Each document could have none or all 32 values. Each value can be from 10
to 100 characters of text. This Terms Facet never returns a result at
scale. Any thoughts on what is happening? Is my setup flawed?

  1. Will I ever be able to run a facet on a field that can have millions
    of unique text values? I have some data analysis cases like this where I'd
    like to use Elasticsearch Facetting.

3.) Would reducing the fields I'm faceting on to integers ( and then
translating back to text outside ES ) make a big difference in performance
and required resources?

Test Query:

curl -X POST "
http://remote_host:9200/companies/company/_search?pretty=true" -d '
{
"query" : {
"match_all" : { }
},
"facets" : {
"industries" : {
"terms" : {
"field" : "industries.term.keyword_lowercase",
"size" : 100
}
}
},
"size" : 0
}
'

Index Configuration:

{
"index" : {
"number_of_shards" : 5,
"number_of_replicas" : 1,
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "standard",
"filter" : ["standard", "word_delimiter", "lowercase", "stop"]
},
"html_strip" : {
"tokenizer" : "standard",
"filter" : ["standard", "word_delimiter", "lowercase", "stop"],
"char_filter" : "html_strip"
},
"keyword_lowercase" : {
"tokenizer" : "keyword",
"filter" : "lowercase"
}
}
}
}
}

Company Document Mapping:

** i've removed irrelevant fields

{
"company" : {
"type" : "object",
"include_in_all" : false,
"path" : "full",
"dynamic" : "strict",
"properties" : {
"name" : {
"type" : "multi_field",
"fields" : {
"name" : { "type" : "string", "index" : "analyzed", "include_in_all" :
"true", "boost" : 10.0 },
"keyword_lowercase" : { "type" : "string", "index" : "analyzed",
"analyzer" : "keyword_lowercase", "include_in_all" : "false" }
}
},
"description" : { "type" : "string", "index" : "analyzed",
"include_in_all" : "true", "boost" : 6.0 },
"industries" : {
"type" : "nested",
"include_in_root" : true,
"properties" : {
"term" : {
"type" : "multi_field",
"fields" : {
"term" : { "type" : "string", "index" : "analyzed", "include_in_all" :
true, "boost" : 3.0 },
"keyword_lowercase" : { "type" : "string", "index" : "analyzed",
"analyzer" : "keyword_lowercase" }
}
},
"description" : { "type" : "string", "index" : "analyzed",
"include_in_all" : true },
"score" : { "type" : "integer" },
"verified" : { "type" : "boolean" }
}
}
}
}
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/3e608b31-8569-49d3-b9fa-20d3a1e4a597%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAGCwEM80zPKeTo%3DrXEBinoatkZmX%2BbWqhx2itE4tuCBg87NEwQ%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQBa14uQ3De4XPCv%2BRP6qxBxKcHZi6A-j09Qi759Ru0X7w%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(tbrianjones) #4

I built a fresh server with Elasticsearch V0.90.9 and all my problems went
away without making a single change to my index or code. It's running on a
much smaller server ( CPU and MEMORY ) than my previous install as well
with no problems.

The new install even loaded an index that was originally created by a
server running 0.2.x from Amazon S3.

This is great. Elasticsearch impresses and surprises me again.

I will probably still investigate converting the text i'm Facetting on to
integer id's that I'll convert back to strings for users on the app end of
things. It sounds like this will further increase performance.

On Thursday, December 19, 2013 7:00:13 AM UTC-8, Alexander Reelsen wrote:

Hey,

can you test with a more recent version of elasticsearch first? There were
some dramatic improvements regarding facetting.
Also, you should explain your setup a bit more. Facetting can need a lot
of memory with lots of documents as it uses so-called fielddata, so you
should configure and monitor elasticsearch appropriately.

See

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html#field-data

--Alex

On Wed, Dec 18, 2013 at 10:51 PM, Brian Jones <tbria...@gmail.com<javascript:>

wrote:

I'm using the Terms Facet with Elasticsearch V0.20.2. The server has 8 x
Intel Xeon E5-2680 v2 processors and 15GB of memory.

My Terms Facet queries work great as long as the number of documents in
the index is small ( eg. less than 20,000 ). When the system hits more,
pushing into the hundreds of thousands or millions of documents, my Terms
Facets never return results. Watching the server, I initially see a few
Java processes using a lot of CPU, but within a few seconds, this is
reduced to a half dozen processes each using ~2% cpu. I never see memory
usage increase on the server as a result of these queries. When these
queries fail to return results, they also sometimes seem to "freeze"
Elasticsearch and I often have to restart the ES server or even reboot the
physical server to get ES back online for other simple queries.

The fields I'm trying to facet exist for nearly every document and can
have anywhere from 0 to hundreds of different values across the dataset.
All values are text strings and I'm using a custom analyzer that reduces
them to lowercase. I realize that increasing the number of potential
values in a field will dramatically increase the resources needed for the
Terms Facet Query. In testing, I would expect some of the smaller fields
should work fine even at scale with millions of documents.

Questions:

1.) My test field ( industries ), can have no more than 32 unique values.
Each document could have none or all 32 values. Each value can be from 10
to 100 characters of text. This Terms Facet never returns a result at
scale. Any thoughts on what is happening? Is my setup flawed?

  1. Will I ever be able to run a facet on a field that can have millions
    of unique text values? I have some data analysis cases like this where I'd
    like to use Elasticsearch Facetting.

3.) Would reducing the fields I'm faceting on to integers ( and then
translating back to text outside ES ) make a big difference in performance
and required resources?

Test Query:

curl -X POST "
http://remote_host:9200/companies/company/_search?pretty=true" -d '
{
"query" : {
"match_all" : { }
},
"facets" : {
"industries" : {
"terms" : {
"field" : "industries.term.keyword_lowercase",
"size" : 100
}
}
},
"size" : 0
}
'

Index Configuration:

{
"index" : {
"number_of_shards" : 5,
"number_of_replicas" : 1,
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "standard",
"filter" : ["standard", "word_delimiter", "lowercase", "stop"]
},
"html_strip" : {
"tokenizer" : "standard",
"filter" : ["standard", "word_delimiter", "lowercase", "stop"],
"char_filter" : "html_strip"
},
"keyword_lowercase" : {
"tokenizer" : "keyword",
"filter" : "lowercase"
}
}
}
}
}

Company Document Mapping:

** i've removed irrelevant fields

{
"company" : {
"type" : "object",
"include_in_all" : false,
"path" : "full",
"dynamic" : "strict",
"properties" : {
"name" : {
"type" : "multi_field",
"fields" : {
"name" : { "type" : "string", "index" : "analyzed", "include_in_all" :
"true", "boost" : 10.0 },
"keyword_lowercase" : { "type" : "string", "index" : "analyzed",
"analyzer" : "keyword_lowercase", "include_in_all" : "false" }
}
},
"description" : { "type" : "string", "index" : "analyzed",
"include_in_all" : "true", "boost" : 6.0 },
"industries" : {
"type" : "nested",
"include_in_root" : true,
"properties" : {
"term" : {
"type" : "multi_field",
"fields" : {
"term" : { "type" : "string", "index" : "analyzed", "include_in_all" :
true, "boost" : 3.0 },
"keyword_lowercase" : { "type" : "string", "index" : "analyzed",
"analyzer" : "keyword_lowercase" }
}
},
"description" : { "type" : "string", "index" : "analyzed",
"include_in_all" : true },
"score" : { "type" : "integer" },
"verified" : { "type" : "boolean" }
}
}
}
}
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/3e608b31-8569-49d3-b9fa-20d3a1e4a597%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a9d95122-398a-42c4-968f-ea40f4b678e6%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #5