OutOfMemoryError when using Facets search

Zmicier · February 15, 2012, 5:36pm

Hi,

I keep getting OutOfMemoryError, when trying to do faceted search with
a large number of unique facet values. Here is the setup:

I index about 2M of twitter messages and extract names (of people,
organizations, etc.) from them. There are more than 1M unique names
extracted that way. I index a "name" as non-analyzed string.

When run a faceted search to get the 10 most frequent names with their
counts, the server runs out of memory. I currently have
ES_MIN_MEM=2048M, the same for ES_MAX_MEM. The out of memory error
happens even if I restrict the query to run on a small subset of
twitter documents.

Could anybody explain why this is happening, and if there is any way
to deal with this?

Here are details.

Create soc-media-index:

curl -XPUT 'http://localhost:9200/soc-media-index/'

Add a mapping for indexing a nested named entity within a soc-media
document:

curl -XPUT 'http://localhost:9200/soc-media-index/soc-media/_mapping' -
d '
{
"soc-media" : {
"properties" : {
"entity" : {
"type" : "nested",
"include_in_parent" : "true",
"properties" : {
"name" : {
"type" : "string",
"index": "not_analyzed"
},
"nametype" : {
"type" : "string",
"index": "not_analyzed"
}
}
}
}
}
}
'
3. Index 2M+ twitter messages

Run a faceted query to get top 10 people:

{
"query": {
"nested": {
"_scope": "my_scope",
"path": "entity",
"query": {
"term" : { "nametype" : "person"}
}
}
},
"facets" : {
"entities" : {
"terms" : { "field" : "entity.name" },
"scope": "my_scope"
}
}
}

Get out of memory error

Narrowing the query scope to cover only a few documents still leads to
the out of memory error (which is counter-intuitive to me). I
currently run locally in the 1 shard/ 1 replica environment.

I will appreciate any help

Thank you

Zmicier

kimchy · February 15, 2012, 7:55pm

All those values need to be loaded to memory in order to do faceting, regardless of the number of hits you ask for (facets are commuted across all matching docs). You will need to increase the memory used / add more servers. I was hoping to optimize the memory usage of facets for 0.19, but it slipped, though its one of the first things on the list to do for 0.20.

On Wednesday, February 15, 2012 at 7:36 PM, Zmicier wrote:

Hi,

I keep getting OutOfMemoryError, when trying to do faceted search with
a large number of unique facet values. Here is the setup:

I index about 2M of twitter messages and extract names (of people,
organizations, etc.) from them. There are more than 1M unique names
extracted that way. I index a "name" as non-analyzed string.

When run a faceted search to get the 10 most frequent names with their
counts, the server runs out of memory. I currently have
ES_MIN_MEM=2048M, the same for ES_MAX_MEM. The out of memory error
happens even if I restrict the query to run on a small subset of
twitter documents.

Could anybody explain why this is happening, and if there is any way
to deal with this?

Here are details.

Create soc-media-index:

curl -XPUT 'http://localhost:9200/soc-media-index/'

Add a mapping for indexing a nested named entity within a soc-media
document:

curl -XPUT 'http://localhost:9200/soc-media-index/soc-media/_mapping' -
d '
{
"soc-media" : {
"properties" : {
"entity" : {
"type" : "nested",
"include_in_parent" : "true",
"properties" : {
"name" : {
"type" : "string",
"index": "not_analyzed"
},
"nametype" : {
"type" : "string",
"index": "not_analyzed"
}
}
}
}
}
}
'
3. Index 2M+ twitter messages

Run a faceted query to get top 10 people:

{
"query": {
"nested": {
"_scope": "my_scope",
"path": "entity",
"query": {
"term" : { "nametype" : "person"}
}
}
},
"facets" : {
"entities" : {
"terms" : { "field" : "entity.name (http://entity.name)" },
"scope": "my_scope"
}
}
}

Get out of memory error

Narrowing the query scope to cover only a few documents still leads to
the out of memory error (which is counter-intuitive to me). I
currently run locally in the 1 shard/ 1 replica environment.

I will appreciate any help

Thank you

Zmicier

Loco_Jay · February 15, 2012, 7:56pm

sounds great looking forward to 0.20.
On Feb 15, 2012, at 2:55 PM, Shay Banon wrote:

All those values need to be loaded to memory in order to do faceting, regardless of the number of hits you ask for (facets are commuted across all matching docs). You will need to increase the memory used / add more servers. I was hoping to optimize the memory usage of facets for 0.19, but it slipped, though its one of the first things on the list to do for 0.20.
On Wednesday, February 15, 2012 at 7:36 PM, Zmicier wrote:

Hi,

I keep getting OutOfMemoryError, when trying to do faceted search with
a large number of unique facet values. Here is the setup:

I index about 2M of twitter messages and extract names (of people,
organizations, etc.) from them. There are more than 1M unique names
extracted that way. I index a "name" as non-analyzed string.

When run a faceted search to get the 10 most frequent names with their
counts, the server runs out of memory. I currently have
ES_MIN_MEM=2048M, the same for ES_MAX_MEM. The out of memory error
happens even if I restrict the query to run on a small subset of
twitter documents.

Could anybody explain why this is happening, and if there is any way
to deal with this?

Here are details.

Create soc-media-index:

curl -XPUT 'http://localhost:9200/soc-media-index/'

Add a mapping for indexing a nested named entity within a soc-media
document:

curl -XPUT 'http://localhost:9200/soc-media-index/soc-media/_mapping' -
d '
{
"soc-media" : {
"properties" : {
"entity" : {
"type" : "nested",
"include_in_parent" : "true",
"properties" : {
"name" : {
"type" : "string",
"index": "not_analyzed"
},
"nametype" : {
"type" : "string",
"index": "not_analyzed"
}
}
}
}
}
}
'
3. Index 2M+ twitter messages

Run a faceted query to get top 10 people:

{
"query": {
"nested": {
"_scope": "my_scope",
"path": "entity",
"query": {
"term" : { "nametype" : "person"}
}
}
},
"facets" : {
"entities" : {
"terms" : { "field" : "entity.name" },
"scope": "my_scope"
}
}
}

Get out of memory error

Narrowing the query scope to cover only a few documents still leads to
the out of memory error (which is counter-intuitive to me). I
currently run locally in the 1 shard/ 1 replica environment.

I will appreciate any help

Thank you

Zmicier

Zmicier · February 15, 2012, 9:30pm

Thank you, yes, memory optimization in faceting would be invaluable
for me.

Would it help solve my problem if try to combine name and type into
single field "_name" and do faceting based on that combined
field? That would reduce the number of unique values per field. So the
query above would become:

{
"facets" : {
"entities" : {
"terms" : { "field" : "person_name" },
}
}
}

But if apply such an approach across the board in my system, it would
to create of a huge number of artificial fields. Is elasticsearch
scalable enough to deal with thousands (or even tens of thousands) of
different fields within a single index?

On Feb 15, 2:55 pm, Shay Banon kim...@gmail.com wrote:

All those values need to be loaded to memory in order to do faceting, regardless of the number of hits you ask for (facets are commuted across all matching docs). You will need to increase the memory used / add more servers. I was hoping to optimize the memory usage of facets for 0.19, but it slipped, though its one of the first things on the list to do for 0.20.

On Wednesday, February 15, 2012 at 7:36 PM, Zmicier wrote:

Hi,

I keep getting OutOfMemoryError, when trying to do faceted search with
a large number of unique facet values. Here is the setup:

I index about 2M of twitter messages and extract names (of people,
organizations, etc.) from them. There are more than 1M unique names
extracted that way. I index a "name" as non-analyzed string.

When run a faceted search to get the 10 most frequent names with their
counts, the server runs out of memory. I currently have
ES_MIN_MEM=2048M, the same for ES_MAX_MEM. The out of memory error
happens even if I restrict the query to run on a small subset of
twitter documents.

Could anybody explain why this is happening, and if there is any way
to deal with this?

Here are details.

Create soc-media-index:

curl -XPUT 'http://localhost:9200/soc-media-index/'

Add a mapping for indexing a nested named entity within a soc-media
document:

curl -XPUT 'http://localhost:9200/soc-media-index/soc-media/_mapping'-
d '
{
"soc-media" : {
"properties" : {
"entity" : {
"type" : "nested",
"include_in_parent" : "true",
"properties" : {
"name" : {
"type" : "string",
"index": "not_analyzed"
},
"nametype" : {
"type" : "string",
"index": "not_analyzed"
}
}
}
}
}
}
'
3. Index 2M+ twitter messages

Run a faceted query to get top 10 people:

{
"query": {
"nested": {
"_scope": "my_scope",
"path": "entity",
"query": {
"term" : { "nametype" : "person"}
}
}
},
"facets" : {
"entities" : {
"terms" : { "field" : "entity.name (http://entity.name)" },
"scope": "my_scope"
}
}
}

Get out of memory error

Narrowing the query scope to cover only a few documents still leads to
the out of memory error (which is counter-intuitive to me). I
currently run locally in the 1 shard/ 1 replica environment.

I will appreciate any help

Thank you

Zmicier

kimchy · February 16, 2012, 4:52pm

No, it won't help much...

On Wednesday, February 15, 2012 at 11:30 PM, Zmicier wrote:

Thank you, yes, memory optimization in faceting would be invaluable
for me.

Would it help solve my problem if try to combine name and type into
single field "_name" and do faceting based on that combined
field? That would reduce the number of unique values per field. So the
query above would become:

{
"facets" : {
"entities" : {
"terms" : { "field" : "person_name" },
}
}
}

But if apply such an approach across the board in my system, it would
to create of a huge number of artificial fields. Is elasticsearch
scalable enough to deal with thousands (or even tens of thousands) of
different fields within a single index?

On Feb 15, 2:55 pm, Shay Banon <kim...@gmail.com (http://gmail.com)> wrote:

All those values need to be loaded to memory in order to do faceting, regardless of the number of hits you ask for (facets are commuted across all matching docs). You will need to increase the memory used / add more servers. I was hoping to optimize the memory usage of facets for 0.19, but it slipped, though its one of the first things on the list to do for 0.20.

On Wednesday, February 15, 2012 at 7:36 PM, Zmicier wrote:

Hi,

I keep getting OutOfMemoryError, when trying to do faceted search with
a large number of unique facet values. Here is the setup:

I index about 2M of twitter messages and extract names (of people,
organizations, etc.) from them. There are more than 1M unique names
extracted that way. I index a "name" as non-analyzed string.

When run a faceted search to get the 10 most frequent names with their
counts, the server runs out of memory. I currently have
ES_MIN_MEM=2048M, the same for ES_MAX_MEM. The out of memory error
happens even if I restrict the query to run on a small subset of
twitter documents.

Could anybody explain why this is happening, and if there is any way
to deal with this?

Here are details.

Create soc-media-index:

curl -XPUT 'http://localhost:9200/soc-media-index/'

Add a mapping for indexing a nested named entity within a soc-media
document:

curl -XPUT 'http://localhost:9200/soc-media-index/soc-media/_mapping'-
d '
{
"soc-media" : {
"properties" : {
"entity" : {
"type" : "nested",
"include_in_parent" : "true",
"properties" : {
"name" : {
"type" : "string",
"index": "not_analyzed"
},
"nametype" : {
"type" : "string",
"index": "not_analyzed"
}
}
}
}
}
}
'
3. Index 2M+ twitter messages

Run a faceted query to get top 10 people:

{
"query": {
"nested": {
"_scope": "my_scope",
"path": "entity",
"query": {
"term" : { "nametype" : "person"}
}
}
},
"facets" : {
"entities" : {
"terms" : { "field" : "entity.name (http://entity.name)" },
"scope": "my_scope"
}
}
}

Get out of memory error

Narrowing the query scope to cover only a few documents still leads to
the out of memory error (which is counter-intuitive to me). I
currently run locally in the 1 shard/ 1 replica environment.

I will appreciate any help

Thank you

Zmicier

atlaste · April 6, 2012, 12:33pm

All those values need to be loaded to memory in order to do faceting, regardless of the number of hits you ask for (facets are commuted across all matching docs). You will need to increase the memory used / add more servers. I was hoping to optimize the memory usage of facets for 0.19, but it slipped, though its one of the first things on the list to do for 0.20.

Hmm... As a fellow search/ir expert, I was wondering if this the whole story...

From what I understand from ES, each shard is a complete index. If you're querying a field, this would mean that for the faceting, you're not only loading the lexicon once in-memory, but for each shard. (I sort-of assume here that Lucene doesn't do this for every segment as well, which would be terrible)

I also looked up how we do faceting on our own (distributed) search engine. This works a bit different: we do a search on the top-K documents, grab the term id's from the direct index, calculate the score for each term (we do scoring, not just counting) based on the document score (=search score) and the TF/DF of the term, send back the top N facets for each server and do a distributed map-reduce on the resulting terms to retrieve the final N facets.

While I know this is quite different than 'count' facets that ES supports, this is much more scalable and does produce quite good scored facets.

Hope this helps,

Stefan.

Topic		Replies	Views
Facets / OurOfMemorryError / Required RAM Elasticsearch	5	360	July 6, 2017
Facet query on nested structure - Out of memory exception Elasticsearch	1	312	July 6, 2017
Why to cause OOM when searching with query and facet? Elasticsearch	3	701	July 6, 2017
Memory error in term facet Elasticsearch	2	299	July 6, 2017
Running term_stats facets on nested mapping objects: out of memory error Elasticsearch	4	264	July 6, 2017

OutOfMemoryError when using Facets search

Related topics