OutOfMemoryError when using Facets search

Hi,

I keep getting OutOfMemoryError, when trying to do faceted search with
a large number of unique facet values. Here is the setup:

I index about 2M of twitter messages and extract names (of people,
organizations, etc.) from them. There are more than 1M unique names
extracted that way. I index a "name" as non-analyzed string.

When run a faceted search to get the 10 most frequent names with their
counts, the server runs out of memory. I currently have
ES_MIN_MEM=2048M, the same for ES_MAX_MEM. The out of memory error
happens even if I restrict the query to run on a small subset of
twitter documents.

Could anybody explain why this is happening, and if there is any way
to deal with this?

Here are details.

  1. Create soc-media-index:

curl -XPUT 'http://localhost:9200/soc-media-index/'

  1. Add a mapping for indexing a nested named entity within a soc-media
    document:

curl -XPUT 'http://localhost:9200/soc-media-index/soc-media/_mapping' -
d '
{
"soc-media" : {
"properties" : {
"entity" : {
"type" : "nested",
"include_in_parent" : "true",
"properties" : {
"name" : {
"type" : "string",
"index": "not_analyzed"
},
"nametype" : {
"type" : "string",
"index": "not_analyzed"
}
}
}
}
}
}
'
3. Index 2M+ twitter messages

  1. Run a faceted query to get top 10 people:

{
"query": {
"nested": {
"_scope": "my_scope",
"path": "entity",
"query": {
"term" : { "nametype" : "person"}
}
}
},
"facets" : {
"entities" : {
"terms" : { "field" : "entity.name" },
"scope": "my_scope"
}
}
}

  1. Get out of memory error

Narrowing the query scope to cover only a few documents still leads to
the out of memory error (which is counter-intuitive to me). I
currently run locally in the 1 shard/ 1 replica environment.

I will appreciate any help

Thank you

Zmicier

All those values need to be loaded to memory in order to do faceting, regardless of the number of hits you ask for (facets are commuted across all matching docs). You will need to increase the memory used / add more servers. I was hoping to optimize the memory usage of facets for 0.19, but it slipped, though its one of the first things on the list to do for 0.20.

On Wednesday, February 15, 2012 at 7:36 PM, Zmicier wrote:

Hi,

I keep getting OutOfMemoryError, when trying to do faceted search with
a large number of unique facet values. Here is the setup:

I index about 2M of twitter messages and extract names (of people,
organizations, etc.) from them. There are more than 1M unique names
extracted that way. I index a "name" as non-analyzed string.

When run a faceted search to get the 10 most frequent names with their
counts, the server runs out of memory. I currently have
ES_MIN_MEM=2048M, the same for ES_MAX_MEM. The out of memory error
happens even if I restrict the query to run on a small subset of
twitter documents.

Could anybody explain why this is happening, and if there is any way
to deal with this?

Here are details.

  1. Create soc-media-index:

curl -XPUT 'http://localhost:9200/soc-media-index/'

  1. Add a mapping for indexing a nested named entity within a soc-media
    document:

curl -XPUT 'http://localhost:9200/soc-media-index/soc-media/_mapping' -
d '
{
"soc-media" : {
"properties" : {
"entity" : {
"type" : "nested",
"include_in_parent" : "true",
"properties" : {
"name" : {
"type" : "string",
"index": "not_analyzed"
},
"nametype" : {
"type" : "string",
"index": "not_analyzed"
}
}
}
}
}
}
'
3. Index 2M+ twitter messages

  1. Run a faceted query to get top 10 people:

{
"query": {
"nested": {
"_scope": "my_scope",
"path": "entity",
"query": {
"term" : { "nametype" : "person"}
}
}
},
"facets" : {
"entities" : {
"terms" : { "field" : "entity.name (http://entity.name)" },
"scope": "my_scope"
}
}
}

  1. Get out of memory error

Narrowing the query scope to cover only a few documents still leads to
the out of memory error (which is counter-intuitive to me). I
currently run locally in the 1 shard/ 1 replica environment.

I will appreciate any help

Thank you

Zmicier

sounds great looking forward to 0.20.
On Feb 15, 2012, at 2:55 PM, Shay Banon wrote:

All those values need to be loaded to memory in order to do faceting, regardless of the number of hits you ask for (facets are commuted across all matching docs). You will need to increase the memory used / add more servers. I was hoping to optimize the memory usage of facets for 0.19, but it slipped, though its one of the first things on the list to do for 0.20.
On Wednesday, February 15, 2012 at 7:36 PM, Zmicier wrote:

Hi,

I keep getting OutOfMemoryError, when trying to do faceted search with
a large number of unique facet values. Here is the setup:

I index about 2M of twitter messages and extract names (of people,
organizations, etc.) from them. There are more than 1M unique names
extracted that way. I index a "name" as non-analyzed string.

When run a faceted search to get the 10 most frequent names with their
counts, the server runs out of memory. I currently have
ES_MIN_MEM=2048M, the same for ES_MAX_MEM. The out of memory error
happens even if I restrict the query to run on a small subset of
twitter documents.

Could anybody explain why this is happening, and if there is any way
to deal with this?

Here are details.

  1. Create soc-media-index:

curl -XPUT 'http://localhost:9200/soc-media-index/'

  1. Add a mapping for indexing a nested named entity within a soc-media
    document:

curl -XPUT 'http://localhost:9200/soc-media-index/soc-media/_mapping' -
d '
{
"soc-media" : {
"properties" : {
"entity" : {
"type" : "nested",
"include_in_parent" : "true",
"properties" : {
"name" : {
"type" : "string",
"index": "not_analyzed"
},
"nametype" : {
"type" : "string",
"index": "not_analyzed"
}
}
}
}
}
}
'
3. Index 2M+ twitter messages

  1. Run a faceted query to get top 10 people:

{
"query": {
"nested": {
"_scope": "my_scope",
"path": "entity",
"query": {
"term" : { "nametype" : "person"}
}
}
},
"facets" : {
"entities" : {
"terms" : { "field" : "entity.name" },
"scope": "my_scope"
}
}
}

  1. Get out of memory error

Narrowing the query scope to cover only a few documents still leads to
the out of memory error (which is counter-intuitive to me). I
currently run locally in the 1 shard/ 1 replica environment.

I will appreciate any help

Thank you

Zmicier

Thank you, yes, memory optimization in faceting would be invaluable
for me.

Would it help solve my problem if try to combine name and type into
single field "_name" and do faceting based on that combined
field? That would reduce the number of unique values per field. So the
query above would become:

{
"facets" : {
"entities" : {
"terms" : { "field" : "person_name" },
}
}
}

But if apply such an approach across the board in my system, it would
to create of a huge number of artificial fields. Is elasticsearch
scalable enough to deal with thousands (or even tens of thousands) of
different fields within a single index?

On Feb 15, 2:55 pm, Shay Banon kim...@gmail.com wrote:

All those values need to be loaded to memory in order to do faceting, regardless of the number of hits you ask for (facets are commuted across all matching docs). You will need to increase the memory used / add more servers. I was hoping to optimize the memory usage of facets for 0.19, but it slipped, though its one of the first things on the list to do for 0.20.

On Wednesday, February 15, 2012 at 7:36 PM, Zmicier wrote:

Hi,

I keep getting OutOfMemoryError, when trying to do faceted search with
a large number of unique facet values. Here is the setup:

I index about 2M of twitter messages and extract names (of people,
organizations, etc.) from them. There are more than 1M unique names
extracted that way. I index a "name" as non-analyzed string.

When run a faceted search to get the 10 most frequent names with their
counts, the server runs out of memory. I currently have
ES_MIN_MEM=2048M, the same for ES_MAX_MEM. The out of memory error
happens even if I restrict the query to run on a small subset of
twitter documents.

Could anybody explain why this is happening, and if there is any way
to deal with this?

Here are details.

  1. Create soc-media-index:

curl -XPUT 'http://localhost:9200/soc-media-index/'

  1. Add a mapping for indexing a nested named entity within a soc-media
    document:

curl -XPUT 'http://localhost:9200/soc-media-index/soc-media/_mapping'-
d '
{
"soc-media" : {
"properties" : {
"entity" : {
"type" : "nested",
"include_in_parent" : "true",
"properties" : {
"name" : {
"type" : "string",
"index": "not_analyzed"
},
"nametype" : {
"type" : "string",
"index": "not_analyzed"
}
}
}
}
}
}
'
3. Index 2M+ twitter messages

  1. Run a faceted query to get top 10 people:

{
"query": {
"nested": {
"_scope": "my_scope",
"path": "entity",
"query": {
"term" : { "nametype" : "person"}
}
}
},
"facets" : {
"entities" : {
"terms" : { "field" : "entity.name (http://entity.name)" },
"scope": "my_scope"
}
}
}

  1. Get out of memory error

Narrowing the query scope to cover only a few documents still leads to
the out of memory error (which is counter-intuitive to me). I
currently run locally in the 1 shard/ 1 replica environment.

I will appreciate any help

Thank you

Zmicier

No, it won't help much...

On Wednesday, February 15, 2012 at 11:30 PM, Zmicier wrote:

Thank you, yes, memory optimization in faceting would be invaluable
for me.

Would it help solve my problem if try to combine name and type into
single field "_name" and do faceting based on that combined
field? That would reduce the number of unique values per field. So the
query above would become:

{
"facets" : {
"entities" : {
"terms" : { "field" : "person_name" },
}
}
}

But if apply such an approach across the board in my system, it would
to create of a huge number of artificial fields. Is elasticsearch
scalable enough to deal with thousands (or even tens of thousands) of
different fields within a single index?

On Feb 15, 2:55 pm, Shay Banon <kim...@gmail.com (http://gmail.com)> wrote:

All those values need to be loaded to memory in order to do faceting, regardless of the number of hits you ask for (facets are commuted across all matching docs). You will need to increase the memory used / add more servers. I was hoping to optimize the memory usage of facets for 0.19, but it slipped, though its one of the first things on the list to do for 0.20.

On Wednesday, February 15, 2012 at 7:36 PM, Zmicier wrote:

Hi,

I keep getting OutOfMemoryError, when trying to do faceted search with
a large number of unique facet values. Here is the setup:

I index about 2M of twitter messages and extract names (of people,
organizations, etc.) from them. There are more than 1M unique names
extracted that way. I index a "name" as non-analyzed string.

When run a faceted search to get the 10 most frequent names with their
counts, the server runs out of memory. I currently have
ES_MIN_MEM=2048M, the same for ES_MAX_MEM. The out of memory error
happens even if I restrict the query to run on a small subset of
twitter documents.

Could anybody explain why this is happening, and if there is any way
to deal with this?

Here are details.

  1. Create soc-media-index:

curl -XPUT 'http://localhost:9200/soc-media-index/'

  1. Add a mapping for indexing a nested named entity within a soc-media
    document:

curl -XPUT 'http://localhost:9200/soc-media-index/soc-media/_mapping'-
d '
{
"soc-media" : {
"properties" : {
"entity" : {
"type" : "nested",
"include_in_parent" : "true",
"properties" : {
"name" : {
"type" : "string",
"index": "not_analyzed"
},
"nametype" : {
"type" : "string",
"index": "not_analyzed"
}
}
}
}
}
}
'
3. Index 2M+ twitter messages

  1. Run a faceted query to get top 10 people:

{
"query": {
"nested": {
"_scope": "my_scope",
"path": "entity",
"query": {
"term" : { "nametype" : "person"}
}
}
},
"facets" : {
"entities" : {
"terms" : { "field" : "entity.name (http://entity.name)" },
"scope": "my_scope"
}
}
}

  1. Get out of memory error

Narrowing the query scope to cover only a few documents still leads to
the out of memory error (which is counter-intuitive to me). I
currently run locally in the 1 shard/ 1 replica environment.

I will appreciate any help

Thank you

Zmicier

All those values need to be loaded to memory in order to do faceting, regardless of the number of hits you ask for (facets are commuted across all matching docs). You will need to increase the memory used / add more servers. I was hoping to optimize the memory usage of facets for 0.19, but it slipped, though its one of the first things on the list to do for 0.20.

Hmm... As a fellow search/ir expert, I was wondering if this the whole story...

From what I understand from ES, each shard is a complete index. If you're querying a field, this would mean that for the faceting, you're not only loading the lexicon once in-memory, but for each shard. (I sort-of assume here that Lucene doesn't do this for every segment as well, which would be terrible)

I also looked up how we do faceting on our own (distributed) search engine. This works a bit different: we do a search on the top-K documents, grab the term id's from the direct index, calculate the score for each term (we do scoring, not just counting) based on the document score (=search score) and the TF/DF of the term, send back the top N facets for each server and do a distributed map-reduce on the resulting terms to retrieve the final N facets.

While I know this is quite different than 'count' facets that ES supports, this is much more scalable and does produce quite good scored facets.

Hope this helps,

Stefan.