OutOfMemoryException while trying to get all distinct terms for a field

Hi!

I need to get all distinct terms for a list of fields on an cluster of
six nodes with 3TB of data in over 1000 indices. I use following query
to get all terms for certain fields:

{
"query" : {
"match_all" : {
}
},
"facets" : {
"facility" : {
"terms" : {
"field" : "facility",
"size" : 300,
"order" : "term"
}
},
"severity" : {
"terms" : {
"field" : "severity",
"size" : 300,
"order" : "term"
}
},
"hostname" : {
"terms" : {
"field" : "hostname",
"size" : 300,
"order" : "term"
}
},
"timestamp" : {
"terms" : {
"field" : "timestamp",
"size" : 300,
"order" : "term"
}
},
"mainCategory" : {
"terms" : {
"field" : "mainCategory",
"size" : 300,
"order" : "term"
}
},
"subCategory" : {
"terms" : {
"field" : "subCategory",
"size" : 300,
"order" : "term"
}
},
"country" : {
"terms" : {
"field" : "country",
"size" : 300,
"order" : "term"
}
}
}
}

We get OutOfMemoryExceptions (4GB heap size) when we issue this query.
How should I change the query to avoid the problem? ElasticSearch
version is 0.17.8. I know that Lucene has a feature to get all
distinct terms for a field without reading the whole index. How can I
do that with Elasticsearch?

For now we have a fixed list for the values - but that is not an
"elastic" solution... :wink:

CU
Thomas

You are using the terms facet, which causes them to be loaded to memory.
There is a way to get terms in Lucene, but its not exposed in elasticsearch.

On Tue, Oct 25, 2011 at 5:04 PM, Thomas Peuss thomas.peuss@nterra.comwrote:

Hi!

I need to get all distinct terms for a list of fields on an cluster of
six nodes with 3TB of data in over 1000 indices. I use following query
to get all terms for certain fields:

{
"query" : {
"match_all" : {
}
},
"facets" : {
"facility" : {
"terms" : {
"field" : "facility",
"size" : 300,
"order" : "term"
}
},
"severity" : {
"terms" : {
"field" : "severity",
"size" : 300,
"order" : "term"
}
},
"hostname" : {
"terms" : {
"field" : "hostname",
"size" : 300,
"order" : "term"
}
},
"timestamp" : {
"terms" : {
"field" : "timestamp",
"size" : 300,
"order" : "term"
}
},
"mainCategory" : {
"terms" : {
"field" : "mainCategory",
"size" : 300,
"order" : "term"
}
},
"subCategory" : {
"terms" : {
"field" : "subCategory",
"size" : 300,
"order" : "term"
}
},
"country" : {
"terms" : {
"field" : "country",
"size" : 300,
"order" : "term"
}
}
}
}

We get OutOfMemoryExceptions (4GB heap size) when we issue this query.
How should I change the query to avoid the problem? Elasticsearch
version is 0.17.8. I know that Lucene has a feature to get all
distinct terms for a field without reading the whole index. How can I
do that with Elasticsearch?

For now we have a fixed list for the values - but that is not an
"elastic" solution... :wink:

CU
Thomas

Hello Shay!

On 26 Okt., 05:02, Shay Banon kim...@gmail.com wrote:

You are using the terms facet, which causes them to be loaded to memory.
There is a way to get terms in Lucene, but its not exposed in elasticsearch.

So what is the way to go in elasticsearch then?

CU
Thomas

You can use Luke to analyze the Lucene index.

http://www.getopt.org/luke/

It calculates term counts and percentages.

On Wed, Oct 26, 2011 at 9:54 AM, Thomas Peuss thomas.peuss@nterra.comwrote:

Hello Shay!

On 26 Okt., 05:02, Shay Banon kim...@gmail.com wrote:

You are using the terms facet, which causes them to be loaded to memory.
There is a way to get terms in Lucene, but its not exposed in
elasticsearch.

So what is the way to go in elasticsearch then?

Building an API for terms can be a solution. It gets tricky with the
distributed aspect, there used to be one, but it was removed because of the
overhead in supporting it.

CU
Thomas

Hi!

On 26 Okt., 15:08, phobos182 phobos...@gmail.com wrote:

You can use Luke to analyze the Lucene index.

http://www.getopt.org/luke/

It calculates term counts and percentages.

I know Luke. This is not an option. We need to get the terms without
generating a terms list by hand out of one thousand indices...

CU
Thomas

Hi Shay!

On 26 Okt., 22:38, Shay Banon kim...@gmail.com wrote:

So what is the way to go in elasticsearch then?

Building an API for terms can be a solution. It gets tricky with the
distributed aspect, there used to be one, but it was removed because of the
overhead in supporting it.

IMHO it is OK if such an API would deliver not a distinct list if that
is the problem. We could handle that in our application and the list
of distinct terms is very limited in our case (which is part of the
problem because many rows share the same terms). We don't want to
compile such a list by hand. It would be outdated when we are finished
compiling it. :wink:

CU
Thomas