OutOfMemoryException while trying to get all distinct terms for a field

Thomas_Peuss · October 25, 2011, 3:04pm

Hi!

I need to get all distinct terms for a list of fields on an cluster of
six nodes with 3TB of data in over 1000 indices. I use following query
to get all terms for certain fields:

{
"query" : {
"match_all" : {
}
},
"facets" : {
"facility" : {
"terms" : {
"field" : "facility",
"size" : 300,
"order" : "term"
}
},
"severity" : {
"terms" : {
"field" : "severity",
"size" : 300,
"order" : "term"
}
},
"hostname" : {
"terms" : {
"field" : "hostname",
"size" : 300,
"order" : "term"
}
},
"timestamp" : {
"terms" : {
"field" : "timestamp",
"size" : 300,
"order" : "term"
}
},
"mainCategory" : {
"terms" : {
"field" : "mainCategory",
"size" : 300,
"order" : "term"
}
},
"subCategory" : {
"terms" : {
"field" : "subCategory",
"size" : 300,
"order" : "term"
}
},
"country" : {
"terms" : {
"field" : "country",
"size" : 300,
"order" : "term"
}
}
}
}

We get OutOfMemoryExceptions (4GB heap size) when we issue this query.
How should I change the query to avoid the problem? ElasticSearch
version is 0.17.8. I know that Lucene has a feature to get all
distinct terms for a field without reading the whole index. How can I
do that with Elasticsearch?

For now we have a fixed list for the values - but that is not an
"elastic" solution...

CU
Thomas

kimchy · October 26, 2011, 3:02am

You are using the terms facet, which causes them to be loaded to memory.
There is a way to get terms in Lucene, but its not exposed in elasticsearch.

On Tue, Oct 25, 2011 at 5:04 PM, Thomas Peuss thomas.peuss@nterra.comwrote:

Hi!

I need to get all distinct terms for a list of fields on an cluster of
six nodes with 3TB of data in over 1000 indices. I use following query
to get all terms for certain fields:

{
"query" : {
"match_all" : {
}
},
"facets" : {
"facility" : {
"terms" : {
"field" : "facility",
"size" : 300,
"order" : "term"
}
},
"severity" : {
"terms" : {
"field" : "severity",
"size" : 300,
"order" : "term"
}
},
"hostname" : {
"terms" : {
"field" : "hostname",
"size" : 300,
"order" : "term"
}
},
"timestamp" : {
"terms" : {
"field" : "timestamp",
"size" : 300,
"order" : "term"
}
},
"mainCategory" : {
"terms" : {
"field" : "mainCategory",
"size" : 300,
"order" : "term"
}
},
"subCategory" : {
"terms" : {
"field" : "subCategory",
"size" : 300,
"order" : "term"
}
},
"country" : {
"terms" : {
"field" : "country",
"size" : 300,
"order" : "term"
}
}
}
}

We get OutOfMemoryExceptions (4GB heap size) when we issue this query.
How should I change the query to avoid the problem? Elasticsearch
version is 0.17.8. I know that Lucene has a feature to get all
distinct terms for a field without reading the whole index. How can I
do that with Elasticsearch?

For now we have a fixed list for the values - but that is not an
"elastic" solution...

CU
Thomas

Thomas_Peuss · October 26, 2011, 7:54am

Hello Shay!

On 26 Okt., 05:02, Shay Banon kim...@gmail.com wrote:

You are using the terms facet, which causes them to be loaded to memory.
There is a way to get terms in Lucene, but its not exposed in elasticsearch.

So what is the way to go in elasticsearch then?

CU
Thomas

phobos182 · October 26, 2011, 1:08pm

You can use Luke to analyze the Lucene index.

http://www.getopt.org/luke/

It calculates term counts and percentages.

kimchy · October 26, 2011, 8:38pm

On Wed, Oct 26, 2011 at 9:54 AM, Thomas Peuss thomas.peuss@nterra.comwrote:

Hello Shay!

On 26 Okt., 05:02, Shay Banon kim...@gmail.com wrote:

You are using the terms facet, which causes them to be loaded to memory.
There is a way to get terms in Lucene, but its not exposed in
elasticsearch.

So what is the way to go in elasticsearch then?

Building an API for terms can be a solution. It gets tricky with the
distributed aspect, there used to be one, but it was removed because of the
overhead in supporting it.

CU
Thomas

Thomas_Peuss · October 27, 2011, 7:20am

Hi!

On 26 Okt., 15:08, phobos182 phobos...@gmail.com wrote:

You can use Luke to analyze the Lucene index.

http://www.getopt.org/luke/

It calculates term counts and percentages.

I know Luke. This is not an option. We need to get the terms without
generating a terms list by hand out of one thousand indices...

CU
Thomas

Thomas_Peuss · October 27, 2011, 7:25am

Hi Shay!

On 26 Okt., 22:38, Shay Banon kim...@gmail.com wrote:

So what is the way to go in elasticsearch then?

Building an API for terms can be a solution. It gets tricky with the
distributed aspect, there used to be one, but it was removed because of the
overhead in supporting it.

IMHO it is OK if such an API would deliver not a distinct list if that
is the problem. We could handle that in our application and the list
of distinct terms is very limited in our case (which is part of the
problem because many rows share the same terms). We don't want to
compile such a list by hand. It would be outdated when we are finished
compiling it.

CU
Thomas