Advice on the memory consumption

Hi, List

We are trying to use Elastic search in the data analytics use case,
where we have around 7 million (increasing around 1 million per month)
transaction data. and we are digging out aggregation result from
it.

In our cases, we use facet heavily with filter/queries . Our document
is complex type with arrays and nested objects in array to allow us
store certain 1-many relationship. Have done a few performance test
and it turn out to be very good.

{
id: 123,
transaction_time: "2011-01-01T01:01:01",
attributes: [
{
attribute_id: 101,
attribute_value: "something1"
},
{
attribute_id: 102,
attribute_value: "something2"
},
{
attribute_id: 103,
attribute_value: "something3"
},
{
attribute_id: 104,
attribute_value: "something4"
}
],
hierarchies: [
{
role: "agent",
name: "Bill"
},{
role: manager,
name: "Shelly"
}
]
}

for document like this, if we perform a lot of facet on nested fileds
like hierarchies[0].role for example, would it require a more memeory
than the simple fileds like time or id?

From the facet documentation, i can almost be certain that it is the
case, but is there any recommandations or someone has already done it
who can give us some advice,

i.e. some discussion was saying in order to reteive the value, es has
to open each doc and put the nest object values in memory, if this is
a data record that has 10 million records, how much memory are we
talking about ? 2G, 4G, 10G ? 100G?

because if it is around 10G, we can handle that, but if it increase to
a certain amount that reaches say 100G, then that is very hard for us
to justify this kind of calculation using ES then, so wanted someone
to point out their experiences so that we don't have to run into the
trouble later.

Can you give an example of a how you use a facet on nested fields? Are you using scripting? Do you use nested mappings?

On Thursday, February 9, 2012 at 1:26 AM, Bill Lee wrote:

Hi, List

We are trying to use Elastic search in the data analytics use case,
where we have around 7 million (increasing around 1 million per month)
transaction data. and we are digging out aggregation result from
it.

In our cases, we use facet heavily with filter/queries . Our document
is complex type with arrays and nested objects in array to allow us
store certain 1-many relationship. Have done a few performance test
and it turn out to be very good.

{
id: 123,
transaction_time: "2011-01-01T01:01:01",
attributes: [
{
attribute_id: 101,
attribute_value: "something1"
},
{
attribute_id: 102,
attribute_value: "something2"
},
{
attribute_id: 103,
attribute_value: "something3"
},
{
attribute_id: 104,
attribute_value: "something4"
}
],
hierarchies: [
{
role: "agent",
name: "Bill"
},{
role: manager,
name: "Shelly"
}
]
}

for document like this, if we perform a lot of facet on nested fileds
like hierarchies[0].role for example, would it require a more memeory
than the simple fileds like time or id?

From the facet documentation, i can almost be certain that it is the
case, but is there any recommandations or someone has already done it
who can give us some advice,

i.e. some discussion was saying in order to reteive the value, es has
to open each doc and put the nest object values in memory, if this is
a data record that has 10 million records, how much memory are we
talking about ? 2G, 4G, 10G ? 100G?

because if it is around 10G, we can handle that, but if it increase to
a certain amount that reaches say 100G, then that is very hard for us
to justify this kind of calculation using ES then, so wanted someone
to point out their experiences so that we don't have to run into the
trouble later.

Bill,

Since you asked for other peoples' experiences, and our project
involves a lot of aggregation via faceting, here's our data point (for
what its worth):

I seem to get ~125MB-250MB usage per million nested string fields,
per replica/original (mean lengths between 32 and 64B), eg I've just
run a facet across 2M nested string fields on a system with 3 nodes
-45 shards- and 1 replica, and its using 750MB in total across the
nodes).

This is based on checking the field data cache, reported by bigdesk.
There are of course other memory requirements, eg the index.

This seems to scale approximately linearly (disclaimer: based on very
few data points, ~5M being the largest!)

Non-nested multi-valued fields behave differently and are to be
avoided for large datasets unless you have tight bounds on the max
array size. (See the discussion in
https://groups.google.com/group/elasticsearch/browse_thread/thread/31d87c84dd387367/3324ed6bda200a9a#3324ed6bda200a9a)