Problem on faceting on high cardinality field

Hi,

Following is the problem case. I have a index with 35000 docs and I want to
facet on a particular high cardinality field (=~ 100) on this index.
I have an associated facet filter, which should always filter out some 200
documents from this index upon which I want my facet to be run.

Applying query/filter in a separate search query to retrieve those 200 docs
takes around 10 ms.
Using facet with facet-filter (with the same condition) on the same is
giving either heap-space error or query timeout after 60 secs.
Initially I thought high cardinality is the causing the problem, but when I
separated out those 200 docs in a separate index and executed facet on that
particular field, facet results were within 5 ms.

My assumption is that facet-filter first filters out the matching
documents, and field values for those docs only are loaded in memory for
faceting.
Is the assumption correct? If correct, then where is the problem? And if
not, then what is the way-around?

BTW, I m on ES 0.19.9.
My filter condition includes an AND on a RANGE filter and a TERM filter.
ES master node assigned 1GB, ES data node assigned 4GB. Such combination is
working well on our production servers for quite long.

Thanks in advance,
-- Sujoy.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey Sujoy,
We have some problems like that as well, and for us it actually helped
a lot using "execution hint" for faceting. Eg:

"facets" : {
"company" : {
"terms" : {
"field" : "current_company",
"size" : 15,
"execution_hint":"map"
}
}
}

Maybe it helps you as well...

Leonardo Menezes
(+34) 688907766
http://lmenezes.com

On Wed, Feb 27, 2013 at 3:27 PM, Sujoy Sett sujoysett@gmail.com wrote:

Hi,

Following is the problem case. I have a index with 35000 docs and I want
to facet on a particular high cardinality field (=~ 100) on this index.
I have an associated facet filter, which should always filter out some 200
documents from this index upon which I want my facet to be run.

Applying query/filter in a separate search query to retrieve those 200
docs takes around 10 ms.
Using facet with facet-filter (with the same condition) on the same is
giving either heap-space error or query timeout after 60 secs.
Initially I thought high cardinality is the causing the problem, but when
I separated out those 200 docs in a separate index and executed facet on
that particular field, facet results were within 5 ms.

My assumption is that facet-filter first filters out the matching
documents, and field values for those docs only are loaded in memory for
faceting.
Is the assumption correct? If correct, then where is the problem? And if
not, then what is the way-around?

BTW, I m on ES 0.19.9.
My filter condition includes an AND on a RANGE filter and a TERM filter.
ES master node assigned 1GB, ES data node assigned 4GB. Such combination
is working well on our production servers for quite long.

Thanks in advance,
-- Sujoy.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Leonardo,

Tried that, still things haven't changed noticeably. Is there any
documentation on "execution_hint"? I probably noticed this in some earlier
post in this forum only, not sure though.

Any thoughts?

Thanks,
-- Sujoy

On Wednesday, February 27, 2013 8:07:46 PM UTC+5:30, Leonardo Menezes wrote:

Hey Sujoy,
We have some problems like that as well, and for us it actually helped
a lot using "execution hint" for faceting. Eg:

"facets" : {
"company" : {
"terms" : {
"field" : "current_company",
"size" : 15,
"execution_hint":"map"
}
}
}

Maybe it helps you as well...

Leonardo Menezes
(+34) 688907766
http://lmenezes.com

On Wed, Feb 27, 2013 at 3:27 PM, Sujoy Sett <sujo...@gmail.com<javascript:>

wrote:

Hi,

Following is the problem case. I have a index with 35000 docs and I want
to facet on a particular high cardinality field (=~ 100) on this index.
I have an associated facet filter, which should always filter out some
200 documents from this index upon which I want my facet to be run.

Applying query/filter in a separate search query to retrieve those 200
docs takes around 10 ms.
Using facet with facet-filter (with the same condition) on the same is
giving either heap-space error or query timeout after 60 secs.
Initially I thought high cardinality is the causing the problem, but when
I separated out those 200 docs in a separate index and executed facet on
that particular field, facet results were within 5 ms.

My assumption is that facet-filter first filters out the matching
documents, and field values for those docs only are loaded in memory for
faceting.
Is the assumption correct? If correct, then where is the problem? And if
not, then what is the way-around?

BTW, I m on ES 0.19.9.
My filter condition includes an AND on a RANGE filter and a TERM filter.
ES master node assigned 1GB, ES data node assigned 4GB. Such combination
is working well on our production servers for quite long.

Thanks in advance,
-- Sujoy.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I am completely clueless about root cause of this issue, its
pretty weird though. I can only think that filters are not getting applied
and its considering whole data.
Can mapping cause such kind of an issue?

Please respond its kindda critical issue for us.

Thanks in advance

On Thursday, February 28, 2013 12:20:44 PM UTC+5:30, Sujoy Sett wrote:

Hi Leonardo,

Tried that, still things haven't changed noticeably. Is there any
documentation on "execution_hint"? I probably noticed this in some earlier
post in this forum only, not sure though.

Any thoughts?

Thanks,
-- Sujoy

On Wednesday, February 27, 2013 8:07:46 PM UTC+5:30, Leonardo Menezes
wrote:

Hey Sujoy,
We have some problems like that as well, and for us it actually
helped a lot using "execution hint" for faceting. Eg:

"facets" : {
"company" : {
"terms" : {
"field" : "current_company",
"size" : 15,
"execution_hint":"map"
}
}
}

Maybe it helps you as well...

Leonardo Menezes
(+34) 688907766
http://lmenezes.com

On Wed, Feb 27, 2013 at 3:27 PM, Sujoy Sett sujo...@gmail.com wrote:

Hi,

Following is the problem case. I have a index with 35000 docs and I want
to facet on a particular high cardinality field (=~ 100) on this index.
I have an associated facet filter, which should always filter out some
200 documents from this index upon which I want my facet to be run.

Applying query/filter in a separate search query to retrieve those 200
docs takes around 10 ms.
Using facet with facet-filter (with the same condition) on the same is
giving either heap-space error or query timeout after 60 secs.
Initially I thought high cardinality is the causing the problem, but
when I separated out those 200 docs in a separate index and executed facet
on that particular field, facet results were within 5 ms.

My assumption is that facet-filter first filters out the matching
documents, and field values for those docs only are loaded in memory for
faceting.
Is the assumption correct? If correct, then where is the problem? And if
not, then what is the way-around?

BTW, I m on ES 0.19.9.
My filter condition includes an AND on a RANGE filter and a TERM filter.
ES master node assigned 1GB, ES data node assigned 4GB. Such combination
is working well on our production servers for quite long.

Thanks in advance,
-- Sujoy.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hiya

Following is the problem case. I have a index with 35000 docs and I
want to facet on a particular high cardinality field (=~ 100) on this
index.
I have an associated facet filter, which should always filter out some
200 documents from this index upon which I want my facet to be run.

Applying query/filter in a separate search query to retrieve those 200
docs takes around 10 ms.
Using facet with facet-filter (with the same condition) on the same is
giving either heap-space error or query timeout after 60 secs.
Initially I thought high cardinality is the causing the problem, but
when I separated out those 200 docs in a separate index and executed
facet on that particular field, facet results were within 5 ms.

My assumption is that facet-filter first filters out the matching
documents, and field values for those docs only are loaded in memory
for faceting.

That assumption isn't correct. The field values are loaded for all docs
in the index. And, if the field has multiple values, then (in ES <
0.90) it creates a matrix of number_of_docs * max_number_of_values

I'm guessing that you have a large number of values per field, hence the
memory usage. It also explains why, when you index those docs into a
separate index, your heap usage doesn't explode.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks Clint.

So we got the problem. But is there any way-around to achieve the same?
Would upgrading to 0.20 be helpful in any way for this?

-- Sujoy.

On Friday, March 1, 2013 5:32:55 PM UTC+5:30, Clinton Gormley wrote:

Hiya

Following is the problem case. I have a index with 35000 docs and I
want to facet on a particular high cardinality field (=~ 100) on this
index.
I have an associated facet filter, which should always filter out some
200 documents from this index upon which I want my facet to be run.

Applying query/filter in a separate search query to retrieve those 200
docs takes around 10 ms.
Using facet with facet-filter (with the same condition) on the same is
giving either heap-space error or query timeout after 60 secs.
Initially I thought high cardinality is the causing the problem, but
when I separated out those 200 docs in a separate index and executed
facet on that particular field, facet results were within 5 ms.

My assumption is that facet-filter first filters out the matching
documents, and field values for those docs only are loaded in memory
for faceting.

That assumption isn't correct. The field values are loaded for all docs
in the index. And, if the field has multiple values, then (in ES <
0.90) it creates a matrix of number_of_docs * max_number_of_values

I'm guessing that you have a large number of values per field, hence the
memory usage. It also explains why, when you index those docs into a
separate index, your heap usage doesn't explode.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey Clint

Thanks for your response.
Its total number_of_docs in the index or is it number_of_docs which are
matching the filter criteria?

Regards
Jagdeep

On Friday, March 1, 2013 5:32:55 PM UTC+5:30, Clinton Gormley wrote:

Hiya

Following is the problem case. I have a index with 35000 docs and I
want to facet on a particular high cardinality field (=~ 100) on this
index.
I have an associated facet filter, which should always filter out some
200 documents from this index upon which I want my facet to be run.

Applying query/filter in a separate search query to retrieve those 200
docs takes around 10 ms.
Using facet with facet-filter (with the same condition) on the same is
giving either heap-space error or query timeout after 60 secs.
Initially I thought high cardinality is the causing the problem, but
when I separated out those 200 docs in a separate index and executed
facet on that particular field, facet results were within 5 ms.

My assumption is that facet-filter first filters out the matching
documents, and field values for those docs only are loaded in memory
for faceting.

That assumption isn't correct. The field values are loaded for all docs
in the index. And, if the field has multiple values, then (in ES <
0.90) it creates a matrix of number_of_docs * max_number_of_values

I'm guessing that you have a large number of values per field, hence the
memory usage. It also explains why, when you index those docs into a
separate index, your heap usage doesn't explode.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for your response.
Its total number_of_docs in the index or is it number_of_docs which
are matching the filter criteria?

Not quite correct, but it is the total number_of_docs in the index.
(Actually, you have a cache per segment, so it is the total number of
docs per segment. But segments get merged into bigger segments, which
suddenly increases the problem)

clint

Regards
Jagdeep

On Friday, March 1, 2013 5:32:55 PM UTC+5:30, Clinton Gormley wrote:
Hiya

    > 
    > Following is the problem case. I have a index with 35000
    docs and I 
    > want to facet on a particular high cardinality field (=~
    100) on this 
    > index. 
    > I have an associated facet filter, which should always
    filter out some 
    > 200 documents from this index upon which I want my facet to
    be run. 
    > 
    > 
    > Applying query/filter in a separate search query to retrieve
    those 200 
    > docs takes around 10 ms. 
    > Using facet with facet-filter (with the same condition) on
    the same is 
    > giving either heap-space error or query timeout after 60
    secs. 
    > Initially I thought high cardinality is the causing the
    problem, but 
    > when I separated out those 200 docs in a separate index and
    executed 
    > facet on that particular field, facet results were within 5
    ms. 
    > 
    > 
    > My assumption is that facet-filter first filters out the
    matching 
    > documents, and field values for those docs only are loaded
    in memory 
    > for faceting. 
    
    That assumption isn't correct.  The field values are loaded
    for all docs 
    in the index.  And, if the field has multiple values, then (in
    ES < 
    0.90) it creates a matrix of number_of_docs *
    max_number_of_values 
    
    I'm guessing that you have a large number of values per field,
    hence the 
    memory usage.  It also explains why, when you index those docs
    into a 
    separate index, your heap usage doesn't explode. 
    
    clint 

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Sun, 2013-03-03 at 22:42 -0800, Sujoy Sett wrote:

Thanks Clint.

So we got the problem. But is there any way-around to achieve the
same?
Would upgrading to 0.20 be helpful in any way for this?

No, although the next version of ES (0.90+) will help this problem.

For the moment, what about keeping those docs in a separate index?

clint

-- Sujoy.

On Friday, March 1, 2013 5:32:55 PM UTC+5:30, Clinton Gormley wrote:
Hiya

    > 
    > Following is the problem case. I have a index with 35000
    docs and I 
    > want to facet on a particular high cardinality field (=~
    100) on this 
    > index. 
    > I have an associated facet filter, which should always
    filter out some 
    > 200 documents from this index upon which I want my facet to
    be run. 
    > 
    > 
    > Applying query/filter in a separate search query to retrieve
    those 200 
    > docs takes around 10 ms. 
    > Using facet with facet-filter (with the same condition) on
    the same is 
    > giving either heap-space error or query timeout after 60
    secs. 
    > Initially I thought high cardinality is the causing the
    problem, but 
    > when I separated out those 200 docs in a separate index and
    executed 
    > facet on that particular field, facet results were within 5
    ms. 
    > 
    > 
    > My assumption is that facet-filter first filters out the
    matching 
    > documents, and field values for those docs only are loaded
    in memory 
    > for faceting. 
    
    That assumption isn't correct.  The field values are loaded
    for all docs 
    in the index.  And, if the field has multiple values, then (in
    ES < 
    0.90) it creates a matrix of number_of_docs *
    max_number_of_values 
    
    I'm guessing that you have a large number of values per field,
    hence the 
    memory usage.  It also explains why, when you index those docs
    into a 
    separate index, your heap usage doesn't explode. 
    
    clint 

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks Clint.

We have a combination of 3-4 filters decided upon at run-time to find the
necessary subset of data; I guess it won't be easy for us to partition the
data considering all those filters :frowning: .......

Had this document subset been a static one, a separate index could have
worked easily.

-- Sujoy.

On Monday, March 4, 2013 4:13:08 PM UTC+5:30, Clinton Gormley wrote:

On Sun, 2013-03-03 at 22:42 -0800, Sujoy Sett wrote:

Thanks Clint.

So we got the problem. But is there any way-around to achieve the
same?
Would upgrading to 0.20 be helpful in any way for this?

No, although the next version of ES (0.90+) will help this problem.

For the moment, what about keeping those docs in a separate index?

clint

-- Sujoy.

On Friday, March 1, 2013 5:32:55 PM UTC+5:30, Clinton Gormley wrote:
Hiya

    > 
    > Following is the problem case. I have a index with 35000 
    docs and I 
    > want to facet on a particular high cardinality field (=~ 
    100) on this 
    > index. 
    > I have an associated facet filter, which should always 
    filter out some 
    > 200 documents from this index upon which I want my facet to 
    be run. 
    > 
    > 
    > Applying query/filter in a separate search query to retrieve 
    those 200 
    > docs takes around 10 ms. 
    > Using facet with facet-filter (with the same condition) on 
    the same is 
    > giving either heap-space error or query timeout after 60 
    secs. 
    > Initially I thought high cardinality is the causing the 
    problem, but 
    > when I separated out those 200 docs in a separate index and 
    executed 
    > facet on that particular field, facet results were within 5 
    ms. 
    > 
    > 
    > My assumption is that facet-filter first filters out the 
    matching 
    > documents, and field values for those docs only are loaded 
    in memory 
    > for faceting. 
    
    That assumption isn't correct.  The field values are loaded 
    for all docs 
    in the index.  And, if the field has multiple values, then (in 
    ES < 
    0.90) it creates a matrix of number_of_docs * 
    max_number_of_values 
    
    I'm guessing that you have a large number of values per field, 
    hence the 
    memory usage.  It also explains why, when you index those docs 
    into a 
    separate index, your heap usage doesn't explode. 
    
    clint 

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Mon, 2013-03-04 at 05:25 -0800, Sujoy Sett wrote:

Thanks Clint.

We have a combination of 3-4 filters decided upon at run-time to find
the necessary subset of data; I guess it won't be easy for us to
partition the data considering all those filters :frowning: .......

Had this document subset been a static one, a separate index could
have worked easily.

You may find that only 3% of your docs have got high numbers of values
for a particular field. Those are the ones you want to move to a
separate index.

eg if you have 100 docs with 2 values in a field, and 1 doc with 1000
values, then you get a matrix of 100 * 1000.

clint

-- Sujoy.

On Monday, March 4, 2013 4:13:08 PM UTC+5:30, Clinton Gormley wrote:
On Sun, 2013-03-03 at 22:42 -0800, Sujoy Sett wrote:
> Thanks Clint.
>
>
> So we got the problem. But is there any way-around to
achieve the
> same?
> Would upgrading to 0.20 be helpful in any way for this?

    No, although the next version of ES (0.90+) will help this
    problem. 
    
    For the moment, what about keeping those docs in a separate
    index? 
    
    clint 
    
    > 
    > 
    > -- Sujoy. 
    > 
    > 
    > On Friday, March 1, 2013 5:32:55 PM UTC+5:30, Clinton
    Gormley wrote: 
    >         Hiya 
    >         
    >         > 
    >         > Following is the problem case. I have a index with
    35000 
    >         docs and I 
    >         > want to facet on a particular high cardinality
    field (=~ 
    >         100) on this 
    >         > index. 
    >         > I have an associated facet filter, which should
    always 
    >         filter out some 
    >         > 200 documents from this index upon which I want my
    facet to 
    >         be run. 
    >         > 
    >         > 
    >         > Applying query/filter in a separate search query
    to retrieve 
    >         those 200 
    >         > docs takes around 10 ms. 
    >         > Using facet with facet-filter (with the same
    condition) on 
    >         the same is 
    >         > giving either heap-space error or query timeout
    after 60 
    >         secs. 
    >         > Initially I thought high cardinality is the
    causing the 
    >         problem, but 
    >         > when I separated out those 200 docs in a separate
    index and 
    >         executed 
    >         > facet on that particular field, facet results were
    within 5 
    >         ms. 
    >         > 
    >         > 
    >         > My assumption is that facet-filter first filters
    out the 
    >         matching 
    >         > documents, and field values for those docs only
    are loaded 
    >         in memory 
    >         > for faceting. 
    >         
    >         That assumption isn't correct.  The field values are
    loaded 
    >         for all docs 
    >         in the index.  And, if the field has multiple
    values, then (in 
    >         ES < 
    >         0.90) it creates a matrix of number_of_docs * 
    >         max_number_of_values 
    >         
    >         I'm guessing that you have a large number of values
    per field, 
    >         hence the 
    >         memory usage.  It also explains why, when you index
    those docs 
    >         into a 
    >         separate index, your heap usage doesn't explode. 
    >         
    >         clint 
    >         
    >         
    >         
    > 
    > -- 
    > You received this message because you are subscribed to the
    Google 
    > Groups "elasticsearch" group. 
    > To unsubscribe from this group and stop receiving emails
    from it, send 
    > an email to elasticsearc...@googlegroups.com. 
    > For more options, visit
    https://groups.google.com/groups/opt_out. 
    >   
    >   

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

Upgrading to 0.90.0 Beta helped in local dev.
Currently checking other dependencies for a full update.

Will also consider the data partitioning option.

Thanks very much,
-- Sujoy.

On Monday, March 4, 2013 7:01:11 PM UTC+5:30, Clinton Gormley wrote:

On Mon, 2013-03-04 at 05:25 -0800, Sujoy Sett wrote:

Thanks Clint.

We have a combination of 3-4 filters decided upon at run-time to find
the necessary subset of data; I guess it won't be easy for us to
partition the data considering all those filters :frowning: .......

Had this document subset been a static one, a separate index could
have worked easily.

You may find that only 3% of your docs have got high numbers of values
for a particular field. Those are the ones you want to move to a
separate index.

eg if you have 100 docs with 2 values in a field, and 1 doc with 1000
values, then you get a matrix of 100 * 1000.

clint

-- Sujoy.

On Monday, March 4, 2013 4:13:08 PM UTC+5:30, Clinton Gormley wrote:
On Sun, 2013-03-03 at 22:42 -0800, Sujoy Sett wrote:
> Thanks Clint.
>
>
> So we got the problem. But is there any way-around to
achieve the
> same?
> Would upgrading to 0.20 be helpful in any way for this?

    No, although the next version of ES (0.90+) will help this 
    problem. 
    
    For the moment, what about keeping those docs in a separate 
    index? 
    
    clint 
    
    > 
    > 
    > -- Sujoy. 
    > 
    > 
    > On Friday, March 1, 2013 5:32:55 PM UTC+5:30, Clinton 
    Gormley wrote: 
    >         Hiya 
    >         
    >         > 
    >         > Following is the problem case. I have a index with 
    35000 
    >         docs and I 
    >         > want to facet on a particular high cardinality 
    field (=~ 
    >         100) on this 
    >         > index. 
    >         > I have an associated facet filter, which should 
    always 
    >         filter out some 
    >         > 200 documents from this index upon which I want my 
    facet to 
    >         be run. 
    >         > 
    >         > 
    >         > Applying query/filter in a separate search query 
    to retrieve 
    >         those 200 
    >         > docs takes around 10 ms. 
    >         > Using facet with facet-filter (with the same 
    condition) on 
    >         the same is 
    >         > giving either heap-space error or query timeout 
    after 60 
    >         secs. 
    >         > Initially I thought high cardinality is the 
    causing the 
    >         problem, but 
    >         > when I separated out those 200 docs in a separate 
    index and 
    >         executed 
    >         > facet on that particular field, facet results were 
    within 5 
    >         ms. 
    >         > 
    >         > 
    >         > My assumption is that facet-filter first filters 
    out the 
    >         matching 
    >         > documents, and field values for those docs only 
    are loaded 
    >         in memory 
    >         > for faceting. 
    >         
    >         That assumption isn't correct.  The field values are 
    loaded 
    >         for all docs 
    >         in the index.  And, if the field has multiple 
    values, then (in 
    >         ES < 
    >         0.90) it creates a matrix of number_of_docs * 
    >         max_number_of_values 
    >         
    >         I'm guessing that you have a large number of values 
    per field, 
    >         hence the 
    >         memory usage.  It also explains why, when you index 
    those docs 
    >         into a 
    >         separate index, your heap usage doesn't explode. 
    >         
    >         clint 
    >         
    >         
    >         
    > 
    > -- 
    > You received this message because you are subscribed to the 
    Google 
    > Groups "elasticsearch" group. 
    > To unsubscribe from this group and stop receiving emails 
    from it, send 
    > an email to elasticsearc...@googlegroups.com. 
    > For more options, visit 
    https://groups.google.com/groups/opt_out. 
    >   
    >   

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey Sujoy,
is it possible to have some metrics on how it improved? Memory and
response time wise. thanks,

Leonardo Menezes

http://es.linkedin.com/in/leonardomenezess
http://twitter.com/leonardomenezes

On Mon, Mar 4, 2013 at 3:36 PM, Sujoy Sett sujoysett@gmail.com wrote:

Hi,

Upgrading to 0.90.0 Beta helped in local dev.
Currently checking other dependencies for a full update.

Will also consider the data partitioning option.

Thanks very much,
-- Sujoy.

On Monday, March 4, 2013 7:01:11 PM UTC+5:30, Clinton Gormley wrote:

On Mon, 2013-03-04 at 05:25 -0800, Sujoy Sett wrote:

Thanks Clint.

We have a combination of 3-4 filters decided upon at run-time to find
the necessary subset of data; I guess it won't be easy for us to
partition the data considering all those filters :frowning: .......

Had this document subset been a static one, a separate index could
have worked easily.

You may find that only 3% of your docs have got high numbers of values
for a particular field. Those are the ones you want to move to a
separate index.

eg if you have 100 docs with 2 values in a field, and 1 doc with 1000
values, then you get a matrix of 100 * 1000.

clint

-- Sujoy.

On Monday, March 4, 2013 4:13:08 PM UTC+5:30, Clinton Gormley wrote:
On Sun, 2013-03-03 at 22:42 -0800, Sujoy Sett wrote:
> Thanks Clint.
>
>
> So we got the problem. But is there any way-around to
achieve the
> same?
> Would upgrading to 0.20 be helpful in any way for this?

    No, although the next version of ES (0.90+) will help this
    problem.

    For the moment, what about keeping those docs in a separate
    index?

    clint

    >
    >
    > -- Sujoy.
    >
    >
    > On Friday, March 1, 2013 5:32:55 PM UTC+5:30, Clinton
    Gormley wrote:
    >         Hiya
    >
    >         >
    >         > Following is the problem case. I have a index with
    35000
    >         docs and I
    >         > want to facet on a particular high cardinality
    field (=~
    >         100) on this
    >         > index.
    >         > I have an associated facet filter, which should
    always
    >         filter out some
    >         > 200 documents from this index upon which I want my
    facet to
    >         be run.
    >         >
    >         >
    >         > Applying query/filter in a separate search query
    to retrieve
    >         those 200
    >         > docs takes around 10 ms.
    >         > Using facet with facet-filter (with the same
    condition) on
    >         the same is
    >         > giving either heap-space error or query timeout
    after 60
    >         secs.
    >         > Initially I thought high cardinality is the
    causing the
    >         problem, but
    >         > when I separated out those 200 docs in a separate
    index and
    >         executed
    >         > facet on that particular field, facet results were
    within 5
    >         ms.
    >         >
    >         >
    >         > My assumption is that facet-filter first filters
    out the
    >         matching
    >         > documents, and field values for those docs only
    are loaded
    >         in memory
    >         > for faceting.
    >
    >         That assumption isn't correct.  The field values are
    loaded
    >         for all docs
    >         in the index.  And, if the field has multiple
    values, then (in
    >         ES <
    >         0.90) it creates a matrix of number_of_docs *
    >         max_number_of_values
    >
    >         I'm guessing that you have a large number of values
    per field,
    >         hence the
    >         memory usage.  It also explains why, when you index
    those docs
    >         into a
    >         separate index, your heap usage doesn't explode.
    >
    >         clint
    >
    >
    >
    >
    > --
    > You received this message because you are subscribed to the
    Google
    > Groups "elasticsearch" group.
    > To unsubscribe from this group and stop receiving emails
    from it, send
    > an email to elasticsearc...@googlegroups.**com.
    > For more options, visit
    https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>.
    >
    >

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for explanation. Its a sure help. Will try the approach you have
suggested.

On Monday, March 4, 2013 7:01:11 PM UTC+5:30, Clinton Gormley wrote:

On Mon, 2013-03-04 at 05:25 -0800, Sujoy Sett wrote:

Thanks Clint.

We have a combination of 3-4 filters decided upon at run-time to find
the necessary subset of data; I guess it won't be easy for us to
partition the data considering all those filters :frowning: .......

Had this document subset been a static one, a separate index could
have worked easily.

You may find that only 3% of your docs have got high numbers of values
for a particular field. Those are the ones you want to move to a
separate index.

eg if you have 100 docs with 2 values in a field, and 1 doc with 1000
values, then you get a matrix of 100 * 1000.

clint

-- Sujoy.

On Monday, March 4, 2013 4:13:08 PM UTC+5:30, Clinton Gormley wrote:
On Sun, 2013-03-03 at 22:42 -0800, Sujoy Sett wrote:
> Thanks Clint.
>
>
> So we got the problem. But is there any way-around to
achieve the
> same?
> Would upgrading to 0.20 be helpful in any way for this?

    No, although the next version of ES (0.90+) will help this 
    problem. 
    
    For the moment, what about keeping those docs in a separate 
    index? 
    
    clint 
    
    > 
    > 
    > -- Sujoy. 
    > 
    > 
    > On Friday, March 1, 2013 5:32:55 PM UTC+5:30, Clinton 
    Gormley wrote: 
    >         Hiya 
    >         
    >         > 
    >         > Following is the problem case. I have a index with 
    35000 
    >         docs and I 
    >         > want to facet on a particular high cardinality 
    field (=~ 
    >         100) on this 
    >         > index. 
    >         > I have an associated facet filter, which should 
    always 
    >         filter out some 
    >         > 200 documents from this index upon which I want my 
    facet to 
    >         be run. 
    >         > 
    >         > 
    >         > Applying query/filter in a separate search query 
    to retrieve 
    >         those 200 
    >         > docs takes around 10 ms. 
    >         > Using facet with facet-filter (with the same 
    condition) on 
    >         the same is 
    >         > giving either heap-space error or query timeout 
    after 60 
    >         secs. 
    >         > Initially I thought high cardinality is the 
    causing the 
    >         problem, but 
    >         > when I separated out those 200 docs in a separate 
    index and 
    >         executed 
    >         > facet on that particular field, facet results were 
    within 5 
    >         ms. 
    >         > 
    >         > 
    >         > My assumption is that facet-filter first filters 
    out the 
    >         matching 
    >         > documents, and field values for those docs only 
    are loaded 
    >         in memory 
    >         > for faceting. 
    >         
    >         That assumption isn't correct.  The field values are 
    loaded 
    >         for all docs 
    >         in the index.  And, if the field has multiple 
    values, then (in 
    >         ES < 
    >         0.90) it creates a matrix of number_of_docs * 
    >         max_number_of_values 
    >         
    >         I'm guessing that you have a large number of values 
    per field, 
    >         hence the 
    >         memory usage.  It also explains why, when you index 
    those docs 
    >         into a 
    >         separate index, your heap usage doesn't explode. 
    >         
    >         clint 
    >         
    >         
    >         
    > 
    > -- 
    > You received this message because you are subscribed to the 
    Google 
    > Groups "elasticsearch" group. 
    > To unsubscribe from this group and stop receiving emails 
    from it, send 
    > an email to elasticsearc...@googlegroups.com. 
    > For more options, visit 
    https://groups.google.com/groups/opt_out. 
    >   
    >   

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Leonardo,

Sharing a rough estimate of the metrics.

Response time on 0.19.2 as well as 0.19.9 on a 4GB node (production
environment) for a facet query (described in earlier post) was around 50-60
seconds. Multiple parallel queries was causing heap space shortage error.

Response time on 0.90.0 on a 1GB node (dev environment) for same facet
query is around 2-3 seconds. Haven't tried parallel execution of queries
yet, no heap space exception so far.

-- Sujoy.

On Monday, March 4, 2013 8:35:04 PM UTC+5:30, Leonardo Menezes wrote:

Hey Sujoy,
is it possible to have some metrics on how it improved? Memory and
response time wise. thanks,

Leonardo Menezes

http://es.linkedin.com/in/leonardomenezess
http://twitter.com/leonardomenezes

On Mon, Mar 4, 2013 at 3:36 PM, Sujoy Sett <sujo...@gmail.com<javascript:>

wrote:

Hi,

Upgrading to 0.90.0 Beta helped in local dev.
Currently checking other dependencies for a full update.

Will also consider the data partitioning option.

Thanks very much,
-- Sujoy.

On Monday, March 4, 2013 7:01:11 PM UTC+5:30, Clinton Gormley wrote:

On Mon, 2013-03-04 at 05:25 -0800, Sujoy Sett wrote:

Thanks Clint.

We have a combination of 3-4 filters decided upon at run-time to find
the necessary subset of data; I guess it won't be easy for us to
partition the data considering all those filters :frowning: .......

Had this document subset been a static one, a separate index could
have worked easily.

You may find that only 3% of your docs have got high numbers of values
for a particular field. Those are the ones you want to move to a
separate index.

eg if you have 100 docs with 2 values in a field, and 1 doc with 1000
values, then you get a matrix of 100 * 1000.

clint

-- Sujoy.

On Monday, March 4, 2013 4:13:08 PM UTC+5:30, Clinton Gormley wrote:
On Sun, 2013-03-03 at 22:42 -0800, Sujoy Sett wrote:
> Thanks Clint.
>
>
> So we got the problem. But is there any way-around to
achieve the
> same?
> Would upgrading to 0.20 be helpful in any way for this?

    No, although the next version of ES (0.90+) will help this 
    problem. 
    
    For the moment, what about keeping those docs in a separate 
    index? 
    
    clint 
    
    > 
    > 
    > -- Sujoy. 
    > 
    > 
    > On Friday, March 1, 2013 5:32:55 PM UTC+5:30, Clinton 
    Gormley wrote: 
    >         Hiya 
    >         
    >         > 
    >         > Following is the problem case. I have a index with 
    35000 
    >         docs and I 
    >         > want to facet on a particular high cardinality 
    field (=~ 
    >         100) on this 
    >         > index. 
    >         > I have an associated facet filter, which should 
    always 
    >         filter out some 
    >         > 200 documents from this index upon which I want my 
    facet to 
    >         be run. 
    >         > 
    >         > 
    >         > Applying query/filter in a separate search query 
    to retrieve 
    >         those 200 
    >         > docs takes around 10 ms. 
    >         > Using facet with facet-filter (with the same 
    condition) on 
    >         the same is 
    >         > giving either heap-space error or query timeout 
    after 60 
    >         secs. 
    >         > Initially I thought high cardinality is the 
    causing the 
    >         problem, but 
    >         > when I separated out those 200 docs in a separate 
    index and 
    >         executed 
    >         > facet on that particular field, facet results were 
    within 5 
    >         ms. 
    >         > 
    >         > 
    >         > My assumption is that facet-filter first filters 
    out the 
    >         matching 
    >         > documents, and field values for those docs only 
    are loaded 
    >         in memory 
    >         > for faceting. 
    >         
    >         That assumption isn't correct.  The field values are 
    loaded 
    >         for all docs 
    >         in the index.  And, if the field has multiple 
    values, then (in 
    >         ES < 
    >         0.90) it creates a matrix of number_of_docs * 
    >         max_number_of_values 
    >         
    >         I'm guessing that you have a large number of values 
    per field, 
    >         hence the 
    >         memory usage.  It also explains why, when you index 
    those docs 
    >         into a 
    >         separate index, your heap usage doesn't explode. 
    >         
    >         clint 
    >         
    >         
    >         
    > 
    > -- 
    > You received this message because you are subscribed to the 
    Google 
    > Groups "elasticsearch" group. 
    > To unsubscribe from this group and stop receiving emails 
    from it, send 
    > an email to elasticsearc...@googlegroups.**com. 
    > For more options, visit 
    https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>. 
    >   
    >   

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Sorry if this has already been posted somewhere, but is there an
(approximate) statement of theoretical memory usage for multi-field
(string) facets anywhere? (I'd be particularly interested in a comparison
vs memory usage for facets on nested children for reasons mentioned below)

I had to turn facets off on most of my platforms a few months ago because
we had insufficient memory (even using nested facets and putting documents
containing the highest X% cardinality in a separate index) - I replaced
them with manual calculations on a subset of the data.

Obviously reverting back to using facets would be fantastic, but since it's
a reasonable amount of effort to jump to 0.90, which I isn't scheduled for
few months yet, it would be really helpful to be able to estimate what the
new memory usage per shard would be (eg given X documents containing an
array of (average) size Yavg (Z unique values across the shard), each
element being average size T bytes.

eg the old version was something like (XYmax + ZT)264B, where obviously
X*Ymax term rapidly became the dominating factor

I started going through the new code in some spare to see if I could do it,
but all those software engineering tricks made it tricky on a phone UI :slight_smile:

On Monday, March 4, 2013 8:31:11 AM UTC-5, Clinton Gormley wrote:

On Mon, 2013-03-04 at 05:25 -0800, Sujoy Sett wrote:

Thanks Clint.

We have a combination of 3-4 filters decided upon at run-time to find
the necessary subset of data; I guess it won't be easy for us to
partition the data considering all those filters :frowning: .......

Had this document subset been a static one, a separate index could
have worked easily.

You may find that only 3% of your docs have got high numbers of values
for a particular field. Those are the ones you want to move to a
separate index.

eg if you have 100 docs with 2 values in a field, and 1 doc with 1000
values, then you get a matrix of 100 * 1000.

clint

-- Sujoy.

On Monday, March 4, 2013 4:13:08 PM UTC+5:30, Clinton Gormley wrote:
On Sun, 2013-03-03 at 22:42 -0800, Sujoy Sett wrote:
> Thanks Clint.
>
>
> So we got the problem. But is there any way-around to
achieve the
> same?
> Would upgrading to 0.20 be helpful in any way for this?

    No, although the next version of ES (0.90+) will help this 
    problem. 
    
    For the moment, what about keeping those docs in a separate 
    index? 
    
    clint 
    
    > 
    > 
    > -- Sujoy. 
    > 
    > 
    > On Friday, March 1, 2013 5:32:55 PM UTC+5:30, Clinton 
    Gormley wrote: 
    >         Hiya 
    >         
    >         > 
    >         > Following is the problem case. I have a index with 
    35000 
    >         docs and I 
    >         > want to facet on a particular high cardinality 
    field (=~ 
    >         100) on this 
    >         > index. 
    >         > I have an associated facet filter, which should 
    always 
    >         filter out some 
    >         > 200 documents from this index upon which I want my 
    facet to 
    >         be run. 
    >         > 
    >         > 
    >         > Applying query/filter in a separate search query 
    to retrieve 
    >         those 200 
    >         > docs takes around 10 ms. 
    >         > Using facet with facet-filter (with the same 
    condition) on 
    >         the same is 
    >         > giving either heap-space error or query timeout 
    after 60 
    >         secs. 
    >         > Initially I thought high cardinality is the 
    causing the 
    >         problem, but 
    >         > when I separated out those 200 docs in a separate 
    index and 
    >         executed 
    >         > facet on that particular field, facet results were 
    within 5 
    >         ms. 
    >         > 
    >         > 
    >         > My assumption is that facet-filter first filters 
    out the 
    >         matching 
    >         > documents, and field values for those docs only 
    are loaded 
    >         in memory 
    >         > for faceting. 
    >         
    >         That assumption isn't correct.  The field values are 
    loaded 
    >         for all docs 
    >         in the index.  And, if the field has multiple 
    values, then (in 
    >         ES < 
    >         0.90) it creates a matrix of number_of_docs * 
    >         max_number_of_values 
    >         
    >         I'm guessing that you have a large number of values 
    per field, 
    >         hence the 
    >         memory usage.  It also explains why, when you index 
    those docs 
    >         into a 
    >         separate index, your heap usage doesn't explode. 
    >         
    >         clint 
    >         
    >         
    >         
    > 
    > -- 
    > You received this message because you are subscribed to the 
    Google 
    > Groups "elasticsearch" group. 
    > To unsubscribe from this group and stop receiving emails 
    from it, send 
    > an email to elasticsearc...@googlegroups.com. 
    > For more options, visit 
    https://groups.google.com/groups/opt_out. 
    >   
    >   

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.