Facet performance going bad on large indexes

Hi,
I've been testing the facet performance during that last week and at first it seemed like the perfect solution for my need. Now I'm not so sure.

I have 2 indexes, index1 with 50GB and index2 with 160GB (both with below mapping).
each user can have up to 1000 different products (different term values). Each term can reach up to 4M count.
index1 has 15M users & index2 has 65M users (main documents).
according to elasticsearch head:
index1 has a total of docs: 189550849 (276425354)
index2 has a total of docs: 1452655206 (1840196363)

running the below query on index1 takes ~5 seconds to return, which is not that fast as it is but that's for the first time before caching. On the other end, when running on index2 it takes up to 30 seconds which is above my SLA.

Also tried using warmers(see below), but I think that because the dates on facet_filter keep changing, the cache is not that helpful.

I'm using 8 m3.2xlarge nodes on my cluster (8 cores 30g RAM). 20g are allocated for ES.

I can tell from bigdesk, that not all nodes a participating in the facet calculation, and that the cpu usage on those nodes is not that high (up to 30%), but goes on for a relatively long time.

I think that my biggest challenge here is to find the right warmer queries for this task, but I will try anything that can make my queries go faster.

Is the index too big for what I'm trying to do?
Are the documents/sub documents count too big for it?
Is the term values for the product field have to many different values to run the calculation in a reasonable time?

update:
forgot to mention I'm using v0.90.2

Thanks in advanced,

query example:

{
"size": 0,
"facets": {
"tags": {
"terms": {
"field": "productCode",
"size": 1000,
"regex": "PRODUCT\d+"
},
"nested": "products",
"facet_filter": {
"range": {
"products.time": {
"from": "2013-12-01",
"to": "2013-12-31",
"include_lower": true,
"include_upper": true
}
}
}
}
}
}

mapping:

{
"user" : {
"_ttl" : {
"enabled" : true
},
"properties" : {
"products" : {
"type" : "nested",
"properties" : {
"time" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"productCode" : { "type" : "string", "index": "not_analyzed"}
}
}
}
}
}

warmer:
{
"query": {"match_all": {}}
,
"size": 0,
"facets": {
"tags": {
"terms": {
"field": "productCode",
"size": 1000
},
"nested": "products"
}
}
}

Oreno,

A few suggestions worth a try (for experimentation):

  1. Can you facet_filter criteria up to the query level - i.e. remove it
    from facet_filter and put it as a nested filter up in the top query?

  2. Find a way to eliminate the regex condition in the terms facet. Is there
    a way to define that in the query part of your overall query (but as a
    filter)?

Please try the above and report back if there is any improvement.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8d2ad90f-02e4-40a0-8e6b-d3fc075fe39b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Binh,
thanks for the input.

1.I have to put the filters on the facet level because I want the facets to count only products that appear in specific dates. If I run this on the query first, then I will also be counting products that are not in the range, just because they exist in the documents that returned from the main query.

  1. Don't think that the regex is the problem here. From what I've read, the facet is loading the entire field into memory even if it was filtered or explicitly requesting a specific term.

If you have any other suggestions I will be more than happy to try.

Thanks!

Oreno,

  1. I'm not sure what you mean. The facet section in the query will only
    execute against the results narrowed down from your query section. So if
    your query section narrowed the results down to say 10 documents, the facet
    section will only facet against those 10 documents. I have found that for
    very large data sets, narrowing it down in the query part generally works
    better than the facet_filter. But this requires experimentation and trial
    to see if it works for you. Give it a try anyway - I'm also curious to know
    if this will make a difference for you.

  2. The regex will slow down a bit. Depends on how much terms you have, it
    may be insignificant, or it may be a lot. Again, try it - with or without
    the regex and see if it makes any difference.

But I'd definitely try #1 above first to see if it has any effect.

If it doesn't do anything significant, then I'd start looking at better
disks (like SSD) and/or increasing the shard/node count to distribute that
load out some more.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d2ddf179-b55b-42ef-950c-399f2f007f7e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

1.What I meant is that even in these 10 docs that came out of the main query, I only want to count products that have a date in the range specified. like "count how many users bought PRODUCT001 between range x and y" but I want the same query to execute on every product. What I don't want is for product PRODUCT001 to be incremented because another doc that returned had it but on non relevant dates.
if I won't have a filter in the facet than it will just count all the appearances of the regex in these 10 docs, at this point it won't care about the dates.
Hope I made it clear.

Also, the mapping I had to set for the facets is actually making the main requests pretty expensive. The queries that ES is regularly great at.

  1. Tried it - unfortunately there's no performance improvement.

*In my older mapping, a single request came in under a seconds, but I had to do more than 500 of these request in a single bulk request(facets were not suitable for my calculation in that mapping) which caused the cpu usage go to the roof, took a long time and made it hard to support multiple users.

So now I moved to facets and I guess I reached some limitation here as well.

Thanks,

Oreno,

Just to clarify, I'm talking about doing a query like below (note the
filter part inside the query-level). Does it not meet your requirement, or
maybe I'm still confused?

{
"query": {
"filtered" : {
"query" : { whatever },
"filter" : {
"nested" : {
"path" : "products",
"filter" : {
"range": {
"products.time": {
"from": "2013-12-01",
"to": "2013-12-31",
"include_lower": true,
"include_upper": true
}
}
},
"_cache" : true
}
}
}
},
"facets": {
"tags": {
"terms": {
"field": "productCode",
"size": 1000
}
}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ea1df3b9-2483-4e0e-975b-c50db16eb16c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.