Memory issues with facets


(slushi) #1

I am trying to use facets to implement autocomplete behavior (try to
autocomplete the search term as the user types) but I have been having
memory issues. I have an single node index with 2 shards, each with about
450K documents and 150K terms in the field being used for autocomplete.
After starting the server, I tried doing my facet search which is something
like this:

{ "size" : 0,
"query" : {
"prefix" : { "f" : "foo" }
},
"facets" : {
"autocomplete" : {
"terms" : {
"field" : "f",
"regex" : "foo.*",
"regex_flags" : "DOTALL",
"size" : 5
}
}
}
}

the bigdesk cache graph showed the field cache jumping up to 800,000,000.
This seems high, these are basically tags so they should be short strings.
I saw this threadhttps://groups.google.com/forum/?fromgroups#!topic/elasticsearch/Mdh8hN04c2c which
is probably relevant as the field is indeed multivalued. I will try to
switch to nested documents and see if there is any improvement. A few
additional questions:

  1. The docs say that by default field cache entries never expire. I
    assume the cache is still kept coherent with the index, so if docs are
    updated/deleted/expired the relevant cache entries are evicted?
  2. Does the number of shards/node increase field cache memory usage?
    I.e., would memory usage go down if i used one shard rather than 2?
  3. I have a specific field just to support autocomplete, would it be
    better to simply split out the field into its own index rather than use
    nested docs?
  4. Does this approach (facets for autocomplete) make sense given the
    size of the field? I was using solr before which had a way (terms
    component http://wiki.apache.org/solr/TermsComponent) to directly
    access the index terms, this seems much more efficient than what I am doing
    here. Would it make more sense to build a separate index for autocomplete
    by periodically mining the top terms from the main index and then doing
    "normal" prefix queries to get a suggestion?

Thanks!


(Shay Banon) #2

Using facets for autocomplete usually does not make sense, search the
mailing list on how to do it with edge ngrams.

On Mon, Jun 4, 2012 at 6:55 PM, slushi kireetreddy@gmail.com wrote:

I am trying to use facets to implement autocomplete behavior (try to
autocomplete the search term as the user types) but I have been having
memory issues. I have an single node index with 2 shards, each with about
450K documents and 150K terms in the field being used for autocomplete.
After starting the server, I tried doing my facet search which is something
like this:

{ "size" : 0,
"query" : {
"prefix" : { "f" : "foo" }
},
"facets" : {
"autocomplete" : {
"terms" : {
"field" : "f",
"regex" : "foo.*",
"regex_flags" : "DOTALL",
"size" : 5
}
}
}
}

the bigdesk cache graph showed the field cache jumping up to 800,000,000.
This seems high, these are basically tags so they should be short strings.
I saw this threadhttps://groups.google.com/forum/?fromgroups#!topic/elasticsearch/Mdh8hN04c2c which
is probably relevant as the field is indeed multivalued. I will try to
switch to nested documents and see if there is any improvement. A few
additional questions:

  1. The docs say that by default field cache entries never expire. I
    assume the cache is still kept coherent with the index, so if docs are
    updated/deleted/expired the relevant cache entries are evicted?
  2. Does the number of shards/node increase field cache memory usage?
    I.e., would memory usage go down if i used one shard rather than 2?
  3. I have a specific field just to support autocomplete, would it be
    better to simply split out the field into its own index rather than use
    nested docs?
  4. Does this approach (facets for autocomplete) make sense given the
    size of the field? I was using solr before which had a way (terms
    component http://wiki.apache.org/solr/TermsComponent) to directly
    access the index terms, this seems much more efficient than what I am doing
    here. Would it make more sense to build a separate index for autocomplete
    by periodically mining the top terms from the main index and then doing
    "normal" prefix queries to get a suggestion?

Thanks!


(slushi) #3

I was planning to try ngrams as well. One issue I am unsure about is how to
boost popular terms in the ngram index. The nice thing about the solr
TermsComponent or facets is I can suggest common terms as the user types
instead of the "closest" term. If I build an autocompletion index while I
am building my main index, it's not clear to me how to easily boost terms
according to index popularity.

On Friday, June 8, 2012 6:03:38 PM UTC-4, kimchy wrote:

Using facets for autocomplete usually does not make sense, search the
mailing list on how to do it with edge ngrams.

On Mon, Jun 4, 2012 at 6:55 PM, slushi <kiree...@gmail.com <javascript:>>wrote:

I am trying to use facets to implement autocomplete behavior (try to
autocomplete the search term as the user types) but I have been having
memory issues. I have an single node index with 2 shards, each with about
450K documents and 150K terms in the field being used for autocomplete.
After starting the server, I tried doing my facet search which is something
like this:

{ "size" : 0,
"query" : {
"prefix" : { "f" : "foo" }
},
"facets" : {
"autocomplete" : {
"terms" : {
"field" : "f",
"regex" : "foo.*",
"regex_flags" : "DOTALL",
"size" : 5
}
}
}
}

the bigdesk cache graph showed the field cache jumping up to 800,000,000.
This seems high, these are basically tags so they should be short strings.
I saw this threadhttps://groups.google.com/forum/?fromgroups#!topic/elasticsearch/Mdh8hN04c2c which
is probably relevant as the field is indeed multivalued. I will try to
switch to nested documents and see if there is any improvement. A few
additional questions:

  1. The docs say that by default field cache entries never expire. I
    assume the cache is still kept coherent with the index, so if docs are
    updated/deleted/expired the relevant cache entries are evicted?
  2. Does the number of shards/node increase field cache memory usage?
    I.e., would memory usage go down if i used one shard rather than 2?
  3. I have a specific field just to support autocomplete, would it be
    better to simply split out the field into its own index rather than use
    nested docs?
  4. Does this approach (facets for autocomplete) make sense given the
    size of the field? I was using solr before which had a way (terms
    component http://wiki.apache.org/solr/TermsComponent) to directly
    access the index terms, this seems much more efficient than what I am doing
    here. Would it make more sense to build a separate index for autocomplete
    by periodically mining the top terms from the main index and then doing
    "normal" prefix queries to get a suggestion?

Thanks!

--


(Ludovic Fleury) #4

Hey, I'm new to ES but I've got the same need:
I've got a list of place with a city and I wanted to provide an
autocomplete feature on the facets list.
I can't see how to do that without facets (because it's an aggregation of
unique value for a field in my index type). Any hint ?
Thanks

Le samedi 9 juin 2012 00:03:38 UTC+2, kimchy a écrit :

Using facets for autocomplete usually does not make sense, search the
mailing list on how to do it with edge ngrams.

On Mon, Jun 4, 2012 at 6:55 PM, slushi <kiree...@gmail.com <javascript:>>wrote:

I am trying to use facets to implement autocomplete behavior (try to
autocomplete the search term as the user types) but I have been having
memory issues. I have an single node index with 2 shards, each with about
450K documents and 150K terms in the field being used for autocomplete.
After starting the server, I tried doing my facet search which is something
like this:

{ "size" : 0,
"query" : {
"prefix" : { "f" : "foo" }
},
"facets" : {
"autocomplete" : {
"terms" : {
"field" : "f",
"regex" : "foo.*",
"regex_flags" : "DOTALL",
"size" : 5
}
}
}
}

the bigdesk cache graph showed the field cache jumping up to 800,000,000.
This seems high, these are basically tags so they should be short strings.
I saw this threadhttps://groups.google.com/forum/?fromgroups#!topic/elasticsearch/Mdh8hN04c2c which
is probably relevant as the field is indeed multivalued. I will try to
switch to nested documents and see if there is any improvement. A few
additional questions:

  1. The docs say that by default field cache entries never expire. I
    assume the cache is still kept coherent with the index, so if docs are
    updated/deleted/expired the relevant cache entries are evicted?
  2. Does the number of shards/node increase field cache memory usage?
    I.e., would memory usage go down if i used one shard rather than 2?
  3. I have a specific field just to support autocomplete, would it be
    better to simply split out the field into its own index rather than use
    nested docs?
  4. Does this approach (facets for autocomplete) make sense given the
    size of the field? I was using solr before which had a way (terms
    component http://wiki.apache.org/solr/TermsComponent) to directly
    access the index terms, this seems much more efficient than what I am doing
    here. Would it make more sense to build a separate index for autocomplete
    by periodically mining the top terms from the main index and then doing
    "normal" prefix queries to get a suggestion?

Thanks!

--


(system) #5