10 mio tweets in an index with 3 shards (I know I should use one
index per day ... I'll do that later ;))
feeding only ~50-100 tweets per seconds (twitter search is the
limit now), but after ~400 tweets I'm doing a hard refresh. That is
necessary so that I can always search (for retweets, duplicates etc)
before indexing.
ES is started via ** on a 64bit jvm. The OS still has ~7GB
I would like to create warm-up-queries like it is possible in Solr.
Can I implement this in ElasticSearch? Is there an internal event
which tells ElasticSearch that a new lucene searcher is or should be
used?
What other optimizations regarding jvm, cache or index settings could
I try?
I also noticed that ES caches queries, so in case you need certein
things very often you could just send the queries and then they should
be cached from that on;
10 mio tweets in an index with 3 shards (I know I should use one
index per day ... I'll do that later ;))
feeding only ~50-100 tweets per seconds (twitter search is the
limit now), but after ~400 tweets I'm doing a hard refresh. That is
necessary so that I can always search (for retweets, duplicates etc)
before indexing.
ES is started via ** on a 64bit jvm. The OS still has ~7GB
I would like to create warm-up-queries like it is possible in Solr.
Can I implement this in Elasticsearch? Is there an internal event
which tells Elasticsearch that a new lucene searcher is or should be
used?
What other optimizations regarding jvm, cache or index settings could
I try?
yes, exactly. Normally you shouldn't do that cause it decreases
indexing speed, which in my case is acceptable.
I also noticed that ES caches queries, so in case you need certein
things very often you could just send the queries and then they should
be cached from that on
Yes, that is one option. But my problem is that timing. When should I
send those queries?
Every second?
I would like to make sure that the user see response times less then a
second (although even this is a bit too much).
I must doing something wrong. I'll investigate if the massive use of
facets is the problematic part of the query.
BTW: I'm already using a relative high number (20s) for the
refresh_interval to decrease realtime enforcement.
Are those requests/ searches similar somehow? You could try to see
what the top N queries are and then cache them within the app and
refresh this cache every N seconds/ minutes /- etc. - and instead
return only the cached values;
In case you also use much faceting you might have a look at the
current trunk - I saw shay did some improvements regarding facet-
speeed there;
yes, exactly. Normally you shouldn't do that cause it decreases
indexing speed, which in my case is acceptable.
I also noticed that ES caches queries, so in case you need certein
things very often you could just send the queries and then they should
be cached from that on
Yes, that is one option. But my problem is that timing. When should I
send those queries?
Every second?
I would like to make sure that the user see response times less then a
second (although even this is a bit too much).
I must doing something wrong. I'll investigate if the massive use of
facets is the problematic part of the query.
BTW: I'm already using a relative high number (20s) for the
refresh_interval to decrease realtime enforcement.
Are those requests/ searches similar somehow? You could try to see
what the top N queries are and then cache them within the app and
refresh this cache every N seconds/ minutes /- etc. - and instead
return only the cached values;
In case you also use much faceting you might have a look at the
current trunk - I saw shay did some improvements regarding facet-
speeed there;
yes, exactly. Normally you shouldn't do that cause it decreases
indexing speed, which in my case is acceptable.
I also noticed that ES caches queries, so in case you need certein
things very often you could just send the queries and then they should
be cached from that on
Yes, that is one option. But my problem is that timing. When should I
send those queries?
Every second?
I would like to make sure that the user see response times less then a
second (although even this is a bit too much).
I must doing something wrong. I'll investigate if the massive use of
facets is the problematic part of the query.
BTW: I'm already using a relative high number (20s) for the
refresh_interval to decrease realtime enforcement.
There isn't an option to have warm up queries when a new reader is created in ES currently. This one is tricky to get right (you want to do it only when you really want to), but its certainly planned.
On Friday, April 1, 2011 at 12:08 AM, Karussell wrote:
Hi,
simple question, but I know it depends on everything.
10 mio tweets in an index with 3 shards (I know I should use one
index per day ... I'll do that later ;))
feeding only ~50-100 tweets per seconds (twitter search is the
limit now), but after ~400 tweets I'm doing a hard refresh. That is
necessary so that I can always search (for retweets, duplicates etc)
before indexing.
ES is started via ** on a 64bit jvm. The OS still has ~7GB
I would like to create warm-up-queries like it is possible in Solr.
Can I implement this in Elasticsearch? Is there an internal event
which tells Elasticsearch that a new lucene searcher is or should be
used?
What other optimizations regarding jvm, cache or index settings could
I try?
does ES depend on the order of filters to apply for performance or is
the order not important? I only know the behaviour from SOLR and there
even the order of the filters was important;
There isn't an option to have warm up queries when a new reader is created in ES currently. This one is tricky to get right (you want to do it only when you really want to), but its certainly planned.
On Friday, April 1, 2011 at 12:08 AM, Karussell wrote:
Hi,
simple question, but I know it depends on everything.
10 mio tweets in an index with 3 shards (I know I should use one
index per day ... I'll do that later ;))
feeding only ~50-100 tweets per seconds (twitter search is the
limit now), but after ~400 tweets I'm doing a hard refresh. That is
necessary so that I can always search (for retweets, duplicates etc)
before indexing.
ES is started via ** on a 64bit jvm. The OS still has ~7GB
I would like to create warm-up-queries like it is possible in Solr.
Can I implement this in Elasticsearch? Is there an internal event
which tells Elasticsearch that a new lucene searcher is or should be
used?
What other optimizations regarding jvm, cache or index settings could
I try?
There isn't an option to have warm up queries when a new reader is created in ES currently. This one is tricky to get right (you want to do it only when you really want to), but its certainly planned.
Shay, thanks for the information!
Do you know any other trick how I could improve query speed when doing
indexing at the same time?
Or should I go with polling some default queries every 30 sec or so?
And assuming my 10 mio tweet index is split up by time - say into 3
indices - will this improve query speed without adding hardware - not
really right?
(Except when I know before to which indices the query should go ...
hmmh I'll think about that)
There isn't an option to have warm up queries when a new reader is created in ES currently. This one is tricky to get right (you want to do it only when you really want to), but its certainly planned.
Shay, thanks for the information!
Do you know any other trick how I could improve query speed when doing
indexing at the same time?
Or should I go with polling some default queries every 30 sec or so?
And assuming my 10 mio tweet index is split up by time - say into 3
indices - will this improve query speed without adding hardware - not
really right?
(Except when I know before to which indices the query should go ...
hmmh I'll think about that)
Yes, decreasing the merge factor will increase query speed, since less segments means less work when searching a single shard, but, it will make indexing slower / more expensive.
Another option to improve search speed is to reduce the term interval and play with the term divisor. Lowering the term interval will improve search perf, but, will cause more memory to be used (which you can control with the divisor). Both can also be set dynamically, though the term interval only applies for newly indexed docs.
On Saturday, April 9, 2011 at 5:38 PM, Karussell wrote:
I'll investigate the index settings a bit more now:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.