Improve query time

Hi!

We currently have 150 million document in an index, 10 shards, 3 replicas
and 3 machines (8 core, 24GB RAM). Search works but rather slow. Query time
measures in seconds, often more than 5. I've been playing with several
queries:
curl -XGET http://188.165.222.156:9200/tweets/_search
-d '{"query":{"filtered":{"query": {"match_all":{}},
"filter":{"term":{"text":"house"}}}},
"sort":{"created_at":{"order":"desc"}}}'
and
curl -XGET http://188.165.222.156:9200/tweets/_search
-d '{"query":{"text":{"text":"house"}},
"sort":{"created_at":{"order":"desc"}}}'
I have had no luck making these queries performant. Also tried
search_type=query_then_fetch with mixed results.

Any tips?

With kind regards,

Hi,

do not use "sort"

Jörg

On Monday, March 5, 2012 3:59:07 PM UTC+1, haarts wrote:

Hi!

We currently have 150 million document in an index, 10 shards, 3 replicas
and 3 machines (8 core, 24GB RAM). Search works but rather slow. Query time
measures in seconds, often more than 5. I've been playing with several
queries:
curl -XGET http://188.165.222.156:9200/tweets/_search-d '{"query":{"filtered":{"query": {"match_all":{}},
"filter":{"term":{"text":"house"}}}},
"sort":{"created_at":{"order":"desc"}}}'
and
curl -XGET http://188.165.222.156:9200/tweets/_search-d '{"query":{"text":{"text":"house"}},
"sort":{"created_at":{"order":"desc"}}}'
I have had no luck making these queries performant. Also tried
search_type=query_then_fetch with mixed results.

Any tips?

With kind regards,

First, have you allocated enough memory to ES? When sorting, it needs to load the created_at values to memory. I would suggest allocating something like 12gb to it. Also, why do you need 10 shards with 3 replicas on 3 machines. First, you end up with 4 copies of each shard, where you really want, I guess, 3 to have a copy on each machine? In this case, have index.number_of_replicas set to 2.

Also, in this case, why do you need 10 shards? Especially if you want a copy on each node. Less shards will mean less memory being used (now you have 10 allocated on each node).

Also, after you are done with bulk indexing the 150 million data, I suggest you optimize the index, it will improve the memory requirements it needs. Optimize it down to ~5 segments (max_num_segments in the optimize API). Note, this will be a heavy operations.

Last, the initial search request will take some time since they need to load the values of created_at to memory. Make sure when testing to warm things up (both loading up the values, but also "warming" the JVM").

On Monday, March 5, 2012 at 4:59 PM, haarts wrote:

Hi!

We currently have 150 million document in an index, 10 shards, 3 replicas and 3 machines (8 core, 24GB RAM). Search works but rather slow. Query time measures in seconds, often more than 5. I've been playing with several queries:
curl -XGET http://188.165.222.156:9200/tweets/_search -d '{"query":{"filtered":{"query": {"match_all":{}}, "filter":{"term":{"text":"house"}}}}, "sort":{"created_at":{"order":"desc"}}}'
and
curl -XGET http://188.165.222.156:9200/tweets/_search -d '{"query":{"text":{"text":"house"}}, "sort":{"created_at":{"order":"desc"}}}'
I have had no luck making these queries performant. Also tried search_type=query_then_fetch with mixed results.

Any tips?

With kind regards,

Unfortunately we do need the results ordered by time. Unless there is an
other way of doing it?

On Monday, 5 March 2012 16:25:21 UTC+1, Jörg Prante wrote:

Hi,

do not use "sort"

Jörg

On Monday, March 5, 2012 3:59:07 PM UTC+1, haarts wrote:

Hi!

We currently have 150 million document in an index, 10 shards, 3 replicas
and 3 machines (8 core, 24GB RAM). Search works but rather slow. Query time
measures in seconds, often more than 5. I've been playing with several
queries:
curl -XGET http://188.165.222.156:9200/tweets/_search-d '{"query":{"filtered":{"query": {"match_all":{}},
"filter":{"term":{"text":"house"}}}},
"sort":{"created_at":{"order":"desc"}}}'
and
curl -XGET http://188.165.222.156:9200/tweets/_search-d '{"query":{"text":{"text":"house"}},
"sort":{"created_at":{"order":"desc"}}}'
I have had no luck making these queries performant. Also tried
search_type=query_then_fetch with mixed results.

Any tips?

With kind regards,

Hi,

On Monday, 5 March 2012 16:36:06 UTC+1, kimchy wrote:

First, have you allocated enough memory to ES? When sorting, it needs to
load the created_at values to memory. I would suggest allocating something
like 12gb to it. Also, why do you need 10 shards with 3 replicas on 3
machines. First, you end up with 4 copies of each shard, where you really
want, I guess, 3 to have a copy on each machine? In this case, have
index.number_of_replicas set to 2.

I was in error. We actually have 2 replicas. We also allocated 13GB of JVM
heap size.

Also, in this case, why do you need 10 shards? Especially if you want a
copy on each node. Less shards will mean less memory being used (now you
have 10 allocated on each node).

We choose for 10 shards as we expect to scale up to 1 billion items. We
guessed we needed more than 3 machines for that.

Also, after you are done with bulk indexing the 150 million data, I
suggest you optimize the index, it will improve the memory requirements it
needs. Optimize it down to ~5 segments (max_num_segments in the optimize
API). Note, this will be a heavy operations.

Excellent, we gave that a try. Performance now is around 1000ms +/- 500ms.

Last, the initial search request will take some time since they need to
load the values of created_at to memory. Make sure when testing to warm
things up (both loading up the values, but also "warming" the JVM").

What do you mean exactly by warming up the JVM? Just do 10 random searches?
Or is there some more rigorous way of doing this?

On Monday, March 5, 2012 at 4:59 PM, haarts wrote:

Hi!

We currently have 150 million document in an index, 10 shards, 3 replicas
and 3 machines (8 core, 24GB RAM). Search works but rather slow. Query time
measures in seconds, often more than 5. I've been playing with several
queries:
curl -XGET http://188.165.222.156:9200/tweets/_search-d '{"query":{"filtered":{"query": {"match_all":{}},
"filter":{"term":{"text":"house"}}}},
"sort":{"created_at":{"order":"desc"}}}'
and
curl -XGET http://188.165.222.156:9200/tweets/_search-d '{"query":{"text":{"text":"house"}},
"sort":{"created_at":{"order":"desc"}}}'
I have had no luck making these queries performant. Also tried
search_type=query_then_fetch with mixed results.

Any tips?

With kind regards,

Regarding warming up, just issue something like 100-1000 search requests with the sorting, and only then start to make your measurements.

What is the load that you are running on the system? I mean, how do you measure the search latency? How many concurrent searches do you execute? When you run the load test, do you see CPU maxing out, or other resources?

On Monday, March 5, 2012 at 6:02 PM, haarts wrote:

Hi,

On Monday, 5 March 2012 16:36:06 UTC+1, kimchy wrote:

First, have you allocated enough memory to ES? When sorting, it needs to load the created_at values to memory. I would suggest allocating something like 12gb to it. Also, why do you need 10 shards with 3 replicas on 3 machines. First, you end up with 4 copies of each shard, where you really want, I guess, 3 to have a copy on each machine? In this case, have index.number_of_replicas set to 2.

I was in error. We actually have 2 replicas. We also allocated 13GB of JVM heap size.

Also, in this case, why do you need 10 shards? Especially if you want a copy on each node. Less shards will mean less memory being used (now you have 10 allocated on each node).

We choose for 10 shards as we expect to scale up to 1 billion items. We guessed we needed more than 3 machines for that.

Also, after you are done with bulk indexing the 150 million data, I suggest you optimize the index, it will improve the memory requirements it needs. Optimize it down to ~5 segments (max_num_segments in the optimize API). Note, this will be a heavy operations.

Excellent, we gave that a try. Performance now is around 1000ms +/- 500ms.

Last, the initial search request will take some time since they need to load the values of created_at to memory. Make sure when testing to warm things up (both loading up the values, but also "warming" the JVM").

What do you mean exactly by warming up the JVM? Just do 10 random searches? Or is there some more rigorous way of doing this?

On Monday, March 5, 2012 at 4:59 PM, haarts wrote:

Hi!

We currently have 150 million document in an index, 10 shards, 3 replicas and 3 machines (8 core, 24GB RAM). Search works but rather slow. Query time measures in seconds, often more than 5. I've been playing with several queries:
curl -XGET http://188.165.222.156:9200/tweets/_search -d '{"query":{"filtered":{"query": {"match_all":{}}, "filter":{"term":{"text":"house"}}}}, "sort":{"created_at":{"order":"desc"}}}'
and
curl -XGET http://188.165.222.156:9200/tweets/_search -d '{"query":{"text":{"text":"house"}}, "sort":{"created_at":{"order":"desc"}}}'
I have had no luck making these queries performant. Also tried search_type=query_then_fetch with mixed results.

Any tips?

With kind regards,

Do you have a fixed date since your data has started, and are you looking
for something like document freshness? Then you should consider to divide
your creation time period from then until now into a series of buckets, and
give static boost values to the documents in the buckets accordingly.
Buckets could be sized by year / month / week / day period lengths, it
depends on the requirements of your app. It is awfully slow (and resource
intensive like Shay described) to sort all docs over and over again, even
all the aged, outdated docs, also with a too fine granularity such as
seconds, while you probably only want to retrieve a bunch of the most
recently indexed docs.

Jörg

On Monday, March 5, 2012 4:50:46 PM UTC+1, haarts wrote:

Unfortunately we do need the results ordered by time. Unless there is an
other way of doing it?

On Monday, 5 March 2012 16:25:21 UTC+1, Jörg Prante wrote:

Hi,

do not use "sort"

Jörg

On Monday, March 5, 2012 3:59:07 PM UTC+1, haarts wrote:

Hi!

We currently have 150 million document in an index, 10 shards, 3
replicas and 3 machines (8 core, 24GB RAM). Search works but rather slow.
Query time measures in seconds, often more than 5. I've been playing with
several queries:
curl -XGET http://188.165.222.156:9200/tweets/_search-d '{"query":{"filtered":{"query": {"match_all":{}},
"filter":{"term":{"text":"house"}}}},
"sort":{"created_at":{"order":"desc"}}}'
and
curl -XGET http://188.165.222.156:9200/tweets/_search-d '{"query":{"text":{"text":"house"}},
"sort":{"created_at":{"order":"desc"}}}'
I have had no luck making these queries performant. Also tried
search_type=query_then_fetch with mixed results.

Any tips?

With kind regards,

Again with the excellent questions! :wink:

We are now running 850 requests, 20 concurrent (with Ruby Typhoeus). This
was used to warm up the cache. Before running the tests we cleared the
cache.

We measure search latency by CURLINFO_TOTAL_TIME
(libcurl - curl_easy_getinfo()).

After warming up the cache we ran 10.000 searches. During these searches we
monitored Bigdesk and 'top'. Whenever we were not maxing out the disks we
were maxing out the CPU's. Memory seems absolutely fine. The average
response time with hot cache was around 600ms. 300ms would make our boss
happy (and not only him).

On Monday, 5 March 2012 17:06:39 UTC+1, kimchy wrote:

Regarding warming up, just issue something like 100-1000 search requests
with the sorting, and only then start to make your measurements.

What is the load that you are running on the system? I mean, how do you
measure the search latency? How many concurrent searches do you execute?
When you run the load test, do you see CPU maxing out, or other resources?

On Monday, March 5, 2012 at 6:02 PM, haarts wrote:

Hi,

On Monday, 5 March 2012 16:36:06 UTC+1, kimchy wrote:

First, have you allocated enough memory to ES? When sorting, it needs to
load the created_at values to memory. I would suggest allocating something
like 12gb to it. Also, why do you need 10 shards with 3 replicas on 3
machines. First, you end up with 4 copies of each shard, where you really
want, I guess, 3 to have a copy on each machine? In this case, have
index.number_of_replicas set to 2.

I was in error. We actually have 2 replicas. We also allocated 13GB of JVM
heap size.

Also, in this case, why do you need 10 shards? Especially if you want a
copy on each node. Less shards will mean less memory being used (now you
have 10 allocated on each node).

We choose for 10 shards as we expect to scale up to 1 billion items. We
guessed we needed more than 3 machines for that.

Also, after you are done with bulk indexing the 150 million data, I
suggest you optimize the index, it will improve the memory requirements it
needs. Optimize it down to ~5 segments (max_num_segments in the optimize
API). Note, this will be a heavy operations.

Excellent, we gave that a try. Performance now is around 1000ms +/- 500ms.

Last, the initial search request will take some time since they need to
load the values of created_at to memory. Make sure when testing to warm
things up (both loading up the values, but also "warming" the JVM").

What do you mean exactly by warming up the JVM? Just do 10 random
searches? Or is there some more rigorous way of doing this?

On Monday, March 5, 2012 at 4:59 PM, haarts wrote:

Hi!

We currently have 150 million document in an index, 10 shards, 3 replicas
and 3 machines (8 core, 24GB RAM). Search works but rather slow. Query time
measures in seconds, often more than 5. I've been playing with several
queries:
curl -XGET http://188.165.222.156:9200/tweets/_search-d '{"query":{"filtered":{"query": {"match_all":{}},
"filter":{"term":{"text":"house"}}}},
"sort":{"created_at":{"order":"desc"}}}'
and
curl -XGET http://188.165.222.156:9200/tweets/_search-d '{"query":{"text":{"text":"house"}},
"sort":{"created_at":{"order":"desc"}}}'
I have had no luck making these queries performant. Also tried
search_type=query_then_fetch with mixed results.

Any tips?

With kind regards,

That is exactly what we are looking for!
I, unfortunately, don't know what 'buckets' are. Do you
mean: Elasticsearch Platform — Find real-time answers at scale | Elastic
? And how would 'static boosting' work? I'd imagine this done query time
not index time, right?

Thank you in advance. +1

On Monday, 5 March 2012 18:20:26 UTC+1, Jörg Prante wrote:

Do you have a fixed date since your data has started, and are you looking
for something like document freshness? Then you should consider to divide
your creation time period from then until now into a series of buckets, and
give static boost values to the documents in the buckets accordingly.
Buckets could be sized by year / month / week / day period lengths, it
depends on the requirements of your app. It is awfully slow (and resource
intensive like Shay described) to sort all docs over and over again, even
all the aged, outdated docs, also with a too fine granularity such as
seconds, while you probably only want to retrieve a bunch of the most
recently indexed docs.

Jörg

On Monday, March 5, 2012 4:50:46 PM UTC+1, haarts wrote:

Unfortunately we do need the results ordered by time. Unless there is an
other way of doing it?

On Monday, 5 March 2012 16:25:21 UTC+1, Jörg Prante wrote:

Hi,

do not use "sort"

Jörg

On Monday, March 5, 2012 3:59:07 PM UTC+1, haarts wrote:

Hi!

We currently have 150 million document in an index, 10 shards, 3
replicas and 3 machines (8 core, 24GB RAM). Search works but rather slow.
Query time measures in seconds, often more than 5. I've been playing with
several queries:
curl -XGET http://188.165.222.156:9200/tweets/_search-d '{"query":{"filtered":{"query": {"match_all":{}},
"filter":{"term":{"text":"house"}}}},
"sort":{"created_at":{"order":"desc"}}}'
and
curl -XGET http://188.165.222.156:9200/tweets/_search-d '{"query":{"text":{"text":"house"}},
"sort":{"created_at":{"order":"desc"}}}'
I have had no luck making these queries performant. Also tried
search_type=query_then_fetch with mixed results.

Any tips?

With kind regards,

Based on the initial queries you've posted, it looks like you're dealing
with time based data. I think you can get much better performance if you
segment the data manually instead of auto sharding as you're doing
now. With your current configuration, every search goes to all 10 shards.

For example, you can create an index per week or month. If you want on the
latest data, you'd only query the index that has the latest data (instead
of all 10 shards). Given that you're exhausting CPU and disk IO, I'd
expect significant improvement in performance with this approach.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Mon, Mar 5, 2012 at 12:37 PM, haarts harmaarts@gmail.com wrote:

That is exactly what we are looking for!
I, unfortunately, don't know what 'buckets' are. Do you mean:
Elasticsearch Platform — Find real-time answers at scale | Elastic? And how would 'static boosting' work? I'd imagine this done query time
not index time, right?

Thank you in advance. +1

On Monday, 5 March 2012 18:20:26 UTC+1, Jörg Prante wrote:

Do you have a fixed date since your data has started, and are you looking
for something like document freshness? Then you should consider to divide
your creation time period from then until now into a series of buckets, and
give static boost values to the documents in the buckets accordingly.
Buckets could be sized by year / month / week / day period lengths, it
depends on the requirements of your app. It is awfully slow (and resource
intensive like Shay described) to sort all docs over and over again, even
all the aged, outdated docs, also with a too fine granularity such as
seconds, while you probably only want to retrieve a bunch of the most
recently indexed docs.

Jörg

On Monday, March 5, 2012 4:50:46 PM UTC+1, haarts wrote:

Unfortunately we do need the results ordered by time. Unless there is an
other way of doing it?

On Monday, 5 March 2012 16:25:21 UTC+1, Jörg Prante wrote:

Hi,

do not use "sort"

Jörg

On Monday, March 5, 2012 3:59:07 PM UTC+1, haarts wrote:

Hi!

We currently have 150 million document in an index, 10 shards, 3
replicas and 3 machines (8 core, 24GB RAM). Search works but rather slow.
Query time measures in seconds, often more than 5. I've been playing with
several queries:
curl -XGET http://188.165.222.156:9200/**tweets/_searchhttp://188.165.222.156:9200/tweets/_search-d '{"query":{"filtered":{"
**query": {"match_all":{}}, "filter":{"term":{"text":"house"}}}},
"sort":{"created_at":{"order":
"desc"}}}'
and
curl -XGET http://188.165.222.156:9200/**tweets/_searchhttp://188.165.222.156:9200/tweets/_search-d '{"query":{"text":{"text":"
house"}}, "sort":{"created_at":{"order":"desc"}}}'
I have had no luck making these queries performant. Also tried
search_type=query_then_fetch with mixed results.

Any tips?

With kind regards,

Can I guarantee the amount of returned results? Say the most recent index
(N) holds Q matching documents. But I'd like to have Q+P (P>0) results.
Will ES automatically query index N+1 (etc)? Do I need to re-search for the
remainder?

On Monday, 5 March 2012 19:24:04 UTC+1, Berkay Mollamustafaoglu wrote:

Based on the initial queries you've posted, it looks like you're dealing
with time based data. I think you can get much better performance if you
segment the data manually instead of auto sharding as you're doing
now. With your current configuration, every search goes to all 10 shards.

For example, you can create an index per week or month. If you want on the
latest data, you'd only query the index that has the latest data (instead
of all 10 shards). Given that you're exhausting CPU and disk IO, I'd
expect significant improvement in performance with this approach.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Mon, Mar 5, 2012 at 12:37 PM, haarts <> wrote:

That is exactly what we are looking for!
I, unfortunately, don't know what 'buckets' are. Do you mean:
Elasticsearch Platform — Find real-time answers at scale | Elastic? And how would 'static boosting' work? I'd imagine this done query time
not index time, right?

Thank you in advance. +1

On Monday, 5 March 2012 18:20:26 UTC+1, Jörg Prante wrote:

Do you have a fixed date since your data has started, and are you
looking for something like document freshness? Then you should consider to
divide your creation time period from then until now into a series of
buckets, and give static boost values to the documents in the buckets
accordingly. Buckets could be sized by year / month / week / day period
lengths, it depends on the requirements of your app. It is awfully slow
(and resource intensive like Shay described) to sort all docs over and over
again, even all the aged, outdated docs, also with a too fine granularity
such as seconds, while you probably only want to retrieve a bunch of the
most recently indexed docs.

Jörg

On Monday, March 5, 2012 4:50:46 PM UTC+1, haarts wrote:

Unfortunately we do need the results ordered by time. Unless there is
an other way of doing it?

On Monday, 5 March 2012 16:25:21 UTC+1, Jörg Prante wrote:

Hi,

do not use "sort"

Jörg

On Monday, March 5, 2012 3:59:07 PM UTC+1, haarts wrote:

Hi!

We currently have 150 million document in an index, 10 shards, 3
replicas and 3 machines (8 core, 24GB RAM). Search works but rather slow.
Query time measures in seconds, often more than 5. I've been playing with
several queries:
curl -XGET http://188.165.222.156:9200/**tweets/_searchhttp://188.165.222.156:9200/tweets/_search-d '{"query":{"filtered":{"
**query": {"match_all":{}}, "filter":{"term":{"text":"house"}}}},
"sort":{"created_at":{"order":
"desc"}}}'
and
curl -XGET http://188.165.222.156:9200/**tweets/_searchhttp://188.165.222.156:9200/tweets/_search-d '{"query":{"text":{"text":"
house"}}, "sort":{"created_at":{"order":"desc"}}}'
I have had no luck making these queries performant. Also tried
search_type=query_then_fetch with mixed results.

Any tips?

With kind regards,

On Monday, 5 March 2012 19:24:04 UTC+1, Berkay Mollamustafaoglu wrote:

Based on the initial queries you've posted, it looks like you're dealing
with time based data. I think you can get much better performance if you
segment the data manually instead of auto sharding as you're doing
now. With your current configuration, every search goes to all 10 shards.

For example, you can create an index per week or month. If you want on the
latest data, you'd only query the index that has the latest data (instead
of all 10 shards). Given that you're exhausting CPU and disk IO, I'd
expect significant improvement in performance with this approach.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Mon, Mar 5, 2012 at 12:37 PM, haarts <> wrote:

That is exactly what we are looking for!
I, unfortunately, don't know what 'buckets' are. Do you mean:
Elasticsearch Platform — Find real-time answers at scale | Elastic? And how would 'static boosting' work? I'd imagine this done query time
not index time, right?

Thank you in advance. +1

On Monday, 5 March 2012 18:20:26 UTC+1, Jörg Prante wrote:

Do you have a fixed date since your data has started, and are you
looking for something like document freshness? Then you should consider to
divide your creation time period from then until now into a series of
buckets, and give static boost values to the documents in the buckets
accordingly. Buckets could be sized by year / month / week / day period
lengths, it depends on the requirements of your app. It is awfully slow
(and resource intensive like Shay described) to sort all docs over and over
again, even all the aged, outdated docs, also with a too fine granularity
such as seconds, while you probably only want to retrieve a bunch of the
most recently indexed docs.

Jörg

On Monday, March 5, 2012 4:50:46 PM UTC+1, haarts wrote:

Unfortunately we do need the results ordered by time. Unless there is
an other way of doing it?

On Monday, 5 March 2012 16:25:21 UTC+1, Jörg Prante wrote:

Hi,

do not use "sort"

Jörg

On Monday, March 5, 2012 3:59:07 PM UTC+1, haarts wrote:

Hi!

We currently have 150 million document in an index, 10 shards, 3
replicas and 3 machines (8 core, 24GB RAM). Search works but rather slow.
Query time measures in seconds, often more than 5. I've been playing with
several queries:
curl -XGET http://188.165.222.156:9200/**tweets/_searchhttp://188.165.222.156:9200/tweets/_search-d '{"query":{"filtered":{"
**query": {"match_all":{}}, "filter":{"term":{"text":"house"}}}},
"sort":{"created_at":{"order":
"desc"}}}'
and
curl -XGET http://188.165.222.156:9200/**tweets/_searchhttp://188.165.222.156:9200/tweets/_search-d '{"query":{"text":{"text":"
house"}}, "sort":{"created_at":{"order":"desc"}}}'
I have had no luck making these queries performant. Also tried
search_type=query_then_fetch with mixed results.

Any tips?

With kind regards,

There is no way to guarantee number of returned results as far as I know.
You'd have to repeat the search for additional indices as you've stated.
Another option may be to use the count query (which works much faster)
first to get the number of results and then run the query itself against
the necessary indices.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Mon, Mar 5, 2012 at 2:43 PM, haarts harmaarts@gmail.com wrote:

Can I guarantee the amount of returned results? Say the most recent index
(N) holds Q matching documents. But I'd like to have Q+P (P>0) results.
Will ES automatically query index N+1 (etc)? Do I need to re-search for the
remainder?

On Monday, 5 March 2012 19:24:04 UTC+1, Berkay Mollamustafaoglu wrote:

Based on the initial queries you've posted, it looks like you're dealing
with time based data. I think you can get much better performance if you
segment the data manually instead of auto sharding as you're doing
now. With your current configuration, every search goes to all 10 shards.

For example, you can create an index per week or month. If you want on
the latest data, you'd only query the index that has the latest data
(instead of all 10 shards). Given that you're exhausting CPU and disk IO,
I'd expect significant improvement in performance with this approach.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Mon, Mar 5, 2012 at 12:37 PM, haarts <> wrote:

That is exactly what we are looking for!
I, unfortunately, don't know what 'buckets' are. Do you mean:
http://www.**Elasticsearch Platform — Find real-time answers at scale | Elasticreference/api/search/facets/
date-histogram-facet.htmlhttp://www.elasticsearch.org/guide/reference/api/search/facets/date-histogram-facet.html? And how would 'static boosting' work? I'd imagine this done query time
not index time, right?

Thank you in advance. +1

On Monday, 5 March 2012 18:20:26 UTC+1, Jörg Prante wrote:

Do you have a fixed date since your data has started, and are you
looking for something like document freshness? Then you should consider to
divide your creation time period from then until now into a series of
buckets, and give static boost values to the documents in the buckets
accordingly. Buckets could be sized by year / month / week / day period
lengths, it depends on the requirements of your app. It is awfully slow
(and resource intensive like Shay described) to sort all docs over and over
again, even all the aged, outdated docs, also with a too fine granularity
such as seconds, while you probably only want to retrieve a bunch of the
most recently indexed docs.

Jörg

On Monday, March 5, 2012 4:50:46 PM UTC+1, haarts wrote:

Unfortunately we do need the results ordered by time. Unless there is
an other way of doing it?

On Monday, 5 March 2012 16:25:21 UTC+1, Jörg Prante wrote:

Hi,

do not use "sort"

Jörg

On Monday, March 5, 2012 3:59:07 PM UTC+1, haarts wrote:

Hi!

We currently have 150 million document in an index, 10 shards, 3
replicas and 3 machines (8 core, 24GB RAM). Search works but rather slow.
Query time measures in seconds, often more than 5. I've been playing with
several queries:
curl -XGET http://188.165.222.156:9200/**tw**eets/_searchhttp://188.165.222.156:9200/tweets/_search-d '{"query":{"filtered":{"
query": {"match_all":{}}, "filter":{"term":{"text":"house"}}}},
"sort":{"created_at":{"order":****"desc"}}}'
and
curl -XGET http://188.165.222.156:9200/**tw**eets/_searchhttp://188.165.222.156:9200/tweets/_search-d '{"query":{"text":{"text":"
house"}}, "sort":{"created_at":{"order":"desc"}}}'
I have had no luck making these queries performant. Also tried
search_type=query_then_fetch with mixed results.

Any tips?

With kind regards,

On Monday, 5 March 2012 19:24:04 UTC+1, Berkay Mollamustafaoglu wrote:

Based on the initial queries you've posted, it looks like you're dealing
with time based data. I think you can get much better performance if you
segment the data manually instead of auto sharding as you're doing
now. With your current configuration, every search goes to all 10 shards.

For example, you can create an index per week or month. If you want on
the latest data, you'd only query the index that has the latest data
(instead of all 10 shards). Given that you're exhausting CPU and disk IO,
I'd expect significant improvement in performance with this approach.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Mon, Mar 5, 2012 at 12:37 PM, haarts <> wrote:

That is exactly what we are looking for!
I, unfortunately, don't know what 'buckets' are. Do you mean:
http://www.**Elasticsearch Platform — Find real-time answers at scale | Elasticreference/api/search/facets/
date-histogram-facet.htmlhttp://www.elasticsearch.org/guide/reference/api/search/facets/date-histogram-facet.html? And how would 'static boosting' work? I'd imagine this done query time
not index time, right?

Thank you in advance. +1

On Monday, 5 March 2012 18:20:26 UTC+1, Jörg Prante wrote:

Do you have a fixed date since your data has started, and are you
looking for something like document freshness? Then you should consider to
divide your creation time period from then until now into a series of
buckets, and give static boost values to the documents in the buckets
accordingly. Buckets could be sized by year / month / week / day period
lengths, it depends on the requirements of your app. It is awfully slow
(and resource intensive like Shay described) to sort all docs over and over
again, even all the aged, outdated docs, also with a too fine granularity
such as seconds, while you probably only want to retrieve a bunch of the
most recently indexed docs.

Jörg

On Monday, March 5, 2012 4:50:46 PM UTC+1, haarts wrote:

Unfortunately we do need the results ordered by time. Unless there is
an other way of doing it?

On Monday, 5 March 2012 16:25:21 UTC+1, Jörg Prante wrote:

Hi,

do not use "sort"

Jörg

On Monday, March 5, 2012 3:59:07 PM UTC+1, haarts wrote:

Hi!

We currently have 150 million document in an index, 10 shards, 3
replicas and 3 machines (8 core, 24GB RAM). Search works but rather slow.
Query time measures in seconds, often more than 5. I've been playing with
several queries:
curl -XGET http://188.165.222.156:9200/**tw**eets/_searchhttp://188.165.222.156:9200/tweets/_search-d '{"query":{"filtered":{"
query": {"match_all":{}}, "filter":{"term":{"text":"house"}}}},
"sort":{"created_at":{"order":****"desc"}}}'
and
curl -XGET http://188.165.222.156:9200/**tw**eets/_searchhttp://188.165.222.156:9200/tweets/_search-d '{"query":{"text":{"text":"
house"}}, "sort":{"created_at":{"order":"desc"}}}'
I have had no luck making these queries performant. Also tried
search_type=query_then_fetch with mixed results.

Any tips?

With kind regards,

As was suggested, segmenting the indexes by time can help, especially with the ability to grow the data more easily. It does require extra work on your end, for example, when searching, start with latest, and if you don't get enough results, start to move to previous date indices. The nice thing is that this can work ok usability wise, you can already display the (possibly partial results) to the user from the first index while you go and fetch additional ones.

But, lets stick with the current config and see what we can do, I would love to know some more data. Note, the best thing to do is to have measurements that do not include cache warmup time, as it brings on too much noise to the system. It would be great if you can run it also on 0.19.

  1. Are the 20 concurrent requests running from the same box? Maybe that box is saturated?
  2. Can you also measure the total_time returned from the search response? This is the time spent in ES cluster. I would love to know the time without the client communication overhead (for example, are you using persistent connections / keep alive or not?).
  3. Can you execute a search without the sorting? How long does that take? (with the above both curl time and total_time)? Sorting is mainly CPU intensive. Lets verify the overhead of sorting in your case.
  4. 20 concurrent on a 10 shard index will cause 200 concurrent requests executed on the cluster of 3 nodes. If might be taxing the CPU (we can verify on #3). You can try and bound the number of concurrent search requests executed on a node by configuring the thread pool for search. See more here: Elasticsearch Platform — Find real-time answers at scale | Elastic.

On Monday, March 5, 2012 at 7:23 PM, haarts wrote:

Again with the excellent questions! :wink:

We are now running 850 requests, 20 concurrent (with Ruby Typhoeus). This was used to warm up the cache. Before running the tests we cleared the cache.

We measure search latency by CURLINFO_TOTAL_TIME (libcurl - curl_easy_getinfo()).

After warming up the cache we ran 10.000 searches. During these searches we monitored Bigdesk and 'top'. Whenever we were not maxing out the disks we were maxing out the CPU's. Memory seems absolutely fine. The average response time with hot cache was around 600ms. 300ms would make our boss happy (and not only him).

On Monday, 5 March 2012 17:06:39 UTC+1, kimchy wrote:

Regarding warming up, just issue something like 100-1000 search requests with the sorting, and only then start to make your measurements.

What is the load that you are running on the system? I mean, how do you measure the search latency? How many concurrent searches do you execute? When you run the load test, do you see CPU maxing out, or other resources?

On Monday, March 5, 2012 at 6:02 PM, haarts wrote:

Hi,

On Monday, 5 March 2012 16:36:06 UTC+1, kimchy wrote:

First, have you allocated enough memory to ES? When sorting, it needs to load the created_at values to memory. I would suggest allocating something like 12gb to it. Also, why do you need 10 shards with 3 replicas on 3 machines. First, you end up with 4 copies of each shard, where you really want, I guess, 3 to have a copy on each machine? In this case, have index.number_of_replicas set to 2.

I was in error. We actually have 2 replicas. We also allocated 13GB of JVM heap size.

Also, in this case, why do you need 10 shards? Especially if you want a copy on each node. Less shards will mean less memory being used (now you have 10 allocated on each node).

We choose for 10 shards as we expect to scale up to 1 billion items. We guessed we needed more than 3 machines for that.

Also, after you are done with bulk indexing the 150 million data, I suggest you optimize the index, it will improve the memory requirements it needs. Optimize it down to ~5 segments (max_num_segments in the optimize API). Note, this will be a heavy operations.

Excellent, we gave that a try. Performance now is around 1000ms +/- 500ms.

Last, the initial search request will take some time since they need to load the values of created_at to memory. Make sure when testing to warm things up (both loading up the values, but also "warming" the JVM").

What do you mean exactly by warming up the JVM? Just do 10 random searches? Or is there some more rigorous way of doing this?

On Monday, March 5, 2012 at 4:59 PM, haarts wrote:

Hi!

We currently have 150 million document in an index, 10 shards, 3 replicas and 3 machines (8 core, 24GB RAM). Search works but rather slow. Query time measures in seconds, often more than 5. I've been playing with several queries:
curl -XGET http://188.165.222.156:9200/tweets/_search -d '{"query":{"filtered":{"query": {"match_all":{}}, "filter":{"term":{"text":"house"}}}}, "sort":{"created_at":{"order":"desc"}}}'
and
curl -XGET http://188.165.222.156:9200/tweets/_search -d '{"query":{"text":{"text":"house"}}, "sort":{"created_at":{"order":"desc"}}}'
I have had no luck making these queries performant. Also tried search_type=query_then_fetch with mixed results.

Any tips?

With kind regards,

The method is to assign bucket numbers to date periods.

Imagine you have a start date, and you put the timestamps of the documents
in buckets, organized by the duration of a week. That's a basic technique,
discretization of timestamps. The first week bucket would be enumerated
with 0, the second week with 1, and so on until today's week. Then, scale
the week bucket number with a viable boost factor, and assign this boost
value to the document. I call this "static boost" because it is not a boost
value that is dynamically computed or applied at query time. By assigning
the static boost, you let the search engine do a "presorting" - most recent
docs are pushed up in the search results at index time - and no longer at
query time with "sort".

If you compare 150 mio. document timestamps sorted at each query, probably
with second resolution, with only a few hundreds of week numbers, without
an extra sort, assigned once at index time, you got the idea of how much
resources can be saved. If the result is too coarse, you can try day
buckets, hour buckets etc.

Jörg

On Monday, March 5, 2012 6:37:33 PM UTC+1, haarts wrote:

That is exactly what we are looking for!
I, unfortunately, don't know what 'buckets' are. Do you mean:
Elasticsearch Platform — Find real-time answers at scale | Elastic? And how would 'static boosting' work? I'd imagine this done query time
not index time, right?

Thank you in advance. +1

On Monday, 5 March 2012 18:20:26 UTC+1, Jörg Prante wrote:

Do you have a fixed date since your data has started, and are you looking
for something like document freshness? Then you should consider to divide
your creation time period from then until now into a series of buckets, and
give static boost values to the documents in the buckets accordingly.
Buckets could be sized by year / month / week / day period lengths, it
depends on the requirements of your app. It is awfully slow (and resource
intensive like Shay described) to sort all docs over and over again, even
all the aged, outdated docs, also with a too fine granularity such as
seconds, while you probably only want to retrieve a bunch of the most
recently indexed docs.

Jörg

On Monday, March 5, 2012 4:50:46 PM UTC+1, haarts wrote:

Unfortunately we do need the results ordered by time. Unless there is an
other way of doing it?

On Monday, 5 March 2012 16:25:21 UTC+1, Jörg Prante wrote:

Hi,

do not use "sort"

Jörg

On Monday, March 5, 2012 3:59:07 PM UTC+1, haarts wrote:

Hi!

We currently have 150 million document in an index, 10 shards, 3
replicas and 3 machines (8 core, 24GB RAM). Search works but rather slow.
Query time measures in seconds, often more than 5. I've been playing with
several queries:
curl -XGET http://188.165.222.156:9200/tweets/_search-d '{"query":{"filtered":{"query": {"match_all":{}},
"filter":{"term":{"text":"house"}}}},
"sort":{"created_at":{"order":"desc"}}}'
and
curl -XGET http://188.165.222.156:9200/tweets/_search-d '{"query":{"text":{"text":"house"}},
"sort":{"created_at":{"order":"desc"}}}'
I have had no luck making these queries performant. Also tried
search_type=query_then_fetch with mixed results.

Any tips?

With kind regards,

  1. Before any of this I ran an optimize command.

  2. The 'box' the requests are running on is my laptop (i7, 4 cores, 4GB
    RAM). The load is about 0.9, CPU at 12%, no noticeable disk activity, down
    30Kb/s, up 4Kb/s. I'd say the machine is fine.

  3. I've used the 'took' field of the responses assuming it is in
    milliseconds. This is the total_time you are referring to? I changed the
    way we run the tests now slightly. First I cleared the caches with a POST
    request to _cache/clear. Then each run selects randomly 500 words from a
    list of 58000. I ran two runs (2x500 words) and discarded the first result.
    The results are:
    5017.185 ms for curl time.
    4960.146 ms for 'took' time.

Most of the time is spend on the ES cluster. I'm not sure if I'm using keep
alive connections. I think so as curl uses them by default (

).

  1. Without the sort:
    3993.570 ms for curl time.
    3951.846 ms for 'took' time.

I don't think sorting is the problem. Strangely enough.

  1. I need to look into this more closely. The same goes with the upgrade to
    0.19 from 0.18.7.

One other thing we noticed was that one of our nodes had a really really
high load (40+, then maxed CPU, then maxed IO), we had to switch IPs in
order to run the tests.

You might have noticed that performance now is worse than previously
reported. This has to do with the new testing strategy. I decided to switch
when I noticed wild swings in performance. These results seem more on the
mark. Unfortunately also far from what we want.

A last thing I noticed is that the performance got better over time. Even
with clearing out the caches. I'm guessing the JVM has something to do with
that?

With kind regards,

On Tue, Mar 6, 2012 at 9:06 AM, Shay Banon kimchy@gmail.com wrote:

As was suggested, segmenting the indexes by time can help, especially
with the ability to grow the data more easily. It does require extra work
on your end, for example, when searching, start with latest, and if you
don't get enough results, start to move to previous date indices. The nice
thing is that this can work ok usability wise, you can already display the
(possibly partial results) to the user from the first index while you go
and fetch additional ones.

But, lets stick with the current config and see what we can do, I would
love to know some more data. Note, the best thing to do is to have
measurements that do not include cache warmup time, as it brings on too
much noise to the system. It would be great if you can run it also on 0.19.

  1. Are the 20 concurrent requests running from the same box? Maybe that
    box is saturated?
  2. Can you also measure the total_time returned from the search response?
    This is the time spent in ES cluster. I would love to know the time without
    the client communication overhead (for example, are you using persistent
    connections / keep alive or not?).
  3. Can you execute a search without the sorting? How long does that take?
    (with the above both curl time and total_time)? Sorting is mainly CPU
    intensive. Lets verify the overhead of sorting in your case.
  4. 20 concurrent on a 10 shard index will cause 200 concurrent requests
    executed on the cluster of 3 nodes. If might be taxing the CPU (we can
    verify on #3). You can try and bound the number of concurrent search
    requests executed on a node by configuring the thread pool for search. See
    more here:
    Elasticsearch Platform — Find real-time answers at scale | Elastic.

On Monday, March 5, 2012 at 7:23 PM, haarts wrote:

Again with the excellent questions! :wink:

We are now running 850 requests, 20 concurrent (with Ruby Typhoeus). This
was used to warm up the cache. Before running the tests we cleared the
cache.

We measure search latency by CURLINFO_TOTAL_TIME (
libcurl - curl_easy_getinfo()).

After warming up the cache we ran 10.000 searches. During these searches
we monitored Bigdesk and 'top'. Whenever we were not maxing out the disks
we were maxing out the CPU's. Memory seems absolutely fine. The average
response time with hot cache was around 600ms. 300ms would make our boss
happy (and not only him).

On Monday, 5 March 2012 17:06:39 UTC+1, kimchy wrote:

Regarding warming up, just issue something like 100-1000 search requests
with the sorting, and only then start to make your measurements.

What is the load that you are running on the system? I mean, how do you
measure the search latency? How many concurrent searches do you execute?
When you run the load test, do you see CPU maxing out, or other resources?

On Monday, March 5, 2012 at 6:02 PM, haarts wrote:

Hi,

On Monday, 5 March 2012 16:36:06 UTC+1, kimchy wrote:

First, have you allocated enough memory to ES? When sorting, it needs to
load the created_at values to memory. I would suggest allocating something
like 12gb to it. Also, why do you need 10 shards with 3 replicas on 3
machines. First, you end up with 4 copies of each shard, where you really
want, I guess, 3 to have a copy on each machine? In this case, have
index.number_of_replicas set to 2.

I was in error. We actually have 2 replicas. We also allocated 13GB of JVM
heap size.

Also, in this case, why do you need 10 shards? Especially if you want a
copy on each node. Less shards will mean less memory being used (now you
have 10 allocated on each node).

We choose for 10 shards as we expect to scale up to 1 billion items. We
guessed we needed more than 3 machines for that.

Also, after you are done with bulk indexing the 150 million data, I
suggest you optimize the index, it will improve the memory requirements it
needs. Optimize it down to ~5 segments (max_num_segments in the optimize
API). Note, this will be a heavy operations.

Excellent, we gave that a try. Performance now is around 1000ms +/- 500ms.

Last, the initial search request will take some time since they need to
load the values of created_at to memory. Make sure when testing to warm
things up (both loading up the values, but also "warming" the JVM").

What do you mean exactly by warming up the JVM? Just do 10 random
searches? Or is there some more rigorous way of doing this?

On Monday, March 5, 2012 at 4:59 PM, haarts wrote:

Hi!

We currently have 150 million document in an index, 10 shards, 3 replicas
and 3 machines (8 core, 24GB RAM). Search works but rather slow. Query time
measures in seconds, often more than 5. I've been playing with several
queries:
curl -XGET http://188.165.222.156:9200/**tweets/_searchhttp://188.165.222.156:9200/tweets/_search-d '{"query":{"filtered":{"
**query": {"match_all":{}}, "filter":{"term":{"text":"house"}}}},
"sort":{"created_at":{"order":
"desc"}}}'
and
curl -XGET http://188.165.222.156:9200/**tweets/_searchhttp://188.165.222.156:9200/tweets/_search-d '{"query":{"text":{"text":"
house"}}, "sort":{"created_at":{"order":"desc"}}}'
I have had no luck making these queries performant. Also tried
search_type=query_then_fetch with mixed results.

Any tips?

With kind regards,

Have you tried using a range filter on the created_at? I think the filter
would limit the number of docs your queries (not sure about the term
filter) are being applied to. According to the docs, range filters are
cached. When executing a query on 3/8, maybe something equivalent to this
sudo code of created_at:[3/2 TO 3/9]. If that doesn't give enough results
you can then query with filter of created_at:[2/24 TO 3/1], and so on. Your
initial range could be adjusted so that it usually returns enough results.
You could use the same range throughout the day and leverage the cache. If
the first queries of the day takes some time you could even warm it up with
a few queries shortly after midnight.

Keep in mind I have not been using ES for long and have not tried this
technique, but I have found that a similar technique using filter query
worked well for Solr.

Regards,
Mark