Best way to sort by a rapidly changing field?

bob_2 · November 26, 2012, 11:38am

What the best way to sort by a quick changing field like number of
votes? There seems to be 2 options.

Delete and reindex the object with the field changed. How many
updates a second can an elasticsearch node usually support? Will this
cause a lot of merging and thus performance issues?
Store the field in Mysql. However searches would require fetching
the entire dataset. And with a couple thousand items, the results
returned are huge. Is there a way to disable everything but the _id?
If I even sort by timestamp it appends a sort: [1353929425990] making
this method even more unsuitable.

{
_index: foo
_type: bar
_id: 1
_score: 1
}

What is the best solution?

I heard that Reddit uses IndexTank which stores fast changing fields
in memory so it can be quickly changed without reindexing. Does
Elasticsearch have something like this?

--

radu_gheorghe · November 26, 2012, 1:28pm

Hello Bob,

I think that with Elasticsearch you'd have to update the value of the field
by deleting and reindexing the document. For convenience, you can use the
Update API, which underneath does the same thing:

I'm not sure how you can solve this problem by using an external store. As
far as I know, if you want to sort on a field with ES, in practice there's
no getting away from at least storing the field in ES.

As for updating ES documents, performance depends in the end on both your
hardware and how your documents look like. But there are some things to
consider:

frequent deletes and inserts will cause a lot of merging, like you said,
which will stress IO
you can store indices in memory if that's an option for you:
Elasticsearch Platform — Find real-time answers at scale | Elastic
if you use the Update API, you will have to update documents one by one.
There are a couple of issues opened which aim to fix that, but they're
unresolved for now:
using bulk API with update (using scripts) in elasticsearch · Issue #1985 · elastic/elasticsearch · GitHub
Update API: update by query · Issue #1607 · elastic/elasticsearch · GitHub

So you might be better off if you can use the bulk API for deleting and
reindexing:

Or instead of bulk deletes you could use delete by query:

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Mon, Nov 26, 2012 at 1:38 PM, Bob bottiger10@gmail.com wrote:

What the best way to sort by a quick changing field like number of
votes? There seems to be 2 options.

Delete and reindex the object with the field changed. How many
updates a second can an elasticsearch node usually support? Will this
cause a lot of merging and thus performance issues?

Store the field in Mysql. However searches would require fetching
the entire dataset. And with a couple thousand items, the results
returned are huge. Is there a way to disable everything but the _id?
If I even sort by timestamp it appends a sort: [1353929425990] making
this method even more unsuitable.

{
_index: foo
_type: bar
_id: 1
_score: 1
}

What is the best solution?

I heard that Reddit uses IndexTank which stores fast changing fields
in memory so it can be quickly changed without reindexing. Does
Elasticsearch have something like this?

--

bob_2 · November 26, 2012, 10:03pm

The problem can be solved by using an external store because databases
like MySQL don't need to reindex when you update a field. However,
when you complete a search query, the entire document list matching a
query must be passed to MySQL so the set can be properly sorted.
Unfortunately, Elasticsearch seems to add a lot of unnecessary
metadata, which may make this method unbearably slow.

Is there any benefit to storing the document instead of manually
deleting and indexing a new copy? Would the increased size affect the
performance?

How many indexes per second would be possible per node? 10? 100? 1000?

Are there any tips for minimizing the impact of merges? I would like
to avoid long GC pauses.

On Nov 26, 5:28 am, Radu Gheorghe radu.gheor...@sematext.com wrote:

Hello Bob,

I think that with Elasticsearch you'd have to update the value of the field
by deleting and reindexing the document. For convenience, you can use the
Update API, which underneath does the same thing:Elasticsearch Platform — Find real-time answers at scale | Elastic

I'm not sure how you can solve this problem by using an external store. As
far as I know, if you want to sort on a field with ES, in practice there's
no getting away from at least storing the field in ES.

As for updating ES documents, performance depends in the end on both your
hardware and how your documents look like. But there are some things to
consider:

frequent deletes and inserts will cause a lot of merging, like you said,
which will stress IO

you can store indices in memory if that's an option for you:Elasticsearch Platform — Find real-time answers at scale | Elastic

if you use the Update API, you will have to update documents one by one.
There are a couple of issues opened which aim to fix that, but they're
unresolved for now:https://github.com/elasticsearch/elasticsearch/issues/1985https://github.com/elasticsearch/elasticsearch/issues/1607

So you might be better off if you can use the bulk API for deleting and
reindexing:Elasticsearch Platform — Find real-time answers at scale | Elastic

Or instead of bulk deletes you could use delete by query:Elasticsearch Platform — Find real-time answers at scale | Elastic

Best regards,
Radu
--http://sematext.com/-- Elasticsearch -- Solr -- Lucene

On Mon, Nov 26, 2012 at 1:38 PM, Bob bottige...@gmail.com wrote:

What the best way to sort by a quick changing field like number of
votes? There seems to be 2 options.

Delete and reindex the object with the field changed. How many
updates a second can an elasticsearch node usually support? Will this
cause a lot of merging and thus performance issues?

Store the field in Mysql. However searches would require fetching
the entire dataset. And with a couple thousand items, the results
returned are huge. Is there a way to disable everything but the _id?
If I even sort by timestamp it appends a sort: [1353929425990] making
this method even more unsuitable.

{
_index: foo
_type: bar
_id: 1
_score: 1
}

What is the best solution?

I heard that Reddit uses IndexTank which stores fast changing fields
in memory so it can be quickly changed without reindexing. Does
Elasticsearch have something like this?

--

jprante · November 26, 2012, 10:41pm

Hi Bob,

can you describe a little bit what the size of the challenge is? How much
numbers do you want to insert per second? How many sorts do you want to
perform per second? Are your numbers only discrete numbers (integers)? What
length is the sorted list you want to generate per query?

There are some options to consider:

simple field caching of numbers. ES can hold field data completely in
RAM, under certain circumstances, you can sort very fast. Be prepared to
have lots of RAM.
manipulate the score values, to create a custom scoring algorithm for
obtaining your sorting order
wait for ES 0.21 ... Lucene 4 comes with support of column stride fields.
The new DocValue fields are suitable for frequently changing values like
click feedbacks or user ratings. See this great presentation of Simon
Willnauer
http://de.slideshare.net/lucenerevolution/willnauer-simon-doc-values-column-stride-fields-in-lucene

Best regards,

Jörg

--

bob_2 · November 27, 2012, 11:02am

Is 50 updates a second reasonable? I would like to have as many
updates as possible, but it is impossible to tell how much traffic I
will eventually get.

I just want to sort on 1 field, number of votes, after the query has
filtered some documents. I would estimate the number of documents
returned to be anywhere from 1 to 2000.

The fields are all integers and longs with 5-10 nested objects. The
document might look like this:

{
votes: 5,
items:
[
{'a': 1},
{'b': 2},
{c': 3}
]
}

The column stride fields sound nice. When is ES 0.21 coming out?

On Nov 26, 2:41 pm, Jörg Prante joergpra...@gmail.com wrote:

Hi Bob,

can you describe a little bit what the size of the challenge is? How much
numbers do you want to insert per second? How many sorts do you want to
perform per second? Are your numbers only discrete numbers (integers)? What
length is the sorted list you want to generate per query?

There are some options to consider:

simple field caching of numbers. ES can hold field data completely in
RAM, under certain circumstances, you can sort very fast. Be prepared to
have lots of RAM.

manipulate the score values, to create a custom scoring algorithm for
obtaining your sorting order

wait for ES 0.21 ... Lucene 4 comes with support of column stride fields.
The new DocValue fields are suitable for frequently changing values like
click feedbacks or user ratings. See this great presentation of Simon
Willnauerhttp://de.slideshare.net/lucenerevolution/willnauer-simon-doc-values-...

Best regards,

Jörg

--

radu_gheorghe · November 27, 2012, 12:50pm

Hello Bob,

On Tue, Nov 27, 2012 at 12:03 AM, Bob bottiger10@gmail.com wrote:

The problem can be solved by using an external store because databases
like MySQL don't need to reindex when you update a field. However,
when you complete a search query, the entire document list matching a
query must be passed to MySQL so the set can be properly sorted.
Unfortunately, Elasticsearch seems to add a lot of unnecessary
metadata, which may make this method unbearably slow.

Oh, I see. I thought you wanted to sort in ES while keeping data in an
external data store.

Is there any benefit to storing the document instead of manually
deleting and indexing a new copy? Would the increased size affect the
performance?

I don't think I understand your question, could you rephrase? Do you mean
to insert a new doc but not delete the old one, or?

How many indexes per second would be possible per node? 10? 100? 1000?

That depends on quite a lot of factors, like:

node hardware
size of docs
mapping (whether you index everything or just certain fields, analyzers,
etc)
number of shards
whether you use bulk or not, and what the bulk size is

I'd suggest you start with a subset of your data and do a little
performance run on a test machine to get an idea. If you really need some
numbers to start with, here's a simple test I did on my laptop:

gist.github.com

https://gist.github.com/radu-gheorghe/4072200

simple_ES_performance

$ cat inserter.py
import pyes
import sys

conn = pyes.ES('127.0.0.1:9200', bulk_size=int(sys.argv[2]))

for i in range(int(sys.argv[1])):
  conn.index({"name":"Joe Tester", "parsedtext":"Joe Testere nice guy", "uuid":"11111", "position":1}, "test-index", "test-type", bulk=True)

$ time python inserter.py 1000 1

This file has been truncated. show original

ES settings were default, except I was using just one shard for my index.
The laptop itself is pretty average, and you can see docs are small. But
the important result here is the big difference between inserting one doc
at a time (500 indexes/sec) and inserting 1000 at a time (6500 indexes/sec).

Are there any tips for minimizing the impact of merges? I would like
to avoid long GC pauses.

AFAIK merges don't have anything with GC. You can use a monitoring tool
like Bigdesk[0] or our own SPM for Elasticsearch[1] to monitor how GC
behaves and whether you would want to tune something there.

Regarding merges, you might want to look at store level throttling[2], to
make sure your IO won't be suffocated by merges. Also, I'd suggest to look
at merge policies[3] and see if the default policy fits your needs or you
need to tweak something there.

[0] GitHub - lukas-vlcek/bigdesk: Live charts and statistics for Elasticsearch cluster.
[1] Elasticsearch Monitoring
[2] Elasticsearch Platform — Find real-time answers at scale | Elastic
[3] Elasticsearch Platform — Find real-time answers at scale | Elastic

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Nov 26, 5:28 am, Radu Gheorghe radu.gheor...@sematext.com wrote:

Hello Bob,

I think that with Elasticsearch you'd have to update the value of the
field
by deleting and reindexing the document. For convenience, you can use the
Update API, which underneath does the same thing:
Elasticsearch Platform — Find real-time answers at scale | Elastic

I'm not sure how you can solve this problem by using an external store.
As
far as I know, if you want to sort on a field with ES, in practice
there's
no getting away from at least storing the field in ES.

As for updating ES documents, performance depends in the end on both your
hardware and how your documents look like. But there are some things to
consider:

frequent deletes and inserts will cause a lot of merging, like you
said,
which will stress IO

you can store indices in memory if that's an option for you:
Elasticsearch Platform — Find real-time answers at scale | Elastic

if you use the Update API, you will have to update documents one by
one.
There are a couple of issues opened which aim to fix that, but they're
unresolved for now:
https://github.com/elasticsearch/elasticsearch/issues/1985https://github.com/elasticsearch/elasticsearch/issues/1607

So you might be better off if you can use the bulk API for deleting and
reindexing:Elasticsearch Platform — Find real-time answers at scale | Elastic

Or instead of bulk deletes you could use delete by query:
Elasticsearch Platform — Find real-time answers at scale | Elastic

Best regards,
Radu
--http://sematext.com/-- Elasticsearch -- Solr -- Lucene

On Mon, Nov 26, 2012 at 1:38 PM, Bob bottige...@gmail.com wrote:

What the best way to sort by a quick changing field like number of
votes? There seems to be 2 options.

Delete and reindex the object with the field changed. How many
updates a second can an elasticsearch node usually support? Will this
cause a lot of merging and thus performance issues?

Store the field in Mysql. However searches would require fetching
the entire dataset. And with a couple thousand items, the results
returned are huge. Is there a way to disable everything but the _id?
If I even sort by timestamp it appends a sort: [1353929425990] making
this method even more unsuitable.

{
_index: foo
_type: bar
_id: 1
_score: 1
}

What is the best solution?

I heard that Reddit uses IndexTank which stores fast changing fields
in memory so it can be quickly changed without reindexing. Does
Elasticsearch have something like this?

--

--

Topic		Replies	Views
Multiple sorting fields performance issues Elasticsearch	6	1487	June 18, 2018
Does it make sense to add timestamp in "index sorting"? Elasticsearch	3	773	November 10, 2021
Efficient way to sort by different fields by client Elasticsearch	1	325	October 8, 2020
Index sorting with two order values in the same field Elasticsearch	1	230	April 6, 2023
Sorting tables Elasticsearch	2	282	July 6, 2017

Best way to sort by a rapidly changing field?

Best regards, Radu

Best regards, Radu

Related topics

Best regards,
Radu

Best regards,
Radu