Best way to sort by a rapidly changing field?

What the best way to sort by a quick changing field like number of
votes? There seems to be 2 options.

  1. Delete and reindex the object with the field changed. How many
    updates a second can an elasticsearch node usually support? Will this
    cause a lot of merging and thus performance issues?

  2. Store the field in Mysql. However searches would require fetching
    the entire dataset. And with a couple thousand items, the results
    returned are huge. Is there a way to disable everything but the _id?
    If I even sort by timestamp it appends a sort: [1353929425990] making
    this method even more unsuitable.

{
_index: foo
_type: bar
_id: 1
_score: 1
}

What is the best solution?

I heard that Reddit uses IndexTank which stores fast changing fields
in memory so it can be quickly changed without reindexing. Does
Elasticsearch have something like this?

--

Hello Bob,

I think that with Elasticsearch you'd have to update the value of the field
by deleting and reindexing the document. For convenience, you can use the
Update API, which underneath does the same thing:

I'm not sure how you can solve this problem by using an external store. As
far as I know, if you want to sort on a field with ES, in practice there's
no getting away from at least storing the field in ES.

As for updating ES documents, performance depends in the end on both your
hardware and how your documents look like. But there are some things to
consider:

So you might be better off if you can use the bulk API for deleting and
reindexing:

Or instead of bulk deletes you could use delete by query:

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Mon, Nov 26, 2012 at 1:38 PM, Bob bottiger10@gmail.com wrote:

What the best way to sort by a quick changing field like number of
votes? There seems to be 2 options.

  1. Delete and reindex the object with the field changed. How many
    updates a second can an elasticsearch node usually support? Will this
    cause a lot of merging and thus performance issues?

  2. Store the field in Mysql. However searches would require fetching
    the entire dataset. And with a couple thousand items, the results
    returned are huge. Is there a way to disable everything but the _id?
    If I even sort by timestamp it appends a sort: [1353929425990] making
    this method even more unsuitable.

{
_index: foo
_type: bar
_id: 1
_score: 1
}

What is the best solution?

I heard that Reddit uses IndexTank which stores fast changing fields
in memory so it can be quickly changed without reindexing. Does
Elasticsearch have something like this?

--

The problem can be solved by using an external store because databases
like MySQL don't need to reindex when you update a field. However,
when you complete a search query, the entire document list matching a
query must be passed to MySQL so the set can be properly sorted.
Unfortunately, Elasticsearch seems to add a lot of unnecessary
metadata, which may make this method unbearably slow.

Is there any benefit to storing the document instead of manually
deleting and indexing a new copy? Would the increased size affect the
performance?

How many indexes per second would be possible per node? 10? 100? 1000?

Are there any tips for minimizing the impact of merges? I would like
to avoid long GC pauses.

On Nov 26, 5:28 am, Radu Gheorghe radu.gheor...@sematext.com wrote:

Hello Bob,

I think that with Elasticsearch you'd have to update the value of the field
by deleting and reindexing the document. For convenience, you can use the
Update API, which underneath does the same thing:Elasticsearch Platform — Find real-time answers at scale | Elastic

I'm not sure how you can solve this problem by using an external store. As
far as I know, if you want to sort on a field with ES, in practice there's
no getting away from at least storing the field in ES.

As for updating ES documents, performance depends in the end on both your
hardware and how your documents look like. But there are some things to
consider:

So you might be better off if you can use the bulk API for deleting and
reindexing:Elasticsearch Platform — Find real-time answers at scale | Elastic

Or instead of bulk deletes you could use delete by query:Elasticsearch Platform — Find real-time answers at scale | Elastic

Best regards,
Radu
--http://sematext.com/-- Elasticsearch -- Solr -- Lucene

On Mon, Nov 26, 2012 at 1:38 PM, Bob bottige...@gmail.com wrote:

What the best way to sort by a quick changing field like number of
votes? There seems to be 2 options.

  1. Delete and reindex the object with the field changed. How many
    updates a second can an elasticsearch node usually support? Will this
    cause a lot of merging and thus performance issues?
  1. Store the field in Mysql. However searches would require fetching
    the entire dataset. And with a couple thousand items, the results
    returned are huge. Is there a way to disable everything but the _id?
    If I even sort by timestamp it appends a sort: [1353929425990] making
    this method even more unsuitable.

{
_index: foo
_type: bar
_id: 1
_score: 1
}

What is the best solution?

I heard that Reddit uses IndexTank which stores fast changing fields
in memory so it can be quickly changed without reindexing. Does
Elasticsearch have something like this?

--

Hi Bob,

can you describe a little bit what the size of the challenge is? How much
numbers do you want to insert per second? How many sorts do you want to
perform per second? Are your numbers only discrete numbers (integers)? What
length is the sorted list you want to generate per query?

There are some options to consider:

  • simple field caching of numbers. ES can hold field data completely in
    RAM, under certain circumstances, you can sort very fast. Be prepared to
    have lots of RAM.

  • manipulate the score values, to create a custom scoring algorithm for
    obtaining your sorting order

  • wait for ES 0.21 ... Lucene 4 comes with support of column stride fields.
    The new DocValue fields are suitable for frequently changing values like
    click feedbacks or user ratings. See this great presentation of Simon
    Willnauer
    http://de.slideshare.net/lucenerevolution/willnauer-simon-doc-values-column-stride-fields-in-lucene

Best regards,

Jörg

--

Is 50 updates a second reasonable? I would like to have as many
updates as possible, but it is impossible to tell how much traffic I
will eventually get.

I just want to sort on 1 field, number of votes, after the query has
filtered some documents. I would estimate the number of documents
returned to be anywhere from 1 to 2000.

The fields are all integers and longs with 5-10 nested objects. The
document might look like this:

{
votes: 5,
items:
[
{'a': 1},
{'b': 2},
{c': 3}
]
}

The column stride fields sound nice. When is ES 0.21 coming out?

On Nov 26, 2:41 pm, Jörg Prante joergpra...@gmail.com wrote:

Hi Bob,

can you describe a little bit what the size of the challenge is? How much
numbers do you want to insert per second? How many sorts do you want to
perform per second? Are your numbers only discrete numbers (integers)? What
length is the sorted list you want to generate per query?

There are some options to consider:

  • simple field caching of numbers. ES can hold field data completely in
    RAM, under certain circumstances, you can sort very fast. Be prepared to
    have lots of RAM.

  • manipulate the score values, to create a custom scoring algorithm for
    obtaining your sorting order

  • wait for ES 0.21 ... Lucene 4 comes with support of column stride fields.
    The new DocValue fields are suitable for frequently changing values like
    click feedbacks or user ratings. See this great presentation of Simon
    Willnauerhttp://de.slideshare.net/lucenerevolution/willnauer-simon-doc-values-...

Best regards,

Jörg

--

Hello Bob,

On Tue, Nov 27, 2012 at 12:03 AM, Bob bottiger10@gmail.com wrote:

The problem can be solved by using an external store because databases
like MySQL don't need to reindex when you update a field. However,
when you complete a search query, the entire document list matching a
query must be passed to MySQL so the set can be properly sorted.
Unfortunately, Elasticsearch seems to add a lot of unnecessary
metadata, which may make this method unbearably slow.

Oh, I see. I thought you wanted to sort in ES while keeping data in an
external data store.

Is there any benefit to storing the document instead of manually
deleting and indexing a new copy? Would the increased size affect the
performance?

I don't think I understand your question, could you rephrase? Do you mean
to insert a new doc but not delete the old one, or?

How many indexes per second would be possible per node? 10? 100? 1000?

That depends on quite a lot of factors, like:

  • node hardware
  • size of docs
  • mapping (whether you index everything or just certain fields, analyzers,
    etc)
  • number of shards
  • whether you use bulk or not, and what the bulk size is

I'd suggest you start with a subset of your data and do a little
performance run on a test machine to get an idea. If you really need some
numbers to start with, here's a simple test I did on my laptop:

ES settings were default, except I was using just one shard for my index.
The laptop itself is pretty average, and you can see docs are small. But
the important result here is the big difference between inserting one doc
at a time (500 indexes/sec) and inserting 1000 at a time (6500 indexes/sec).

Are there any tips for minimizing the impact of merges? I would like
to avoid long GC pauses.

AFAIK merges don't have anything with GC. You can use a monitoring tool
like Bigdesk[0] or our own SPM for Elasticsearch[1] to monitor how GC
behaves and whether you would want to tune something there.

Regarding merges, you might want to look at store level throttling[2], to
make sure your IO won't be suffocated by merges. Also, I'd suggest to look
at merge policies[3] and see if the default policy fits your needs or you
need to tweak something there.

[0] GitHub - lukas-vlcek/bigdesk: Live charts and statistics for Elasticsearch cluster.
[1] Elasticsearch Monitoring
[2] Elasticsearch Platform — Find real-time answers at scale | Elastic
[3] Elasticsearch Platform — Find real-time answers at scale | Elastic

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Nov 26, 5:28 am, Radu Gheorghe radu.gheor...@sematext.com wrote:

Hello Bob,

I think that with Elasticsearch you'd have to update the value of the
field
by deleting and reindexing the document. For convenience, you can use the
Update API, which underneath does the same thing:
Elasticsearch Platform — Find real-time answers at scale | Elastic

I'm not sure how you can solve this problem by using an external store.
As
far as I know, if you want to sort on a field with ES, in practice
there's
no getting away from at least storing the field in ES.

As for updating ES documents, performance depends in the end on both your
hardware and how your documents look like. But there are some things to
consider:

So you might be better off if you can use the bulk API for deleting and
reindexing:Elasticsearch Platform — Find real-time answers at scale | Elastic

Or instead of bulk deletes you could use delete by query:
Elasticsearch Platform — Find real-time answers at scale | Elastic

Best regards,
Radu
--http://sematext.com/-- Elasticsearch -- Solr -- Lucene

On Mon, Nov 26, 2012 at 1:38 PM, Bob bottige...@gmail.com wrote:

What the best way to sort by a quick changing field like number of
votes? There seems to be 2 options.

  1. Delete and reindex the object with the field changed. How many
    updates a second can an elasticsearch node usually support? Will this
    cause a lot of merging and thus performance issues?
  1. Store the field in Mysql. However searches would require fetching
    the entire dataset. And with a couple thousand items, the results
    returned are huge. Is there a way to disable everything but the _id?
    If I even sort by timestamp it appends a sort: [1353929425990] making
    this method even more unsuitable.

{
_index: foo
_type: bar
_id: 1
_score: 1
}

What is the best solution?

I heard that Reddit uses IndexTank which stores fast changing fields
in memory so it can be quickly changed without reindexing. Does
Elasticsearch have something like this?

--

--