What the best way to sort by a quick changing field like number of
votes? There seems to be 2 options.
Delete and reindex the object with the field changed. How many
updates a second can an elasticsearch node usually support? Will this
cause a lot of merging and thus performance issues?
Store the field in Mysql. However searches would require fetching
the entire dataset. And with a couple thousand items, the results
returned are huge. Is there a way to disable everything but the _id?
If I even sort by timestamp it appends a sort: [1353929425990] making
this method even more unsuitable.
{
_index: foo
_type: bar
_id: 1
_score: 1
}
What is the best solution?
I heard that Reddit uses IndexTank which stores fast changing fields
in memory so it can be quickly changed without reindexing. Does
Elasticsearch have something like this?
I think that with Elasticsearch you'd have to update the value of the field
by deleting and reindexing the document. For convenience, you can use the
Update API, which underneath does the same thing:
I'm not sure how you can solve this problem by using an external store. As
far as I know, if you want to sort on a field with ES, in practice there's
no getting away from at least storing the field in ES.
As for updating ES documents, performance depends in the end on both your
hardware and how your documents look like. But there are some things to
consider:
frequent deletes and inserts will cause a lot of merging, like you said,
which will stress IO
What the best way to sort by a quick changing field like number of
votes? There seems to be 2 options.
Delete and reindex the object with the field changed. How many
updates a second can an elasticsearch node usually support? Will this
cause a lot of merging and thus performance issues?
Store the field in Mysql. However searches would require fetching
the entire dataset. And with a couple thousand items, the results
returned are huge. Is there a way to disable everything but the _id?
If I even sort by timestamp it appends a sort: [1353929425990] making
this method even more unsuitable.
{
_index: foo
_type: bar
_id: 1
_score: 1
}
What is the best solution?
I heard that Reddit uses IndexTank which stores fast changing fields
in memory so it can be quickly changed without reindexing. Does
Elasticsearch have something like this?
The problem can be solved by using an external store because databases
like MySQL don't need to reindex when you update a field. However,
when you complete a search query, the entire document list matching a
query must be passed to MySQL so the set can be properly sorted.
Unfortunately, Elasticsearch seems to add a lot of unnecessary
metadata, which may make this method unbearably slow.
Is there any benefit to storing the document instead of manually
deleting and indexing a new copy? Would the increased size affect the
performance?
How many indexes per second would be possible per node? 10? 100? 1000?
Are there any tips for minimizing the impact of merges? I would like
to avoid long GC pauses.
I think that with Elasticsearch you'd have to update the value of the field
by deleting and reindexing the document. For convenience, you can use the
Update API, which underneath does the same thing:Elasticsearch Platform — Find real-time answers at scale | Elastic
I'm not sure how you can solve this problem by using an external store. As
far as I know, if you want to sort on a field with ES, in practice there's
no getting away from at least storing the field in ES.
As for updating ES documents, performance depends in the end on both your
hardware and how your documents look like. But there are some things to
consider:
frequent deletes and inserts will cause a lot of merging, like you said,
which will stress IO
What the best way to sort by a quick changing field like number of
votes? There seems to be 2 options.
Delete and reindex the object with the field changed. How many
updates a second can an elasticsearch node usually support? Will this
cause a lot of merging and thus performance issues?
Store the field in Mysql. However searches would require fetching
the entire dataset. And with a couple thousand items, the results
returned are huge. Is there a way to disable everything but the _id?
If I even sort by timestamp it appends a sort: [1353929425990] making
this method even more unsuitable.
{
_index: foo
_type: bar
_id: 1
_score: 1
}
What is the best solution?
I heard that Reddit uses IndexTank which stores fast changing fields
in memory so it can be quickly changed without reindexing. Does
Elasticsearch have something like this?
can you describe a little bit what the size of the challenge is? How much
numbers do you want to insert per second? How many sorts do you want to
perform per second? Are your numbers only discrete numbers (integers)? What
length is the sorted list you want to generate per query?
There are some options to consider:
simple field caching of numbers. ES can hold field data completely in
RAM, under certain circumstances, you can sort very fast. Be prepared to
have lots of RAM.
manipulate the score values, to create a custom scoring algorithm for
obtaining your sorting order
Is 50 updates a second reasonable? I would like to have as many
updates as possible, but it is impossible to tell how much traffic I
will eventually get.
I just want to sort on 1 field, number of votes, after the query has
filtered some documents. I would estimate the number of documents
returned to be anywhere from 1 to 2000.
The fields are all integers and longs with 5-10 nested objects. The
document might look like this:
can you describe a little bit what the size of the challenge is? How much
numbers do you want to insert per second? How many sorts do you want to
perform per second? Are your numbers only discrete numbers (integers)? What
length is the sorted list you want to generate per query?
There are some options to consider:
simple field caching of numbers. ES can hold field data completely in
RAM, under certain circumstances, you can sort very fast. Be prepared to
have lots of RAM.
manipulate the score values, to create a custom scoring algorithm for
obtaining your sorting order
wait for ES 0.21 ... Lucene 4 comes with support of column stride fields.
The new DocValue fields are suitable for frequently changing values like
click feedbacks or user ratings. See this great presentation of Simon
Willnauerhttp://de.slideshare.net/lucenerevolution/willnauer-simon-doc-values-...
The problem can be solved by using an external store because databases
like MySQL don't need to reindex when you update a field. However,
when you complete a search query, the entire document list matching a
query must be passed to MySQL so the set can be properly sorted.
Unfortunately, Elasticsearch seems to add a lot of unnecessary
metadata, which may make this method unbearably slow.
Oh, I see. I thought you wanted to sort in ES while keeping data in an
external data store.
Is there any benefit to storing the document instead of manually
deleting and indexing a new copy? Would the increased size affect the
performance?
I don't think I understand your question, could you rephrase? Do you mean
to insert a new doc but not delete the old one, or?
How many indexes per second would be possible per node? 10? 100? 1000?
That depends on quite a lot of factors, like:
node hardware
size of docs
mapping (whether you index everything or just certain fields, analyzers,
etc)
number of shards
whether you use bulk or not, and what the bulk size is
I'd suggest you start with a subset of your data and do a little
performance run on a test machine to get an idea. If you really need some
numbers to start with, here's a simple test I did on my laptop:
ES settings were default, except I was using just one shard for my index.
The laptop itself is pretty average, and you can see docs are small. But
the important result here is the big difference between inserting one doc
at a time (500 indexes/sec) and inserting 1000 at a time (6500 indexes/sec).
Are there any tips for minimizing the impact of merges? I would like
to avoid long GC pauses.
AFAIK merges don't have anything with GC. You can use a monitoring tool
like Bigdesk[0] or our own SPM for Elasticsearch[1] to monitor how GC
behaves and whether you would want to tune something there.
Regarding merges, you might want to look at store level throttling[2], to
make sure your IO won't be suffocated by merges. Also, I'd suggest to look
at merge policies[3] and see if the default policy fits your needs or you
need to tweak something there.
I think that with Elasticsearch you'd have to update the value of the
field
by deleting and reindexing the document. For convenience, you can use the
Update API, which underneath does the same thing: Elasticsearch Platform — Find real-time answers at scale | Elastic
I'm not sure how you can solve this problem by using an external store.
As
far as I know, if you want to sort on a field with ES, in practice
there's
no getting away from at least storing the field in ES.
As for updating ES documents, performance depends in the end on both your
hardware and how your documents look like. But there are some things to
consider:
frequent deletes and inserts will cause a lot of merging, like you
said,
which will stress IO
What the best way to sort by a quick changing field like number of
votes? There seems to be 2 options.
Delete and reindex the object with the field changed. How many
updates a second can an elasticsearch node usually support? Will this
cause a lot of merging and thus performance issues?
Store the field in Mysql. However searches would require fetching
the entire dataset. And with a couple thousand items, the results
returned are huge. Is there a way to disable everything but the _id?
If I even sort by timestamp it appends a sort: [1353929425990] making
this method even more unsuitable.
{
_index: foo
_type: bar
_id: 1
_score: 1
}
What is the best solution?
I heard that Reddit uses IndexTank which stores fast changing fields
in memory so it can be quickly changed without reindexing. Does
Elasticsearch have something like this?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.