We were stoked when we found out about the updating feature in the recent
0.19.0rc2 release. We have been eagerly experimenting with it but are
disappointed by it's performance. Hopefully you can tell us we are doing
something wrong.
We roughly use this model: https://gist.github.com/1751349. Starting from a
clean index it takes 7 seconds to index 1000 documents (ok-ish). After
indexing 3 million documents performance degrades to 30 seconds per 1000
documents (prohibitively slow). We expect to insert 500 million documents
plus 4 million a day.
Our approach inserting documents is as follows:
We first try to update a document, if that returns an error we instead
create it.
The resulting documents can contain hundreds and possibly thousands of
'interactions' growing the document size to about 3Mb.
My guess is that it simply gets slower since you index bigger documents with more interactions. The update API still reindex the document. You might turn things around and index and interaction as its own document.
On Tuesday, February 14, 2012 at 12:16 PM, haarts wrote:
Dear list,
We were stoked when we found out about the updating feature in the recent 0.19.0rc2 release. We have been eagerly experimenting with it but are disappointed by it's performance. Hopefully you can tell us we are doing something wrong.
We roughly use this model: ES Data Model Skylines · GitHub. Starting from a clean index it takes 7 seconds to index 1000 documents (ok-ish). After indexing 3 million documents performance degrades to 30 seconds per 1000 documents (prohibitively slow). We expect to insert 500 million documents plus 4 million a day.
Our approach inserting documents is as follows:
We first try to update a document, if that returns an error we instead create it.
The resulting documents can contain hundreds and possibly thousands of 'interactions' growing the document size to about 3Mb.
You could also shard or split the index which will improve indexing
speed or tune lucene options for the indexing process only (e.g.
increase merge factor)
Did you also thought about another model? E.g. feeding interactions
instead of documents? This way you avoid updating but would require
more search logic
We were stoked when we found out about the updating feature in the recent
0.19.0rc2 release. We have been eagerly experimenting with it but are
disappointed by it's performance. Hopefully you can tell us we are doing
something wrong.
We roughly use this model:ES Data Model Skylines · GitHub. Starting from a
clean index it takes 7 seconds to index 1000 documents (ok-ish). After
indexing 3 million documents performance degrades to 30 seconds per 1000
documents (prohibitively slow). We expect to insert 500 million documents
plus 4 million a day.
Our approach inserting documents is as follows:
We first try to update a document, if that returns an error we instead
create it.
The resulting documents can contain hundreds and possibly thousands of
'interactions' growing the document size to about 3Mb.
That is what I thought as well.
You pointed out in an other reply that this parent/child functionality
might be what I was looking for. I've looked into it and have one remaining
question;
I want a query searching for 'tree AND house' and returning the parent
which has a child containing 'tree' and a child containing 'house'.
Based on your Gist https://gist.github.com/758398: my Gisthttps://gist.github.com/1835953
.
I considered an other model as well. Especially a parent/child model as to
prevent reindexing the entire document.
But I haven't been able to get a particular kind of search working with
this. Imagine a particular parent having two children. One child has the
content 'tree' and the other 'house', I require the search 'tree AND house'
to return this parent. A concrete example can be found herehttps://gist.github.com/1835953.
Is that even possible?
Yes, it will work, you will use the has_child filter / query to filter those and get back the parents.
On Wednesday, February 15, 2012 at 4:17 PM, haarts wrote:
Ah. I will dig into these options. Thanks!
I considered an other model as well. Especially a parent/child model as to prevent reindexing the entire document.
But I haven't been able to get a particular kind of search working with this. Imagine a particular parent having two children. One child has the content 'tree' and the other 'house', I require the search 'tree AND house' to return this parent. A concrete example can be found here (gist:1835953 · GitHub). Is that even possible?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.