I'm currently working on implementing ElasticSearch on a Django-based REST
API. I hope to be able to search through roughly 5 million documents, but
I've struggled to find an answer to a question I've had from the beginning: how many fields is too many for a single indexed object?
My setup has 512MB of storage and 4GB of memory, 1 shard, and 2 nodes.
I want to be able to sort/filter on about 30 different fields for that
single model, but only search on 5-6. Is 30 fields too many?
I have a dev environment set up with roughly 30,000 documents and the same
number of fields, and updates and queries are taking significantly longer
than I had hoped. Updating a single document is taking between 4-5
seconds, and searching for a 5-character long string is taking 3-4 seconds.
Is there hope that this is a configuration problem, or should I reconsider
how many fields I'm using? Thanks in advance.
I'm currently working on implementing Elasticsearch on a Django-based REST
API. I hope to be able to search through roughly 5 million documents, but
I've struggled to find an answer to a question I've had from the beginning: how many fields is too many for a single indexed object?
My setup has 512MB of storage and 4GB of memory, 1 shard, and 2 nodes.
I want to be able to sort/filter on about 30 different fields for that
single model, but only search on 5-6. Is 30 fields too many?
We run with about 20 fields and have no trouble:
We have lots more data and lots more machine than you do but I don't see
why it wouldn't scale down.
I have a dev environment set up with roughly 30,000 documents and the same
number of fields, and updates and queries are taking significantly longer
than I had hoped. Updating a single document is taking between 4-5
seconds, and searching for a 5-character long string is taking 3-4 seconds.
Something is up, yeah. Its hard to figure out what might be up from
reading this though. Some questions that are normal to ask here:
Can you post an example document (like as a gist or pastebin or
whatever)?
Can you post an example query?
How much heap are you giving Elasticsearch?
How much disk is the 30,000 documents taking up (/var/lib/elasticsearch)
?
What version are you using?
Do you see IO during the query (iostat -dmx 3 10) ?
The time required for update depends on the peculiarities of the update
operations, the massive scripting overhead, the refresh operation, and the
segment merge activities that are related.
The number of fields does not matter.
My application has 5000 fields. I avoid updates at all costs. A new
document is faster.
I'm currently working on implementing Elasticsearch on a Django-based REST
API. I hope to be able to search through roughly 5 million documents, but
I've struggled to find an answer to a question I've had from the beginning: how many fields is too many for a single indexed object?
My setup has 512MB of storage and 4GB of memory, 1 shard, and 2 nodes.
I want to be able to sort/filter on about 30 different fields for that
single model, but only search on 5-6. Is 30 fields too many?
I have a dev environment set up with roughly 30,000 documents and the same
number of fields, and updates and queries are taking significantly longer
than I had hoped. Updating a single document is taking between 4-5
seconds, and searching for a 5-character long string is taking 3-4 seconds.
Is there hope that this is a configuration problem, or should I reconsider
how many fields I'm using? Thanks in advance.
The time required for update depends on the peculiarities of the update
operations, the massive scripting overhead, the refresh operation, and the
segment merge activities that are related.
The number of fields does not matter.
My application has 5000 fields. I avoid updates at all costs. A new
document is faster.
We can't do that in our application and so have to eat the load for
updates. As far as I can see the biggest cost is in segment merge and the
overhead of tombstoned entries before they are merged out. Still, with
30,000 documents none of that is likely to be a big deal.
I interact with my search engine through django-haystack http://django-haystack.readthedocs.org/en/latest/. A query may look like SearchQuerySet().filter(document='mitch').order_by('created_at')[:100] -- so essentially getting the first 100 documents that have mitchin them, ordered by the fieldcreated_at`.
Each node has 247.5MB of heap allocated judging by my hosting service's
dashboard.
The documents/fields take up roughly 30MB on disk.
Using Elasticsearch version 1.4.2 but could very easily upgrade.
I'm hosting with found.no so I don't have access to a command line to
run that unfortunately.
I haven't found any options in found.no to disable swapping, so I would
assume they have it off by default? I could be wrong though.
Thanks for your reply.
On Thursday, April 16, 2015 at 10:06:39 AM UTC-4, Nikolas Everett wrote:
On Thu, Apr 16, 2015 at 9:40 AM, Mitch Kuchenberg <mi...@getambassador.com
<javascript:>> wrote:
I'm currently working on implementing Elasticsearch on a Django-based
REST API. I hope to be able to search through roughly 5 million documents,
but I've struggled to find an answer to a question I've had from the
beginning: how many fields is too many for a single indexed object?
My setup has 512MB of storage and 4GB of memory, 1 shard, and 2 nodes.
I want to be able to sort/filter on about 30 different fields for that
single model, but only search on 5-6. Is 30 fields too many?
We have lots more data and lots more machine than you do but I don't see
why it wouldn't scale down.
I have a dev environment set up with roughly 30,000 documents and the
same number of fields, and updates and queries are taking significantly
longer than I had hoped. Updating a single document is taking between 4-5
seconds, and searching for a 5-character long string is taking 3-4 seconds.
Something is up, yeah. Its hard to figure out what might be up from
reading this though. Some questions that are normal to ask here:
Can you post an example document (like as a gist or pastebin or
whatever)?
Can you post an example query?
How much heap are you giving Elasticsearch?
How much disk is the 30,000 documents taking up
(/var/lib/elasticsearch) ?
What version are you using?
Do you see IO during the query (iostat -dmx 3 10) ?
This is pretty useless to me. You'd need to show me a fully expanded
version in JSON.
I interact with my search engine through django-haystack http://django-haystack.readthedocs.org/en/latest/. A query may look
like SearchQuerySet().filter(document='mitch').order_by('created_at')[:100] -- so essentially getting the first 100 documents that have mitchin them, ordered by the fieldcreated_at`.
Its always best when talking about things like this to do it with curl
posting JSON because curl is our common language.
Each node has 247.5MB of heap allocated judging by my hosting
service's dashboard.
Sorry, what is the value of -Xmx parameter you used to run Elasticsearch.
The actual amount of of heap in use at a time isn't usually useful.
The documents/fields take up roughly 30MB on disk.
These should be instant.
Using Elasticsearch version 1.4.2 but could very easily upgrade.
Its cool.
I'm hosting with found.no so I don't have access to a command line to
run that unfortunately.
I haven't found any options in found.no to disable swapping, so I
would assume they have it off by default? I could be wrong though.
I think you should take this up with found.no. Maybe try running
elasticsearch locally and comparing. In general 30mb of index should be
instantly searchable unless Elasticsearch is swapped out or the linux disk
cache is cold and the io subsystem is amazingly garbage. Or something else
weird like that.
The last question that I forgot to ask was what your mapping is.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.