I have a single shard with A(parent) and B (child) documents.
At first "has_child" queries were great.
Now, that my database has several millions of records and about 20GB of data, the queries take ALOT of time.
regular queries work fine.
I read that there needs to be an initial loading of data to memory for has_child/has_parent queries to work, and so the first time should take more than the next ones.
However, the first query takes 30 minutes or more, sometimes fails miserably.
What can I do to help this?
I tried increasing ES_HEAP_SIZE, ES_MIN_MEM and ES_MAX_MEM, all to the same value. Is that wise?
should I only have the set ES_HEAP_SIZE and left the others unspecified (there seems to be some confusion in the documentation as to the difference between the three).
If my machine has 2G memory, what should I set this value to, 1500m?
Will adding shards help ( this means reindexing, right? I can't just define more shards)
If I have some known queries that I keep repeating - is there a way to have them indexed or run some kind of map-reduce periodically?
The has_parent / has_child queries rely on a in memory id cache, to run
performantly. You need to have enough memory available to accomodate this
id cache. You can view in the node stats api how much memory the id_cache
is taking up in the heap space.
Can you tell a bit more about your ES setup (how many nodes, how many
indices and primary/replica shards per index)?
2GB per machine isn't that much per machine and I think in your case is the
cause of your problem for the has_parent and has_child queries. By default
an index has 5 primary shards, this means you can add just more machines,
this will spread the memory usage across more machines.
The id cache is loaded when the first has_child / has_parent query is
executed and then reused for subsequent search requests. You can use a
warmer with a has_parent / has_child query to preload the id cache, before
actual search requests are executed:
I have a single shard with A(parent) and B (child) documents.
At first "has_child" queries were great.
Now, that my database has several millions of records and about 20GB of
data, the queries take ALOT of time.
regular queries work fine.
I read that there needs to be an initial loading of data to memory for
has_child/has_parent queries to work, and so the first time should take
more
than the next ones.
However, the first query takes 30 minutes or more, sometimes fails
miserably.
What can I do to help this?
I tried increasing ES_HEAP_SIZE, ES_MIN_MEM and ES_MAX_MEM, all to the
same value. Is that wise?
should I only have the set ES_HEAP_SIZE and left the others unspecified
(there seems to be some confusion in the documentation as to the difference
between the three).
If my machine has 2G memory, what should I set this value to, 1500m?
Will adding shards help ( this means reindexing, right? I can't just
define more shards)
If I have some known queries that I keep repeating - is there a way to
have them indexed or run some kind of map-reduce periodically?
Thanks!
I have one node with one shard and no replication (I was planning on expanding it as I understand ES better...)
I'm using an amazon m1.medium or m1.large machine for this.
Looking at the id_cache size, it's zero, both before and during the query i try to run:
id_cache_size: 0b
(btw, what is the id_cache size, and how does this differ from the heap size? or is this the same...)
So, what you suggest is adding more machines?
What i mainly feel is missing, is my understanding of how to monitor the cluster and understand what the problem is.
I attach a printout of querying with ...:9200/_cluster/nodes/stats?process=true&os=true&fs=true&network=true
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.