Has_child / has_parent queries for a large DB

I have a single shard with A(parent) and B (child) documents.
At first "has_child" queries were great.

Now, that my database has several millions of records and about 20GB of data, the queries take ALOT of time.
regular queries work fine.

I read that there needs to be an initial loading of data to memory for has_child/has_parent queries to work, and so the first time should take more than the next ones.

However, the first query takes 30 minutes or more, sometimes fails miserably.

What can I do to help this?

  1. I tried increasing ES_HEAP_SIZE, ES_MIN_MEM and ES_MAX_MEM, all to the same value. Is that wise?
    should I only have the set ES_HEAP_SIZE and left the others unspecified (there seems to be some confusion in the documentation as to the difference between the three).

If my machine has 2G memory, what should I set this value to, 1500m?

  1. Will adding shards help ( this means reindexing, right? I can't just define more shards)

  2. If I have some known queries that I keep repeating - is there a way to have them indexed or run some kind of map-reduce periodically?

If not, what would you recommend I should do?

The has_parent / has_child queries rely on a in memory id cache, to run
performantly. You need to have enough memory available to accomodate this
id cache. You can view in the node stats api how much memory the id_cache
is taking up in the heap space.

Can you tell a bit more about your ES setup (how many nodes, how many
indices and primary/replica shards per index)?
2GB per machine isn't that much per machine and I think in your case is the
cause of your problem for the has_parent and has_child queries. By default
an index has 5 primary shards, this means you can add just more machines,
this will spread the memory usage across more machines.

The id cache is loaded when the first has_child / has_parent query is
executed and then reused for subsequent search requests. You can use a
warmer with a has_parent / has_child query to preload the id cache, before
actual search requests are executed:
http://www.elasticsearch.org/guide/reference/api/admin-indices-warmers/

Martijn

On 7 May 2013 16:20, eranid eranid@gmail.com wrote:

I have a single shard with A(parent) and B (child) documents.
At first "has_child" queries were great.

Now, that my database has several millions of records and about 20GB of
data, the queries take ALOT of time.
regular queries work fine.

I read that there needs to be an initial loading of data to memory for
has_child/has_parent queries to work, and so the first time should take
more
than the next ones.

However, the first query takes 30 minutes or more, sometimes fails
miserably.

What can I do to help this?

  1. I tried increasing ES_HEAP_SIZE, ES_MIN_MEM and ES_MAX_MEM, all to the
    same value. Is that wise?
    should I only have the set ES_HEAP_SIZE and left the others unspecified
    (there seems to be some confusion in the documentation as to the difference
    between the three).

If my machine has 2G memory, what should I set this value to, 1500m?

  1. Will adding shards help ( this means reindexing, right? I can't just
    define more shards)

  2. If I have some known queries that I keep repeating - is there a way to
    have them indexed or run some kind of map-reduce periodically?

If not, what would you recommend I should do?

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/has-child-has-parent-queries-for-a-large-DB-tp4034383.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks!
I have one node with one shard and no replication (I was planning on expanding it as I understand ES better...)
I'm using an amazon m1.medium or m1.large machine for this.

Looking at the id_cache size, it's zero, both before and during the query i try to run:
id_cache_size: 0b

(btw, what is the id_cache size, and how does this differ from the heap size? or is this the same...)

So, what you suggest is adding more machines?

What i mainly feel is missing, is my understanding of how to monitor the cluster and understand what the problem is.

I attach a printout of querying with ...:9200/_cluster/nodes/stats?process=true&os=true&fs=true&network=true

{
cluster_name: elasticsearch
nodes: {
rvPqcyaxRkKCqTHs3T2qBA: {
timestamp: 1368014136488
name: Rainbow
transport_address: inet[ip-10-137-40-231.ec2.internal/10.137.40.231:9300]
hostname: ip-10-137-40-231
attributes: { aws_availability_zone: us-east-1c}
indices: {
store: {
size: 26.4gb
size_in_bytes: 28389539108
throttle_time: 0s
throttle_time_in_millis: 0
}
docs: {
count: 16795436
deleted: 260670
}
indexing: {
index_total: 3478423
index_time: 1.4h
index_time_in_millis: 5346619
index_current: 3
delete_total: 210748
delete_time: 4.9m
delete_time_in_millis: 298083
delete_current: 0
}
get: {
total: 263490
time: 1.6m
time_in_millis: 96962
exists_total: 263353
exists_time: 1.6m
exists_time_in_millis: 96944
missing_total: 137
missing_time: 18ms
missing_time_in_millis: 18
current: 0
}
search: {
query_total: 2494237
query_time: 8.9h
query_time_in_millis: 32390632
query_current: 4
fetch_total: 2494236
fetch_time: 14.5m
fetch_time_in_millis: 872967
fetch_current: 1
}
cache: {
field_evictions: 0
field_size: 189mb
field_size_in_bytes: 198210516
filter_count: 10
filter_evictions: 0
filter_size: 13.4mb
filter_size_in_bytes: 14136296
bloom_size: 28.8mb
bloom_size_in_bytes: 30230528
id_cache_size: 0b
id_cache_size_in_bytes: 0
}
merges: {
current: 2
current_docs: 22961
current_size: 19.4mb
current_size_in_bytes: 20388239
total: 11509
total_time: 3h
total_time_in_millis: 10914669
total_docs: 40842823
total_size: 44.6gb
total_size_in_bytes: 47969339439
}
refresh: {
total: 83971
total_time: 46.7m
total_time_in_millis: 2805647
}
flush: {
total: 925
total_time: 19.8m
total_time_in_millis: 1190485
}
}
os: {
timestamp: 1368014136489
uptime: 18 hours, 7 minutes and 52 seconds
uptime_in_millis: 65272000
load_average: [
2.92
3.38
2.83
]
cpu: {
sys: 1
user: 97
idle: 0
}
mem: {
free: 607.7mb
free_in_bytes: 637239296
used: 3gb
used_in_bytes: 3295408128
free_percent: 59
used_percent: 40
actual_free: 2.1gb
actual_free_in_bytes: 2331492352
actual_used: 1.4gb
actual_used_in_bytes: 1601155072
}
swap: {
used: 0b
used_in_bytes: 0
free: 0b
free_in_bytes: 0
}
}
process: {
timestamp: 1368014136490
open_file_descriptors: 1139
cpu: {
percent: 99
sys: 1 hour, 35 minutes, 40 seconds and 410 milliseconds
sys_in_millis: 5740410
user: 14 hours, 29 minutes, 20 seconds and 440 milliseconds
user_in_millis: 52160440
total: 16 hours, 5 minutes and 850 milliseconds
total_in_millis: 57900850
}
mem: {
resident: 1.3gb
resident_in_bytes: 1451016192
share: 5.4mb
share_in_bytes: 5746688
total_virtual: 2.1gb
total_virtual_in_bytes: 2266382336
}
}
network: {
tcp: {
active_opens: 50
passive_opens: 2971419
curr_estab: 181
in_segs: 20468767
out_segs: 19617991
retrans_segs: 6901
estab_resets: 15
attempt_fails: 0
in_errs: 9
out_rsts: 106
}
}
fs: {
timestamp: 1368014136506
data: [
{
path: /var/lib/elasticsearch/elasticsearch/nodes/0
mount: /
dev: /dev/xvda1
total: 98.4gb
total_in_bytes: 105689415680
free: 65.6gb
free_in_bytes: 70445109248
available: 60.6gb
available_in_bytes: 65077608448
disk_reads: 487455
disk_writes: 1402229
disk_read_size: 10.9gb
disk_read_size_in_bytes: 11780604928
disk_write_size: 45.3gb
disk_write_size_in_bytes: 48698900480
}
]
}
}
}
}