Hi,
Following is the document structure of the document I am indexing in ES.
{
"key1": "val1",
"key2": "val2",
...
"keyN": "valN",
"array1": [
,
],
"array2": [
{
"key1": "val1",
"key2": "val2",
"key3": "val3",
"key4": "val4"
}
]
}
Document size usually is about 800 bytes but it can be between 600 to 1500
bytes depending upon the number of entries of we have in the arrays (that
can be anywhere between 1 to 20) and there is a URL field whose value can
be small or large depending upon user input.
The way I am using elasticsearch is that I do a large number of writes,
1000 - 1300 per second. But the way I do it is using UPDATE API with UPSERT
instead of directly indexing. The reason for this is that sometimes I get
documents that have already been indexed. So I generate the document ID
(DOC_ID) by doing a md5 of some unique values in my doc and always do an
UPDATE with UPSERT so that the doc gets index if it does not exist, but if
it does exist it must get dropped.
Out of these 1000 - 1300 UPDATES per second, about 80% UPDATES are of the
above type as I have described.
In the remaining 20%, I do updates on the docs that have been already been
indexed (by the above described way of indexing docs). How this is done is
by using an UPDATE script like so:
POST /index/doc_type/DOC_ID/_update
{
"script": "if (ctx._source.array1.contains(some_val)) { ctx.op =
"none" } else { ctx._source.array1 += some_val; ctx._source._array2 +=
some_doc }",
"params": {
"some_val": ,
"some_doc": {
"key1": "val1",
"key2": "val2",
}
}
}
So if a document DOC_ID has been indexed, then in the second type I just
update that document otherwise I do nothing. Also, if the document is
found, then I do some more checks to perform the operation as described by
the script in the above UPDATE request.
I am not doing a large number of queries right now since the application is
just getting developed but that could go upto 100 - 200 requests per
second. Query types are count (i.e. we don't need the docs but just the
numbers) with a lot of term facets.
This is pretty much my scenario.
Now coming to the problem, the issues that I am seeing is that the write
throughput was high when I started the cluster but with time it slowed down
to about 50 writes (UPDATES in my case) per second.
There is just one index with 32 shards and 1 replica each. I have a 3 node
cluster with 32 GB of RAM, out of which I have committed 16 GB to JVM heap
(MLOCKALL set to true as suggested to prevent paging out). Each node has 32
cores. We are using normal SATA II hard disks and not SSDs. I have also
installed ElasticSearchHQ plugin for monitoring.
Following are the index settings:
{
"index": {
"settings": {
"index.refresh_interval": "5s",
"index.version.created": "900799",
"index.number_of_replicas": "1",
"index.uuid": "4dFGm17qSF6Qe8lc20rgbw",
"index.number_of_shards": "32"
}
}
}
Some observations:
- After sometime, 96% of JVM heap is always used by ES.
- GC has been taking a lot of time (5 sec to 18 sec)
- After GC memory drop is negligible. (Example 15.7 GB to 15.6 GB)
- Index Refresh time is large (in the order of 100s of milliseconds)
The above observations did not changed when I stopped writing and querying
to ES for more than 12 hours (I stopped the writer at night and checked HQ
the next morning).
I am struggling hard to understand how ES is using the JVM that the heap
usage does not drop at all even when nothing is being done. Why is GC so
slow? The node configuration is pretty decent and I think I have configured
ES to at least all the suggested best configs (like mlockall set to true,
appropriate heap size and enough memory for disk cache, ulimit set to
unlimited).
What else should I be looking at? Or is there more information for you guys
to help out?
Some help will be greatly appreciated!
Vaidik Kapoor
vaidikkapoor.info
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CACWtv5mOfP2AwjLqQ6m%2BMbn_JG_x7A0LPbswrGwFEr-77MhyZg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.