I guess I find the cause of long index time.
I tried to turn on the debug logging on the es servers and finds that
it keeps to print out the "mapping" which should be the cause of the
problem.
Here is the mapping:
(be careful it is very large)
I have 2 "dynamic" fields: attributeAssociatesMap and gridAssociatesMap
which are java map
these 2 fields in fact will grow depending on the input data.
e.g.
for attributeAssociatesMap, if the entry contains different values of
"attribute id", it will generate multiple keys
e.g. 97/name/1, 98/name/1, etc
same logic applies to gridAssociatesMap
and different entry contains their own multiple attribute ids and grid
ids and it makes the mapping changes and grows continously
From observing on the log, I guess whenever this happens during indexing:
- client submit index request to es node
- if the index request introduces new fields, the mapping of the
index will change
- the changed mapping will "update" the existing records in the
indexes, that's why my index requests will hold longer (from several
minutes to over 30 minutes) when the index size grows
Do you think this map mapping is the cause of the problem?
The reasons why I use map is that because I can directly index the
field values without needing to aggregate all the field values into
one field by custom separator (and so do not need to escape the
separator as well)
If this map mapping is not appropriate, is there any other suggestion?
Thanks,
Wing
On Thu, Jun 14, 2012 at 10:59 AM, Yiu Wing TSANG ywtsang@gmail.com wrote:
I observe this "higher cpu" pattern for the 2 nodes. (say "master
node" is the one holding on the active shards, "slave node" is the one
holding all the replicated shards)
-
when index request comes, the master node cpu has no real change,
but the slave node cpu usage jumps from 0.x% to 10%
-
even index request is finished and returned to the client, the cpu
usage of the slave node is still keeping for some times, may be 1 to 2
minutes, it varies
-
somehow this jump of cpu usage after each index request cannot
"end" before the next index request and the cpu usage of the slave
node jumps to 1x%, by then I can observe the index request (just bulk
index with 100 items) will need serveral minutes to complete.
Things seem to be nicer for index requests if I turn down the slave
node, i.e. only the master node is doing indexing. But at sometime
later, the index request still need some minutes to finish even the
index request only index for 100 items. Normally the index request of
100 items can complete well belows 1 second.
My index is around 110.7gb / docs: 2725536 , and I use bulk request
for this 100 item and will turn off the refresh interval when bulk
request starts.
So, is the indexing performance random? or depends on what factor?
Wing
On Wed, Jun 13, 2012 at 4:29 PM, Yiu Wing TSANG ywtsang@gmail.com wrote:
I have setup a 2-node cluster, but 1 node is showing high CPU usage/high GC
number while the other is showing normal.
Let me describe the basic details of the setup:
- 2 separate and "identical" machines, 32G ram
- each machine is hosting an elasticsearch instance, the setup is the same,
they communicate by multicast
- index setup is 5 shards + 1 replica (i.e. the default setup)
I have another java process running on the other machine, which keep
submitting bulk index requests to this 2-node cluster (the index load is not
high, just around 1000~2000 req for every 5 minutes).
And we don't have any search requests on the 2-node cluster, i.e. just only
doing bulk index requests.
By "elasticsearch head", node 1 are hosting all the "active" 5 shards while
node 2 are hosting all the "non-active" 5 shards. Now the index size is
about 4.3gb/docs: 233774
By "elasticsearch bigdesk", we observe that the node 2 are having much
higher cpu usage and the GC numbers are much higher and node 2 are showing
"higher" loading. I have attached the screen shots of the 2 nodes (1_.jpg
refer to node 1, 2_.jpg refer to node 2)
And this seems to make our bulk index requests "unstable", e.g. sometimes it
can finish indexing 1000 records within 1 seconds , but sometimes it takes
over 5 mins, which is random and independent of the number of records in the
bulk index requests.
Another strange thing to me is that the "doc counts" of the 2 nodes are not
the same, is that normal?
Do you have any hint on troubleshooting the cause of the above?
Thanks,
Wing