I am in the process to set up a data analytical project
using ELK pipeline over a huge amount of apache log files. One of
the required data features is to continuously update the session IDs for every
single web transaction in the Apache log. The Java Client API v2.1.1 is used to perform
the data update.
I’ve encountered following problems during the test and
debugging and hope anyone who has similar experience could give me some clues.
Processor: Intel Core i7-4790 3.6GHz
Disk Space: 465GB on
ELK Spec: logstash-2.1.0 , elasticsearch-2.1.1 running on single-node machine and
Client API: SearchScroll, Batch API and partial update API
Amount of Data: 500 million records
Elastic Search running slowly
It took about 10 days to load 500 million
log records to Elastic Search 2.1.1. Does the figure seem normal?
Failure of Batch API
The data update transactions are requested by
batch API, however it seems no matter what values of bulk size and
timeout/retry interval are, the last batch update in a series of updates
transactions always fails throwing a NoNodeAvailableException.
Random Exception of Access Denied Error
If partial data update is invoked directly
without batch API, the random error appears like following. The error may
happen after every few million update transactions.
08:59:06,303][DEBUG][action.admin.indices.stats] [Johnny Storm] [inices:monitor/stats] failed to execute
operation for shard [[akamai_access_log_2015.05], node[MD8IEMRHRlyCqT3MSuwjhw],
[P], v, s[STARTED], a[id=vN3wl3mwS-bq9UNJsqxNA]]
BroadcastShardOperationFailedException[operationindices:monitor/stats failed]; nested: ElasticsearchException[failed to refresh store stats];
The above exceptions look like synchronizing issue between
data transactions , however without sufficient insight on Elasticsearch
implementation, I feel I am sitting in darkness and have no idea how to get rid
of the issues. Hope the descriptions give you some glimpses of what the actual problems are. I am happy to provide further details if needed.