Logstash Reindex Strategy

James_Tang · February 26, 2016, 6:33am

Created a Gist documenting my approach to reindexing using logstash. Hopes it helps

Reindexing your Elasticsearch indice with limited resource can be a painw when you have limited resources and need it running at the same time. Hence it is advisable to size up the quantity and break it down into chunks based on time. Look to Kibana. The break down is already done for you even as you perform your search. Just pop up the request and the aggregation query is there. Using this, you can tally your document count according to time to verify your activities.

I need to do this as due to resource constrains. Logstash input plugin sometimes hit into error and the plugin restart. When it restarts the query get executed again. With logstash plugin-input-Elasticsearch, it resume a new search. Any previous scroll ID is discarded. This is something you do not want happening. You can end up with more document in the target than the source. (.i.e Corruption). Thus breaking it down to chucks limit the corruption and makes remediation easier. This automates the process of executing the logstash config one after another. Otherwise, manually is going be costly in terms of time.

So the strategy is like this:
1)create a logstash config template with {START} and {END} tag which we will replace using SED command with the actual time value.
2)create a input.data file that will have 2 value per line, START and END EPOCH time.
3)The script will loop through the input and create the actual logstash config file and execute it.

It is my experience that with approx 1GB of memory, you should be performing approx 30-50K document in one iteration.
Dependency: Logstash (prefereably in path), Cygwin(For windows), sed.
Assumption: everything is happening in the current directory.
Lastly using a diff tool to compare the source and target aggregation result to verify the process.

warkolm · February 26, 2016, 6:36am

Interesting idea!

James_Tang · February 26, 2016, 9:03am

If the plugin does not throw unrecoverable errors like "Error: Unable to establish loopback connection" and restart, I would not need to do this. It is also rather frustrating, after having been successful for a couple tens of thousands of document and it reported a network connection issue? Hard to determine if it is a client side issue or a server side issue. ie Server rejecting or Client not connecting which resulted in this error.

Monitored the client side with JCONSOLE while it ran, seems ok. Heap memory only approx 300MB from the default 1GB allocated.

I would take a look at these, if I am managing it or debug the issue as we need to figure out which side is the culprit first:

"open_file_descriptors" : ,
"max_file_descriptors" :
"heap_used_percent" : 31,
CPU
search" : "open_contexts"
"thread_pool" ,
"http" : "current_open" 58, ( I will be interested to know how this co-relates to the size of in the logstash configuration. )

sarbjeet · April 25, 2016, 4:20am

please give some example of config file source code

Topic		Replies	Views
Reindexing with logstash Logstash	2	890	December 20, 2016
Reindexing Data Logstash	3	559	September 6, 2019
Reindexing is slow process Elasticsearch reindex	7	5616	October 12, 2021
Reindexing depth-first, not breadth-first Logstash	1	733	July 6, 2017
How to use elasticsearch input plugin in logstash to get more than 10000 results from ES Logstash	11	4217	July 4, 2018

Logstash Reindex Strategy

Related topics