Update API timeout

I've written a logstash script in which it reads via read mode from a directory. The logs I am processing are from a POS and the driectory structrue is as such that the store name and the till number is in the path. Using the path, I parse the path via a grok and tag each log line with the store name and till number.

Directory structure : StoreName/TillNumber/*

A small human error occured while creating the directory structure.

Actual Store Name and Till Number
Directory Structure : 91500/005/*

Current
Directory Structure : S91500/T005/*

The storeName currently indexed with each log is S91500 it should be 91500
The tillNumber currently indexed with each log is T005 it should be 005

Now since the data has been indexed. The data is almost 4.5gb. I have two wrong fields of store name and till number with each log. I need to update the fields of every document in the index.

i have written a query which uses the Update By Query API

POST marstons-logs*/_update_by_query?conflicts=proceed
{
 "script": {
 "source": "ctx._source.storeName=91500",
 "lang": "painless"
},
"query": {
 "match": {
 "storeName": "S91500"
  }
 }
}

But Runnning it on my test data which is hardly 50 Mb gives me a timeout. Considering my actual data is of 4.5gb, if I run the query on my actual data my data will become inconsistent.
Is there any better method of updating the fields?
Reindexing might be one of the alternative but currently I cannot consider that.

can you clarify on gives me a timeout - it might be that the update by query is actually running fine, but the HTTP client you are using is simply timing out, which does not stop the update.

You can start the update in the background and use the Tasks API to get information about the execution of the API, see https://www.elastic.co/guide/en/elasticsearch/reference/6.4/docs-update-by-query.html#_url_parameters_2

@spinscale

i'm getting the following exception

{
"statusCode": 504,
"error": "Gateway Time-out",
"message": "Client request timeout" 
}

as written above, this is no indication, that the update by query has not run. Have you checked your data?

I'm not sure you follow. I need to know why there is a gate way timeout error. I have checked my test data. It is fine, but for executing the query on 5gb of data I must be certain the timeout issue won't occur, otherwise the data might become inconsistent.

because the HTTP client timeout of kibana is lower than the total execution time of the update by query statement. That's why I wrote that you should not rely on the HTTP call to return, but immediately start the update by query in the background as linked above. Again, the update by query call will continue to run, even if you receive this message.

Hope this helps!

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.