Check for parsing errors before indexing

davidbien · May 2, 2019, 9:55am

I use a python script to send aws cloudtrail logs to elasticsearch. It works most of the time however occasionally I get a parsing error like this:

('1 document(s) failed to index.', [{'index': {'_index': 
'n_cloudtrail-2019.03.27', '_type': 'record', '_id': '169deb77-d3f0-4964-8f98-79e64a6923c8', 'status':
400, 'error': {'type': 'mapper_parsing_exception', 'reason': 'failed to parse [apiVersion]', 'caused_
by': {'type': 'illegal_argument_exception', 'reason': 'Invalid format: "2018_11_05" is malformed at "_
11_05"'}}

Just one of these is enough to break the whole index and cause all other indices to be inaccessible.
How can I prevent these from happening? Is it possible to check for parsing errors and skip them before indexing? Or perhaps change the date format of that particular field?
Here's a snippet of my python code:

logger.info('Event: ' + json.dumps(event, indent=2))
s3Bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']

try:
    response = s3.get_object(Bucket=s3Bucket, Key=key)
    content = gzip.GzipFile(fileobj=BytesIO(response['Body'].read())).read()
    for record in json.loads(content)['Records']:
        recordJson = json.dumps(record)
        indexName = 'cloudtrail-' + datetime.datetime.now().strftime("%Y.%m.%d")
        res = es.index(index=indexName, doc_type='record', id=record['eventID'], body=recordJson)
        logger.info(res)
    return True
except Exception as e:
    logger.error('Something went wrong: ' + str(e))
    traceback.print_exc()
    return False

warkolm · May 2, 2019, 10:10am

That seems pretty odd, can you elaborate?

davidbien · May 2, 2019, 10:26am

I created an index pattern for cloudtrail-* and could search by this, however once I get one of the above some shards start to fail. I get a notification for example

3 of 50 shards failed

When trying to reindex a corrupted I get the above error(parsing error) as well.
And all cloudtrail indices become inaccessible.
I deleted all cloudtrail indices yesterday so I can't re-create this atm but if I enable lambda function I might get this again in few days time as this seems to be a bit random.

warkolm · May 2, 2019, 10:28am

That should not happen, and the bad document should be rejected. Perhaps your logs can share more details though.

davidbien · May 2, 2019, 10:42am

Should that happen automatically or do I need to add anything in my code to drop any corrupt logs?
As I said, I deleted all indices yesterday since I couldn't test my code changes due to no indices being viewable.

davidbien · May 2, 2019, 11:07am

I re-enabled the lambda function, let's see how long it works.

system · May 30, 2019, 11:07am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.