Ingest Pipeline with Grok Processor cannot index all files

morningkaren · October 19, 2020, 7:02pm

Hello,

In Python, I created an ingest pipeline with a grok processor to look for names in the file.filename field like so :
body = { "description" : "name_search_test pipeline", "processors" : [ { "grok" : { "field" : "file.filename", "patterns" : ["%{NAME:first_name}" ], "pattern_definitions" : { "NAME" : fn } } } ] } p.put_pipeline(id="name_search_test",body=body)

fn is a string of about 20,000 names separated by a pipe like so : "Karen|Mary|Jon|Susan"

When I run fscrawler with the pipeline specified in my yaml settings, only 2 out of about 200 files got indexed. The two files that got indexed had a name found in the file.filename field.

I was wondering why fscrawler could not index all the files? I also tried to put an ignore_failure= True in the grok processor, but again, not all files could be indexed.

My questions are:

How can I make sure I can index all the files with this processor?
If a name is not found in the file.filename field, how do I still get a new first_name field but with nothing as its value?

Thank you!

morningkaren · October 20, 2020, 4:17pm

I was able to find an answer to my question #2. If I use the on_failure parameter, I am able to create a first_name field with value "NO NAME FOUND" if there are no names found.

However, I am still not able to index all of my files. So, question #1 is still open.

dadoonet · October 22, 2020, 8:28am

Could you share a reproduction script which uses the _simulate ingest API with a document that did not work with the pipeline?

To get that document, you can either start fscrawler with the --trace option or remove the pipeline configuration and index the document which did not work. Then get it back with Kibana and put it as a doc input in the simulate API.

morningkaren · October 22, 2020, 1:16pm

Hi David,

Thanks for your response. Although I cannot share the results of the simulation with a document that did not work on the pipeline, I can say that it was successful in finding a first_name in the file.filename field.

I chose a document that was not indexed with the original pipeline and ran it through a simulation and it worked.

I also realized today that when I ran fscrawler on files that were pdf or excel, not all of them are indexed, but when the files are all converted to txt, everything got indexed. Do you know why not all of my files are getting indexed?

Thanks,

Karen

dadoonet · October 22, 2020, 2:07pm

Two options:

they are not indexed because of an issue when executing the pipeline. For that, you need to debug why it does not work. To debug it, you need to do what I mentioned previously
they are not read at all by FSCrawler because FSCrawler thinks they have not been modified since the last run. In which case you can touch the document to change its date or start FSCrawler with the --restart option.

morningkaren · October 27, 2020, 6:22pm

Hi David,

Thanks for your response!

I think the issue was that I was running out of space on my disk. I freed up some space and the pipeline was able to index everything.

Thanks for your help anyway.

Best,

Karen

system · November 24, 2020, 6:22pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fscrawler and Elasticsearch Parsing ingested document/ stop at 1st value groked Discussions en français	2	651	June 25, 2018
Ingest pipeline should work based on conditions Elasticsearch	2	368	July 18, 2020
Ingest pipeline grok processor on_failure Elasticsearch ingest-pipeline	2	826	January 4, 2021
Help with the grok processor Elasticsearch	5	731	August 9, 2018
Grok processor is not applied for indexed records in elastic search Elasticsearch	2	543	November 27, 2017

Ingest Pipeline with Grok Processor cannot index all files

Related topics