Ingest Pipeline with Grok Processor cannot index all files

Hello,

In Python, I created an ingest pipeline with a grok processor to look for names in the file.filename field like so :
body = { "description" : "name_search_test pipeline", "processors" : [ { "grok" : { "field" : "file.filename", "patterns" : ["%{NAME:first_name}" ], "pattern_definitions" : { "NAME" : fn } } } ] } p.put_pipeline(id="name_search_test",body=body)

fn is a string of about 20,000 names separated by a pipe like so : "Karen|Mary|Jon|Susan"

When I run fscrawler with the pipeline specified in my yaml settings, only 2 out of about 200 files got indexed. The two files that got indexed had a name found in the file.filename field.

I was wondering why fscrawler could not index all the files? I also tried to put an ignore_failure= True in the grok processor, but again, not all files could be indexed.

My questions are:

  1. How can I make sure I can index all the files with this processor?
  2. If a name is not found in the file.filename field, how do I still get a new first_name field but with nothing as its value?

Thank you!

I was able to find an answer to my question #2. If I use the on_failure parameter, I am able to create a first_name field with value "NO NAME FOUND" if there are no names found.

However, I am still not able to index all of my files. So, question #1 is still open.

Could you share a reproduction script which uses the _simulate ingest API with a document that did not work with the pipeline?

To get that document, you can either start fscrawler with the --trace option or remove the pipeline configuration and index the document which did not work. Then get it back with Kibana and put it as a doc input in the simulate API.

Hi David,

Thanks for your response. Although I cannot share the results of the simulation with a document that did not work on the pipeline, I can say that it was successful in finding a first_name in the file.filename field.

I chose a document that was not indexed with the original pipeline and ran it through a simulation and it worked.

I also realized today that when I ran fscrawler on files that were pdf or excel, not all of them are indexed, but when the files are all converted to txt, everything got indexed. Do you know why not all of my files are getting indexed?

Thanks,

Karen

Two options:

  • they are not indexed because of an issue when executing the pipeline. For that, you need to debug why it does not work. To debug it, you need to do what I mentioned previously
  • they are not read at all by FSCrawler because FSCrawler thinks they have not been modified since the last run. In which case you can touch the document to change its date or start FSCrawler with the --restart option.

Hi David,

Thanks for your response!

I think the issue was that I was running out of space on my disk. I freed up some space and the pipeline was able to index everything.

Thanks for your help anyway.

Best,

Karen

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.