FSCrawler - Ingest pipeline error

Hi,

I'm using FSCrawler 2.7 and Elasticsearch and Kibana 7.1.1.

I have a single document where I want to extract and add two separate fields from said document.

I have the following

POST _ingest/pipeline/_simulate
{
  "pipeline": {
  "description" : "parse multiple patterns",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["additionalfield1: (?<additionalfield1>([^,]*))additionalfield2: (?<additionalfield2>([^,]*))"]
      }
    }
  ]
},
"docs":[
  {
    "_source": {
      "message": "This is a document with a lengthy text it contains a number of paragraphs and at the end Ill add some markers that indicate additional information I'd like to pull out and add as additional fields. This is the end of the actual document with additional information being added prior to the closing bracket of the RTF.\nadditionalfield1: this is information associated with additionalfield1\nadditionalfield2: information associated with additionalfield2"
    }
  }
  ]
}

That simulation gives the result I'm after

{
"docs" : [
{
"doc" : {
"_index" : "_index",
"_type" : "_doc",
"_id" : "_id",
"_source" : {
"additionalfield2" : "information associated with additionalfield2",
"additionalfield1" : "this is information associated with additionalfield1\n",
"message" : """
This is a document with a lengthy text it contains a number of paragraphs and at the end Ill add some markers that indicate additional information I'd like to pull out and add as additional fields. This is the end of the actual document with additional information being added prior to the closing bracket of the RTF.
additionalfield1: this is information associated with additionalfield1
additionalfield2: information associated with additionalfield2
"""
},
"_ingest" : {
"timestamp" : "2019-12-03T03:16:50.505Z"
}
}
}
]
}

if I create the pipeline as

PUT _ingest/pipeline/test_pipeline_id
{
    "description" : "parse multiple patterns",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["additionalfield1: (?<additionalfield1>([^,]*))additionalfield2: (?<additionalfield2>([^,]*))"]
      }
    }
  ]
}

with the settings file of fscrawler containing the below

pipeline: "test_pipeline_id"

and then run fscrawler as

fscrawler modtest --loop 1 --restart --debug

I end up with the below error

ElasticsearchException[Elasticsearch exception [type=exception, reason=java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [message] not present as part of path [message]]]; nested: ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=java.lang.IllegalArgumentException: field [message] not present as part of path [message]]]; nested: ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=field [message] not present as part of path [message]]];

I'm sure I'm doing everything wrong so if a kind soul could please explain what and where that is, and what I need to do, that would be very much appreciated and you'd also make it onto Santa's good list I'm sure!!

Thanks heaps

FSCrawler does not generate the extracted content in message field but in content (see https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html#generated-fields).

Update the pipeline to:

PUT _ingest/pipeline/test_pipeline_id
{
    "description" : "parse multiple patterns",
  "processors": [
    {
      "grok": {
        "field": "content",
        "patterns": ["additionalfield1: (?<additionalfield1>([^,]*))additionalfield2: (?<additionalfield2>([^,]*))"]
      }
    }
  ]
}

And that should be good.

Doh!

Good Lord, thanks heaps for that David, much appreciated. Can't believe I was indexing without the pipeline, checking to make sure that worked, without even reflecting on the field being content and not message. Can't thank you enough!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.