FSCrawler - Ingest pipeline error

c95mbq · December 3, 2019, 4:02am

Hi,

I'm using FSCrawler 2.7 and Elasticsearch and Kibana 7.1.1.

I have a single document where I want to extract and add two separate fields from said document.

I have the following

POST _ingest/pipeline/_simulate
{
  "pipeline": {
  "description" : "parse multiple patterns",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["additionalfield1: (?<additionalfield1>([^,]*))additionalfield2: (?<additionalfield2>([^,]*))"]
      }
    }
  ]
},
"docs":[
  {
    "_source": {
      "message": "This is a document with a lengthy text it contains a number of paragraphs and at the end Ill add some markers that indicate additional information I'd like to pull out and add as additional fields. This is the end of the actual document with additional information being added prior to the closing bracket of the RTF.\nadditionalfield1: this is information associated with additionalfield1\nadditionalfield2: information associated with additionalfield2"
    }
  }
  ]
}

That simulation gives the result I'm after

{
"docs" : [
{
"doc" : {
"_index" : "_index",
"_type" : "_doc",
"_id" : "_id",
"_source" : {
"additionalfield2" : "information associated with additionalfield2",
"additionalfield1" : "this is information associated with additionalfield1\n",
"message" : """
This is a document with a lengthy text it contains a number of paragraphs and at the end Ill add some markers that indicate additional information I'd like to pull out and add as additional fields. This is the end of the actual document with additional information being added prior to the closing bracket of the RTF.
additionalfield1: this is information associated with additionalfield1
additionalfield2: information associated with additionalfield2
"""
},
"_ingest" : {
"timestamp" : "2019-12-03T03:16:50.505Z"
}
}
}
]
}

if I create the pipeline as

PUT _ingest/pipeline/test_pipeline_id
{
    "description" : "parse multiple patterns",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["additionalfield1: (?<additionalfield1>([^,]*))additionalfield2: (?<additionalfield2>([^,]*))"]
      }
    }
  ]
}

with the settings file of fscrawler containing the below

pipeline: "test_pipeline_id"

and then run fscrawler as

fscrawler modtest --loop 1 --restart --debug

I end up with the below error

ElasticsearchException[Elasticsearch exception [type=exception, reason=java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [message] not present as part of path [message]]]; nested: ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=java.lang.IllegalArgumentException: field [message] not present as part of path [message]]]; nested: ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=field [message] not present as part of path [message]]];

I'm sure I'm doing everything wrong so if a kind soul could please explain what and where that is, and what I need to do, that would be very much appreciated and you'd also make it onto Santa's good list I'm sure!!

Thanks heaps

dadoonet · December 3, 2019, 8:07am

FSCrawler does not generate the extracted content in message field but in content (see https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html#generated-fields).

Update the pipeline to:

PUT _ingest/pipeline/test_pipeline_id
{
    "description" : "parse multiple patterns",
  "processors": [
    {
      "grok": {
        "field": "content",
        "patterns": ["additionalfield1: (?<additionalfield1>([^,]*))additionalfield2: (?<additionalfield2>([^,]*))"]
      }
    }
  ]
}

And that should be good.

c95mbq · December 3, 2019, 8:21pm

Doh!

Good Lord, thanks heaps for that David, much appreciated. Can't believe I was indexing without the pipeline, checking to make sure that worked, without even reflecting on the field being content and not message. Can't thank you enough!

system · December 31, 2019, 8:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fscrawler pipeline feature Elasticsearch	11	2289	July 26, 2018
Ingestion Pipeline not parsing out field values Elasticsearch	1	476	March 30, 2018
ElasticSearch Pipeline Issue Elasticsearch	3	957	June 9, 2017
Unable to access nested fields in pipeline processors Elasticsearch	4	3810	March 24, 2017
Fscrawler injest node pipeline Elasticsearch	2	544	November 13, 2017

FSCrawler - Ingest pipeline error

Related topics