Fscrawler pipeline feature

pa26992 · June 15, 2018, 7:28am

Hey I am unable to use the fscrawler pipeline feature. I wanted to use it to replace '_' in path.real to '/'. I am using 6.X kibana and elasticsearch and 2.4 fscrawler.
Config is as follows.

    {
      "name" : "test1",
      "fs" : {
        "url" : "/tmp/es",
        "update_rate" : "15m",
        "excludes" : [ "~*" ],
        "json_support" : false,
        "filename_as_id" : false,
        "add_filesize" : true,
        "remove_deleted" : true,
        "add_as_inner_object" : false,
        "store_source" : false,
        "index_content" : true,
        "attributes_support" : false,
        "raw_metadata" : false,
        "xml_support" : false,
        "index_folders" : true,
        "lang_detect" : false,
        "continue_on_error" : false,
        "pdf_ocr" : true,
        "ocr" : {
          "language" : "eng"
        }
      },
      "elasticsearch" : {
        "nodes" : [ {
          "host" : "127.0.0.1",
          "port" : 9200,
          "scheme" : "HTTP"
        } ],
        "pipeline": "pipeline1",
        "bulk_size" : 100,
        "flush_interval" : "5s"
      },
      "rest" : {
        "scheme" : "HTTP",
        "host" : "127.0.0.1",
        "port" : 8080,
        "endpoint" : "fscrawler"
      }
    }

Creating pipeline as follows:

    PUT _ingest/pipeline/pipeline1
    {
      "description" : "testing pipeline",
      "processors" : [
        {
          "gsub": {
              "field": "path.real",
              "pattern": "_",
              "replacement": "/"
            }
        }
      ]
    }

I am uploading pdfs using rest feature of fscrawler
Please help.

dadoonet · June 15, 2018, 10:24am

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

This is the icon to use if you are not using markdown format:

There's a live preview panel for exactly this reasons.

Lots of people read these forums, and many of them will simply skip over a post that is difficult to read, because it's just too large an investment of their time to try and follow a wall of badly formatted text.
If your goal is to get an answer to your questions, it's in your interest to make it as easy to read and understand as possible.
Please update your post.

pa26992 · June 15, 2018, 11:50am

Apologies, I have formatted the query now.

dadoonet · June 15, 2018, 12:03pm

Some few things to check.

First can you try the latest SNAPSHOT of FSCrawler?
Then, did you simulate your pipeline with the simulate endpoint to make sure your regex does what you expect?

If so, could you share the simulate call and the response?

pa26992 · June 15, 2018, 3:12pm

Sorry wont be able to do that, dont have the required rights on the system.

I tried simulating even a small "set" but doesnt seem to be working on Elasticsearch. please help me if I am missing something, I am very new to elasticsearch and fscrawler. Following is the request and response

POST _ingest/pipeline/pipeline1/_simulate
{
  "pipeline":{
    "description" : "testing pipeline",
    "processors" : [
      {
        "set": {
          "field":"path",
          "value": "bar123"
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_type": "_doc",
      "_id": "id",
      "_source": {
        "path": "test_123"
      }
    }
  ]
}

{
  "docs": [
    {
      "doc": {
        "_index": "index",
        "_type": "_doc",
        "_id": "id",
        "_source": {
          "path": "test_123",
          "content": "bar"
        },
        "_ingest": {
          "timestamp": "2018-06-15T15:07:39.718Z"
        }
      }
    }
  ]
}

dadoonet · June 15, 2018, 4:23pm

This is working well:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "testing pipeline",
    "processors": [
      {
        "gsub": {
          "field": "path",
          "pattern": "_",
          "replacement": "/"
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_type": "_doc",
      "_id": "id",
      "_source": {
        "path": "test_123"
      }
    }
  ]
}

It gives:

{
  "docs": [
    {
      "doc": {
        "_index": "index",
        "_type": "_doc",
        "_id": "id",
        "_source": {
          "path": "test/123"
        },
        "_ingest": {
          "timestamp": "2018-06-15T16:22:10.750350Z"
        }
      }
    }
  ]
}

Can you do it now with a typical document sent by FSCrawler to Elasticsearch?

pa26992 · June 26, 2018, 10:33am

Hi,

Thanks for the reply. Yes, it did work with elastic search but am afraid not with fscrawler.
Following is how am i using this.

PUT _ingest/pipeline/pipeline1
{
  "description": "testing pipeline",
    "processors": [
      {
        "gsub": {
          "field": "meta.raw.Application-Name",
          "pattern": " ",
          "replacement": "/"
        }
      }
    ]
  }
}

FS Crawler config

{
  "name" : "pipeline_testing",
  "fs" : {
    "url" : "/home/testfolder1",
    "update_rate" : "1m",
    "excludes" : [ "~*" ],
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "add_as_inner_object" : false,
    "store_source" : false,
    "index_content" : true,
    "attributes_support" : false,
    "raw_metadata" : false,
    "xml_support" : false,
    "index_folders" : true,
    "lang_detect" : false,
    "continue_on_error" : false,
    "pdf_ocr" : true,
    "ocr" : {
      "language" : "eng"
    }
  },
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "127.0.0.1",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "pipeline": "pipeline1",
    "bulk_size" : 100,
    "flush_interval" : "5s"
  },
  "rest" : {
    "scheme" : "HTTP",
    "host" : "127.0.0.1",
    "port" : 8080,
    "endpoint" : "fscrawler"
  }
}

It does not indexes any document with this config. removing pipeline indexes but doesn't uses pipeline. Same worked well with simulate and when elasticsearch is used directly. Am I missing something here if you can suggest.
Thanks

dadoonet · June 27, 2018, 7:36am

Can you share the output of:

GET pipeline_testing/_search
{
  "size": 1
}

pa26992 · June 28, 2018, 5:38am

Following is the output of the above query when pipeline is set in fscrawler config

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

As soon as I remove line "pipeline":"pipeline1", from the config it returns the following

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 100,
    "max_score": 1,
    "hits": [
      {
        "_index": "pipeline_testing",
        "_type": "doc",
        "_id": "b5db833fe0ee409de6b926d67e63b1",
        "_score": 1,
        "_source": {
          "content": """
.........
........

dadoonet · June 28, 2018, 5:50am

Could you share the full output of the latest?

pa26992 · June 28, 2018, 6:02am

Apologies, issue was on my side. I tried searching for error on elastic search console and found the following error. Found that I changed the key earlier while simulating to check if its drills down in json and forgot to correct that in actual pipeline. I corrected that and its working well now.

[2018-06-28T11:05:19,468][DEBUG][o.e.a.b.TransportBulkAction] [C8-OFwE] failed to execute pipeline [pipeline1] for document [pipeline_testing/doc/bd1cd038f977f76ce54bc1cace51d4]
org.elasticsearch.ElasticsearchException: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [raw] not present as part of path [meta.raw.Application-Name]
	at org.elasticsearch.ingest.CompoundProcessor.newCompoundProcessorException(CompoundProcessor.java:156) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:107) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:58) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.ingest.PipelineExecutionService.innerExecute(PipelineExecutionService.java:169) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.ingest.PipelineExecutionService.access$000(PipelineExecutionService.java:42) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.ingest.PipelineExecutionService$2.doRun(PipelineExecutionService.java:94) [elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) [elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.2.4.jar:6.2.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [raw] not present as part of path [meta.raw.Application-Name]
	... 11 more
Caused by: java.lang.IllegalArgumentException: field [raw] not present as part of path [meta.raw.Application-Name]

Thanks for your time. It really helped me.

system · July 26, 2018, 6:12am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
FSCrawler - Ingest pipeline error Elasticsearch	3	1548	December 31, 2019
Ingesting documents (pdf, word, .txt) to elasticsearch Elasticsearch	31	39012	March 21, 2017
FSCrawler Question Elasticsearch	7	3125	March 17, 2017
Fscrawler injest node pipeline Elasticsearch	2	544	November 13, 2017
Filesearch solution using ES 5.5.0 Elasticsearch	13	1771	August 30, 2017

Fscrawler pipeline feature

Related topics