Fscrawler pipeline feature


(panand) #1

Hey I am unable to use the fscrawler pipeline feature. I wanted to use it to replace '_' in path.real to '/'. I am using 6.X kibana and elasticsearch and 2.4 fscrawler.
Config is as follows.

    {
      "name" : "test1",
      "fs" : {
        "url" : "/tmp/es",
        "update_rate" : "15m",
        "excludes" : [ "~*" ],
        "json_support" : false,
        "filename_as_id" : false,
        "add_filesize" : true,
        "remove_deleted" : true,
        "add_as_inner_object" : false,
        "store_source" : false,
        "index_content" : true,
        "attributes_support" : false,
        "raw_metadata" : false,
        "xml_support" : false,
        "index_folders" : true,
        "lang_detect" : false,
        "continue_on_error" : false,
        "pdf_ocr" : true,
        "ocr" : {
          "language" : "eng"
        }
      },
      "elasticsearch" : {
        "nodes" : [ {
          "host" : "127.0.0.1",
          "port" : 9200,
          "scheme" : "HTTP"
        } ],
        "pipeline": "pipeline1",
        "bulk_size" : 100,
        "flush_interval" : "5s"
      },
      "rest" : {
        "scheme" : "HTTP",
        "host" : "127.0.0.1",
        "port" : 8080,
        "endpoint" : "fscrawler"
      }
    }

Creating pipeline as follows:

    PUT _ingest/pipeline/pipeline1
    {
      "description" : "testing pipeline",
      "processors" : [
        {
          "gsub": {
              "field": "path.real",
              "pattern": "_",
              "replacement": "/"
            }
        }
      ]
    }

I am uploading pdfs using rest feature of fscrawler
Please help.


(David Pilato) #2

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

This is the icon to use if you are not using markdown format:

There's a live preview panel for exactly this reasons.

Lots of people read these forums, and many of them will simply skip over a post that is difficult to read, because it's just too large an investment of their time to try and follow a wall of badly formatted text.
If your goal is to get an answer to your questions, it's in your interest to make it as easy to read and understand as possible.
Please update your post.


(panand) #3

Apologies, I have formatted the query now.


(David Pilato) #4

Some few things to check.

First can you try the latest SNAPSHOT of FSCrawler?
Then, did you simulate your pipeline with the simulate endpoint to make sure your regex does what you expect?

If so, could you share the simulate call and the response?


(panand) #5

Sorry wont be able to do that, dont have the required rights on the system.

I tried simulating even a small "set" but doesnt seem to be working on elastic search. please help me if I am missing something, I am very new to elasticsearch and fscrawler. Following is the request and response

POST _ingest/pipeline/pipeline1/_simulate
{
  "pipeline":{
    "description" : "testing pipeline",
    "processors" : [
      {
        "set": {
          "field":"path",
          "value": "bar123"
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_type": "_doc",
      "_id": "id",
      "_source": {
        "path": "test_123"
      }
    }
  ]
}
{
  "docs": [
    {
      "doc": {
        "_index": "index",
        "_type": "_doc",
        "_id": "id",
        "_source": {
          "path": "test_123",
          "content": "bar"
        },
        "_ingest": {
          "timestamp": "2018-06-15T15:07:39.718Z"
        }
      }
    }
  ]
}

(David Pilato) #6

This is working well:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "testing pipeline",
    "processors": [
      {
        "gsub": {
          "field": "path",
          "pattern": "_",
          "replacement": "/"
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_type": "_doc",
      "_id": "id",
      "_source": {
        "path": "test_123"
      }
    }
  ]
}

It gives:

{
  "docs": [
    {
      "doc": {
        "_index": "index",
        "_type": "_doc",
        "_id": "id",
        "_source": {
          "path": "test/123"
        },
        "_ingest": {
          "timestamp": "2018-06-15T16:22:10.750350Z"
        }
      }
    }
  ]
}

Can you do it now with a typical document sent by FSCrawler to Elasticsearch?


(panand) #7

Hi,

Thanks for the reply. Yes, it did work with elastic search but am afraid not with fscrawler.
Following is how am i using this.

PUT _ingest/pipeline/pipeline1
{
  "description": "testing pipeline",
    "processors": [
      {
        "gsub": {
          "field": "meta.raw.Application-Name",
          "pattern": " ",
          "replacement": "/"
        }
      }
    ]
  }
}

FS Crawler config

{
  "name" : "pipeline_testing",
  "fs" : {
    "url" : "/home/testfolder1",
    "update_rate" : "1m",
    "excludes" : [ "~*" ],
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "add_as_inner_object" : false,
    "store_source" : false,
    "index_content" : true,
    "attributes_support" : false,
    "raw_metadata" : false,
    "xml_support" : false,
    "index_folders" : true,
    "lang_detect" : false,
    "continue_on_error" : false,
    "pdf_ocr" : true,
    "ocr" : {
      "language" : "eng"
    }
  },
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "127.0.0.1",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "pipeline": "pipeline1",
    "bulk_size" : 100,
    "flush_interval" : "5s"
  },
  "rest" : {
    "scheme" : "HTTP",
    "host" : "127.0.0.1",
    "port" : 8080,
    "endpoint" : "fscrawler"
  }
}

It does not indexes any document with this config. removing pipeline indexes but doesn't uses pipeline. Same worked well with simulate and when elasticsearch is used directly. Am I missing something here if you can suggest.
Thanks


(David Pilato) #8

Can you share the output of:

GET pipeline_testing/_search
{
  "size": 1
}

(panand) #9

Following is the output of the above query when pipeline is set in fscrawler config

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

As soon as I remove line "pipeline":"pipeline1", from the config it returns the following

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 100,
    "max_score": 1,
    "hits": [
      {
        "_index": "pipeline_testing",
        "_type": "doc",
        "_id": "b5db833fe0ee409de6b926d67e63b1",
        "_score": 1,
        "_source": {
          "content": """
.........
........

(David Pilato) #10

Could you share the full output of the latest?


(panand) #11

Apologies, issue was on my side. I tried searching for error on elastic search console and found the following error. Found that I changed the key earlier while simulating to check if its drills down in json and forgot to correct that in actual pipeline. I corrected that and its working well now.

[2018-06-28T11:05:19,468][DEBUG][o.e.a.b.TransportBulkAction] [C8-OFwE] failed to execute pipeline [pipeline1] for document [pipeline_testing/doc/bd1cd038f977f76ce54bc1cace51d4]
org.elasticsearch.ElasticsearchException: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [raw] not present as part of path [meta.raw.Application-Name]
	at org.elasticsearch.ingest.CompoundProcessor.newCompoundProcessorException(CompoundProcessor.java:156) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:107) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:58) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.ingest.PipelineExecutionService.innerExecute(PipelineExecutionService.java:169) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.ingest.PipelineExecutionService.access$000(PipelineExecutionService.java:42) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.ingest.PipelineExecutionService$2.doRun(PipelineExecutionService.java:94) [elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) [elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.2.4.jar:6.2.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [raw] not present as part of path [meta.raw.Application-Name]
	... 11 more
Caused by: java.lang.IllegalArgumentException: field [raw] not present as part of path [meta.raw.Application-Name]

Thanks for your time. It really helped me.


(system) #12

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.