Can I parse text in pdf document before sending it to elasticsearch using FSCrawler

dadoonet · May 24, 2019, 6:46am

Now I understand the question.
So it's not related to FSCrawler but more a general question on how I can extract a phone number from a text, right?

I mean that FSCrawler is responsible to extract the text from a PDF.
Once done, you can do whatever with the extracted text.

Here I'd probably try to use an ingest pipeline (which you can define later in FSCrawler with Elasticsearch settings — FSCrawler 2.10-SNAPSHOT documentation) to try to apply some regex on your text.

You can try the Grok processor may be: Grok processor | Elasticsearch Guide [8.11] | Elastic

If you have further questions, please provide an example of what you tried so far, without using FSCrawler. As I said, that's not FSCrawler's responsability doing that. Like (but for another use case):

POST _ingest/pipeline/_simulate
{
  "pipeline": {
  "description" : "parse multiple patterns",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["%{FAVORITE_DOG:pet}", "%{FAVORITE_CAT:pet}"],
        "pattern_definitions" : {
          "FAVORITE_DOG" : "beagle",
          "FAVORITE_CAT" : "burmese"
        }
      }
    }
  ]
},
"docs":[
  {
    "_source": {
      "message": "I love burmese cats!"
    }
  }
  ]
}

Topic		Replies	Views
Fscrawler creating custome mapping Elasticsearch	2	512	March 12, 2019
Does FSCrawler support chunking? Elastic Search crawler	8	117	October 4, 2024
Fscrawler and Elasticsearch Parsing ingested document/ stop at 1st value groked Discussions en français	2	651	June 25, 2018
Is it necessary to use Ingest Attachment Processor to index pdf files Elasticsearch	28	2355	November 9, 2018
FSCrawler Question Elasticsearch	7	3083	March 17, 2017

Can I parse text in pdf document before sending it to elasticsearch using FSCrawler

Related topics