I'm using FSCrawler 2.6 and it's working great for indexing the pdf document. However one issue i am facing is that it is putting all the contents of the PDF in the "content" field(i am newbie in this field). So, my question is that "is there any way that i can have my custom mapping for the data of pdf i.e. latitude/longitude, number or if not that... line wise(like content.line1, content.line2, content.line3...) ?".
Parsing text to extract meaningful content (entities) is a difficult thing.
The only option I can see for now is by using
In FSCrawler you can configure the ingest pipeline name to apply after the text has been extracted. See https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.