Prepend fields to files that are to be indexed

c95mbq · February 26, 2019, 9:10pm

Hi,

Apologies for what's more of a question around what's possible than any specific problem - I'm an absolute rookie but very much enjoying all the things ELK have to offer out of the box.

Long story short, I recently discovered the brilliant fscrawler plugin, big props to the author. I basically have a large number of RTF files and this allowed me to very easily index and search those files using Kibana without having to spend any considerable time on setting this up. My problem though is when I extract those documents and write the files, there is additional information that would be of value to add as fields that can then be searched .

That is, using a single document as an example, this in itself is a RTF that I don't create but simply extract from a blob and write to a file before sending it to a location where fscrawler picks it up and does its magic.

When I extract that document from the blob, there is associated information that I could search for and that would be of value if I could somehow add to the document in question. For example, imagine the RTF document that is extracted is a building report. I don't know what the structure of the document is and this can vary a great deal but say for example I can find additional information by running queries such as the author of the document, the number of floors of the building etc. etc.

How could I go about adding this information, those fields, and make it so that fscrawler adds those as additional fields that are searchable along with the actual contents of the file itself?

Originally I thought perhaps I could add to the filename and use what would be an expected pattern to extract those fields using an ingest node pipeline? That would lead to very lengthy filenames though and also seems a bit silly but I have no idea how one does these things.

Could/would one instead prepend that information to the start of the RTF file and somehow process and remove that information before the document is then processed by fscrawler if that makes sense? That is, say I'd add something like author: blahblah before the start of the actual RTF along the lines of

author: blahblah {\rtf1\ansi\ansicpg1252\uc0\deff0{\fonttbl

is that what people would do to be able to search the document using content and then also a field called author? Are there any "proper" ways of doing this and is someone able to please describe the basic steps?

Again, apologies for this being such a generic question and more about seeking help to get an idea of how people achieve these things and if the above is feasible? If so, is it an ingest node pipeline that I would use to create the type field and then remove the text from the file before it's parsed by fscrawler?

Thanks for any suggestions and help you can offer, much appreciated.

c95mbq · February 27, 2019, 2:26am

Just wanted to add that I thought, maybe, I could abuse some metadata fields and add a {/info} section to the RTF document but after parsed by fscrawler and indexed, meta.author and meta.title, which I had set, don't seem to come across.

ok so I've just realized the reason the metadata didn't come across was that I tried writing it into the RTF file but must have written this in the wrong location as when I shift it, it does come across. That means I could probably borrow some of those metadata field as they're not set by the RTF produced in our system but it seems and feels very dodgy even though it might be the quickest way for someone like me that doesn't really know what they're doing.

system · March 27, 2019, 2:26am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to add custom fields uppon indexing Elasticsearch	2	1050	July 6, 2017
Filesearch solution using ES 5.5.0 Elasticsearch	13	1790	August 30, 2017
[FSCrawler] Add data in upload files Elasticsearch	4	687	July 24, 2018
Fscrawler missing the field file.extension when indexing through Rest API Elasticsearch	3	471	November 23, 2017
FSCrawler - Ingest pipeline error Elasticsearch	3	1552	December 31, 2019

Prepend fields to files that are to be indexed

Related topics