Hi,
Apologies for what's more of a question around what's possible than any specific problem - I'm an absolute rookie but very much enjoying all the things ELK have to offer out of the box.
Long story short, I recently discovered the brilliant fscrawler plugin, big props to the author. I basically have a large number of RTF files and this allowed me to very easily index and search those files using Kibana without having to spend any considerable time on setting this up. My problem though is when I extract those documents and write the files, there is additional information that would be of value to add as fields that can then be searched .
That is, using a single document as an example, this in itself is a RTF that I don't create but simply extract from a blob and write to a file before sending it to a location where fscrawler picks it up and does its magic.
When I extract that document from the blob, there is associated information that I could search for and that would be of value if I could somehow add to the document in question. For example, imagine the RTF document that is extracted is a building report. I don't know what the structure of the document is and this can vary a great deal but say for example I can find additional information by running queries such as the author of the document, the number of floors of the building etc. etc.
How could I go about adding this information, those fields, and make it so that fscrawler adds those as additional fields that are searchable along with the actual contents of the file itself?
Originally I thought perhaps I could add to the filename and use what would be an expected pattern to extract those fields using an ingest node pipeline? That would lead to very lengthy filenames though and also seems a bit silly but I have no idea how one does these things.
Could/would one instead prepend that information to the start of the RTF file and somehow process and remove that information before the document is then processed by fscrawler if that makes sense? That is, say I'd add something like author: blahblah before the start of the actual RTF along the lines of
author: blahblah {\rtf1\ansi\ansicpg1252\uc0\deff0{\fonttbl
is that what people would do to be able to search the document using content and then also a field called author? Are there any "proper" ways of doing this and is someone able to please describe the basic steps?
Again, apologies for this being such a generic question and more about seeking help to get an idea of how people achieve these things and if the above is feasible? If so, is it an ingest node pipeline that I would use to create the type field and then remove the text from the file before it's parsed by fscrawler?
Thanks for any suggestions and help you can offer, much appreciated.