Apologies for the poor topic, it was the best I could do, sad as that is.
I'm really struggling with this but here's what I'm trying to do. I have a large number of documents that are in RTF format. The documents themselves would have more value if augmented with some additional information so I'm basically injecting that information into the document/s after the last paragraph. Imagine it looks like....
This is a document with a lengthy text
it contains a number of paragraphs
and at the end
I'll add some markers that indicate additional information I'd like to pull out and add as additional fields
additionalfield1: this is information associated with additionalfield1
additionalfield2: information associated with additionalfield2
FSCrawler extracts the text from the documents and I'm guessing I want/need a processing pipeline before the document gets indexed. In the above example, I thought it would make sense to extract what follows additionalfield1: into its own field, same with additionalfield2, before removing the two from the source that gets indexed (though maybe that's a bad idea as the person searching couldn't actually confirm, visually, that that information was indeed present in the document that was indexed).
Firstly, does that make sense and would you use multiple patterns to match, one for each field, or is there a better way to do this? Given each of these "special" fields can only appear once and it'll be at the start of a line, but at the end of the actual document, will it be dreadfully slow and is there a way to search for it bottom up, so to speak?
Second, I can't for the life of me create a grok pattern that matches the above. A simple regex like
(?<=additionalfield1:).*$ (multi-line) returns what I'd like to set the field additionalfield1 to but I can't turn that into a custom grok pattern. Any help with this and the above would be much appreciated.