Custom grok pattern to pull field from large text

c95mbq · October 10, 2019, 5:24am

Apologies for the poor topic, it was the best I could do, sad as that is.

I'm really struggling with this but here's what I'm trying to do. I have a large number of documents that are in RTF format. The documents themselves would have more value if augmented with some additional information so I'm basically injecting that information into the document/s after the last paragraph. Imagine it looks like....

This is a document with a lengthy text
it contains a number of paragraphs
and at the end
I'll add some markers that indicate additional information I'd like to pull out and add as additional fields
additionalfield1: this is information associated with additionalfield1
additionalfield2: information associated with additionalfield2

FSCrawler extracts the text from the documents and I'm guessing I want/need a processing pipeline before the document gets indexed. In the above example, I thought it would make sense to extract what follows additionalfield1: into its own field, same with additionalfield2, before removing the two from the source that gets indexed (though maybe that's a bad idea as the person searching couldn't actually confirm, visually, that that information was indeed present in the document that was indexed).

Firstly, does that make sense and would you use multiple patterns to match, one for each field, or is there a better way to do this? Given each of these "special" fields can only appear once and it'll be at the start of a line, but at the end of the actual document, will it be dreadfully slow and is there a way to search for it bottom up, so to speak?

Second, I can't for the life of me create a grok pattern that matches the above. A simple regex like (?<=additionalfield1:).*$ (multi-line) returns what I'd like to set the field additionalfield1 to but I can't turn that into a custom grok pattern. Any help with this and the above would be much appreciated.

Thanks heaps

c95mbq · November 3, 2019, 9:29pm

Adding some additional information in hope of a reply.

If I add the below into the Grok Pattern field of the Grok Debugger in Kibana, it does seem to pull out the field as required.

Is that the right syntax to use though?

Second, I can't seem to add a second pattern to grab (in this example), the additionalfield2. I've seen examples where multiple patterns is said to be added using curly brackets and comma separated but this results in

"Provided Grok patterns do not match data in the input"

Any help would be much appreciated.

system · December 1, 2019, 9:29pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to extract terms into a new field while indexing Elasticsearch	6	956	May 1, 2019
FSCrawler - Ingest pipeline error Elasticsearch	3	1552	December 31, 2019
Add_field in filebeat individuell for each pattern Beats filebeat	1	40	January 29, 2025
Ingest pipeline - extract regex from events Elasticsearch painless , ingest-pipeline	2	778	November 14, 2023
Can I parse text in pdf document before sending it to elasticsearch using FSCrawler Elasticsearch	18	1332	June 23, 2019

Custom grok pattern to pull field from large text

Related topics