as I realised the Regex was wrong. I made the following change to the mapping as per your example. I called my field "content" since this is the name of the field n my _source. I then re-indexed.
Using the following Query still returns me the script in the content, which I did not expect:
as a starter for 10 I think I need to use a 'split' processor, however, I don't think this is complete because I haven't said what to do with it now I've split it out. How do I expand this to discard text in the separator? I'm guessing I need to introduce a 'foreach' somewhere in here since I have no way of knowing how many script blocks or where they are in a page:
I deleted the _status file and re-ran the index. The txt still appears in my document. I know my pattern is ok as I've tried it via a number of online regex test tools.
basically in the content of some documents there may be text in the form Hello world <% globalheadoffice = true ..... %> from here so I don't want anything to appear that is between the <% %> values or those values themselves. But I do want to see in the document Hello world from here
I think the pipeline is working for a degree. That is to say that some of the text has been removed but not all. If I use www.regextester.com this replaces everything that I expect. If I use freeformatter.com it doesn't so I guess my next question is what regex parser does elasticsearch use/conform too?
it is strange. changing to a Java regex parser, if I change my pattern to be (?:..)[^<%]+[^%>](?:..)+g
with my test script, it get's me close but not close enough, but going with this for now, I recreated my pipeline and index, repopulated with FsCrawler, but the text remained the same (i.e. not as per Java regex test harness).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.