How do I assign a second analyzer to my mapping

HI All,
Here is my index so far:

put fullsiteindex
{
  "settings" : {
    "number_of_shards":3,
    "number_of_replicas":2,
    "analysis": {
             "filter": {
                 "my_stop": {
                     "type":      "stop",
                     "stopwords":  "_english_"
                 },
                 "english_stemmer" : {
           		       "type": "stemmer",
          		      "language": "english"
    	            },
    	           "vbscript": {
    	              "type" : "pattern_replace",
    	              "pattern": "<\\%*\\%>",
    	              "replacement": ""
    	            }

             },
"analyzer" : {
     		    "english":  {
        		 	"tokenizer" : "standard",
        			 "filter": [
           		 		"lowercase",
            				"my_stop",
           		 		"english_stemmer"
      		 	]
    		        },
    		    "scripttoken": {
    		        "tokenizer" : "pattern",
    		        "filter": ["vbscript"]
    		    }
    		  }
        }
  },
  "mappings" : {
    "site": {
      "properties": {
"data": {
  "type": "text",
  "analyzer": "english",
  "search_analyzer": "english"
}
  }
  }
  }
}

How do I add my scripttoken analyzer to my mappings property since it needs both analyzers assigned.

You can do something like:

{
   "settings":{
      "number_of_shards":3,
      "number_of_replicas":2,
      "analysis":{
         "filter":{
            "my_stop":{
               "type":"stop",
               "stopwords":"_english_"
            },
            "english_stemmer":{
               "type":"stemmer",
               "language":"english"
            },
            "vbscript":{
               "type":"pattern_replace",
               "pattern":"<\\%*\\%>",
               "replacement":""
            }
         },
         "analyzer":{
            "english":{
               "tokenizer":"standard",
               "filter":[
                  "lowercase",
                  "my_stop",
                  "english_stemmer"
               ]
            },
            "scripttoken":{
               "tokenizer":"pattern",
               "filter":[
                  "vbscript"
               ]
            }
         }
      }
   },
   "mappings":{
      "site":{
         "properties":{
            "data":{
               "type":"text",
               "analyzer":"english",
               "search_analyzer":"english",
               "fields": {
                 "foo": {
                   "type": "text",
                   "analyzer": "scripttoken"
                 }
               }
            }
         }
      }
   }
}

I changed my filter to be:

"vbscript": {
    	              "type" : "pattern_replace",
    	              "pattern": "(?:..)[^<%]+[^%>](?:..)",
    	              "replacement": ""
    	            }

as I realised the Regex was wrong. I made the following change to the mapping as per your example. I called my field "content" since this is the name of the field n my _source. I then re-indexed.

Using the following Query still returns me the script in the content, which I did not expect:

GET fullsiteindex/_search
{
"query": { "query_string" : { "query" :"my query string", "fields" :["content"] }}, "_source" :["content", "file.url", "path.virtual", "meta.title", "file.last_modified"]
}

Elasticsearch does not change the content of the field. That’s probably why you are still seeing it.
It should not index that content though.

If you want to remove that from the source, try ingest node feature and preprocess the document before it gets indexed.

I'm using FSCrawler to create this index. Can I still use ingest? I'm just looking for a good example now

I also just run the following query where my query string is a variable defined within my script block and it still returned 2 hits.

GET fullsiteindex/_search
{ 
  "query": { "query_string" : { "query" :"arListHeadOffice",  "fields" :["content"] }}, "_source" :["content", "file.url", "path.virtual", "meta.title", "file.last_modified"]
}

So the filter hasn't worked

as a starter for 10 I think I need to use a 'split' processor, however, I don't think this is complete because I haven't said what to do with it now I've split it out. How do I expand this to discard text in the separator? I'm guessing I need to introduce a 'foreach' somewhere in here since I have no way of knowing how many script blocks or where they are in a page:

put _ingest/pipeline/removescript
{
  "pipeline" :{
    "description": "remove script",
    "processors": [
      { "split" :{
        "field": "content",
        "separator": "(?:..)[^<%]+[^%>](?:..)"
        }
      }]
  }
}

May be gsub would be a better fit? https://www.elastic.co/guide/en/elasticsearch/reference/current/gsub-processor.html

HI dadoonet, Yes, the gsub looks a better fit. I've changed the pipeline to be:

put _ingest/pipeline/removescript
{
  "pipeline" :{
    "description": "remove script",
    "processors": [
      { "gsub" :{
        "field": "content",
        "pattern": "(?:..)[^<%]+[^%>](?:..)",
        "replacement": ""
        }
      }]
  }
}

but I get the following error:

{
  "error": {
    "root_cause": [
      {
        "type": "parse_exception",
        "reason": "[processors] required property is missing",
        "header": {
          "property_name": "processors"
        }
      }
    ],
    "type": "parse_exception",
    "reason": "[processors] required property is missing",
    "header": {
      "property_name": "processors"
    }
  },
  "status": 400
}

I have finally managed to create the pipeline with:

put _ingest/pipeline/removescript
{

    "description" : "remove script",
    "processors": [
      { "gsub" :{
        "field": "content",
        "pattern": "(?:..)[^<%]+[^%>](?:..)",
        "replacement": ""
        }
      }]
}

My issue now is that when I try to create the index with :

put fullsiteindex?pipeline=removescript {......

it errors with illegal_request. unrecognized parameter [pipeline]

The pipeline must be applied per document

PUT index/doc/1?pipeline=foo
{
}

I'm using FSCrawler since we are talking over 1000 documents. How can I apply it using this?

In my _settings file within elasticsearch I added a reference to the pipeline:

"elasticsearch" : {
    "nodes" : [ {
      "pipeline" : "removescript",
      "host" : "localhost",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],

I deleted the _status file and re-ran the index. The txt still appears in my document. I know my pattern is ok as I've tried it via a number of online regex test tools.

basically in the content of some documents there may be text in the form Hello world <% globalheadoffice = true ..... %> from here so I don't want anything to appear that is between the <% %> values or those values themselves. But I do want to see in the document Hello world from here

Any chance you are using FSCrawler REST endpoint?

Hi dadoonet,

I think the pipeline is working for a degree. That is to say that some of the text has been removed but not all. If I use www.regextester.com this replaces everything that I expect. If I use freeformatter.com it doesn't so I guess my next question is what regex parser does elasticsearch use/conform too?

The Java one according to https://github.com/elastic/elasticsearch/blob/master/modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/GsubProcessor.java

https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

it is strange. changing to a Java regex parser, if I change my pattern to be (?:..)[^<%]+[^%>](?:..)+g

with my test script, it get's me close but not close enough, but going with this for now, I recreated my pipeline and index, repopulated with FsCrawler, but the text remained the same (i.e. not as per Java regex test harness).

Not sure where to go from here.

You need a regex expert maybe, which I'm not! :pensive:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.