How do I assign a second analyzer to my mapping

bilpor · November 30, 2017, 10:20am

HI All,
Here is my index so far:

put fullsiteindex
{
  "settings" : {
    "number_of_shards":3,
    "number_of_replicas":2,
    "analysis": {
             "filter": {
                 "my_stop": {
                     "type":      "stop",
                     "stopwords":  "_english_"
                 },
                 "english_stemmer" : {
           		       "type": "stemmer",
          		      "language": "english"
    	            },
    	           "vbscript": {
    	              "type" : "pattern_replace",
    	              "pattern": "<\\%*\\%>",
    	              "replacement": ""
    	            }

             },
"analyzer" : {
     		    "english":  {
        		 	"tokenizer" : "standard",
        			 "filter": [
           		 		"lowercase",
            				"my_stop",
           		 		"english_stemmer"
      		 	]
    		        },
    		    "scripttoken": {
    		        "tokenizer" : "pattern",
    		        "filter": ["vbscript"]
    		    }
    		  }
        }
  },
  "mappings" : {
    "site": {
      "properties": {
"data": {
  "type": "text",
  "analyzer": "english",
  "search_analyzer": "english"
}
  }
  }
  }
}

How do I add my scripttoken analyzer to my mappings property since it needs both analyzers assigned.

dadoonet · November 30, 2017, 11:12am

You can do something like:

{
   "settings":{
      "number_of_shards":3,
      "number_of_replicas":2,
      "analysis":{
         "filter":{
            "my_stop":{
               "type":"stop",
               "stopwords":"_english_"
            },
            "english_stemmer":{
               "type":"stemmer",
               "language":"english"
            },
            "vbscript":{
               "type":"pattern_replace",
               "pattern":"<\\%*\\%>",
               "replacement":""
            }
         },
         "analyzer":{
            "english":{
               "tokenizer":"standard",
               "filter":[
                  "lowercase",
                  "my_stop",
                  "english_stemmer"
               ]
            },
            "scripttoken":{
               "tokenizer":"pattern",
               "filter":[
                  "vbscript"
               ]
            }
         }
      }
   },
   "mappings":{
      "site":{
         "properties":{
            "data":{
               "type":"text",
               "analyzer":"english",
               "search_analyzer":"english",
               "fields": {
                 "foo": {
                   "type": "text",
                   "analyzer": "scripttoken"
                 }
               }
            }
         }
      }
   }
}

bilpor · November 30, 2017, 12:09pm

I changed my filter to be:

"vbscript": {
    	              "type" : "pattern_replace",
    	              "pattern": "(?:..)[^<%]+[^%>](?:..)",
    	              "replacement": ""
    	            }

as I realised the Regex was wrong. I made the following change to the mapping as per your example. I called my field "content" since this is the name of the field n my _source. I then re-indexed.

Using the following Query still returns me the script in the content, which I did not expect:

GET fullsiteindex/_search
{
"query": { "query_string" : { "query" :"my query string", "fields" :["content"] }}, "_source" :["content", "file.url", "path.virtual", "meta.title", "file.last_modified"]
}

dadoonet · November 30, 2017, 12:24pm

Elasticsearch does not change the content of the field. That’s probably why you are still seeing it.
It should not index that content though.

If you want to remove that from the source, try ingest node feature and preprocess the document before it gets indexed.

bilpor · November 30, 2017, 1:41pm

I'm using FSCrawler to create this index. Can I still use ingest? I'm just looking for a good example now

bilpor · November 30, 2017, 2:00pm

I also just run the following query where my query string is a variable defined within my script block and it still returned 2 hits.

GET fullsiteindex/_search
{ 
  "query": { "query_string" : { "query" :"arListHeadOffice",  "fields" :["content"] }}, "_source" :["content", "file.url", "path.virtual", "meta.title", "file.last_modified"]
}

So the filter hasn't worked

bilpor · November 30, 2017, 4:45pm

as a starter for 10 I think I need to use a 'split' processor, however, I don't think this is complete because I haven't said what to do with it now I've split it out. How do I expand this to discard text in the separator? I'm guessing I need to introduce a 'foreach' somewhere in here since I have no way of knowing how many script blocks or where they are in a page:

put _ingest/pipeline/removescript
{
  "pipeline" :{
    "description": "remove script",
    "processors": [
      { "split" :{
        "field": "content",
        "separator": "(?:..)[^<%]+[^%>](?:..)"
        }
      }]
  }
}

dadoonet · November 30, 2017, 7:44pm

May be gsub would be a better fit? https://www.elastic.co/guide/en/elasticsearch/reference/current/gsub-processor.html

bilpor · December 1, 2017, 8:32am

HI dadoonet, Yes, the gsub looks a better fit. I've changed the pipeline to be:

put _ingest/pipeline/removescript
{
  "pipeline" :{
    "description": "remove script",
    "processors": [
      { "gsub" :{
        "field": "content",
        "pattern": "(?:..)[^<%]+[^%>](?:..)",
        "replacement": ""
        }
      }]
  }
}

but I get the following error:

{
  "error": {
    "root_cause": [
      {
        "type": "parse_exception",
        "reason": "[processors] required property is missing",
        "header": {
          "property_name": "processors"
        }
      }
    ],
    "type": "parse_exception",
    "reason": "[processors] required property is missing",
    "header": {
      "property_name": "processors"
    }
  },
  "status": 400
}

bilpor · December 1, 2017, 10:57am

I have finally managed to create the pipeline with:

put _ingest/pipeline/removescript
{

    "description" : "remove script",
    "processors": [
      { "gsub" :{
        "field": "content",
        "pattern": "(?:..)[^<%]+[^%>](?:..)",
        "replacement": ""
        }
      }]
}

My issue now is that when I try to create the index with :

put fullsiteindex?pipeline=removescript {......

it errors with illegal_request. unrecognized parameter [pipeline]

dadoonet · December 1, 2017, 11:23am

The pipeline must be applied per document

PUT index/doc/1?pipeline=foo
{
}

bilpor · December 1, 2017, 11:44am

I'm using FSCrawler since we are talking over 1000 documents. How can I apply it using this?

bilpor · December 1, 2017, 12:18pm

In my _settings file within elasticsearch I added a reference to the pipeline:

"elasticsearch" : {
    "nodes" : [ {
      "pipeline" : "removescript",
      "host" : "localhost",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],

I deleted the _status file and re-ran the index. The txt still appears in my document. I know my pattern is ok as I've tried it via a number of online regex test tools.

basically in the content of some documents there may be text in the form Hello world <% globalheadoffice = true ..... %> from here so I don't want anything to appear that is between the <% %> values or those values themselves. But I do want to see in the document Hello world from here

dadoonet · December 1, 2017, 12:29pm

Any chance you are using FSCrawler REST endpoint?

bilpor · December 1, 2017, 1:35pm

Hi dadoonet,

I think the pipeline is working for a degree. That is to say that some of the text has been removed but not all. If I use www.regextester.com this replaces everything that I expect. If I use freeformatter.com it doesn't so I guess my next question is what regex parser does elasticsearch use/conform too?

dadoonet · December 1, 2017, 2:11pm

The Java one according to https://github.com/elastic/elasticsearch/blob/master/modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/GsubProcessor.java

https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

bilpor · December 1, 2017, 3:41pm

it is strange. changing to a Java regex parser, if I change my pattern to be (?:..)[^<%]+[^%>](?:..)+g

with my test script, it get's me close but not close enough, but going with this for now, I recreated my pipeline and index, repopulated with FsCrawler, but the text remained the same (i.e. not as per Java regex test harness).

Not sure where to go from here.

dadoonet · December 1, 2017, 5:58pm

You need a regex expert maybe, which I'm not!

system · December 29, 2017, 5:58pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to add Multiple analyzers to a field Elasticsearch	3	16766	December 7, 2018
Multiple custom analyzers in a single index Elasticsearch	3	1063	July 5, 2017
Using two analyzers stemmer and synonym at a same time Elasticsearch	3	945	July 5, 2017
How to map two Analyzers to a same field in an index? Elasticsearch	2	386	July 3, 2019
How to add Multiple analyzers to specific field Elasticsearch	2	584	November 7, 2018

How do I assign a second analyzer to my mapping

Related topics