Sending HTML through REST API for html_strip


(Jennifer Cumming) #1

I'm trying to setup a mapping that will use html_strip filter in a custom analyzer to strip html tags before indexing. Setting up the mapping was relatively easy but now I find the only way I can get the filter to work when indexing data is to URL encode the HTML. This just seem wrong, surely I should be able to just send an HTML string in the request body of my index request?

This is my index settings:

{
    "settings": {
      "index": {
        "analysis": {
          "analyzer": {
            "html_analyzer": {
              "filter": [
                "standard"
              ],
              "char_filter": [
                "html_strip"
              ],
              "type":"custom",
              "tokenizer" : "standard"
            }
          }
        }
      }
    }   
}

And this is my type mapping:

{
  "test_type": {
    "mappings": {
      "test_type": {
        "dynamic": "strict",
        "properties": {
          "createtime": {
            "type": "long"
          },
          "updatetime": {
            "type": "long"
          },
          "title": {
            "type": "string"
          },
          "description": {
            "type": "string"
          },
          "label": {
            "type": "string"
          },
          "notes": {
            "properties": {
              "author": {
                "properties": {
                  "firstname": {
                    "type": "string"
                  },
                  "secondname": {
                    "type": "string"
                  }
                }
              },
              "category": {
                "type": "string"
              },
              "createdate": {
                "type": "date",
                "format": "dateOptionalTime"
              },
              "detail": {
                "type": "string",
                "analyzer": "html_analyzer"
              },
              "type": {
                "type": "string"
              }
            }
          }
        }
      }
    }
  }
}

If I use the analyze endpoint to test it works

GET http://localhost:9200/test2/_analyze?analyzer=html_analyzer&text=this+is+a+note

{
  "tokens": [
    {
      "token": "this",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "is",
      "start_offset": 8,
      "end_offset": 14,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "a",
      "start_offset": 15,
      "end_offset": 16,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "note",
      "start_offset": 17,
      "end_offset": 21,
      "type": "<ALPHANUM>",
      "position": 4
    }
  ]
}

However, if I then index the same string in a document

PUT http://localhost:9200/test2/test_type/1
{"title": "title1", "_notes": [{"type": "test", "detail": "this <b>is</b> a note"}], "_updatetime": 1438007564786}

And search for 'b'

GET http://localhost:9200/test2/_search?q=b

Then I get a hit from the document

The only way I've managed to get this to work is to send the HTML string URL encoded i.e.

PUT http://localhost:9200/test2/test_type/1
{"title": "title1", "_notes": [{"type": "test", "detail": "This %3Cb%3Eis%3C%5Cb%3E a note"}], "_updatetime": 1438007564800}

(Jennifer Cumming) #2

Just realised that Postman may be URL encoding the string I'm sending to the _analyze API, so that would make it consistent but it's still strange I have to URL encode the string when it is part of a JSON request body.


(system) #3