Storing the html stripped version of a document in elasticsearch


(Melvyn Peignon) #1

Hi!

I have a piece of text, let's say:

"<p><br/>台灣人,奧運直播,使用PPStream,(PPS網路電視),觀看同步奧運實況</b>!"

I succeed to index this text without the html tags, using "strip_html". Now I'm trying to store this text without the HTML tags:

PUT test
{
  "settings" : {
    "index" : {
        "number_of_shards" : 1, 
        "number_of_replicas" : 0
    },
    "analysis": {
      "analyzer": {
        "ch_analyzer": {
          "tokenizer": "icu_tokenizer",
          "char_filter":  [ "html_strip" ]
        }
      }
    }
  },
  "mappings": {
    "qa": {
      "properties": {
        "comment_desc": {
          "type":     "text",
          "analyzer": "ch_analyzer"
        },
        "article_title": {
          "type":     "text",
          "analyzer": "ch_analyzer"
        },
        "article_desc": {
          "type":     "text",
          "analyzer": "ch_analyzer"
        }
      }
    }, 
    "sport": {
      "properties": {
        "title": {
          "type":     "text",
          "analyzer": "ch_analyzer"
        },
        "content": {
          "type":     "text",
          "analyzer": "ch_analyzer"
        }
      }
    }
  }
}

This is my mappings and settings. What should I change to be able to store this text without its HTML tags? Is that possible? Should I preprocess my document beforehand?


(Mike Barretta) #2

@Melvyn_Peignon as far as I know, the html_strip char filter is indeed removing those tags prior to indexing, so that you are not storing them.

Do you see those HTML tags coming back from a query?


(Melvyn Peignon) #3

@Mike.Barretta from what I understood I am not Indexing the HTML tags in my inverted index, but I am storing documents with the HTML tags. I made a little example that is easily reproducible:

PUT test
{
  "settings" : {
    "index" : {
        "number_of_shards" : 1, 
        "number_of_replicas" : 0
    },
    "analysis": {
      "analyzer": {
        "ch_analyzer": {
          "tokenizer": "standard",
          "char_filter":  [ "html_strip" ]
        }
      }
    }
  },
  "mappings": {
    "sport": {
      "properties": {
        "title": {
          "type":     "text",
          "analyzer": "ch_analyzer",
          "store": true
        }
      }
    }
  }
}

Now you can add a document that contain some HTML tags

PUT test/sport/0
{
  "title": "<div>A little test just for testing </div>"
}

I cannot search this document using HTML tags (which is great):

GET test/sport/_search
{
  "query": {
    "match": {
      "title": "div"
    }
  }
}

Return me:

{
  "took": 27,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

But I cannot retrieve my document without the HTML:

GET test/sport/_search
{
  "stored_fields": "title" 
}

Which return me:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "test",
        "_type": "sport",
        "_id": "0",
        "_score": 1,
        "fields": {
          "title": [
            "<div>A little test just for testing </div>"
          ]
        }
      }
    ]
  }
}

You can see that the document still contains the HTML tags.


(Mike Barretta) #4

@Melvyn_Peignon I see, I'm sorry - missed the distinction you were making.

So no, if you elect to store the field (stored:true or by default in _source if not disabled), Elasticsearch stores the "raw" value, not the value post-analyzer. If you want the raw stored value to not include HTML tags, you'll need to remove them before you put them into Elasticsearch. That said, you could probably hack together a scripted field (which can be stored) that removes the tags, but I wouldn't recommend it.


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.