HTML Filter - How do I use it in a search?


(blasearch) #1

Hola,

I'm unsuccessfully trying to use analyzers, with HTML filters, and could use some help. I'm pretty sure I'm missing something fundamental here.

Here is my html character filter:

                "char_filter" : {
			"html_char_filter": {
				"type": "html_strip"
			}
		},

Here is my analyzer:

		"analyzer" : {
			"english_stem_analyzer" : {
				"tokenizer" : "standard",
				"filter" : [
					"stem_english_possessive_filter",
					"stem_english_filter",
					"lowercase",
					"english_stop_filter",
					"asciifolding"
				],
				"char_filter" : [
					"html_char_filter"
				]
			}
                }

I applied it to a field like so in the mapping:

			"content" : {
				"type": "text",
				"analyzer" : "english_stem_analyzer",
				"search_analyzer" : "english_stem_analyzer"
                        }

When I do a simple match all query, the content field still has html in it. (within the source)

Why?

How do I actually strip out html from the query response, or from the field as it is indexed?

Any light you could shed is very appreciated.


(Ivan Brusic) #2

What do you mean when the content field still has html in it? Are you
saying that when you search for HTML, the document/field matches, or simply
that the response contains HTML? If it is the latter, then the behavior is
expected since the document source is preserved. The analysis chain will
only modify what is actually indexed.

Since you are doing a match all, which mean no query terms, you probably
are looking for the response to be modified. There is no good way to get
the analyzed content back. Highlighting is the most used workaround.


(blasearch) #3

The response contains HTML in the _source.content field.

I'm using nested bool (should) queries with multiple term and match queries in them.

Can I use highlighting to get the value of the content field back with the HTML characters stripped out of it?

Thanks for any additional context and/or information!


(blasearch) #4

Ping ^^


(Ivan Brusic) #5

I have never used highlighting to return analyzed text, but that is what
you should do. Never responded back since I was hoping others would chime
in. An easy way to find out is to try it yourself!

I find it easier to index what I what and not have Elasticsearch do any
data munging.


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.