Skip html tags on indexing


(Alexandr Yakubenko) #1

I have a next custom analyzer for description-field of document:

/_settings :
"analysis" : {
"analyzer" : {
"snowball_without_tags" : {
"char_filter" : [ "html_strip" ],
"tokenizer" : "standard",
"type" : "custom",
"filter" : [ "standard", "lowercase", "stop", "snowball" ]
}
}
},

/_mapping :

      "description" : {
        "type" : "string",
        "analyzer" : "snowball_without_tags"
      },

I have a next document:

{
"id":"426923832-32907_jobs_to_careers",
"position":"Executives - Be Your Own Boss and Own A Franchise (32907)",
"description":"

<span style=\"font-size: x-large;\">Be Your Own Boss!

<span style=\"font-size: x-large;\">Over 30 starting under $10,000. Browse the top franchises for 2013 for food, tech, home-based, and many more.

<span style=\"font-size: x-large;\">We offer one of the largest directories of Franchise and Business Opportunities on the internet.

<span style=\"font-size: x-large;\">We work with the big guys you've heard of and the new guys that everyone will be talking about soon, making Gator your one-stop-shop for business ownership information and research. We also have hundreds of articles about franchising to help educate you on buying a business.

"
}

And I'm trying to do a search for this document by 'span'-query:

curl -X GET 'http://localhost:9200/jobs/_search?size=0' -d '{"query":{"query_string":{"query":"span"}},"filter":{"term":{"id":"426923832-32907_jobs_to_careers"}}}'

And document is found by tag in a description.

{"took":9,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.0,"hits":[]}}

Could you guys help me to configure my analyzer to skip tags from analyzed text please?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7621a17e-31b6-4b2c-800d-019500188354%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Binh Ly-2) #2

The html_strip is applied to your description field so it will work only if
you search in that field:

{
"query": {
"query_string": {
"query": "description:span"
}
}
}

Your example search looks in the _all field by default which is using the
standard analyzer. You can apply your analyzer to the _all field if you
like after which you will get your desired results:

"mappings": {
"doc": {
"_all": {
"type": "string",
"analyzer": "snowball_without_tags"
},
"properties": {
}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d8b47ed4-2ecd-40ea-ab83-becb21d45cc4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Alexandr Yakubenko) #3

Got it.

Your answer is very helpful. Thank you very much.

четверг, 27 марта 2014 г., 15:59:03 UTC+3 пользователь Binh Ly написал:

The html_strip is applied to your description field so it will work only
if you search in that field:

{
"query": {
"query_string": {
"query": "description:span"
}
}
}

Your example search looks in the _all field by default which is using the
standard analyzer. You can apply your analyzer to the _all field if you
like after which you will get your desired results:

"mappings": {
"doc": {
"_all": {
"type": "string",
"analyzer": "snowball_without_tags"
},
"properties": {
}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e64624e2-049b-4753-9a69-6d33f650a62c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #4