Tokenizer for html tags attributes

Jack_5 · September 17, 2014, 6:56pm

Hi.

What i need to achieve is a better html documents indexing.

I started with first analyzer that strips html chars and works with text
only, but almost half om my searches will be through html tags (and more -
some specific html attributes). For example, i have an index with content
field that stores html page content and search might look like name="generator"
http-equiv="Wordpress 3.1" or it might look like

So i wonder if there is a way to create a tokenizer that would use only
html tags and split them in pieces (space is ok), so that we get something
like 'html', 'name="generator"', 'src="jquery.js"'. All i ws able to
achieve so far is tokenizing each tag as single token (with all params in
it). Obviously this won't work...

Will be glad to hear any suggestions.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/635d2b97-8dd5-4266-b60e-40300d986828%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Elasticsearch tokenize html javascript Elasticsearch	1	294	July 6, 2017
How can I index HTML tags Elasticsearch	4	512	March 22, 2021
Indexing HTML Elasticsearch	5	689	July 6, 2017
How to tokenize html, javascript and css Elasticsearch	1	481	July 6, 2017
Tokenizing HTML Elasticsearch	5	628	July 6, 2017

Tokenizer for html tags attributes

Related topics