Tokenizer for html tags attributes


What i need to achieve is a better html documents indexing.

I started with first analyzer that strips html chars and works with text
only, but almost half om my searches will be through html tags (and more -
some specific html attributes). For example, i have an index with content
field that stores html page content and search might look like name="generator"
http-equiv="Wordpress 3.1"
or it might look like

So i wonder if there is a way to create a tokenizer that would use only
html tags and split them in pieces (space is ok), so that we get something
like 'html', 'name="generator"', 'src="jquery.js"'. All i ws able to
achieve so far is tokenizing each tag as single token (with all params in
it). Obviously this won't work...

Will be glad to hear any suggestions.

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
To view this discussion on the web visit
For more options, visit