Changing Analyzer behavior for hyphens - suggestions?

Hey guys,

after working with the ELK stack for a while now, we still got an very
annoying problem regarding the behavior of the standard analyzer - it
splits terms into tokens using hyphens or dots as delimiters.

e.g logsource:firewall-physical-management get split into "firewall" ,
"physical" and "management". On one side thats cool because if you search
for logsource:firewall you get all the events with firewall as an token in
the field logsource.

The downside on this behaviour is if you are doing e.g. an "top 10 search"
on an field in Kibana, all the tokens are counted as an whole term and get
rated due to their count:
top 10:

  1. firewall : 10
  2. physical : 10
  3. management: 10

instead of top 10:

  1. firewall-physical-management: 10

Well in the standard mapping from logstash this is solved using and .raw
field as "not_analyzed" but the downside on this is you got 2 fields
instead of one (even if its a multi_field) and the usage for kibana users
is not that great.

So what we need is that logsource:firewall-physical-management get
tokenized into "firewall-physical-management", "firewall" , "physical" and
"management".

I tried this using the word_delimiter filter token with the following
mapping:

"analysis" : {
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["lowercase", "asciifolding",
"my_worddelimiter"]
}
},
"filter" : {
"my_worddelimiter" : {
"type" : "word_delimiter",
"generate_word_parts": false,
"generate_number_parts": false,
"catenate_words": false,
"catenate_numbers": false,
"catenate_all": false,
"split_on_case_change": false,
"preserve_original": true,
"split_on_numerics": false,
"stem_english_possessive": true
}
}
}

But this unfortunately didnt do the job.

I´ve saw on my recherche that some other guys have an similar problem like
this, but expect some replacement suggestions, no real solution was found.

If anyone have some ideas on how to start working on this, i would be very
happy.

thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4094292c-057f-43d8-9af0-1ea83ad45a1c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

You search for a hyphen-aware tokenizer, like this?

It is in my plugin bundle

Jörg

On Wed, Nov 19, 2014 at 5:46 PM, horst knete baduncle23@hotmail.de wrote:

Hey guys,

after working with the ELK stack for a while now, we still got an very
annoying problem regarding the behavior of the standard analyzer - it
splits terms into tokens using hyphens or dots as delimiters.

e.g logsource:firewall-physical-management get split into "firewall" ,
"physical" and "management". On one side thats cool because if you search
for logsource:firewall you get all the events with firewall as an token in
the field logsource.

The downside on this behaviour is if you are doing e.g. an "top 10 search"
on an field in Kibana, all the tokens are counted as an whole term and get
rated due to their count:
top 10:

  1. firewall : 10
  2. physical : 10
  3. management: 10

instead of top 10:

  1. firewall-physical-management: 10

Well in the standard mapping from logstash this is solved using and .raw
field as "not_analyzed" but the downside on this is you got 2 fields
instead of one (even if its a multi_field) and the usage for kibana users
is not that great.

So what we need is that logsource:firewall-physical-management get
tokenized into "firewall-physical-management", "firewall" , "physical" and
"management".

I tried this using the word_delimiter filter token with the following
mapping:

"analysis" : {
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["lowercase", "asciifolding",
"my_worddelimiter"]
}
},
"filter" : {
"my_worddelimiter" : {
"type" : "word_delimiter",
"generate_word_parts": false,
"generate_number_parts": false,
"catenate_words": false,
"catenate_numbers": false,
"catenate_all": false,
"split_on_case_change": false,
"preserve_original": true,
"split_on_numerics": false,
"stem_english_possessive": true
}
}
}

But this unfortunately didnt do the job.

I´ve saw on my recherche that some other guys have an similar problem like
this, but expect some replacement suggestions, no real solution was found.

If anyone have some ideas on how to start working on this, i would be very
happy.

thanks.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4094292c-057f-43d8-9af0-1ea83ad45a1c%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4094292c-057f-43d8-9af0-1ea83ad45a1c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFpv%3DWwBK_bskq2BELn%2BbTb%3DOwZO%3DOPm5U4Tw%2BrO3tTWg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi,

thx for response and this awesome plugin bundle (especially for me as
german).

Unfortunately the hyphen analyzer plugin didnt do the job in the way i
wanted it to be.

The "hyphen-analyzer" does something similar like the whitespace analyzer -
it just dont split on hyphen and instead see them as ALPHANUM characters
(at least that is what i think right now).

So the term "this-is-a-test" get tokenized into "this-is-a-test" which is
nice behaviour, but in order to make an "full-text-search" on this field it
should get tokenized into "this-is-a-test", "this", "is", "a" and "test" as
i wrote before.

i think maybe abusing the word_delimiter token filter could do the job,
because there is an option "preserve_original".

unfortunately if you adjust the filter like this:

PUT /logstash-2014.11.20
{
"index" : {
"analysis" : {
"analyzer" : {
"wordtest" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : [
"lowercase",
"word"
]
}
},
"filter" : {
"word" : {
"type" : "word_delimiter",
"generate_word_parts": false,
"generate_number_parts": false,
"catenate_words": false,
"catenate_numbers": false,
"catenate_all": false,
"split_on_case_change": false,
"preserve_original": true,
"split_on_numerics": false,
"stem_english_possessive": true
}
}
}
}
}

and make an analyze test:

curl -XGET 'localhost:9200/logstash-2014.11.20/_analyze?filters=word' -d
'this-is-a-test'

the response is this:
{"tokens":[{"token":"this","start_offset":0,"end_offset":4,"type":"","position":1},{"token":"is","start_offset":5,"end_offset":7,"type":"","position":2},{"token":"a","start_offset":8,"end_offset":9,"type":"","position":3},{"token":"test","start_offset":10,"end_offset":14,"type":"","position":4}]

which just says it tokenized it in everything expect the original term,
which make me wonder if the preserver_original settings is working?

Any idea on this?

Am Mittwoch, 19. November 2014 18:26:09 UTC+1 schrieb Jörg Prante:

You search for a hyphen-aware tokenizer, like this?

Hyphen Tokenizer Demo · GitHub

It is in my plugin bundle

GitHub - jprante/elasticsearch-plugin-bundle: A bundle of useful Elasticsearch plugins

Jörg

On Wed, Nov 19, 2014 at 5:46 PM, horst knete <badun...@hotmail.de
<javascript:>> wrote:

Hey guys,

after working with the ELK stack for a while now, we still got an very
annoying problem regarding the behavior of the standard analyzer - it
splits terms into tokens using hyphens or dots as delimiters.

e.g logsource:firewall-physical-management get split into "firewall" ,
"physical" and "management". On one side thats cool because if you search
for logsource:firewall you get all the events with firewall as an token in
the field logsource.

The downside on this behaviour is if you are doing e.g. an "top 10
search" on an field in Kibana, all the tokens are counted as an whole term
and get rated due to their count:
top 10:

  1. firewall : 10
  2. physical : 10
  3. management: 10

instead of top 10:

  1. firewall-physical-management: 10

Well in the standard mapping from logstash this is solved using and .raw
field as "not_analyzed" but the downside on this is you got 2 fields
instead of one (even if its a multi_field) and the usage for kibana users
is not that great.

So what we need is that logsource:firewall-physical-management get
tokenized into "firewall-physical-management", "firewall" , "physical" and
"management".

I tried this using the word_delimiter filter token with the following
mapping:

"analysis" : {
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["lowercase", "asciifolding",
"my_worddelimiter"]
}
},
"filter" : {
"my_worddelimiter" : {
"type" : "word_delimiter",
"generate_word_parts": false,
"generate_number_parts": false,
"catenate_words": false,
"catenate_numbers": false,
"catenate_all": false,
"split_on_case_change": false,
"preserve_original": true,
"split_on_numerics": false,
"stem_english_possessive": true
}
}
}

But this unfortunately didnt do the job.

I´ve saw on my recherche that some other guys have an similar problem
like this, but expect some replacement suggestions, no real solution was
found.

If anyone have some ideas on how to start working on this, i would be
very happy.

thanks.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4094292c-057f-43d8-9af0-1ea83ad45a1c%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4094292c-057f-43d8-9af0-1ea83ad45a1c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/64ac834c-3593-490d-8fe9-9a12404a98f1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

The whitespace tokenizer has the problem that punctuation is not ignored. I
find the word_delimiter filter not working at all with whitespace, only
with keyword tokenizer, with massive pattern matching which is complex and
expensive :frowning:

Therefore I took the classic tokenizer and generalized the hyphen rules in
the grammar. The tokenizer "hyphen" and filter "hyphen" are two routines.
The tokenizer "hyphen" keeps hyphenated words together and handles
punctuation correct. The filter "hyphen" adds combinations to the original
form.

Main point is to add combinations of dehyphenated forms so they can be
searched.

Single words are only taken into account when the word is positioned at the
edge.

For example, the phrase "der-die-das" should be indexed in the following
forms:

"der-die-das", "derdiedas", "das", "derdie", derdie-das", "die-das", "der"

Jörg

On Thu, Nov 20, 2014 at 9:29 AM, horst knete baduncle23@hotmail.de wrote:

So the term "this-is-a-test" get tokenized into "this-is-a-test" which is
nice behaviour, but in order to make an "full-text-search" on this field it
should get tokenized into "this-is-a-test", "this", "is", "a" and "test" as
i wrote before.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEveN15MGdB-2fKAx46bntZ8VO8ii88BNxDkfo6W5jPMw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

I think our solution is now to just replace all the "non-letters" from
elasticsearch with an "_".

"char_filter" : {
"replace" : {
"type" : "mapping",
"mappings": ["\.=>", "\u2010=>",
"'''=>", "\:=>", "\u0020=>", "\u005C=>", "\u0028=>",
"\u0029=>
", "\u0026=>", "\u002F=>", "\u002D=>", "\u003F=>",
"\u003D=>_"]
}
},

This lead to that the terms wont get splited into useless tokens from the
standard analyzer.

The downside of this solution is that some urls or Windows paths looks very
ugly for the human eye now, e.g:

http://g.ceipmsn.com/8SE/411?MI=B9DC2E6D07184453A1EFC4E765A16D30-0&LV=3.0.131.0&OS=6.1.7601&AG=1217
=>

http___g_ceipmsn_com_8se_411_mi_b9dc2e6d07184453a1efc4e765a16d30_0_lv_3_0_131_0_os_6_1_7601_ag_1217

The good thing compared to not analyzed is that if i search for url:8se,
the search will return the events with this url in it.

I think this is not a perfect solution, but rather an good workaround till
lucene gives us better analyzer types to work with.

Thanks for sharing your experience so far!

cheers

Am Donnerstag, 20. November 2014 20:07:27 UTC+1 schrieb Jörg Prante:

The whitespace tokenizer has the problem that punctuation is not ignored.
I find the word_delimiter filter not working at all with whitespace, only
with keyword tokenizer, with massive pattern matching which is complex and
expensive :frowning:

Therefore I took the classic tokenizer and generalized the hyphen rules in
the grammar. The tokenizer "hyphen" and filter "hyphen" are two routines.
The tokenizer "hyphen" keeps hyphenated words together and handles
punctuation correct. The filter "hyphen" adds combinations to the original
form.

Main point is to add combinations of dehyphenated forms so they can be
searched.

Single words are only taken into account when the word is positioned at
the edge.

For example, the phrase "der-die-das" should be indexed in the following
forms:

"der-die-das", "derdiedas", "das", "derdie", derdie-das", "die-das", "der"

Jörg

On Thu, Nov 20, 2014 at 9:29 AM, horst knete <badun...@hotmail.de
<javascript:>> wrote:

So the term "this-is-a-test" get tokenized into "this-is-a-test" which is
nice behaviour, but in order to make an "full-text-search" on this field it
should get tokenized into "this-is-a-test", "this", "is", "a" and "test" as
i wrote before.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2aa17dac-b484-4385-8efd-9a74847e9582%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

this is what I am looking for. how can I achieve it?

Heh, that's exactly what I'm looking for as well. Any solutions?