Hi,
thx for response and this awesome plugin bundle (especially for me as
german).
Unfortunately the hyphen analyzer plugin didnt do the job in the way i
wanted it to be.
The "hyphen-analyzer" does something similar like the whitespace analyzer -
it just dont split on hyphen and instead see them as ALPHANUM characters
(at least that is what i think right now).
So the term "this-is-a-test" get tokenized into "this-is-a-test" which is
nice behaviour, but in order to make an "full-text-search" on this field it
should get tokenized into "this-is-a-test", "this", "is", "a" and "test" as
i wrote before.
i think maybe abusing the word_delimiter token filter could do the job,
because there is an option "preserve_original".
unfortunately if you adjust the filter like this:
PUT /logstash-2014.11.20
{
"index" : {
"analysis" : {
"analyzer" : {
"wordtest" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : [
"lowercase",
"word"
]
}
},
"filter" : {
"word" : {
"type" : "word_delimiter",
"generate_word_parts": false,
"generate_number_parts": false,
"catenate_words": false,
"catenate_numbers": false,
"catenate_all": false,
"split_on_case_change": false,
"preserve_original": true,
"split_on_numerics": false,
"stem_english_possessive": true
}
}
}
}
}
and make an analyze test:
curl -XGET 'localhost:9200/logstash-2014.11.20/_analyze?filters=word' -d
'this-is-a-test'
the response is this:
{"tokens":[{"token":"this","start_offset":0,"end_offset":4,"type":"","position":1},{"token":"is","start_offset":5,"end_offset":7,"type":"","position":2},{"token":"a","start_offset":8,"end_offset":9,"type":"","position":3},{"token":"test","start_offset":10,"end_offset":14,"type":"","position":4}]
which just says it tokenized it in everything expect the original term,
which make me wonder if the preserver_original settings is working?
Any idea on this?
Am Mittwoch, 19. November 2014 18:26:09 UTC+1 schrieb Jörg Prante:
You search for a hyphen-aware tokenizer, like this?
Hyphen Tokenizer Demo · GitHub
It is in my plugin bundle
GitHub - jprante/elasticsearch-plugin-bundle: A bundle of useful Elasticsearch plugins
Jörg
On Wed, Nov 19, 2014 at 5:46 PM, horst knete <badun...@hotmail.de
<javascript:>> wrote:
Hey guys,
after working with the ELK stack for a while now, we still got an very
annoying problem regarding the behavior of the standard analyzer - it
splits terms into tokens using hyphens or dots as delimiters.
e.g logsource:firewall-physical-management get split into "firewall" ,
"physical" and "management". On one side thats cool because if you search
for logsource:firewall you get all the events with firewall as an token in
the field logsource.
The downside on this behaviour is if you are doing e.g. an "top 10
search" on an field in Kibana, all the tokens are counted as an whole term
and get rated due to their count:
top 10:
- firewall : 10
- physical : 10
- management: 10
instead of top 10:
- firewall-physical-management: 10
Well in the standard mapping from logstash this is solved using and .raw
field as "not_analyzed" but the downside on this is you got 2 fields
instead of one (even if its a multi_field) and the usage for kibana users
is not that great.
So what we need is that logsource:firewall-physical-management get
tokenized into "firewall-physical-management", "firewall" , "physical" and
"management".
I tried this using the word_delimiter filter token with the following
mapping:
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["lowercase", "asciifolding",
"my_worddelimiter"]
}
},
"filter" : {
"my_worddelimiter" : {
"type" : "word_delimiter",
"generate_word_parts": false,
"generate_number_parts": false,
"catenate_words": false,
"catenate_numbers": false,
"catenate_all": false,
"split_on_case_change": false,
"preserve_original": true,
"split_on_numerics": false,
"stem_english_possessive": true
}
}
}
But this unfortunately didnt do the job.
I´ve saw on my recherche that some other guys have an similar problem
like this, but expect some replacement suggestions, no real solution was
found.
If anyone have some ideas on how to start working on this, i would be
very happy.
thanks.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4094292c-057f-43d8-9af0-1ea83ad45a1c%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4094292c-057f-43d8-9af0-1ea83ad45a1c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/64ac834c-3593-490d-8fe9-9a12404a98f1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.