ElasticSearch analyzer for short strings

redserpent7 · April 8, 2015, 9:22am

I am creating an ES application where it should be able to index and search
file content (pdf, word, txt, etc) and file names where it is possible for
any file to be indexed.

For the content I use the compact language detector to detect the language
of the content and assign it to its corresponding content field.

I have no issues with content, yet the problem is with file names. The
thing is that file names are short and they can be written in any language
making it hard to detect the language as CLD (or any language detector) do
not perform well with short strings.

Currently I have configured the following analyzer for file names:

"def_analyzer":{
"type":"custom",
"filter":["word_delimiter","icu_normalizer","icu_folding"],
"tokenizer":"icu_tokenizer"
}

Tested and it works fine most of the time. tried it with many latin
languages and it was able to get the result correctly most of the times.

The only times that it failed to get correct results is when the file name
had an acronym in it.

For example, when I search for WinSCP, the top result do not contain the
files whos names have WinSCP in them. What I get though are files with
names like Win32 for example.

I believe this is the work of the Word Delimiter token filter as its
probably splitting WinSCP into Win, S, C and P.

So what I am asking is, what is a good analyzer/filters combination for
short strings regardless of their language.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bfe3d9ee-538e-4d07-9dbd-d029ebd9a7df%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Can language analyzers be configured to use char_filters and token_filters? Elasticsearch	3	361	July 6, 2017
Analyzer in Kibana Elasticsearch	5	411	June 6, 2018
Dot not used as delimiter Elasticsearch	4	2175	July 6, 2017
Language analyzer Elasticsearch	2	337	July 6, 2017
Length Token Filter Elasticsearch	10	1773	July 6, 2017

ElasticSearch analyzer for short strings

Related topics