Analyzers and char_filters o_0 creepy outputs

(georgi.mateev) #1

Hi! This is a sample setup, close to what I am working with

As you can see, I am trying to remove the hyphens from all words, so that
words like "hand-made" are indexed as "handmade". The goal is to make a
search for "handmade" find all documents, containing "hand-made" and vice
For some reason it doesn't work, though :frowning:

I have also attached 3 sample queries. The expected result would be for all
of them to return the same result set.

  1. Astonishingly, a search for "Chemie-injenieur" finds 2 results, but a
    search for "Chemieingenieur" finds none. This is pretty creepy to me, since
    the char_filter is supposed to strip the hyphens prior to tokenizing in the
    indexing process.

  2. Another creepy fact is that if I specify the searchAnalyzer explicitly,
    I find no results (see query 3) from this document set

  3. Moreover the analyzeAPI shows that the search term "Chemie-ingenieur"
    gets translated to "Chemieingenieur" using this analyzer

  4. And the most creepy facts is that when I run these queries with the
    actual index data (800+ documents), I get 17 results for "Chemie-ingenieur"
    and 22 for "Chemieingenieur", where NONE OF THEM OVERLAPS. I.e. I have a
    total of 39 documents that should be matching either of the queries. Some
    of the documents that match "Chemie-ingenieur" actually don't contain the
    word with the hyphen. So I would expect these documents to be contained in
    both result sets, maybe with a different relevancy score. This is, however,
    not the case.

Please help me get over this, I have been struggling with it for a full
week already. I would be very grateful for some explanation too, apart from
a solution, since the output is much different that what I expect from my
understanding and this means that I don't really understand the system.

P.S. Please focus on the actual problem and let's not discuss the mapping
into details. The version I have pasted is pretty different than what I
have started with initially, due to the try-and-error approach I have been
using for almost a week.

Thanks sincerely,

(system) #2