edgeNGram tokenizer with the word delimiter filter


(Hieu Nguyen) #1

Hello guys,
I have been using the edgeNGram tokenizer to enable partial prefix matching
on a query. However, the tokenizer treats certain characters as
punctuations (e.g. C# => C, I/O => I and O), so I had to add "punctuation"
character class to the edgeNGram tokenizer and use the word_delimiter
filter to drop punctuations.
'tokenizer': {

   'prefix_tokenizer': {
       'type': 'edgeNGram',                                             
         
       'min_gram': 1,                                                   
         
       'max_gram': 30,                                                 
          
       'token_chars': ['letter', 'digit', 'symbol', 'punctuation'],     
         
    },
}

'filter': {                                                             
          
    'my_word_delimiter': {
        'type': 'word_delimiter',
            'type_table': [                                             
              
                '# => ALPHANUM'                                         
           
             ]                                                         
                
         }                                                             
                
 }

Unfortunately, this causes the highlight snippets to contain the duplicate
tokens when, for example, the query is "U.S. pol" and the matching document
contains "U.S. politics, as follows: UU.S.
Politics (the letter U is highlighted twice). I see how word
delimiter creates the same token for different prefixes ("U" tokens for "U"
and "U.") , but the highlighting seems strange to me because "U" and "U.S"
have the same offset.

Do you have any suggestions?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a88b15ae-bfb8-419b-a58c-f3e7c8556faa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Adrien Grand) #2

Hi,

The highlighter should indeed highlight it only once since they share the
same offsets. Can you provide us with a full curl recreation?

On Sun, Apr 27, 2014 at 2:51 AM, Hieu Nguyen hieu@quora.com wrote:

Hello guys,
I have been using the edgeNGram tokenizer to enable partial prefix
matching on a query. However, the tokenizer treats certain characters as
punctuations (e.g. C# => C, I/O => I and O), so I had to add "punctuation"
character class to the edgeNGram tokenizer and use the word_delimiter
filter to drop punctuations.
'tokenizer': {

   'prefix_tokenizer': {
       'type': 'edgeNGram',

       'min_gram': 1,

       'max_gram': 30,

       'token_chars': ['letter', 'digit', 'symbol', 'punctuation'],

    },
}

'filter': {

    'my_word_delimiter': {
        'type': 'word_delimiter',
            'type_table': [

                '# => ALPHANUM'

             ]

         }

 }

Unfortunately, this causes the highlight snippets to contain the duplicate
tokens when, for example, the query is "U.S. pol" and the matching document
contains "U.S. politics, as follows: UU.S.
Politics (the letter U is highlighted twice). I see how word
delimiter creates the same token for different prefixes ("U" tokens for
"U" and "U.") , but the highlighting seems strange to me because "U" and
"U.S" have the same offset.

Do you have any suggestions?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a88b15ae-bfb8-419b-a58c-f3e7c8556faa%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/a88b15ae-bfb8-419b-a58c-f3e7c8556faa%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7s6J3sUGmtoyDaJCcT%3DNbBxKaoQmtGnr-6V-E9SmuqjA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #3