Phrases with special characters


Hi there !

I am facing a little issue with tokenization in Elasticsearch through Java API. I found some answers about it, but I still lack the last little clue that will let me make all of this work together.

Basically, the idea is that I have a field called "description" in my mapping, containing several words. The default tokenizer/analyzer do not work for me because I want to be able to store special characters inside of this field, and take them into account later for research.

So I saw the analyzer "not_analyzed" does just that. However, I'd like this sentence to remain analyzed, so that requests for "my field }" and "} my field" could both return "my } field".

Then, I found a way to use the pattern analyzer, which seemed like exactly what I'm looking for if I configure it to just split on whitespaces. So I put this into my elasticsearch.yml file :

    type: pattern
    pattern: \\s+
  default.type: whitespace

This might not be perfect since it defines this analyzer as the default one, whereas I'd prefer it is used only for field "description". But, well, why not, that seems like a good start.

However, even with this analyzer, I could not make my special-characters-requests work. Here is what I found out with some curl requests :

  1. My field, once analyzed, is stored as follow :

    "tokens" : [ {
    "token" : "my",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
    }, {
    "token" : "\}",
    "start_offset" : 3,
    "end_offset" : 5,
    "type" : "word",
    "position" : 2
    }, {
    "token" : "field",
    "start_offset" : 6,
    "end_offset" : 11,
    "type" : "word",
    "position" : 3
    } ]

  2. When I look at what the query I execute looks like in debug mode in eclipse, turns out this query returns no result :

    "query_string" : {
    "query" : "description:my \} field",
    "default_operator" : "and"

  3. Tried to execute this simpler query with Curl :

    "query_string" : {
    "query" : "description:my \} field"

Now, it works ! However, EVERY SINGLE element from my repositories is returned as a match... with most of them having a score of 0.0 !
Looks like adding a filter to keep only the results with a score above, let's say, 2.0 might work, however this looks so arbitrary that I definitely do not like this idea !

  1. I found the "match_phrase" query looks exactly like what I am looking for, but I could not manage to have any result with this one, neither with Java nor with Curl...

I feel like the answer is really close, but I am lacking some little clues to find exactly what I am doing wrong.

I hope you will be able to help me without losing to much time (though I could have avoided to write such a huge post then ;-)).

By advance, a huge thanks !

(system) #2