Dealing with the plus sign (+) character in Indexing/Searching


#1

I'm trying to search text for "C++" and can't seem to get it to work. The other characters are treated properly as alphanum it seems but the + isn't working. Help is much appreciated. I have the following settings:

settings: {
  analysis: {
    filter: {
      symbol_filter: {
        type: "word_delimiter",
        type_table: ["# => ALPHANUM", "@ => ALPHANUM", "/ => ALPHANUM", "$ => ALPHANUM", "& => ALPHANUM", "+ => ALPHANUM", "- => ALPHANUM"]
      }
    },
    analyzer: {
      symbol_analyzer: {
        type: "custom",
        tokenizer: "whitespace",
        filter: ["lowercase", "symbol_filter"]
      }
    }
  }
}

(Isabel Drost-Fromm) #2

Your question is better suited for the https://discuss.elastic.co/c/kibana forum, not the not the Elasticsearch forum. Please resend your message to the Kibana forum, where you are likely to get more/better responses.


#3

But im not using kibana?

This question strictly regards elasticsearch so im confused as to why elasticsearch is the inappropriate forum?


(Isabel Drost-Fromm) #4

sigh So here's where my "ask in the Kibana forum" answer went to - was already searching for it in the thread I had thought I had added it to. Sorry for the confusion.

Can you also post an example document you are indexing as well as the query you would like to run? Makes it easier to reproduce what you are doing. The only thing I can think of right now is "+" being treated specially e.g. by the query_string query.


(Doug Turnbull) #5

How are you querying? A "+" to the query string query parser means
mandatory. So you may have C++ in your index but the searches are failing
in this case you need to either escape the "+" while searching or use a
different query method.

Have you confirmed with the _analyze API that you're producing the expected
tokens? I might also suggest elyzer for seeing your analyzer run
step-by-step


#6

Thanks for the response.

I've tried escaping during indexing as well. I've noticed since this is happening with other characters too like the @ symbol. If i query with or without an @ symbol, for example, a document that only should contain a match with the @ as part of the query will return a hit in both circumstances. I'm using a whitespace tokenizer and the fields are returning with the @ symbol intact. When I do a basic test with inserting the @ symbol and the analyze api, the token does show the symbol. Any help is much appreciated, I'm totally lost here.

Below is an example of how I query. This gets a successful hit but I want to figure it out how to avoid this type of querying hitting, and keep in mind the screeName is indexed as @AJEnglish:

{
    "query": {
        "bool": {
            "should": [
                {
                    "match": {
                        "screenName": {
                            "query": "AJEnglish"
                        }
                    }
                }
            ],
            "minimum_should_match": 1
        }
    },
    "sort": {
        "uuid": {
            "order": "desc"
        }
    },
    "size": "100",
    "aggs": {
        "types_count": {
            "terms": {
                "field": "_type"
            }
        }
    }
}

(mac2000) #7

Got into same issue, here is my workaround:

DELETE /sample

PUT /sample
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonyms": {
          "type": "synonym",
          "synonyms": ["c++,c ++=>cpp", "c#, c #=>csharp"]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "whitespace",
          "filter": ["lowercase", "my_synonyms"]
        }
      }
    }
  },
  "mappings": {
    "product": {
      "properties": {
        "title": {
          "type": "string",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}

POST /sample/product/1
{
  "title": "Programming C++"
}

POST /sample/product/2
{
  "title": "Programming C#"
}

GET /sample/_analyze?analyzer=my_analyzer&text=c%2B%2B

POST /sample/_search
{
  "query": {
    "multi_match": {
      "query": "c++",
      "fields": ["title"]
    }
  }
}

POST /sample/_validate/query?explain
{
  "query": {
    "multi_match": {
      "query": "c#",
      "fields": ["title"]
    }
  }
}

The idea behind it is to use synonyms, so while indexing (saving) and searching you search "cpp" instead of "c++". Also please note that if you are going to use custom analyzer you probably will want to add morphology, html strip and other usefull stuff, example above is as minimum as possible just to show a way to get around issue.


#8

cool, thanks for the help. This solution unfortunately wont work for me because i need to preserve the ability to still search for CPP for its original context without returning any C++ results. The C++ example is just one of many character based use cases i need to account for without sacrificing the ability to search any other character string. As noted I'm also having trouble with @ based results too. I thought treating the characters as alphanum would work so im awfully confused as to why I'm getting the results I am.


(mac2000) #9

Hm, I just tried mapping from your first post, and all seems to be ok.

DELETE /sample

PUT /sample
{
  "settings": {
    "analysis": {
      "filter": {
        "my_delimiter": {
          "type": "word_delimiter",
          "type_table": [
            "# => ALPHANUM",
            "+ => ALPHANUM"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "whitespace",
          "filter": ["lowercase", "my_delimiter"]
        }
      }
    }
  },
  "mappings": {
    "product": {
      "properties": {
        "title": {
          "type": "string",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}

POST /sample/product/1
{
  "title": "Programming C++"
}

POST /sample/product/2
{
  "title": "Programming C#"
}

POST /sample/_search
{
  "query": {
    "multi_match": {
      "query": "c++",
      "fields": ["title"]
    }
  }
}

#10

What are your hit results?

If you had a stand alone "C" would the "C++" query return a hit? Is the /sample/product/1 the only thing being returned for the sample search? That's the issue I'm having at least, that it would return both. It is able to catch C++ but its also catching C anything which I don't want.


#12

Fixed the issue. Was a couple causes, one it turned out the mapping I was using in the actual app was not hitting the custom analyzer during the query and oddly enough the settings weren't always being setup and rather default settings were what was actually set. I've been using the JS api and wasn't initializing elastic search correctly so fixing that fixed this.

Thanks all for the help.


(system) #13