Custom analyzer returning wrong document

mjohnst · August 26, 2015, 8:12am

Hello,

I am writing some custom analyzers to handle specific special characters that I care about, and still use the standard tokenizer (can't use whitespace or pattern for my documents).

This is the scenario I'm trying to get working:

A document contains the text AT&T which gets character mapped to ATAMPERSAND_SYMBOLampSEMICOLON_SYMBOLT via the index_analyzer (should be single token with standard tokenizer).

Then, by searching AT&T (not AT&T) the same document will be returned by converting AT&T to ATAMPERSAND_SYMBOLampSEMICOLON_SYMBOLT with the search_analyzer.

Here's my Sense code trying this out, but the wrong document is being returned... Any help is much appreciated!

The custom analyzers and mapping:

PUT /index
{
    "settings": {
        "analysis": {
            "analyzer": {
                "TextIndexAnalyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "char_filter": [ "SpecialCharactersIndexFilter" ]
                },
                "TextSearchAnalyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "char_filter": [ "SpecialCharactersSearchFilter" ]
                }
            },
            "char_filter": {
                "SpecialCharactersIndexFilter": {
                    "type": "mapping",
                    "mappings": [
                        "&=>AMPERSAND_SYMBOL",
                        ";=>SEMICOLON_SYMBOL"
                    ]
                },
                "SpecialCharactersSearchFilter": {
                    "type": "mapping",
                    "mappings": [
                        "&=>AMPERSAND_SYMBOLampSEMICOLON_SYMBOL"
                    ]
                }
            }
        }
   },
   "mappings": {
       "Doc": {
           "properties": {
               "Text": {
                    "type": "string",
                    "index_analyzer": "TextIndexAnalyzer",
                    "search_analyzer": "TextSearchAnalyzer"
                }
           }
       }
   }
}

Two example documents:

PUT /index/Doc/1
{
    "Doc.Text": "AT&amp;T"
}

PUT /index/Doc/2
{
    "Doc.Text": "AT&T"
}

Want to get document _id: 1 with this search, but _id: 2 is returned:

GET /index/Doc/_search
{
    "query": {
        "match_phrase": {
           "Doc.Text": "AT&T"
        }
    }
}

Showing that the two analyzers return the same tokens:

POST /index/_analyze?analyzer=TextIndexAnalyzer
{
  AT&amp;T
}
POST /index/_analyze?analyzer=TextSearchAnalyzer
{
  AT&T
}

Thanks again for any help!

Sarwar · August 26, 2015, 12:02pm

You are using match_phrase but is "AT&T" a phrase? The way you have set it up, it's a term being matched not a phrase and so the closest match is AT&T

mjohnst · August 26, 2015, 8:57pm

Yeah, AT&T is still a phrase I think.

I've figured out my problem. I had some incorrectness with the index and type parts of the URI for what I had in the mapping.

I have it working now.