[SOLVED] Question about custom analyzer

Hi all,

When indexing in elasticsearch I try to find the right custom analyzer to allow a search like the exemple below:

Ip address: 192.168.1.1

Search requests: 192 or 192.168 or 192.168.1

Expected result:

192.168.1.1
or
192.168.1.1
or
192.168.1.1

(the bold text represents the highlight result that I want to retrieve)

May be I can use:

Analyzer: custom
tokenizer: keyword
token filter: edge Ngram

Does someone has some advice ? :wink:

Thanks,
Alex

If the IP adress is the only thing in the field that you are analyzing, I would try the Path Hierarchy Tokenizer with the dot as delimiter.

Thanks for your reply.

I think it will do the job ! thanks a lot !!

I just played around with it a bit, if you use that you might want to use another analyzer at search time (e.g. keyword), because if you also split the search term with the path hierarchy tokenizer, you will get matches even if only the first octet matches. Also the dots might need some little special tweaking.

Thanks for your advice.

It will be helpful if you have an example :wink: How you managed with the dots ?

Even if I use the keyword filter, I will have the same issue because my search will be tokenize with the path hierarchy tokenizer and after filtering with keyword. So for 192.168 search I will get 192 and 192.168, so all 192 will match too isn't it ?

Thanks for your help !

PS: the _all field is disable in my environment.

Okay, heres an example, maybe still not perfect but I think it shows some of the possibilities you have:

I created a test index with two analyzers, one for indexing (using the path_hierarchy tokenizer) and one for the query, and a mapping for a doc just containing this ip field:

PUT /ip4test
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "ip_4_tokenizer": {
          "type": "path_hierarchy",
          "delimiter": "."
        }
      },
      "filter": {
        "remove_trailing_dot": {
          "type": "pattern_replace",
          "pattern": "\\.$",
          "replace": ""
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "ip_4_tokenizer"
        },
        "dedot_keyword": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "remove_trailing_dot"
          ]
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "ip": {
          "type": "string",
          "analyzer": "my_analyzer",
          "search_analyzer": "dedot_keyword"
        }
      }
    }
  }
}

If you now use the _analyze endpoint you can see how the IP adress gets broken up at index time:

curl -XGET 'localhost:9200/ip4test/_analyze?pretty&analyzer=my_analyzer' -d 11.22.33.44

{
  "tokens" : [ {
    "token" : "11",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "11.22",
    "start_offset" : 0,
    "end_offset" : 5,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "11.22.33",
    "start_offset" : 0,
    "end_offset" : 8,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "11.22.33.44",
    "start_offset" : 0,
    "end_offset" : 11,
    "type" : "word",
    "position" : 0
  } ]
}

So now enter some docs:

PUT /ip4test/my_type/1
{
  "ip" : "11.4.76.03"
}

PUT /ip4test/my_type/2
{
  "ip" : "11.4.71.04"
}

PUT /ip4test/my_type/3
{
  "ip" : "11.41.71.04"
}

And do some querying:

GET /ip4test/my_type/_search 
{
  "query": { "match": {
    "ip" : "11.4"
  }
  }
  , "highlight": {
    "fields": {"ip" : {}}
  }
}

"hits": [
      {
        "_index": "ip4test",
        "_type": "my_type",
        "_id": "2",
        "_score": 0.30685282,
        "_source": {
          "ip": "11.4.71.04"
        },
        "highlight": {
          "ip": [
            "<em>11.4</em>.71.04"
          ]
        }
      },
      {
        "_index": "ip4test",
        "_type": "my_type",
        "_id": "1",
        "_score": 0.30685282,
        "_source": {
          "ip": "11.4.76.03"
        },
        "highlight": {
          "ip": [
            "<em>11.4</em>.76.03"
          ]
        }
      }
    ]

See how this one didn't match 11.41.71.04 not sure if that was the intention or not. I no, you have to use prefixes.

Without removing the dot at the end of the query term, the next example would return no results, but thanks to the pattern_replace filter it does:

GET /ip4test/my_type/_search 
{
  "query": { "match": {
    "ip" : "11.4."
  }
  }
  , "highlight": {
    "fields": {"ip" : {}}
  }
}

"hits": [
      {
        "_index": "ip4test",
        "_type": "my_type",
        "_id": "2",
        "_score": 0.30685282,
        "_source": {
          "ip": "11.4.71.04"
        },
        "highlight": {
          "ip": [
            "<em>11.4</em>.71.04"
          ]
        }
      },
      {
        "_index": "ip4test",
        "_type": "my_type",
        "_id": "1",
        "_score": 0.30685282,
        "_source": {
          "ip": "11.4.76.03"
        },
        "highlight": {
          "ip": [
            "<em>11.4</em>.76.03"
          ]
        }
      }
    ]

Depending on your exact use case you might have to modify this a bit. Hope that helps a bit.

Many thanks for this perfect and full answer !

I made some tests yesterday, I missed the pattern_replace for the search_analyzer. With your example it does the job !

Have a good day.