Case-insensitive search of a substring in many (or all) fields

jfschaff · April 12, 2020, 8:28am

Hi everyone, I'm new to Elasticsearch, and have been playing with the query language and reading the doc lately.

And I find it hard to construct the right query for my need...

I have simple documents, with only a few text fields. I would like to search a substring in all the fields (something like a wildcard search). It should be case insensitive.

Is there a way to do that? I've looked at wildcard, multi_match, and other query types, but had little success so far...

Bonus: since the text is french, it would be great to have special characters equivalent to their non accentuated versions (ç=c, é=e, etc.) such that looking for special matches spécial.

dadoonet · April 12, 2020, 9:09am

Bonjour

You need probably to define your own analyzer. I'd use some ngrams, lowercase and asciifolding token filters.

If you can't make it work, could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

A full reproduction script will help readers to understand, reproduce and if needed fix your problem. It will also most likely help to get a faster answer.

In case you need, there's also a french speaking space: #in-your-native-tongue:discussions-en-francais

jfschaff · April 12, 2020, 10:25am

Thanks!

Let's start simple: searching a substring in a single field. Even that was not straightforward...

What took me some time to figure out is that wildcard search doesn't like Caps, which is not explicit in the doc... Shouldn't that be documented?

For instance

DELETE index

PUT index/_doc/1
{
  "full_name": "Service de cardiologie"
}

GET index/_search
{
  "query": {
    "wildcard": {
      "full_name": "Ser*"
    }
  }
}

gives

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

but without the capital "S", it works:

GET index/_search
{
  "query": {
    "wildcard": {
      "full_name": "ser*"
    }
  }
}

gives a match:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "full_name" : "Service de cardiologie"
        }
      }
    ]
  }
}

This behavior is not intuitive, because the match query does not care about the case:

GET index/_search
{
  "query": {
    "match": {
      "full_name": "Service"
    }
  }
}
GET index/_search
{
  "query": {
    "match": {
      "full_name": "service"
    }
  }
}
GET index/_search
{
  "query": {
    "match": {
      "full_name": "SERVICE"
    }
  }
}

all output the same result (a match).

dadoonet · April 12, 2020, 12:44pm

You need to understand how everything works behind the scene and what is the analysis process.

I'd recommend to start here: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html

jfschaff · April 12, 2020, 2:17pm

Thanks! I read that and now understand the analysis part better. But i still don't get how the search works...

For instance, given this analyzer and documents:

DELETE index

PUT index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "sloppy_french_analyser": {
          "tokenizer": "my_tokenizer",
          "filter": [
            "asciifolding",
            "lowercase"
          ]
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 3
        }
      }
    }
  }
}

PUT index/_doc/1
{
  "full_name": "Jean-François SCHAFF"
}

PUT index/_doc/2
{
  "full_name": "Jean-Bernard"
}

I would expect these two requests to perform the same search, but they are not...

GET index/_search
{
  "query": {
    "match": {
      "full_name": {
        "query": "jean",
      }
    }
  }
}

GET index/_search
{
  "query": {
    "match": {
      "full_name": {
        "query": "jean",
        "analyzer": "sloppy_french_analyser"
      }
    }
  }
}

I would also expect this request

GET index/_search
{
  "query": {
    "match": {
      "full_name": {
        "query": "jea",
      }
    }
  }
}

to find something, but it's not...

It's just very hard to get into Elasticsearch from reading the doc and examples when you are not already a search specialist... Very frustrating...

dadoonet · April 12, 2020, 3:13pm

It's because of the mapping.
You did not define any mapping so elasticsearch generated one by default.
It applied the default standard analyzer to your full_name field.

You need to force it.

DELETE index
PUT index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "sloppy_french_analyser": {
          "tokenizer": "my_tokenizer",
          "filter": [
            "asciifolding",
            "lowercase"
          ]
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 3
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "full_name": {
        "type": "text",
        "analyzer": "sloppy_french_analyser"
      }
    }
  }
}


PUT index/_doc/1
{
  "full_name": "Jean-François SCHAFF"
}

PUT index/_doc/2
{
  "full_name": "Jean-Bernard"
}

GET index/_search
{
  "query": {
    "match": {
      "full_name": {
        "query": "jea"
      }
    }
  }
}

jfschaff · April 12, 2020, 4:08pm

To be complete, here is a "solution" to this topic:

DELETE index

PUT index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "sloppy_french_analyser": {
          "tokenizer": "my_tokenizer",
          "filter": [
            "asciifolding",
            "lowercase"
          ]
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 4
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "full_name": {
        "type": "text",
        "analyzer": "sloppy_french_analyser"
      },
      "unit": {
        "type": "text",
        "analyzer": "sloppy_french_analyser"
      }
    }
  }
}

PUT index/_doc/1
{
  "full_name": "Jean-François SCHAFF-SERANO",
  "unit": "MSP"
}

PUT index/_doc/2
{
  "full_name": "Marion Bernard",
  "unit": "Service de néonat"
}

GET index/_search
{
  "query": {
    "multi_match": {
      "query": "service"
    }
  }
}

The multi_match query will output things in a reasonable order. The ngram tokenizer make it match "too many" documents rather than "not enough", which is nice for a quick search box, which I want to implement.

Thank you @dadoonet for your guidance!

dadoonet · April 13, 2020, 1:17am

BTW here is a script (for es 6.x) that gives you other ideas on how to combine multiple way for querying the same field using different rules.

gist.github.com

https://gist.github.com/dadoonet/5179ee72ecbf08f12f53d4bda1b76bab

search_kibana_console.txt

### REINIT
DELETE user
PUT user
{
  "settings": {
    "number_of_shards": 1
  }, 
  "mappings": {
    "_doc": {
      "properties": {

This file has been truncated. show original

system · May 11, 2020, 1:17am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Case insensitive wildcard search in 7.17.0 Elasticsearch	3	2818	July 8, 2022
Sub string search along with case Insensitive and special characters Elasticsearch	1	1104	August 14, 2017
Query_string is case sensitive Elasticsearch	8	6725	July 28, 2018
Cant do case insensitive search in elastic search Elasticsearch	3	514	March 16, 2023
Wildcard search on fields using multi-match while not searching sub fields Elasticsearch	1	454	April 18, 2018

Case-insensitive search of a substring in many (or all) fields

Related topics