Case-insensitive search of a substring in many (or all) fields

Hi everyone, I'm new to Elasticsearch, and have been playing with the query language and reading the doc lately.

And I find it hard to construct the right query for my need...

I have simple documents, with only a few text fields. I would like to search a substring in all the fields (something like a wildcard search). It should be case insensitive.

Is there a way to do that? I've looked at wildcard, multi_match, and other query types, but had little success so far...

Bonus: since the text is french, it would be great to have special characters equivalent to their non accentuated versions (ç=c, é=e, etc.) such that looking for special matches spécial.

Bonjour :wink:

You need probably to define your own analyzer. I'd use some ngrams, lowercase and asciifolding token filters.

If you can't make it work, could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

A full reproduction script will help readers to understand, reproduce and if needed fix your problem. It will also most likely help to get a faster answer.

In case you need, there's also a french speaking space: #in-your-native-tongue:discussions-en-francais :wink:

1 Like

Thanks!

Let's start simple: searching a substring in a single field. Even that was not straightforward...

What took me some time to figure out is that wildcard search doesn't like Caps, which is not explicit in the doc... Shouldn't that be documented?

For instance

DELETE index

PUT index/_doc/1
{
  "full_name": "Service de cardiologie"
}

GET index/_search
{
  "query": {
    "wildcard": {
      "full_name": "Ser*"
    }
  }
}

gives

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

but without the capital "S", it works:

GET index/_search
{
  "query": {
    "wildcard": {
      "full_name": "ser*"
    }
  }
}

gives a match:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "full_name" : "Service de cardiologie"
        }
      }
    ]
  }
}

This behavior is not intuitive, because the match query does not care about the case:

GET index/_search
{
  "query": {
    "match": {
      "full_name": "Service"
    }
  }
}
GET index/_search
{
  "query": {
    "match": {
      "full_name": "service"
    }
  }
}
GET index/_search
{
  "query": {
    "match": {
      "full_name": "SERVICE"
    }
  }
}

all output the same result (a match).

You need to understand how everything works behind the scene and what is the analysis process.

I'd recommend to start here: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html

1 Like

Thanks! I read that and now understand the analysis part better. But i still don't get how the search works...

For instance, given this analyzer and documents:

DELETE index

PUT index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "sloppy_french_analyser": {
          "tokenizer": "my_tokenizer",
          "filter": [
            "asciifolding",
            "lowercase"
          ]
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 3
        }
      }
    }
  }
}

PUT index/_doc/1
{
  "full_name": "Jean-François SCHAFF"
}

PUT index/_doc/2
{
  "full_name": "Jean-Bernard"
}

I would expect these two requests to perform the same search, but they are not...

GET index/_search
{
  "query": {
    "match": {
      "full_name": {
        "query": "jean",
      }
    }
  }
}

GET index/_search
{
  "query": {
    "match": {
      "full_name": {
        "query": "jean",
        "analyzer": "sloppy_french_analyser"
      }
    }
  }
}

I would also expect this request

GET index/_search
{
  "query": {
    "match": {
      "full_name": {
        "query": "jea",
      }
    }
  }
}

to find something, but it's not...

It's just very hard to get into Elasticsearch from reading the doc and examples when you are not already a search specialist... Very frustrating...

It's because of the mapping.
You did not define any mapping so elasticsearch generated one by default.
It applied the default standard analyzer to your full_name field.

You need to force it.

DELETE index
PUT index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "sloppy_french_analyser": {
          "tokenizer": "my_tokenizer",
          "filter": [
            "asciifolding",
            "lowercase"
          ]
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 3
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "full_name": {
        "type": "text",
        "analyzer": "sloppy_french_analyser"
      }
    }
  }
}


PUT index/_doc/1
{
  "full_name": "Jean-François SCHAFF"
}

PUT index/_doc/2
{
  "full_name": "Jean-Bernard"
}

GET index/_search
{
  "query": {
    "match": {
      "full_name": {
        "query": "jea"
      }
    }
  }
}
1 Like

To be complete, here is a "solution" to this topic:

DELETE index

PUT index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "sloppy_french_analyser": {
          "tokenizer": "my_tokenizer",
          "filter": [
            "asciifolding",
            "lowercase"
          ]
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 4
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "full_name": {
        "type": "text",
        "analyzer": "sloppy_french_analyser"
      },
      "unit": {
        "type": "text",
        "analyzer": "sloppy_french_analyser"
      }
    }
  }
}

PUT index/_doc/1
{
  "full_name": "Jean-François SCHAFF-SERANO",
  "unit": "MSP"
}

PUT index/_doc/2
{
  "full_name": "Marion Bernard",
  "unit": "Service de néonat"
}

GET index/_search
{
  "query": {
    "multi_match": {
      "query": "service"
    }
  }
}

The multi_match query will output things in a reasonable order. The ngram tokenizer make it match "too many" documents rather than "not enough", which is nice for a quick search box, which I want to implement.

Thank you @dadoonet for your guidance!

BTW here is a script (for es 6.x) that gives you other ideas on how to combine multiple way for querying the same field using different rules.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.