Multi-field full-text searching in multiple languages with wildcards, fuzzy, and phrase match

maurice · September 12, 2019, 10:59pm

Hey everybody,

I’m working on an index containing audio transcriptions that are full-text searchable using wildcards, fuzzy, and phrase matching with highlighting.
I’m currently expanding the mappings of this index to include multiple languages with highlighting distinguishable by which version of the text it appears in.

The current data structure looks like this:

{
  "transcription_string": ""
}

And is queried like this:

{
  "highlight": {
    "fields": {
      "transcription_string": {}
    }, 
    "number_of_fragments": 0
  },
  "query": {
    "bool": {
      "must": {
        "match": {
          "transcription_string": "some"
        },
        {
          "transcription_string": "words"
        }
      }
    }
  }
}

There are now multiple fields nested inside the transcription_string field that all need to be searched, though.

{
  "transcription_string": {
    "transcription": {
      "es": "",
      "en": ""
    }
  }
}

At first I was investigating multi_match, something along the lines of

{
  "highlight": {
    "fields": {
      "transcription_string.*": {}
    }, 
    "number_of_fragments": 0
  },
  "query": {
    "bool": {
      "must": {
        "multi_match": {
          "query": "some words", 
          "fields": [ "transcription_string.*" ]
        }
      }
    }
  }
}

It appears that this API is not compatible with wildcards though (and maybe not fuzzy either).

I’ve reached some success using this query structure:

{
  "highlight": {
    "fields": {
      "transcription_string.*": {}
    }, 
    "number_of_fragments": 0
  },
  "query": {
    "bool": {
      "should": [
        {
          "bool": {
            "must": [
              {
                "match": {"transcription_string.transcription.en": "some"}
              },
              {
                "match": {"transcription_string.transcription.en": "words"}
              }
            ]
          }
        },
        {
          "bool": {
            "must": [
              {
                "match": {"transcription_string.transcription.es": "some"}
              },
              {
                "match": {"transcription_string.transcription.es": "words"}
              }
            ]
          }
        }
      ]
    }
  }
}

This seems to be compatible with wildcards, phrase matching, and fuzzy searching by replacing match with wildcard, match_phrase, or the desired structure for fuzzy searching.

This query looks super long though..
Is there possibly a better way to structure this?
Is there any way I could make this more performant?
Is this nested mapping for multi-language transcriptions even the best way to store the data?

system · October 10, 2019, 10:59pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Wildcard search on fields using multi-match while not searching sub fields Elasticsearch	1	454	April 18, 2018
Boolean searching a multi-value multi-field Elasticsearch	3	522	July 31, 2018
Highlights not generated for certain Fuzzy & Wildcard Phrase match queries using Span Elasticsearch	2	508	November 24, 2021
Elasticsearch across all fields with fuzzy and wildcard searching Elasticsearch	1	1145	April 9, 2018
Exact Search on multiple wildcard-fields Elasticsearch	23	7942	September 5, 2018

Multi-field full-text searching in multiple languages with wildcards, fuzzy, and phrase match

Related topics