Multi-field full-text searching in multiple languages with wildcards, fuzzy, and phrase match

Hey everybody,

I’m working on an index containing audio transcriptions that are full-text searchable using wildcards, fuzzy, and phrase matching with highlighting.
I’m currently expanding the mappings of this index to include multiple languages with highlighting distinguishable by which version of the text it appears in.

The current data structure looks like this:

{
  "transcription_string": ""
}

And is queried like this:

{
  "highlight": {
    "fields": {
      "transcription_string": {}
    }, 
    "number_of_fragments": 0
  },
  "query": {
    "bool": {
      "must": {
        "match": {
          "transcription_string": "some"
        },
        {
          "transcription_string": "words"
        }
      }
    }
  }
}

There are now multiple fields nested inside the transcription_string field that all need to be searched, though.

{
  "transcription_string": {
    "transcription": {
      "es": "",
      "en": ""
    }
  }
}

At first I was investigating multi_match, something along the lines of

{
  "highlight": {
    "fields": {
      "transcription_string.*": {}
    }, 
    "number_of_fragments": 0
  },
  "query": {
    "bool": {
      "must": {
        "multi_match": {
          "query": "some words", 
          "fields": [ "transcription_string.*" ]
        }
      }
    }
  }
}

It appears that this API is not compatible with wildcards though (and maybe not fuzzy either).

I’ve reached some success using this query structure:

{
  "highlight": {
    "fields": {
      "transcription_string.*": {}
    }, 
    "number_of_fragments": 0
  },
  "query": {
    "bool": {
      "should": [
        {
          "bool": {
            "must": [
              {
                "match": {"transcription_string.transcription.en": "some"}
              },
              {
                "match": {"transcription_string.transcription.en": "words"}
              }
            ]
          }
        },
        {
          "bool": {
            "must": [
              {
                "match": {"transcription_string.transcription.es": "some"}
              },
              {
                "match": {"transcription_string.transcription.es": "words"}
              }
            ]
          }
        }
      ]
    }
  }
}

This seems to be compatible with wildcards, phrase matching, and fuzzy searching by replacing match with wildcard, match_phrase, or the desired structure for fuzzy searching.

This query looks super long though..
Is there possibly a better way to structure this?
Is there any way I could make this more performant?
Is this nested mapping for multi-language transcriptions even the best way to store the data?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.