I am using the query string cross_fields
query type to provide a simple search UI that queries many different fields as though they are one big field.
I want to make it possible to have different analyzers for some of the underlying fields without losing the cross_fields
functionality. I can see in the documentation on this that it explains that only fields with the same analyzer can be grouped together into one big field.
Since this is a decision that has been made intentionally and documented, I'm assuming there is a good reason behind it - I'm just wondering if someone familiar with the codebase can explain it in more detail and suggest any workarounds that might exist.
Ideally what I'd like is for the following to work:
PUT synonym-cross-fields
{
"mappings": {
"properties": {
"forename": { "type": "text", "search_analyzer": "synonym" },
"lastname": { "type": "text" }
}
},
"settings": {
"index": {
"analysis": {
"filter": {
"synonym_graph": {
"type": "synonym_graph",
"lenient": true,
"synonyms": ["Mike,Michael"]
}
},
"analyzer": {
"synonym": {
"tokenizer": "standard",
"filter": ["synonym_graph", "lowercase"]
}
}
}
}
}
}
PUT synonym-cross-fields/_doc/1
{
"forename": "michael",
"lastname": "greene"
}
GET synonym-cross-fields/_search
{
"query": {
"query_string": {
"default_operator": "AND",
"type": "cross_fields",
"fields": ["forename","lastname"],
"query": "Mike Greene"
}
}
}
What I naively would expect would be a sensible implementation is for the above query to look like is
+blended(terms:[forename:mike, forename:michael, lastname:mike])
+blended(terms:[forename:greene, lastname:greene])
When actually the different analyzers cause the query to be organized by field rather than by term.
(+Synonym(forename:michael forename:mike) +forename:greene)
|
(+lastname:michael +lastname:greene)
In the case of synonyms, I can work around this by applying the analysis at index time and then using "analyzer": "standard"
on my query_string_query
. While this workaround is fine for my synonyms, it only works in this case because both the "standard" and "synonym" tokens are in the index. This workaround doesn't work for other analyzers where the "standard" tokens do not match the customized tokens (e.g. uax_url_email_tokenizer
only emits a single token for an email so this workaround fails).
I am assuming the functionality is designed this way because each analyzer maps the whole query string to an independent sequence of tokens (rather than being a map from a single token to a sequence of tokens). This probably makes it impossible to generically align the tokens from different analyzers.
A potential different way to group fields (tokenizer instead of analyzer)
Considering my specific use case makes me wonder whether the cross_fields
grouping could be applied at the tokenizer level instead. This wouldn't solve every case but would solve the synonym use case providing the synonym analyzer also used the standard tokenizer. In this case, the logic for cross_fields
would be to apply the tokenization, group the query around these starting tokens, and then within these groups apply the synonym expansions where relevant.