How best to craft this query?

Greg_Mowery · November 9, 2015, 1:24pm

I am working on switching a legacy search app to use Elastic. Having cataloged the data, I am now looking at some of their typical queries, and writing a parsing routine to switch these to elastic. I am stumped on how to convert the below query into the equivalent elastic syntax. If anyone can offer any suggestions, it would help a lot.

TITLE,KEYWORDS,KEYWORDS+
("RNA"
"RNAS"
"MTRNA"
"TRNA*"
"RRNA*"
"RIBOSOM*"
"RIBOZYME*") AND (("INTERFER*"
"GENE SILENC*"
"INTERACTION*"
"PIWI"
"SPLICING"
"SPLICE*"
"CODING"
"NONCODING"
"ENCOD*"
"HELICASE*"
"PROCESSOSOME*"
"EDITING"
"BIOGENE*"
"PROCESSING"
"BINDING"
"SUBUNIT*"
"HELIX"
"STEM CELL*"
"EUKARYOTIC*"
"PROKARYOTIC*"
"VIRAL SYSTEM*"
"STRUCTURAL ANALY*"
"BIOCHEMICAL*"
"BIOPHYSICAL*"
"BIOGENESIS*"
"CHEMISTRY") NOT ("INTERFERON*"))

This query will be issued against 3 fields (not an issue), and has nested ANDs/NOTs (not an issue). What I am stumbling over are the terms "STEM CELL*" or "STRUCTURAL ANALY*" (anything with a trailing wildcard). In the current system they preform as a match_phrase_prefix query. I am fumbling on how to mix and match this into a workable query.

How could I best output this in Elastic?

Thanks
Greg

polyfractal · November 9, 2015, 4:43pm

Can I assume these are all optional OR'd keywords, regardless of whether they are wildcard or not?? E.g. "RNA" OR "RNAS" OR "TRNA*", etc?

If that's true, I think the easiest option is for your app to separate "single words" from "wildcards". All single words going into one large match, all wildcards go into individual match_phrase_prefix. Something like this:

{
   "query": {
      "bool": {
         "must": [
            {
               "bool": {
                  "should": [
                     {
                        "match": {
                           "keyword": "RNAS MTRNA"
                        }
                     },
                     {
                        "match_phrase_prefix": {
                           "keyword": "TRNA"
                        }
                     },
                     {
                        "match_phrase_prefix": {
                           "keyword": "RRNA"
                        }
                     },
                     ...
                  ]
               }
            },
            {
               "bool": {
                  "should": [
                     {
                        "match": {
                           "keyword": "PIWI SPLICING CODING NONCODING EDITING PROCESSING BINDING CHEMISTRY"
                        }
                     },
                     {
                        "match_phrase_prefix": {
                           "keyword": "INTERFER"
                        }
                     },
                     {
                        "match_phrase_prefix": {
                           "keyword": "GENE SILENC"
                        }
                     },
                     ...
                  ],
                  "must_not": [
                     {
                        "match_phrase_prefix": {
                           "keyword": "INTERFERON"
                        }
                     }
                  ]
               }
            }
         ]
      }
   }
}

So we have two levels of bools to setup the AND/OR/NOT sequence that you have, then a set of should clauses in the interior which has a single match for all single words (and defaults to OR'ing them) and then individual phrase prefix for all the wildcards.

That should be a fairly close approximation of your query, if I understand correctly. That said...I'm wary of any query that needs so many wildcards, for performance reasons. So you may want to investigate alternate schemes to reduce your wildcards if you run into performance problems =)

Greg_Mowery · November 9, 2015, 7:04pm

Your assumption is correct, they all are OR'd. Now that I can see the structure of how the query is constructed, its making some sense.

I really appreciate the help!

Greg

polyfractal · November 9, 2015, 8:01pm

Np, happy to help! Lemme know how it goes...as an ex-biologist, your query has piqued my curiosity