ES7.9.1 : match_phrase_prefix irregular behaviour

We have an ES index of the 230 million + files in our archive - each file gets indexed with details such as filename, directory path and file size

I've been attempting to construct queries to aggregate information for a given directory path such as total volume and total number of files. For this I've been using match_phrase_prefix but seeing differing levels of reliability in the response i'm getting.

Here's two examples - in the first, for path /badc/ukmo-midas/data/WM, I'm expecting to get 77 hits.. in the second, for path /badc/ukmo-midas/data/WH149 hits.

the two paths are to directories which each have a couple of files and a sub-directory call 'yearly_files' which contain data files for each year.

Here's my query

    GET ceda-fbi/_search/
        {"query": {
                     "match_phrase_prefix": {
                       "info.directory.analyzed": {"query":"/badc/ukmo-midas/data/WM",
                     }
                 }
           } 
     }

This gets me just 4 results - 2 for the files in the directory as given in the query itself... and then just 2 from the 'yearly_files' sub-directory.

Meanwhile:

    GET ceda-fbi/_search/
        {"query": {
                     "match_phrase_prefix": {
                       "info.directory.analyzed": {"query":"/badc/ukmo-midas/data/WH",
                     }
                 }
           } 
     }

gives me 147 hits.. i.e. a lot closer to what I"m expecting to get.

I've also tried :

    GET ceda-fbi/_search/
        {"query": {
                     "match_phrase_prefix": {
                       "info.directory.analyzed": {"query":"/badc/ukmo-midas/data/WM/yearly_files",
                     }
                 }
           } 
     }

In this case this gave the right number!... but we won't know what sub-directories we'll need to use a priori here. (use-case is to sum-up all files within a dataset based on the dataset giving a particular path in a directory hierarchy below which all the files are stored sub-divided into directories as needed)

I've tried changing the max_expansions up from 50 to 100 and then 1000. In the first case no change at 100, but I get the expected number when this gets to 1000. In the latter case, there was no change at any point.. i.e. the default of 50 was as good as for 100 and 1000.

As a quick aside - we did also try using match_phrase - that seemed to work too... but we don't want that as order is important here to ensure we're getting the right directory path being picked up.

Whilst this might appear to resolve the issue there are a few questions we are left with:

  1. why does changing the max_expansions work in this case as the methodology of the match_phrase_prefix should, from what we understand, work anyway for the given phrase against the field in question anyway.
  2. [rhetorical question only] If we are mis-interpretting how the field is being checked against, then how will we know what to set the max_expansions to for all parts of the archive? That seems to much of an arbretary number for us to use this method in our use case.
  3. Are we using the wrong query here - and instead should look at something like the much more expensive regexp query????

We've been scratching out heads about this one and tried to google/search here for an solution having read the docs and we're stuck.... any ES gurus able to shed some light please?

It is probably worth adding the field mapping:

    {
          "index_name" : {
            "mappings" : {
              "info.directory.analyzed" : {
                "full_name" : "info.directory.analyzed",
                "mapping" : {
                  "analyzed" : {
                    "type" : "text"
                  }
                }
              }
            }
          }
        }

I'd use a path tokenizer based analyzer.

In the past, I wrote something which might help you.

http://david.pilato.fr/blog/2015/12/10/building-a-directory-map-with-elk/

HTH

1 Like

Thanks David.