Querying a file path field by just the basename

dansei · December 3, 2020, 8:11pm

I have a field that stores the full path of a file on disk. The file's basename is unique and I'd like to be able to query for just the basename as well.

The simplest solution is to add a second field which contains the basename but I tried using an analyzer:

I created a path_tree_rev_tokenizer of type path_hierarchy with delimiter / and set it to reverse the order of tokenization, so /home/bob is tokenized as [home/bob, bob].
I created a path_tree_rev analyzer that uses this tokenizer.
I made a Text field called file.path with the path_tree_rev analyzer.

In a simple test separate from my main application code, I created a document with /home/bob in the file.path field and queried it with

"query": {
    "match": {
        "file.path": "bob"
    }
}

and it matched my document successfully.

However when I literally copied and pasted the code into my main application and re-created my index with the new field and analyzer definitions, Elasticsearch is unable to find a "real" document by querying for the basename. The only difference I can see is that in my application the field name is source.file.path.

Is there something I'm doing wrong here, or is there a way to diagnose why the query is not succeeding?

Christian_Dahlqvist · December 8, 2020, 6:57am

Can you share a sample document as well as the mapping of the index?

dansei · December 9, 2020, 3:22pm

Hi @Christian_Dahlqvist. Here is a sequence of queries that creates an index, adds a document, and then queries it:

curl -XDELETE 'http://localhost:9200/foo?pretty' -d ''
curl -H 'Content-Type: application/json' -XPUT 'http://localhost:9200/foo?pretty' -d '{
  "mappings": {
    "properties": {
      "bar": {
        "properties": {
          "name": {
            "type": "text"
          },
          "path": {
            "analyzer": "path_tree_rev",
            "type": "text"
          }
        },
        "type": "object"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "path_tree_rev": {
          "tokenizer": "path_tree_rev_tokenizer",
          "type": "custom"
        }
      },
      "tokenizer": {
        "path_tree_rev_tokenizer": {
          "delimiter": "/",
          "reverse": true,
          "type": "path_hierarchy"
        }
      }
    }
  }
}'
curl -H 'Content-Type: application/json' -XPOST 'http://localhost:9200/foo/_doc?pretty' -d '{
  "bar": {
    "name": "Bob",
    "path": "/home/bob"
  }
}'
curl -H 'Content-Type: application/json' -XPOST 'http://localhost:9200/foo/_search?pretty' -d '{
  "query": {
    "match": {
      "bar.path": "bob"
    }
  }
}'

Christian_Dahlqvist · December 9, 2020, 3:23pm

Please provide an example that can be run from the Kibana console.

dansei · December 9, 2020, 3:33pm

I updated the post above with the actual requests made by Python. Hopefully they can be pasted into Kibana easily.

dansei · December 9, 2020, 3:57pm

So I discovered that I can dump all of the terms indexed for a given field. So I tried it out on my full application where searching by base name is not working.

curl  -H 'Content-Type: application/json' -XGET 'http://localhost:9200/my_index/_doc/gDEhSHYBoWN8sy6cFN7j/_termvectors' -d '{
  "fields" : ["source.file"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}'

This proves that the document is properly indexed by its base name:

        "some_boring_filename" : {
          "doc_freq" : 1,
          "ttf" : 1,
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 66,
              "end_offset" : 92
            }
          ]
        },

Yet when I query it:

curl -H 'Content-Type: application/json' -XPOST 'http://localhost:9200/my_index/_search?pretty' -d '{
  "query": {
    "match": {
      "source.file": "some_boring_filename"
    }
  }
}'

It is not found:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

dansei · December 30, 2020, 7:02pm

Bump to prevent this from being closed.

As the previous post shows, the paths are being properly tokenized, but the search is not working.

system · January 27, 2021, 7:02pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
I cannot search the file name from path in Elasticsearch Elasticsearch	4	361	October 11, 2023
How to Query path_hierarchy tokenized field Elasticsearch	1	419	February 27, 2018
Best way to implement terms query for analyzed field Elasticsearch	2	417	January 15, 2019
Clarification on path field Elasticsearch	3	751	July 5, 2017
Query string query doesn't work when field has customer analyzer Elasticsearch	5	1996	July 5, 2017

Querying a file path field by just the basename

Related topics