Querying a file path field by just the basename

I have a field that stores the full path of a file on disk. The file's basename is unique and I'd like to be able to query for just the basename as well.

The simplest solution is to add a second field which contains the basename but I tried using an analyzer:

  1. I created a path_tree_rev_tokenizer of type path_hierarchy with delimiter / and set it to reverse the order of tokenization, so /home/bob is tokenized as [home/bob, bob].

  2. I created a path_tree_rev analyzer that uses this tokenizer.

  3. I made a Text field called file.path with the path_tree_rev analyzer.

In a simple test separate from my main application code, I created a document with /home/bob in the file.path field and queried it with

"query": {
    "match": {
        "file.path": "bob"
    }
}

and it matched my document successfully.

However when I literally copied and pasted the code into my main application and re-created my index with the new field and analyzer definitions, Elasticsearch is unable to find a "real" document by querying for the basename. The only difference I can see is that in my application the field name is source.file.path.

Is there something I'm doing wrong here, or is there a way to diagnose why the query is not succeeding?

Can you share a sample document as well as the mapping of the index?

Hi @Christian_Dahlqvist. Here is a sequence of queries that creates an index, adds a document, and then queries it:

curl -XDELETE 'http://localhost:9200/foo?pretty' -d ''
curl -H 'Content-Type: application/json' -XPUT 'http://localhost:9200/foo?pretty' -d '{
  "mappings": {
    "properties": {
      "bar": {
        "properties": {
          "name": {
            "type": "text"
          },
          "path": {
            "analyzer": "path_tree_rev",
            "type": "text"
          }
        },
        "type": "object"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "path_tree_rev": {
          "tokenizer": "path_tree_rev_tokenizer",
          "type": "custom"
        }
      },
      "tokenizer": {
        "path_tree_rev_tokenizer": {
          "delimiter": "/",
          "reverse": true,
          "type": "path_hierarchy"
        }
      }
    }
  }
}'
curl -H 'Content-Type: application/json' -XPOST 'http://localhost:9200/foo/_doc?pretty' -d '{
  "bar": {
    "name": "Bob",
    "path": "/home/bob"
  }
}'
curl -H 'Content-Type: application/json' -XPOST 'http://localhost:9200/foo/_search?pretty' -d '{
  "query": {
    "match": {
      "bar.path": "bob"
    }
  }
}'

Please provide an example that can be run from the Kibana console.

I updated the post above with the actual requests made by Python. Hopefully they can be pasted into Kibana easily.

So I discovered that I can dump all of the terms indexed for a given field. So I tried it out on my full application where searching by base name is not working.

curl  -H 'Content-Type: application/json' -XGET 'http://localhost:9200/my_index/_doc/gDEhSHYBoWN8sy6cFN7j/_termvectors' -d '{
  "fields" : ["source.file"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}'

This proves that the document is properly indexed by its base name:

        "some_boring_filename" : {
          "doc_freq" : 1,
          "ttf" : 1,
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 66,
              "end_offset" : 92
            }
          ]
        },

Yet when I query it:

curl -H 'Content-Type: application/json' -XPOST 'http://localhost:9200/my_index/_search?pretty' -d '{
  "query": {
    "match": {
      "source.file": "some_boring_filename"
    }
  }
}'

It is not found:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

Bump to prevent this from being closed.

As the previous post shows, the paths are being properly tokenized, but the search is not working.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.