Keyword normalizer search behavior


(Eric Miller) #1

I am using ES 5.5.1. I have data which includes a path attribute. That attribute has values like foo/bar/file.txt. I want to find files by folder. For example, I'd like to search for foo/bar/ and find foo/bar/file.txt (as well as foo/bar/file2.txt, but not files in sub-folders like foo/bar/baz/file.txt). I would also like to aggregate and sort by folder. I think the most efficient way to do this is to have a keyword sub-field that is indexed with the folder path. However, I'm having trouble searching on such a value.

Consider the following index create command.

PUT path_index
{
    "settings": {
        "index": {
            "analysis": {
                "char_filter": {
                    "folder_filter": {
                        "pattern": "(.*/)[^/]+",
                        "type": "pattern_replace",
                        "replacement": "$1"
                    }
                },
                "analyzer": {
                    "folder": {
                        "tokenizer": "keyword",
                        "char_filter": [
                            "folder_filter"
                        ]
                    }
                },
                "normalizer": {
                    "folder": {
                        "char_filter": [
                            "folder_filter"
                        ]
                    }
                }
            }
        }
    },
    "mappings": {
        "pathType": {
            "properties": {
                "path": {
                    "type": "keyword",
                    "fields": {
                        "folder": {
                            "type": "text",
                            "analyzer": "folder",
                            "fielddata": true
                        },
                        "folderKeyword": {
                            "type": "keyword",
                            "normalizer": "folder"
                        }
                    }
                }
            }
        }
    }
}

The above creates a "folder" text sub-field which keeps a folder value. (foo/bar/ for foo/bar/file.txt). That does what I want and searches work correctly. I also have a sub-field "folderKeyword" which does the same thing but with a keyword. "folderKeyword" does not work as I expect.

I add one document to my index

PUT path_index/pathType/0
{
  "path": "foo/bar/file.txt"
}

Then this search finds that document.

POST path_index/_search    
{
    "query": {
        "term": {"path.folder": "foo/bar/"}
    }
}

But this search fails to find the document.

POST path_index/_search    
{
    "query": {
        "term": {"path.folderKeyword": "foo/bar/"}
    }
}

I don't see the difference. Why does "folderKeyword" fail?

Additional info: A prefix search of "path.folder": "foo/bar/" works but "path.folderKeyword": "foo/bar/" fails. However a prefix search without the slash "path.folderKeyword": "foo/bar" succeeds. A term search without the trailing slash "path.folderKeyword": "foo/bar" fails. A sort on "path.folder" and on "path.folderKeyword" both show the same sort value: "foo/bar/".


(Abdon Pijpelink) #2

The term query will apply the normalizer to your query too, so your query for foo/bar/ becomes a query for foo//. This is counter-intuitive and it looks like this is going to change in the future (https://github.com/elastic/elasticsearch/issues/25487).

Until then, a way around this is to use a different query that does not apply normalization, for example a query_string query. This does find your document:

POST path_index/_search
{
  "query": {
    "query_string": {
      "query": "path.folderKeyword:foo/bar/"
    }
  }
}

(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.