Regexp matches when it shouldn't, doesn't match when it should


#1

Hi there!

Pretty new to ES so please bear with me!

So, I have an index "dirs" which contains a bunch of documents, each representing a directory in some filesystem. Looks a bit like this...

{
"settings":{
	"number_of_shards":1,
	"number_of_replicas":0
},
"mappings":{
	"_default_": {
		"_all": {
			"enabled": true
		}
	},
	"dir":{
		"_all": {
			"enabled": true
		},
		"properties":{
			"Path":{ "type": "text", "index" : "not_analyzed" },
			"Depth":{ "type": "integer"},
			"Fingerprint":{"type": "text"}
		}
	}
}
}

Now, I want to search this index using a regular expression say "/Users.*" to get every directory under /Users.

$ curl -XGET 'localhost:9200/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "query": {
        "regexp":{
            "Path": "/Users"
        }
    }
}
'
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 6,
    "successful" : 6,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

Is this ES saying there's nothing in the index with a path starting with /Users? If so, that's wrong! Lets prove it with a more liberal regex...

$ curl -XGET 'localhost:9200/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "query": {
        "regexp":{
            "Path": ".*"
        }
    }
}
'
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 6,
    "successful" : 6,
    "failed" : 0
  },
  "hits" : {
    "total" : 31321,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "dirs",
        "_type" : "dir",
        "_id" : "/Users/marmel01/ether/etherpad-lite/src/node_modules/wd/node_modules/caseless",
        "_score" : 1.0,
        "_source" : {
          "Path" : "/Users/marmel01/ether/etherpad-lite/src/node_modules/wd/node_modules/caseless",
          "Depth" : 9,
          "Fingerprint" : ""
        }
      }
    ]
  }
}

See, EVERYTHING starts with /Users, something's wrong here!

Here's another bit of weirdness, instead of matching "/Users", lets try "users" (lowercase, no leading slash)

$ curl -XGET 'localhost:9200/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "query": {
        "regexp":{
            "Path": "users"
        }
    }
}
'
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 6,
    "successful" : 6,
    "failed" : 0
  },
  "hits" : {
    "total" : 31321,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "dirs",
        "_type" : "dir",
        "_id" : "/Users/marmel01/ether/etherpad-lite/src/node_modules/wd/node_modules/caseless",
        "_score" : 1.0,
        "_source" : {
          "Path" : "/Users/marmel01/ether/etherpad-lite/src/node_modules/wd/node_modules/caseless",
          "Depth" : 9,
          "Fingerprint" : ""
        }
      }
    ]
  }
}

So, is "users" matching "Users" here? Is there some case sensitivity weirdness going on here?

What's more, we lose our match again if we put the slash back in "/users"

$ curl -XGET 'localhost:9200/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "query": {
        "regexp":{
            "Path": "/users"
        }
    }
}
'
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 6,
    "successful" : 6,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

Is this expected behaviour? Is there a better way of doing these kinds of matches? I'm going the regex route because eventually I'd like to do more complex matches. For example "/Users/[^/]+" to get all directories under /Users

Thanks!

Mark


(Adrien Grand) #2

What is the output of the validate API with rewrite=true on your query?


#3

Hi jpountz, thanks for taking a look!

Here's the output for the lowercase, no slash query, which shouldn't match but does

curl -XGET 'localhost:9200/dirs/dir/_validate/query?rewrite=true&pretty' -H 'Content-Type: application/json' -d'
    {
        "query": {
            "regexp":{
                "Path": "users"
            }
        }
    }
    '
{
  "valid" : true,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "explanations" : [
    {
      "index" : "dirs",
      "valid" : true,
      "explanation" : "Path:/users/"
    }
  ]
}

The uppercase query which should but doesnt...

curl -XGET 'localhost:9200/dirs/dir/_validate/query?rewrite=true&pretty' -H 'Content-Type: application/json' -d'
    {
        "query": {
            "regexp":{
                "Path": "Users"
            }
        }
    }
'
{
  "valid" : true,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "explanations" : [
    {
      "index" : "dirs",
      "valid" : true,
      "explanation" : "Path:/Users/"
    }
  ]
}

The uppercase slashed query which should match, but doesn't...

     curl -XGET 'localhost:9200/dirs/dir/_validate/query?rewrite=true&pretty' -H 'Content-Type: application/json' -d'
    {
        "query": {
            "regexp":{
                "Path": "/Users"
            }
        }
    }
    '
{
  "valid" : true,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "explanations" : [
    {
      "index" : "dirs",
      "valid" : true,
      "explanation" : "Path://Users/"
    }
  ]
}

All valid by the look of it!

Mark


#4

So, I think I managed to get the required behaviour by changing the type of the Path field from "text" to "keyword".

Not really sure what this does, or if what I was seeing is expected behaviour with the text type.

Time to RTFM I guess :wink:

Cheers,

Mark


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.