Unexpected results in regexp query

Markos_Fragkakis · August 13, 2014, 9:06am

Hi all,

I have created my own mapping that does sentence tokenization:

This is the relevant part:

        "analysis": {
           "filter": {
              "csr_token_length_filter": {
                 "type": "length",
                 "max": "32776",
                 "min": "0"
              }
           },
           "char_filter": {
              "csr_new_line_character_filter": {
                 "type": "mapping",
                 "mappings": [
                    " -\\n => ",
                    " —\\n => ",
                    " \\n =>\\u0020"
                 ]
              }
           },
           "analyzer": {
              "csr_sentence_analyzer": {
                 "filter": "csr_token_length_filter",
                 "char_filter": [
                    "html_strip",
                    "csr_new_line_character_filter"
                 ],
                 "type": "custom",
                 "tokenizer": "csr_sentence_tokenizer"
              }
           },
           "tokenizer": {
              "csr_sentence_tokenizer": {
                 "flags": [
                    "NONE"
                 ],
                 "type": "pattern",
                 "pattern": "(?<=[.?!])\\s+(?=[\\da-zA-Z])"
              }
           }
        },

I am indexing a string containing the lorem ipsum text. So, when I tokenize
it in marvel with this:

GET /markosindex/_analyze?analyzer=csr_sentence_analyzer
{
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat
non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
}

I get this:

{
"tokens": [
{
"token": "{ Lorem ipsum dolor sit amet, consectetur adipisicing
elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.",
"start_offset": 0,
"end_offset": 126,
"type": "word",
"position": 1
},
{
"token": "Ut enim ad minim veniam, quis nostrud exercitation
ullamco laboris nisi ut aliquip ex ea commodo consequat.",
"start_offset": 127,
"end_offset": 234,
"type": "word",
"position": 2
},
{
"token": "Duis aute irure dolor in reprehenderit in voluptate
velit esse cillum dolore eu fugiat nulla pariatur.",
"start_offset": 235,
"end_offset": 337,
"type": "word",
"position": 3
},
{
"token": "Excepteur sint occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim id est laborum. } ",
"start_offset": 338,
"end_offset": 451,
"type": "word",
"position": 4
}
]
}

But when I run a regexp query on this field, I don't get any results:

GET /markosindex/_search
{

"fields" : ["filename", "mime"],

"query": {
"filtered": {
"query": {
"regexp": {
"fileTextContent.fileTextContentSentenceAnalyzed":
".consectetur\s+adipisicing\s+elit."
}
},
"filter" : {
"query" : {
"match_all" : { }
}
}
}
},
"highlight": {
"fields": {"fileTextContent.fileTextContentSentenceAnalyzed": {}}}
}

However, when I change the intermediate "\s+" to ".+", the query works:

GET /markosindex/_search
{

"fields" : ["filename", "mime"],

"query": {
"filtered": {
"query": {
"regexp": {
"fileTextContent.fileTextContentSentenceAnalyzed":
".consectetur.+adipisicing.+elit."
}
},
"filter" : {
"query" : {
"match_all" : { }
}
}
}
},
"highlight": {
"fields": {"fileTextContent.fileTextContentSentenceAnalyzed": {}}}
}

Any idea what is going on? My analyzer shows spaces between words in each
token (sentence). However, the regexp query does not work for me.

Cheers,

Markos

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a89f8dfc-16b0-46c7-b0ae-f68f1ac9079f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Markos_Fragkakis · August 13, 2014, 9:13am

Forgot to add the relevant part of my mapping:

           "fileTextContent": {
              "type": "string",
              "index": "no",
              "fields": {
                 "fileTextContentSentenceAnalyzed": {
                    "type": "string",
                    "analyzer": "csr_sentence_analyzer"
                 },
                 "fileTextContentAnalyzed": {
                    "type": "string"
                 }
              }
           },

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/30900d12-e36d-4059-9b88-dbd81c240708%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Regexp Query not works on keyword field Elasticsearch	1	649	July 6, 2017
Analyzing URLs for regexp queries Elasticsearch	4	5589	July 6, 2017
Help with analyzer and mapping Elasticsearch	9	554	July 6, 2017
Regex search not working Kibana	4	2144	July 6, 2017
Searching for "foo" should also find occurrence of "foo.bar" Elasticsearch	6	478	July 6, 2017

Unexpected results in regexp query

Related topics