Unexpected results in regexp query


(Markos Fragkakis) #1

Hi all,

I have created my own mapping that does sentence tokenization:

This is the relevant part:

        "analysis": {
           "filter": {
              "csr_token_length_filter": {
                 "type": "length",
                 "max": "32776",
                 "min": "0"
              }
           },
           "char_filter": {
              "csr_new_line_character_filter": {
                 "type": "mapping",
                 "mappings": [
                    " -\\n => ",
                    " —\\n => ",
                    " \\n =>\\u0020"
                 ]
              }
           },
           "analyzer": {
              "csr_sentence_analyzer": {
                 "filter": "csr_token_length_filter",
                 "char_filter": [
                    "html_strip",
                    "csr_new_line_character_filter"
                 ],
                 "type": "custom",
                 "tokenizer": "csr_sentence_tokenizer"
              }
           },
           "tokenizer": {
              "csr_sentence_tokenizer": {
                 "flags": [
                    "NONE"
                 ],
                 "type": "pattern",
                 "pattern": "(?<=[.?!])\\s+(?=[\\da-zA-Z])"
              }
           }
        },

I am indexing a string containing the lorem ipsum text. So, when I tokenize
it in marvel with this:

GET /markosindex/_analyze?analyzer=csr_sentence_analyzer
{
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat
non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
}

I get this:

{
"tokens": [
{
"token": "{ Lorem ipsum dolor sit amet, consectetur adipisicing
elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.",
"start_offset": 0,
"end_offset": 126,
"type": "word",
"position": 1
},
{
"token": "Ut enim ad minim veniam, quis nostrud exercitation
ullamco laboris nisi ut aliquip ex ea commodo consequat.",
"start_offset": 127,
"end_offset": 234,
"type": "word",
"position": 2
},
{
"token": "Duis aute irure dolor in reprehenderit in voluptate
velit esse cillum dolore eu fugiat nulla pariatur.",
"start_offset": 235,
"end_offset": 337,
"type": "word",
"position": 3
},
{
"token": "Excepteur sint occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim id est laborum. } ",
"start_offset": 338,
"end_offset": 451,
"type": "word",
"position": 4
}
]
}

But when I run a regexp query on this field, I don't get any results:

GET /markosindex/_search
{

"fields" : ["filename", "mime"],

"query": {
"filtered": {
"query": {
"regexp": {
"fileTextContent.fileTextContentSentenceAnalyzed":
".consectetur\s+adipisicing\s+elit."
}
},
"filter" : {
"query" : {
"match_all" : { }
}
}
}
},
"highlight": {
"fields": {"fileTextContent.fileTextContentSentenceAnalyzed": {}}}
}

However, when I change the intermediate "\s+" to ".+", the query works:

GET /markosindex/_search
{

"fields" : ["filename", "mime"],

"query": {
"filtered": {
"query": {
"regexp": {
"fileTextContent.fileTextContentSentenceAnalyzed":
".consectetur.+adipisicing.+elit."
}
},
"filter" : {
"query" : {
"match_all" : { }
}
}
}
},
"highlight": {
"fields": {"fileTextContent.fileTextContentSentenceAnalyzed": {}}}
}

Any idea what is going on? My analyzer shows spaces between words in each
token (sentence). However, the regexp query does not work for me.

Cheers,

Markos

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a89f8dfc-16b0-46c7-b0ae-f68f1ac9079f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Markos Fragkakis) #2

Forgot to add the relevant part of my mapping:

           "fileTextContent": {
              "type": "string",
              "index": "no",
              "fields": {
                 "fileTextContentSentenceAnalyzed": {
                    "type": "string",
                    "analyzer": "csr_sentence_analyzer"
                 },
                 "fileTextContentAnalyzed": {
                    "type": "string"
                 }
              }
           },

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/30900d12-e36d-4059-9b88-dbd81c240708%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #3