Hello,
I built a analyzer plugin that tokenizes XML. It generates 1 token per XML attribute, all tokens for a XML node are set at the same position.
I need a clean way to search for XML node that have multiple criteria set (example: <w a=1 b=2 c=3> ⇒ I want to find all nodes that have a=1 AND b=2.
I found a ugly solution by experimenting with the span API (setting slop to -1, see below). I wonder if I could find a better solution.
Now, in more details:
POST /_analyze
{"analyzer":"annotation", "text":"<w lemma=be>am</w>"}
Will output:
{
"tokens": [
{
"token": "am",
"start_offset": 0,
"end_offset": 18,
"type": "word",
"position": 0
},
{
"token": "lemma=be",
"start_offset": 0,
"end_offset": 18,
"type": "attr",
"position": 0
}
]
}
I need a way to retrieve the document if it contains a node with the word am which have the lemma=be attribute.
Note: am and lemma=be are at the same position.
I couldn't find how to achieve this with the query language, but got something working with the span_near API, which is kinda hacky: a secret recipe was to set "slop" to -1 and "in_order" to false.
GET /corpus/segment/_search
{
"query": {
"span_near" : {
"clauses" : [
{ "span_term" : { "sr": "am" } },
{ "span_term" : { "sr": "lemma=be" } }
],
"slop" : -1,
"in_order" : false
}
}
}
If anyone has experience and/or advices on how to achieve that more cleanly, it would be appreciated.
Thanks!
JC