Hello,
I built a analyzer plugin that tokenizes XML. It generates 1 token per XML attribute, all tokens for a XML node are set at the same position.
I need a clean way to search for XML node that have multiple criteria set (example: <w a=1 b=2 c=3>
⇒ I want to find all nodes that have a=1
AND b=2
.
I found a ugly solution by experimenting with the span API (setting slop
to -1
, see below). I wonder if I could find a better solution.
Now, in more details:
POST /_analyze
{"analyzer":"annotation", "text":"<w lemma=be>am</w>"}
Will output:
{
"tokens": [
{
"token": "am",
"start_offset": 0,
"end_offset": 18,
"type": "word",
"position": 0
},
{
"token": "lemma=be",
"start_offset": 0,
"end_offset": 18,
"type": "attr",
"position": 0
}
]
}
I need a way to retrieve the document if it contains a node with the word am
which have the lemma=be
attribute.
Note: am
and lemma=be
are at the same position.
I couldn't find how to achieve this with the query language, but got something working with the span_near
API, which is kinda hacky: a secret recipe was to set "slop"
to -1
and "in_order"
to false
.
GET /corpus/segment/_search
{
"query": {
"span_near" : {
"clauses" : [
{ "span_term" : { "sr": "am" } },
{ "span_term" : { "sr": "lemma=be" } }
],
"slop" : -1,
"in_order" : false
}
}
}
If anyone has experience and/or advices on how to achieve that more cleanly, it would be appreciated.
Thanks!
JC