Querying tokens at the same position

Hello,

I built a analyzer plugin that tokenizes XML. It generates 1 token per XML attribute, all tokens for a XML node are set at the same position.

I need a clean way to search for XML node that have multiple criteria set (example: <w a=1 b=2 c=3> ⇒ I want to find all nodes that have a=1 AND b=2.

I found a ugly solution by experimenting with the span API (setting slop to -1, see below). I wonder if I could find a better solution.

Now, in more details:

POST /_analyze

{"analyzer":"annotation", "text":"<w lemma=be>am</w>"}

Will output:

{
  "tokens": [
    {
      "token": "am",
      "start_offset": 0,
      "end_offset": 18,
      "type": "word",
      "position": 0
    },
    {
      "token": "lemma=be",
      "start_offset": 0,
      "end_offset": 18,
      "type": "attr",
      "position": 0
    }
  ]
}

I need a way to retrieve the document if it contains a node with the word am which have the lemma=be attribute.

Note: am and lemma=be are at the same position.

I couldn't find how to achieve this with the query language, but got something working with the span_near API, which is kinda hacky: a secret recipe was to set "slop" to -1 and "in_order" to false.

GET /corpus/segment/_search

{
    "query": {
        "span_near" : {
            "clauses" : [
                { "span_term" : { "sr": "am" } },
                { "span_term" : { "sr": "lemma=be" } }
            ],
            "slop" : -1,
            "in_order" : false
        }
    }
}

If anyone has experience and/or advices on how to achieve that more cleanly, it would be appreciated.

Thanks!
JC

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.