How can I get Spans' position data from a Near Span Query

hananlevy · January 19, 2017, 9:45pm

I'm running a SpanNear Query against a non-stored text field in ElasticSearch. Is there a way of getting the Span's "Position Info" data? There's a way through accessing the "getSpans" method in Lucene, and Lucene .Net. However, I can find a way of getting the data using ElasticSearch.

any thoughts?

P.S. I need the position information so i can access the Term Vectors and get the actual terms.

hananlevy · January 25, 2017, 5:46pm

This is the use case, appreciate any thought.

Assume you're indexing large amounts of data, in very big documents - so you won't be able to store the actual data, and would also like to exclude it from the _source field.

Let say you wish to find a sequence of 4 terms, that match particular regular expression.
for instance - a master card number sequence.
You could create 4 span multi queries - with an underlying regex query - and join them all in a SpanNear, ordered query with a slope of 0.

That would give you the doc_id of all document that contain such a sequence.
However, if you wanted to know the actual sequence that matched the SpanNearQuery. For example, if you want to verify that the matched terms actually constitute a valid credit card number (let's say you're monitoring sensitive information for security, classification and remediation purposes) you would have to run a verification algorithm, on the actual terms. You'd have to join the 4 consecutive terms, into a card number and verify it by the Luhn verification algorithm.

You can't use highlighting - since you cannot store the actual data, and won't store it since its too big- and you've excluded it from the _source.

One way of getting the data is based on Spans, and term vectors.
The spans information form the SpanNear query, will contain the position information for each matching sequence (and for each term within that sequence) - in every document that matched it.

using this position information you can than access the term vector of that specific field, in that specific document, and retrieve the actual terms.

There are probably tons of other situation in which you'd want to access the actual term, in a specific position.

And since there's already access to that information - it's a shame not to make available for use.
The Lucene's Span Queries all expose the getSpans method as a public method, and you can easily access it using Lucene.
I don't think you can do that in Elasticsearch. (Do correct me if I'm wrong - It would be super helpful! )

A good response in my view would be a JSON array, containing all spans objects for each hit document, where each span object contains the Start and End position of each document.

system · February 22, 2017, 5:46pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.