Can analyzer positions be overloaded to enable lattice / confusion network indexing & search?


#1

I am looking into storing probabilistic lattices/confusion networks rather than one-best collections of words. In a lattice/confusion network, each word position in the document is really a set of possible words with associated probabilities that sum to 100%. As far as I can tell, there is not low level support for lattice / confusion networks in Lucene - nor higher level support in Elasticsearch. Is that correct?

Another major consideration is that I need to preserve phrase matching. My somewhat hack-ish idea after looking through the documentation is to abuse the "position" values in the output of the analyzers. If I could force multiple tokens to share the same position, I would essentially create an index from a lattice while potentially preserving the phrase match functionality depending on monotonically increasing position values... is this reasonable? and if so, how can I accomplish this? A custom analyzer?

Thanks!


#2

confusion network is simpler, because it matches structurally what is encoded anyway (e.g. synonyms share the same position value). if you need to store additional per-position data (e.g. POS or some probability) maybe look into lucene payloads, which is just a per-position byte[] where you can shove that stuff.

A lattice would be more complex: you'd have to encode additional information into the payload, e.g. the "phrase length" (called positionLength in lucene analyzer terminology).

In either case, keep in mind its an inverted index, so I'm not sure how useful doing this would be.


#3

Realistically, by index time we would always store a confusion network. From what I can tell, the synonym behavior for phrase search is extremely similar to what I am looking to do with one notable difference: we would want to declare "synonyms" at a specific time slice; not global to a document or index. In fact, they aren't really synonyms in a traditional sense, just "probabilistic competitors". Is there an obvious way to do this?

The payload option is a good suggestion but doesn't address the critical issue of full phrase search. While important in many languages, the most important example I can come up with is in Mandarin where there are no word boundaries. In order to search from a given string of characters, you essentially perform a full string match... but when dealing with confusion networks, this essentially means storing each sausage as a single character and performing a phrase search (because if you stored "words" they would be arbitrary due to the language; speakers agree ~70% of the time on "word" boundaries). In other words, in this context phrase search = term search.

In English, for confusion networks phrase search != term search, but if you want to preserve normal phrase search then the payload suggestion wouldn't help.

Thanks!


(system) #4