Can analyzer positions be overloaded to enable lattice / confusion network indexing & search?

pdiorio · June 11, 2015, 3:58am

I am looking into storing probabilistic lattices/confusion networks rather than one-best collections of words. In a lattice/confusion network, each word position in the document is really a set of possible words with associated probabilities that sum to 100%. As far as I can tell, there is not low level support for lattice / confusion networks in Lucene - nor higher level support in Elasticsearch. Is that correct?

Another major consideration is that I need to preserve phrase matching. My somewhat hack-ish idea after looking through the documentation is to abuse the "position" values in the output of the analyzers. If I could force multiple tokens to share the same position, I would essentially create an index from a lattice while potentially preserving the phrase match functionality depending on monotonically increasing position values... is this reasonable? and if so, how can I accomplish this? A custom analyzer?

Thanks!

rmuir · June 17, 2015, 6:37pm

confusion network is simpler, because it matches structurally what is encoded anyway (e.g. synonyms share the same position value). if you need to store additional per-position data (e.g. POS or some probability) maybe look into lucene payloads, which is just a per-position byte[] where you can shove that stuff.

A lattice would be more complex: you'd have to encode additional information into the payload, e.g. the "phrase length" (called positionLength in lucene analyzer terminology).

In either case, keep in mind its an inverted index, so I'm not sure how useful doing this would be.

pdiorio · June 20, 2015, 3:26am

Realistically, by index time we would always store a confusion network. From what I can tell, the synonym behavior for phrase search is extremely similar to what I am looking to do with one notable difference: we would want to declare "synonyms" at a specific time slice; not global to a document or index. In fact, they aren't really synonyms in a traditional sense, just "probabilistic competitors". Is there an obvious way to do this?

The payload option is a good suggestion but doesn't address the critical issue of full phrase search. While important in many languages, the most important example I can come up with is in Mandarin where there are no word boundaries. In order to search from a given string of characters, you essentially perform a full string match... but when dealing with confusion networks, this essentially means storing each sausage as a single character and performing a phrase search (because if you stored "words" they would be arbitrary due to the language; speakers agree ~70% of the time on "word" boundaries). In other words, in this context phrase search = term search.

In English, for confusion networks phrase search != term search, but if you want to preserve normal phrase search then the payload suggestion wouldn't help.

Thanks!

Topic		Replies	Views
Field “title” was indexed without position data; cannot run PhraseQuery Elasticsearch	3	2916	July 6, 2017
Full text search : search phrase in text Elasticsearch	5	428	July 6, 2017
Field \"message\" was indexed without position data; cannot run PhraseQuery Elasticsearch	10	8747	July 5, 2017
Multiple tokens with same position Elasticsearch	3	2041	July 5, 2017
Any alternative solution to phrase query? Elasticsearch painless	0	142	April 16, 2024

Can analyzer positions be overloaded to enable lattice / confusion network indexing & search?

Related topics