Completion suggestor with synonyms analyzer - getting TokenStream expanded to 450 finite strings. Only <= 256 finite strings are supported


#1

ES version used: 2.3
I'm using the completion suggester with a custom [synonymize] analyzer that expands the words in each indexed document with its synonyms.

When trying to index some documents with an input filed containing many synonyms, I get the following message:

java.lang.IllegalArgumentException: TokenStream expanded to 450 finite strings. Only <= 256 finite strings are supported
        at org.elasticsearch.search.suggest.completion.CompletionTokenStream.incrementToken(CompletionTokenStream.java:66)
        at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:634)
        at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:365)
        at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:321)
        at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
        at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
        at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1477)
        at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1256)
        at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:530)
        at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:457)
        at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601)
        at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
        at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:119)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
        at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

I tried setting the max_token_length to 900, but it does not have any effect and the above exception message continues to show up.

Here is my schema:

{  "mappings":{
      "pname_ar":{
         "properties":{
			"name": {"type":"string"},
			"suggest": {
						"type":"completion",
						"analyzer": "synonimize",
						"search_analyzer":"autocomplete",
						"preserve_position_increments":false,
						"payloads":true,
						  "context": {
			                    "catalog": { 
			                        "type": "category"
			                    	}
			                    }
						}
						,"max_token_length":900
         }
      }
   },
   "settings": {
        "analysis": {
            "filter": {
                "addy_synonym_filter": {
                    "type": "synonym",
                    "synonyms":
                    [__my synonyms list is placed here__]
                }
            },
            "analyzer": {
                "autocomplete": {
                    "type": "custom",
                    "tokenizer": "lowercase"
                    ,"max_token_length":900
                },
                "synonimize": {
                	"type":"custom",
                    "tokenizer": "bigStandardTokenizer",
                    "filter": [
                        "lowercase",
                        "addy_synonym_filter"
                    ]
                    ,"max_token_length":900
                }
            },"tokenizer": {
                    	"bigStandardTokenizer":{
                    		"type":"standard",
                    		"max_token_length":900
                    		}
                    	}
        }
}
}

As you can see, I tried adding "max_token_length" at different tokens and they don't seem to have an effect on the above mentioned error message.

Any suggestions for how to bypass the above message ?


(Jimferenczi) #2

As you can see, I tried adding “max_token_length” at different tokens and they don’t seem to have an effect on the above mentioned error message.
Any suggestions for how to bypass the above message ?

The finite strings are the number of paths that a single input creates.
Suppose you have a synonym like street,st and the input is 1 st street, the completion field will index 1 st street, 1 street street, 1 st st and 1 street st. The number of finite strings in this case is 4 and can easily explode if you define a lot of variations for the same word.
max_token_length restricts the length of the input, 1 st street in this example. If you have the issue with a value of 900 it means that you have an input that when restricted to the first 900 characters create more than 256 possible paths. Unfortunately there is no way to change the finite strings limit in 2.x (see below).

java.lang.IllegalArgumentException: TokenStream expanded to 450 finite strings. Only <= 256 finite strings are supported at

You're hitting the hard limit set in 2.x to prevent the path explosions that synonyms or other expansions could produce.
This limit has been removed in the suggest v2 available for new indices in 5.x. You should upgrade to the latest version and reindex your suggest in a new index to get this new suggester.


#4

Thank you for the response.
I can’t upgrade to ES v5 right now due to the issue I reported earlier on https://github.com/elastic/elasticsearch/issues/22912

I ended up using the edge n-grams search-as-you-type method explained here:
https://www.elastic.co/guide/en/elasticsearch/guide/master/_index_time_search_as_you_type.html

However, I’m struggling now with another issue. Words order in the query does not seem to be respected by ES.
For instance, when searching for “jo”, I get all of the following (in this order):

“school was funded by john White”
“John and his wife went to the pool”

I need to give higher priority to the second sentence since the term “john” appears at the start of that sentence.
The same problem happens with “john w” as the order of the terms is not respected with the edge n-grams method.


(Jimferenczi) #5

We plan to work on the issue you mentioned soon, no promises but it's on the radar. Though I don't understand how using edge ngrams solves the duplication issue. If you switched to this method you still need to deduplicate entries when you index your data, isn't it ?
Regarding your issue with edge ngrams, you could set up two fields, one with a keyword tokenizer and an edge ngram filter and one with a standard tokenizer and the edge ngram filter. Then at query time you could query the two fields with a should clause and a big boost on the prefix field to make sure that the exact prefix is always preferred ?


#6

This is being used for a separate part of the application in which only unique values are added (deduplication is handled by a DB query and all values need to be inserted once).

I've been thinking about adding a separate field as you suggested, but totally forgot about the "keyword" type. I'll experiment with it and let you know how it goes. Thanks again.


#7

Unfortunately, adding a new filed with the "keyword" type did not help as I was unable to apply the synonyms filter on the individual words in the sentence.

What I did instead is rewrite the query to the following:

  "query": {
  "bool": {
    "must": { "match": { "fieldName": request_text }},
    "should":
    {
      "span_first" : {
        "match" : {
            "span_term" : { "fieldName" : {"value":request_text ,"boost":15} }
        },
        "end" : 1
      }
    }
  }
}

The above works fine when having single term. However, when the user starts typing the second term, the span_first part no longer works. I've been looking for a way to replace the span_term part with a span_phrase but with no luck so far.


(Jimferenczi) #8

span_term queries are not analyzed (the input is considered as the term to search in the index) so you cannot use them with the request_text. You'll need to do the analysis on your side to use the span queries so I don't think it's a good solution.
Also the synonyms limitation is not really a problem for the keyword solution because input that match with synonyms would be returned by the text field part. They will not have the boost but at least they will match so only the exact prefix will rank first which seems to be the primary goal ?


#9

I'm applying a simple cleanup for the user input on the application side before passing it to ES. So, only a standard analyzer was performed on the search query.

I think what you are suggesting is that I apply the synonyms filter on the search query ? If my understanding is correct, then there won't be any matching synonyms when the user starts typing any term as partial terms won't have any hits in the synonyms dictionary.

I'm currently experimenting with span_near and adding it to my query. The goal is to give priority to matching terms that appear at the beginning of the indexed sentences. span_near does not help with restricting the search to only the first tokens of the documents. However, its 'slop' feature is very helpful. Here is what I have so far:

{
"query": {
	"bool":{
		"should":[
			{"match":{"name":"team member"} }
    ,{
      "span_first" : {
        "match" : {
            "span_term" : { "name" :{"value": "team","boost":15} }
        },
        "end" : 1
      }
    },{        "span_near" : {
        "clauses" : [
            { "span_term" : { "name" : "team" } },
            { "span_term" : { "name" : "member" } }
        ],
        "slop" : 0,
        "in_order" : true
    }}

			]
			
	}
}
,"size":100:

}

Of course, my application would have to rewrite the above query to accommodate the number of tokens in the search query. If the search query contains only one term, then the span_near part would not be needed. If more than two tokens exist, then span_near would have to accommodate all of the tokens.


(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.