Ingest attachmnet increase file content size to index

Hello. I am using Elasticsearch 5.4 and ingest-attachment plugin. It works fine with index, search, analyzing with file content up to 32kb, but I have a requirement for indexing and searching big files.
I found the solution here in max_content_length: https://www.elastic.co/guide/en/elasticsearch/reference/5.4/modules-http.html
But my problem that I am using TransportClient and when I try to index big file I get the following exception:

java.lang.IllegalArgumentException: Document contains at least one immense term in field="attachment.content" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[80, 114, 101, 112, 97, 114, 101, 100, 32, 101, 120, 99, 108, 117, 115, 105, 118, 101, 108, 121, 32, 102, 111, 114, 32, 86, 106, 97, 99, 104]...', original message: bytes can be at most 32766 in length; got 395087

Could you explain me how to increase the file content size for TransportClient?

Hello. Could you answer my above question? I have to solve this problem in nearly future.

I think you are wrongly using the ingest attachment plugin.

Could you tell what is your mapping, what is the ingest pipeline and how you index a document?

I think the error message is saying, that there is a single crazy big term in the document. It seems that the splitting in many terms was not successful or possible.

Is it possible that this document contains such a term? Or that tika was not able to create multiple terms out of this document...

My theory is that he is indexing the BASE64 attachment has not been launched on this field.

Thank you for replying.

localhost:9200/_ingest/pipeline/attachment

{
"attachment": {
    "description": "Extract attachment information",
    "processors": [
        {
            "attachment": {
                "field": "payload",
                "indexed_chars": "-1",
                "properties": [
                    "content",
                    "content_type",
                    "content_length",
                    "title",
                    "language"
                ]
            },
            "remove": {
                "field": "payload"
            }
        }
    ]
} }

localhost:9200/payload_index/my_type/_mapping

{
"payload_index": {
	"mappings": {
		"my_type": {
			"properties": {
				"attachment": {
					"properties": {
						"content": {
							"type": "text",
							"fields": {
								"_lowercase": {
									"type": "text",
									"analyzer": "_lowercase"
								}
							},
							"analyzer": "english"
						},
						"content_length": {
							"type": "long"
						},
						"content_type": {
							"type": "text",
							"fields": {
								"keyword": {
									"type": "keyword",
									"ignore_above": 256
								}
							}
						},
						"language": {
							"type": "text",
							"fields": {
								"keyword": {
									"type": "keyword",
									"ignore_above": 256
								}
							}
						},
						"title": {
							"type": "text",
							"fields": {
								"keyword": {
									"type": "keyword",
									"ignore_above": 256
								}
							}
						}
					}
				},
				"payload": {
					"type": "text",
					"fields": {
						"_lowercase": {
							"type": "text",
							"analyzer": "_lowercase"
						}
					},
					"analyzer": "english"
				}
			}
		}
	}
} }

There is a peace of code how I index a document:

TransportClient transportClient = new PreBuiltTransportClient( Settings.builder()
          .put( "cluster.name", "my_cluster" )
          .put( "node.name", "my_node" ).build() )
          .addTransportAddresses( new InetSocketTransportAddress(
            InetAddress.getByName( "127.0.0.1" ),
            elasticConf().getPortNumber( 9300) ) );

XContentBuilder xContentBuilder = jsonBuilder().startObject();
xContentBuilder.field( "payload", bytes );
IndexRequestBuilder requestBuilder = transportClient.prepareIndex( "payload_index", "my_type", id );
requestBuilder.setPipeline( "attachment" );
requestBuilder.setSource( xContentBuilder.endObject() ).execute().actionGet();

localhost:9200/payload_index/my_type/_search

{
"took": 1,
"timed_out": false,
"_shards": {
	"total": 5,
	"successful": 5,
	"failed": 0
},
"hits": {
	"total": 1,
	"max_score": 1,
	"hits": [
		{
			"_index": "payload_index",
			"_type": "my_type",
			"_id": "1",
			"_score": 1,
			"_source": {
				"attachment": {
					"content_type": "text/plain; charset=ISO-8859-1",
					"language": "en",
					"content": "Test attachment content.",
					"content_length": 24
				}
			}
		}
	]
}  }

Really I don't need BASE64 attachment.

Could you run the _ingest/pipeline/attachment/_simulate API with your BASE64 content that is failing?

And paste here the result or upload to gist.github.com?

I put the whole BASE64 content file but it is too large for output all 'payload' here, so i cut it here for display. "content_length": 402701

localhost:9200/_ingest/pipeline/attachment/_simulate

{   "docs" : [
      {   "_index": "payload_index",
          "_type": "my_type",
          "_id": "5",
          "_source": {
          "payload": "VW5kZXIgQ29uc3RydWN0aW9uOiAgVGhlIGJvb2sgeW914oCZcmUgcmVhZGluZyBpcyBzdGlsbCB1bmRlcmRldmVsb3BtZW50LiBBcyBwYXJ0IG9mIG91ciBCZXRhIGJvb2sgcHJvZ3JhbSwgd2XigJlyZSByZWxlYXNpbmd0aGlzIGNvcHkgd2VsbCBiZWZvcmUgYSBub3JtYWwgYm9vayB3b3VsZCBiZSByZWxlYXNlZC4gVGhhdHdheSB5b3XigJlyZSBhYmxlIHRvIGdldCB0aGlzIGNvbnRlbnQgYSBjb3VwbGUgb2YgbW9udGhzIGJlZm9yZWl04oCZcyBhdmFpbGFibGUgaW4gZmluaXNoZWQgZm9ybS"
          }
       }      ]     } 

Response: Status: 200 OK, Time: 1539 ms, Size: 409.06 KB

{
"docs": [
    {
        "doc": {
            "_type": "my_type",
            "_id": "5",
            "_index": "payload_index",
            "_source": {
                "attachment": {
                    "content_type": "text/plain; charset=UTF-8",
                    "language": "en",
                    "content": "Under Construction:  The book you’re reading is still underdevelopment. As part of our Beta book program, we’re releasingthis copy well before a normal book would be released. Thatway you’re able to get this content a couple of months beforeit’s available in finished form",
                    "content_length": 402701
                }
            },
            "_ingest": {
                "timestamp": "2018-04-18T15:56:43.736Z"
            }
        }
    }
]  }

But there are about 39000 highlighted blue symbols in 'content' and others are coloured by black in it. Is it possible to index the whole text?

Is my mapping wrong? And how to index the big files in my situation when I don't need BASE64?

Everything looks good. Are you sure you defined the pipeline when indexing ?

Sounds like you did but I want to double check as I don't get the full picture here.

I do it when I build request for indexing.

Should I put the encoded Base64 bytes for index like in simulate? And what is the maximum file size for indexing?

Yes. What did you put in there?

I put there bytes as it is.

But when i run simulate i put base64 encoded.

I have set http.max_content_length: 500 mb

But there are about 39000 highlighted blue symbols in 'content' and others are coloured by black in it. Is it possible to index the whole text? And what is the maximum file size for indexing?

But there are about 39000 highlighted blue symbols in 'content' and others are coloured by black in it.

What does it mean?

When i run simulate command for big file with content-length 402701 I got that.

I don't have any problem with small files but big files raise the described above exception (the first record in this discussion).

I got that.

What do you get? Can you do a screenshot?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.