Index 350MB of file content using ingest attachment pipiline Elastic.Net throws System.OutOfMemory Exception

Hi There, hope you all well.

Our setup: our program on dotnet full framework using NEST 7.9 and elastic search 7.9. Changed elasticsearch.yml http.max_content_length: 1000mb and using plugin ingest attachment for parsing files. As well as set the attachment.content be TermVector with position offsets

Our project is using elastic search to index physical files like .pdf, docx, .csv ... and some of them can be large. We are now at a stage to stress test the system and found that when index more than 350MB Elasticsearch.Net throws 'System.OutOfMemoryException' , the exception is pasted at the bottom.

Just want to know if we have hit any limitations of elastic search in term of size that we can index in one go?

Is there any alternative that you can recommend to index large physical files?

Thanks!
Kiet Tran

Error: -----------------

FailureReason: Unrecoverable/Unexpected BadRequest while attempting PUT on http://localhost:9200/ktr/_doc/40554962-4e0d-4b6f-91fb-f77a7a5f4fb0?pipeline=attachmentsV2&refresh=wait_for
 - [1] BadRequest: Node: http://localhost:9200/ Exception: OutOfMemoryException Took: 00:00:00.0419857
# Audit exception in step 1 BadRequest:
System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.Buffers.DefaultArrayPool`1.Rent(Int32 minimumLength)
   at Elasticsearch.Net.Utf8Json.Internal.BinaryUtil.EnsureCapacity(Byte[]& bytes, Int32 offset, Int32 appendLength) in C:\Users\russc\source\elasticsearch-net\src\Elasticsearch.Net\Utf8Json\Internal\BinaryUtil.cs:line 76
   at Elasticsearch.Net.Utf8Json.JsonWriter.WriteString(String value) in C:\Users\russc\source\elasticsearch-net\src\Elasticsearch.Net\Utf8Json\JsonWriter.cs:line 343
   at Serialize(Byte[][] , Object[] , JsonWriter& , DocumentPOCO , IJsonFormatterResolver )
   at Elasticsearch.Net.Utf8Json.Resolvers.DynamicMethodAnonymousFormatter`1.Serialize(JsonWriter& writer, T value, IJsonFormatterResolver formatterResolver) in C:\Users\russc\source\elasticsearch-net\src\Elasticsearch.Net\Utf8Json\Resolvers\DynamicObjectResolver.cs:line 1708
   at Elasticsearch.Net.Utf8Json.JsonSerializer.SerializeUnsafe[T](T value, IJsonFormatterResolver resolver) in C:\Users\russc\source\elasticsearch-net\src\Elasticsearch.Net\Utf8Json\JsonSerializer.cs:line 167
   at Elasticsearch.Net.Utf8Json.JsonSerializer.Serialize[T](Stream stream, T value, IJsonFormatterResolver resolver) in C:\Users\russc\source\elasticsearch-net\src\Elasticsearch.Net\Utf8Json\JsonSerializer.cs:line 113
   at Nest.DefaultHighLevelSerializer.Serialize[T](T data, Stream writableStream, SerializationFormatting formatting) in C:\Users\russc\source\elasticsearch-net\src\Nest\CommonAbstractions\SerializationBehavior\DefaultHighLevelSerializer.cs:line 40
   at Elasticsearch.Net.DiagnosticsSerializerProxy.Serialize[T](T data, Stream stream, SerializationFormatting formatting) in C:\Users\russc\source\elasticsearch-net\src\Elasticsearch.Net\Serialization\DiagnosticsSerializerProxy.cs:line 100
   at Nest.IndexDescriptor`1.Nest.IProxyRequest.WriteJson(IElasticsearchSerializer sourceSerializer, Stream stream, SerializationFormatting formatting) in C:\Users\russc\source\elasticsearch-net\src\Nest\Document\Single\Index\IndexRequest.cs:line 39
   at Nest.ProxyRequestFormatterBase`2.Serialize(JsonWriter& writer, TRequestInterface value, IJsonFormatterResolver formatterResolver) in C:\Users\russc\source\elasticsearch-net\src\Nest\CommonAbstractions\SerializationBehavior\JsonFormatters\ProxyRequestFormatterBase.cs:line 43
   at Elasticsearch.Net.Utf8Json.JsonSerializer.SerializeUnsafe[T](T value, IJsonFormatterResolver resolver) in C:\Users\russc\source\elasticsearch-net\src\Elasticsearch.Net\Utf8Json\JsonSerializer.cs:line 167
   at Elasticsearch.Net.Utf8Json.JsonSerializer.Serialize[T](Stream stream, T value, IJsonFormatterResolver resolver) in C:\Users\russc\source\elasticsearch-net\src\Elasticsearch.Net\Utf8Json\JsonSerializer.cs:line 113
   at Nest.DefaultHighLevelSerializer.Serialize[T](T data, Stream writableStream, SerializationFormatting formatting) in C:\Users\russc\source\elasticsearch-net\src\Nest\CommonAbstractions\SerializationBehavior\DefaultHighLevelSerializer.cs:line 40
   at Elasticsearch.Net.DiagnosticsSerializerProxy.Serialize[T](T data, Stream stream, SerializationFormatting formatting) in C:\Users\russc\source\elasticsearch-net\src\Elasticsearch.Net\Serialization\DiagnosticsSerializerProxy.cs:line 100
   at Elasticsearch.Net.SerializableData`1.Write(Stream writableStream, IConnectionConfigurationValues settings) in C:\Users\russc\source\elasticsearch-net\src\Elasticsearch.Net\Transport\SerializableData.cs:line 31
   at Elasticsearch.Net.HttpWebRequestConnection.Request[TResponse](RequestData requestData) in C:\Users\russc\source\elasticsearch-net\src\Elasticsearch.Net\Connection\HttpWebRequestConnection.cs:line 52
   at Elasticsearch.Net.RequestPipeline.CallElasticsearch[TResponse](RequestData requestData) in C:\Users\russc\source\elasticsearch-net\src\Elasticsearch.Net\Transport\Pipeline\RequestPipeline.cs:line 172
   at Elasticsearch.Net.Transport`1.Request[TResponse](HttpMethod method, String path, PostData data, IRequestParameters requestParameters) in C:\Users\russc\source\elasticsearch-net\src\Elasticsearch.Net\Transport\Transport.cs:line 77

I'd not send big binary files to elasticsearch (big BASE64 Json) but instead I'd extract the data locally and just send the extracted text to elasticsearch.

That's what FSCrawler project is doing.

If you continue sending big binary files to elasticsearch, you will probably need to have a lot of HEAP on the ingest nodes. I'd by the way dedicate nodes as ingest nodes only and will send the index requests to them.
So if you node crash, it won't crash a data node.

Thanks Dadoonet for the quick reply. We are just trying to get this to work using NEST, Elasticsearch and ingest attachment plugin.

An alternative we are thinking of is to create a temporary index where this temp index only store large document content as base64 in a sperate documentContet field. Also to get the large base64 across the wire we are going to send content in chunks and at Elasticsearch calling Update with script += to append the base64.
Once the full file is stored then we call Elasticsearch reindex with ingest pipeline to parse the document into the searchable index.

With this approach is there any hidden costs that we naively not aware of?

Thanks,

Kiet

That could work but I'd not do such a thing.

That will require rewriting a lot of big segments as you are going to update (delete+insert) a lot of times the same big (and bigger) document.

Really. I'd do the extraction outside elasticsearch and I'd not store the binary blob in elasticsearch.

My 2 cents

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.