ES indexing times - ES v2.4.1

Hi,
We are using a cluster with total 5 nodes (3 master 2 data) setting .
We plan to index a document as large as 100 MB in size.
We look forward to expect around 50K rpm roughly this size and provide search capabilities on that.

A single document looks of this structure

{ 
name: "abc",
id : "1",
elements:[
    {
    id:"1",
   value: " Some long text .... assume 2000 chars"
   },
   ....
   ....
   ..10K elements...
    ...
]
}

Assume one such documents is roughly ~100 MB in size.
We plan to index such document in millions.

Trying to index such document with a dynamic mapping take around ~20 sec with
shards =50 and replication = 0 .
We have gzip compression enabled and would want to scale it more replicas too.

To mention settings
on cluster we have "heap_max": "69gb", "heap_used": "21.3gb",
on node we have "heap_committed": "17.4gb","heap_used": "5gb",

Also swap is disabled and we use SSD.
Along with that we played with refresh_interval to be as large as 3 seconds . This leads to improvements but we can't go beyond 2 seconds.

Are there any indexing numbers that can ponder light on how much time it takes for ES to ingest such payloads ?

Are there any recommended optimizations for such huge payload.
We have already looked on - https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html

Note : we plan to use nested so as to provide search capabilities. Those numbers are lower than the one mentioned above. The above case should be the best case as we are using the simplest possible mapping.

PS : Currently Es V2.4.1 (We are upgrading soon)

Elasticsearch is IMO not really designed to work with such large documents. The amount of time it takes to index a document depends on the size and amount of work that need to be done, so I would expect indexing to be very slow. You can also suffer heap pressure. Apart from indexing you also need to change snider the effect on the cluster when you try to retrieve these large documents during querying. I would recommend looking into denormalizing you data model rather than use such large nested documents. This is especially true if you plan on updating documents.

Retrieval should be easy and fast as it needs to search in a single doc. We won't be doing across shards query. I would come back with average numbers for this .

Updating the document is really low. 1 in 1,000 times per se.
For frequent updates parent-child relationship joins makes sense, but querying becomes too tricky in that case, since you will have to do all possible combinations for multi-word queries and then define the relevance based on the factors like min-match.

Retrieval will return the full document or at least process it to extract relevant part so would likely add quite some load given the estimated size of the documents.

That can be handled by excluding the value field out of _source.

@Christian_Dahlqvist do you any more ideas ?

No, I would still discourage using very large nested documents and instead change the data model. If you want any additional feedback on how to denormalize you will need to provide a lot more information about your actual data and what the high level requirements are for querying.

I also did that. I tried a parent-child relationship instead of nested. I found the indexing time to be very similar.
In the parent child relationship, i break all the above mentioned elements array as child_doc and put them within a parent_id.
All this is done, in a bulk call to send the data.
To avoid size issues we have updated the http_max_size value to as large as 120 MB in elasticsearch.yml
The indexing time is as large as ~20 sec still.
I also have the data posted above, in case you need any clarification on that please let me know.

Have you tried completely denormalizing your data and use a completely flat data model? That should allow you to send smaller bulk requests in parallel which is likely to be less taxing on system resources and probably faster.

Is your cluster backed by local SSD storage? If not this could also be a contributing factor to slow indexing as indexing tends to be I/O intensive. I would also recommend youoptimize your mappings andtune for indexing speed if have not already.

When it comes to the data model you should also look at querying performance, not just indexing speed. Retrieving these huge documents from disk to process and/or return them may be slow and use a lot of system resources. If you can not find a better model you may also need to scale up or out your cluster.

Yes, Even with the simplest mapping which doesn't have any nesting or parent-child relationship. It takes time. Expecting ES to generate using the dynamic mapping or providing the simplest mapping. The original question is for that itself, understanding that it should be the best possible case.

The other method you suggest is doing smaller bulk requests in parallel, which means sending lets say 100 elements in one bulk call. Which would be mean sending 1000 calls ( 100 elements chunk if total element count is 100 X 1000 = 10K) . This would still mean the total document is indexed after 1000 calls.

Yes, its SSD with swap disabled. The querying performance is fine since we only need the id field and not the entire source at the time of query and we have added text as an excluded field as well.

When denormalized, what is the average size of the documents? What does your mappings look like? What indexing throughput are you seeing? What does CPU usage look like?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.