I have a set of documents that I'm trying to load into a single large index on a single node. About seven million documents, about 400gb of text. The fields are in different files, so I'm just using a script that loads each file, creates a bulk upsert, and submits it to the local DB. As the load proceeds, though, the database seems to slow to a crawl and eventually pass my 10-second bulk update timeouts. I can index just the first half of the docs quickly, or just the second half of the docs quickly, so I know that there's nothing special about some docs in the DB.
I've done all the standard tricks like setting the refresh interval to -1, adjusting the number of shards (from 1 to 4 to 10), disabling disk swap on the server, and none seem to have much effect.
I'm using a fast machine (8-core Xeon, 32gb of RAM, 2x4T SSDs in RAID 0), but is this just too much data to expect ES to work with given how much RAM I have? Any other thoughts would be appreciated.
What is the structure and size of the documents? Are you using nested mappings? How are you updating the documents? Whic h version of Elasticsearch are you using?
The documents are mostly a couple of hundred kilobytes, though there are some that are up to 20MB. They look kinda like this:
{'field':'big block of text',
'other-field':'bigger block of text',
'list-o-data': [{'num':1, 'text:'medium block of text'}, {'num':2, 'text':'medium block of text'}]}
I'm running 7.12.1 on Ubuntu.
Each of the top-level fields (including the entire list) is added to the docs in the database using bulk updates with doc_as_upsert to preserve the other fields. I'm letting the low-level Python library's streaming_bulk handler set up the requests into bulk calls. All in one thread.
How many items can the list have? Is this mapped as nested? Are you upserting these one by one?
Each time you update a document the whole document is reindexed. If you perform a large number of updates of each document you will be reindexing the large content many times over, which may be slow.
Have you considered modelling this using parent-child instead of nested documents (if that is used)?
Ok, that is better than I feared, but still means all that text in these large documents is analyzed and indexed multiple times, likelyb getting slower as they grow.
I'm going to try to do all the fields as different docs and see if that fixes the performance issue before worrying about how to structure the joins. I'll let ya know!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.