Single node, large database index performance

twbarr · May 24, 2021, 8:25pm

I have a set of documents that I'm trying to load into a single large index on a single node. About seven million documents, about 400gb of text. The fields are in different files, so I'm just using a script that loads each file, creates a bulk upsert, and submits it to the local DB. As the load proceeds, though, the database seems to slow to a crawl and eventually pass my 10-second bulk update timeouts. I can index just the first half of the docs quickly, or just the second half of the docs quickly, so I know that there's nothing special about some docs in the DB.

I've done all the standard tricks like setting the refresh interval to -1, adjusting the number of shards (from 1 to 4 to 10), disabling disk swap on the server, and none seem to have much effect.

I'm using a fast machine (8-core Xeon, 32gb of RAM, 2x4T SSDs in RAID 0), but is this just too much data to expect ES to work with given how much RAM I have? Any other thoughts would be appreciated.

Christian_Dahlqvist · May 24, 2021, 8:51pm

What is the structure and size of the documents? Are you using nested mappings? How are you updating the documents? Whic h version of Elasticsearch are you using?

twbarr · May 24, 2021, 9:00pm

The documents are mostly a couple of hundred kilobytes, though there are some that are up to 20MB. They look kinda like this:

{'field':'big block of text',
 'other-field':'bigger block of text',
 'list-o-data': [{'num':1, 'text:'medium block of text'}, {'num':2, 'text':'medium block of text'}]}

I'm running 7.12.1 on Ubuntu.

Each of the top-level fields (including the entire list) is added to the docs in the database using bulk updates with doc_as_upsert to preserve the other fields. I'm letting the low-level Python library's streaming_bulk handler set up the requests into bulk calls. All in one thread.

elasticforme · May 24, 2021, 9:02pm

7 million record is nothing. like Christian said it might be some other issue that you have to figure out and this forum can def help you.

I upload 6million record in 45 min pulling from database. using python bulk insert. insert 3000 record in bulk.

Christian_Dahlqvist · May 24, 2021, 9:04pm

How many items can the list have? Is this mapped as nested? Are you upserting these one by one?

Each time you update a document the whole document is reindexed. If you perform a large number of updates of each document you will be reindexing the large content many times over, which may be slow.

Have you considered modelling this using parent-child instead of nested documents (if that is used)?

twbarr · May 24, 2021, 9:17pm

I'm only performing about four or five updates total on each document. I batch up the entire list and upsert that at once.

Christian_Dahlqvist · May 24, 2021, 9:24pm

Ok, that is better than I feared, but still means all that text in these large documents is analyzed and indexed multiple times, likelyb getting slower as they grow.

twbarr · May 24, 2021, 9:36pm

If I use parent-child documents, can I still perform queries like (first-field:foo AND second-field:bar) if they're in separate documents?

twbarr · May 26, 2021, 12:34am

I'm going to try to do all the fields as different docs and see if that fixes the performance issue before worrying about how to structure the joins. I'll let ya know!

system · June 23, 2021, 12:35am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch Performance Issue Elasticsearch	7	571	September 4, 2020
Performance optimization (indexing & search) for single node (master thesis) Elasticsearch	3	695	May 10, 2019
Horizontal scaling of indexing Elasticsearch	8	2007	July 5, 2017
Initial Upload ElasticSearch 6.3 Bulk insert slows to a crawl Elasticsearch	3	789	September 9, 2018
7 seconds to index document once i get close to 2million documents Elasticsearch	4	764	April 1, 2018

Single node, large database index performance

Related topics