[Resolved] Requesting help for a initial indexing for over 30 Million Documents

dominikmank · December 12, 2015, 7:00pm

Hey,

after a long time of testing and hitting the wall with my head, i decided to ask here in the forum.

In our product we want to use Elasticsearch. Basicly, its implemented a long time already (started with version 0.90), but the product is not ready yet.
Its not a "new" product, but a reengineered one, so we need to migrate.

Some facts before i get into detail where my problem begins:

Elasticsearch Version 1.7
Client language PHP
1 Index (currently)
4 Mappings
10 Million Documents per Mapping (currently) (so 40 Million in total)
Every Document got a size of ~12kb
routing per custom field (over 5000 differend routings at the end)
2 Mappings got parent/child
Daily growing index
multiple daily updates on documents
Disabled manual refresh after bulk indexing

Our mappings containing dynamic_templates, custom analyzer and raw values.

Where the problem begins, at the initial indexing phase.
We are migrating Data from a MySQL database over a parallel working mechanism. Basicly we build chunks between 10-15 MB and send them to Elasticsearch.
Already tried with 5,10,20,30 in parallel.
The chunksize was tested already too, tested 200,500,750,1000,2000.
Replicas are set to 0 while indexing
Server is with SSD Cache (next week "real" ssd), Heap Size was scaled from 16-32GB, with cluster (up to 4), without, we got all tested.

That said, every "bulk job" sends only to ONE routing in the current state.

The problems are, well, i dont know. We got Marvel installed, we see that the CPU goes up/down, we see that the Heap goes up/down and so on, but we dont know why
-** Indexing TEN million documents take over 2 Days**

ES is rejecting with "queue is full (50)"
ES is throtteling very fast (guess 30 min after starting the first bulks)
Garbage collections goes active at some time, and then it doesnt do anything else.

I think much of these problems can be fixed with settings (i hope so).

Do you got some tips for me?

Best regards,
Dominik

warkolm · December 12, 2015, 8:14pm

Reduce your bulk size to ~5MB.
Also use different indices, putting all of those into the one isn't ideal.

dominikmank · December 12, 2015, 8:20pm

Hey ,

thanks for the quick reply.

I will try this right away (okay, thats a lie...i need to wait until monday, but then, i promise, i will try it right away!)

"different indices", my thought is now, that i will create a index for every client, that would make about 5000, thats too much, isnt it?

Thanks again!

warkolm · December 12, 2015, 8:21pm

Yes that is a lot.
Put each document "type" into an index, ie don't add all 4 different mappings into the one.

dominikmank · December 12, 2015, 8:28pm

Dont this collide with the parent/child connections?

We will try some possibilities, is there a "max" index size? (as example, marvel creates every day an index..so there are 365 per year...it grows a lot too...?)

BTW: That with the bulk size is already tested with the "200"er chunks, thats where it's really low in MB usage, but it reaches the throttling & GC so much faster then, thats why we currently scale at 1000' docs / bulk.

warkolm · December 12, 2015, 8:31pm

You need to put P/C in one index, yes.
But if the others aren't the same mapping then it makes sense to put them in their own index, see Index vs. Type | Elastic Blog

dominikmank · December 12, 2015, 8:41pm

Okay, lets think i've done this. That makes 20 Million docs in 1 Index, and each 10 Million in another 2 (perhaps there's only one mapping without p/c).

That would be at maximum reduces the index size to a half.
Since i can't index 10 million documents at the moment (because Elasticsearch says "NO"), do you have another adice, which i can use also?

We're running elasticsearch with default settings at the moment, we tried throttle.type = none, but its throttling all the time. So i think there can be some optimizings.

I know its like looking in a black box, but what do you think can be a good server setup?

If you need some further details i will try to give you some, if i can

warkolm · December 12, 2015, 9:21pm

You haven't even changed the heap size?! That's the first thing I would do.

dominikmank · December 12, 2015, 9:32pm

we did, sorry that i didnt mentioned it in the last post!

jprante · December 13, 2015, 10:00am

You should show your bulk indexing code. I guess this kind of executing bulks just to one shard per request may not be a good idea because it does not distribute load over the cluster.

dominikmank · December 13, 2015, 10:11am

Yet i can't show you the bulk indexing code, because it's containing sensitive client data. But we're doing it like described in

There are only "indexing" operations, so no deletions at this time.
We have disabled manual refresh after a bulk insert also edited in the first post

Yep, thats what sound quite logical after a while, we will try next week with bulk delivering to more than 1 shard, then we will see how this work.

That said: since we're doing parallel indexing, there is the possibility that this will be done already, but the cluster is going down actually.

I don't know how to scale yet. Do you have any further suggestions?

gmack · December 13, 2015, 7:01pm

Agreed with removing one-shard routing as a first step, and also increasing the ES_HEAP_SIZE env var. 32G is the max you can normally use per node.

Are you launching multiple PHP scripts who coordinate the chunks?
If so you might need to change some config settings in elasticsearch.yml:

threadpool:
  bulk: 
      type: fixed
      size: 100  (or whatever you can get away with depending on hardware. default is smaller)
      queue_size: 500 (or whatever you can get away with depending on hardware/memory. default is 50)

dominikmank · December 13, 2015, 7:13pm

Yep, agreed to that with 32 GB, we're testing that already.

@gmack do you suggest to max out the heap size?

We're using a php script that will build up a queue (actually per client) and push it to Gearman (http://gearman.org/), Then Gearman will spread it to the world, or in our case, to our workers which we can mass scale. So 30 are there currently, but we can launch another 100, thats not a problem, if it would help.

What for informations do you would need to say what size & queue_size were good? I can get the Information to the server right tomorrow

gmack · December 13, 2015, 8:56pm

At that scale of reindexing, definitely max out (and lock) the memory alloc for the ES java VM. If you want to do the work of capturing JVM memory pressure over time, you can bring that down, and perhaps apply the savings to more nodes. But why bother, it's not like 1M machines, right ?

gmack · December 13, 2015, 8:59pm

We got those numbers during bulk index testing. We use a highly optimized machine code compiled program, so it's easy to feed ES as much it can eat. And promptly overwhelm the threads and queues.

Those numbers apply to our scenario, but for yours, I say measure twice, cut once

G

jprante · December 13, 2015, 10:32pm

Yes, that's why I asked for the code you wrote, not the data. It's quite interesting how you can saturate 32G of heap on multiple nodes with such little data.

dominikmank · December 14, 2015, 7:08am

@jprante

All the time, one after another, as i said, in the current definition we fire against one shard in, hm, let me guess, 100-1000 bulk operations.
Thats obviously where the heap size goes up, when you ask me.

Today i will try the following

Switching from fire all against one node to multiple (currently 4, since i have a cluster of 4) (suggested by,well, all?)
Max out the Heap Size (suggested by,well, all?)
Reduzing the Bulk Size to round 5 MB (suggested by @warkolm)
changing the threadpool setting in elasticsearch.yml (suggested by @gmack)

I can't promise that i can do all of these things today, since its everytime work to reset it to the beginning, but i will give my best.

In the meantime i will build up a diagram which will hopefully show better informations, but i dont know. I will try to show the way to the bulk index too, i dont know how much benefit there is, since in the end, its a
$request = array( "body" => $this->queue ); $this->elasticsearch->bulk($request);

For every further optimizations i would be very grateful

dominikmank · December 14, 2015, 6:09pm

Update:
Sadly today i only tested one thing:
but, i will try the other recommendations in the next days!

Index without grouping our _routing field (aka. do it all on one shard please). Its currently running, seems much more stable. No warnings so far, running since 3 PM GMT+1 I'm "only" indexing 10M documents now, but this was a pain in the ass before, so i'm in a good mood, that this will bring some good results.

A question which will then appear will be:
when the explicit routing was my "bottleneck", then my question will be: how much performance do i loose, since we have not that many documents, i'm hoping that it will be so crucial, does anybody has some experience with that?

(actually the routing is divided in round about 5200 different routes).

dominikmank · December 15, 2015, 3:31pm

Okay,

so we take another step. My first mapping (the "parent") is completed in round about 4 hours, completly. Thats something i can work with.

After indexing this 10 Million Documents we wanted to try to index it's childrens. Per Parent there are rounded 2 childs. As before, we dont use the explicit routing (for bulk), but randomly select some childs and bulk them.

But here seems to be a problem.

Where our Cluster of 4 Servers with a heap size of 16GB each works very smooth for indexing the first mapping (no load above 2-3, no garbage collection, little throtteling) there is an heavy impact when we started to index the childrens. Every server load goes up to over 6, the heap size explodes, gc is heavily active, as throtteling is.

Received response for a request that has timed out, sent [104816ms] ago, timed out [57185ms] ago, action [internal:discovery/zen/fd/ping],

[gc][old][42682][82] duration [43.6s], collections [1]/[44.2s], total [43.6s]/[8.3m], memory [14.8gb]->[14.4gb]/[15.5gb], all_pools {[young] [2.5mb]->[115.3mb]/[399.4mb]}{[survivor] [11mb]->[0b]/[49.8mb]}{[old] [14.7gb]->[14.3gb]/[15.1gb]

Where is the difference. I know the childs will be indexed to the same shard as parent, but thats why we randomly bulk them now (which works very well on parent).

Somebody has an idea here?

warkolm · December 15, 2015, 10:26pm

You should really upgrade to 2.X, there are a number of improvements around parent/child that may help.

Topic		Replies	Views
ElasticSearch high CPU on merge threads Elasticsearch	8	2593	July 5, 2017
Rapidly Degrading Bulk Indexing Performance Elasticsearch	7	368	July 6, 2017
Index Dimensioning and Optimization (across the Cluster) Elasticsearch	6	376	March 24, 2021
Reduced bulk performance in 2.3.5 versus 1.3? Elasticsearch	7	1242	July 5, 2017
ElasticSearch Performance Elasticsearch	4	348	October 12, 2020

[Resolved] Requesting help for a initial indexing for over 30 Million Documents

Related topics