[Resolved] Requesting help for a initial indexing for over 30 Million Documents

Hey,

after a long time of testing and hitting the wall with my head, i decided to ask here in the forum.

In our product we want to use Elasticsearch. Basicly, its implemented a long time already (started with version 0.90), but the product is not ready yet.
Its not a "new" product, but a reengineered one, so we need to migrate.

Some facts before i get into detail where my problem begins:

  • Elasticsearch Version 1.7
  • Client language PHP
  • 1 Index (currently)
  • 4 Mappings
  • 10 Million Documents per Mapping (currently) (so 40 Million in total)
  • Every Document got a size of ~12kb
  • routing per custom field (over 5000 differend routings at the end)
  • 2 Mappings got parent/child
  • Daily growing index
  • multiple daily updates on documents
  • Disabled manual refresh after bulk indexing

Our mappings containing dynamic_templates, custom analyzer and raw values.

Where the problem begins, at the initial indexing phase.
We are migrating Data from a MySQL database over a parallel working mechanism. Basicly we build chunks between 10-15 MB and send them to Elasticsearch.
Already tried with 5,10,20,30 in parallel.
The chunksize was tested already too, tested 200,500,750,1000,2000.
Replicas are set to 0 while indexing
Server is with SSD Cache (next week "real" ssd), Heap Size was scaled from 16-32GB, with cluster (up to 4), without, we got all tested.

That said, every "bulk job" sends only to ONE routing in the current state.

The problems are, well, i dont know. We got Marvel installed, we see that the CPU goes up/down, we see that the Heap goes up/down and so on, but we dont know why
-** Indexing TEN million documents take over 2 Days**

  • ES is rejecting with "queue is full (50)"
  • ES is throtteling very fast (guess 30 min after starting the first bulks)
  • Garbage collections goes active at some time, and then it doesnt do anything else.

I think much of these problems can be fixed with settings (i hope so).

Do you got some tips for me?

Best regards,
Dominik

Reduce your bulk size to ~5MB.
Also use different indices, putting all of those into the one isn't ideal.

Hey ,

thanks for the quick reply.

I will try this right away (okay, thats a lie...i need to wait until monday, but then, i promise, i will try it right away!)

"different indices", my thought is now, that i will create a index for every client, that would make about 5000, thats too much, isnt it?

Thanks again!

Yes that is a lot.
Put each document "type" into an index, ie don't add all 4 different mappings into the one.

Dont this collide with the parent/child connections?

We will try some possibilities, is there a "max" index size? (as example, marvel creates every day an index..so there are 365 per year...it grows a lot too...?)

BTW: That with the bulk size is already tested with the "200"er chunks, thats where it's really low in MB usage, but it reaches the throttling & GC so much faster then, thats why we currently scale at 1000' docs / bulk.

You need to put P/C in one index, yes.
But if the others aren't the same mapping then it makes sense to put them in their own index, see Index vs. Type | Elastic Blog

Okay, lets think i've done this. That makes 20 Million docs in 1 Index, and each 10 Million in another 2 (perhaps there's only one mapping without p/c).

That would be at maximum reduces the index size to a half.
Since i can't index 10 million documents at the moment (because Elasticsearch says "NO"), do you have another adice, which i can use also?

We're running elasticsearch with default settings at the moment, we tried throttle.type = none, but its throttling all the time. So i think there can be some optimizings.

I know its like looking in a black box, but what do you think can be a good server setup?

If you need some further details i will try to give you some, if i can :slight_smile:

You haven't even changed the heap size?! That's the first thing I would do.

we did, sorry that i didnt mentioned it in the last post! :slight_smile:

You should show your bulk indexing code. I guess this kind of executing bulks just to one shard per request may not be a good idea because it does not distribute load over the cluster.

Yet i can't show you the bulk indexing code, because it's containing sensitive client data. But we're doing it like described in

There are only "indexing" operations, so no deletions at this time.
We have disabled manual refresh after a bulk insert also edited in the first post

Yep, thats what sound quite logical after a while, we will try next week with bulk delivering to more than 1 shard, then we will see how this work.

That said: since we're doing parallel indexing, there is the possibility that this will be done already, but the cluster is going down actually.

I don't know how to scale yet. Do you have any further suggestions?

Agreed with removing one-shard routing as a first step, and also increasing the ES_HEAP_SIZE env var. 32G is the max you can normally use per node.

Are you launching multiple PHP scripts who coordinate the chunks?
If so you might need to change some config settings in elasticsearch.yml:

threadpool:
  bulk: 
      type: fixed
      size: 100  (or whatever you can get away with depending on hardware. default is smaller)
      queue_size: 500 (or whatever you can get away with depending on hardware/memory. default is 50)

Yep, agreed to that with 32 GB, we're testing that already.

@gmack do you suggest to max out the heap size?

We're using a php script that will build up a queue (actually per client) and push it to Gearman (http://gearman.org/), Then Gearman will spread it to the world, or in our case, to our workers which we can mass scale. So 30 are there currently, but we can launch another 100, thats not a problem, if it would help.

What for informations do you would need to say what size & queue_size were good? I can get the Information to the server right tomorrow :slight_smile:

At that scale of reindexing, definitely max out (and lock) the memory alloc for the ES java VM. If you want to do the work of capturing JVM memory pressure over time, you can bring that down, and perhaps apply the savings to more nodes. But why bother, it's not like 1M machines, right ? :smile:

We got those numbers during bulk index testing. We use a highly optimized machine code compiled program, so it's easy to feed ES as much it can eat. And promptly overwhelm the threads and queues.

Those numbers apply to our scenario, but for yours, I say measure twice, cut once :wink:

G

Yes, that's why I asked for the code you wrote, not the data. It's quite interesting how you can saturate 32G of heap on multiple nodes with such little data.

@jprante

All the time, one after another, as i said, in the current definition we fire against one shard in, hm, let me guess, 100-1000 bulk operations.
Thats obviously where the heap size goes up, when you ask me.

Today i will try the following

  • Switching from fire all against one node to multiple (currently 4, since i have a cluster of 4) (suggested by,well, all?)
  • Max out the Heap Size (suggested by,well, all?)
  • Reduzing the Bulk Size to round 5 MB (suggested by @warkolm)
  • changing the threadpool setting in elasticsearch.yml (suggested by @gmack)

I can't promise that i can do all of these things today, since its everytime work to reset it to the beginning, but i will give my best.

In the meantime i will build up a diagram which will hopefully show better informations, but i dont know. I will try to show the way to the bulk index too, i dont know how much benefit there is, since in the end, its a
$request = array( "body" => $this->queue ); $this->elasticsearch->bulk($request);

For every further optimizations i would be very grateful

Update:
Sadly today i only tested one thing:
but, i will try the other recommendations in the next days!

  • Index without grouping our _routing field (aka. do it all on one shard please). Its currently running, seems much more stable. No warnings so far, running since 3 PM GMT+1 :slight_smile: I'm "only" indexing 10M documents now, but this was a pain in the ass before, so i'm in a good mood, that this will bring some good results.

A question which will then appear will be:
when the explicit routing was my "bottleneck", then my question will be: how much performance do i loose, since we have not that many documents, i'm hoping that it will be so crucial, does anybody has some experience with that?

(actually the routing is divided in round about 5200 different routes).

Okay,

so we take another step. My first mapping (the "parent") is completed in round about 4 hours, completly. Thats something i can work with.

After indexing this 10 Million Documents we wanted to try to index it's childrens. Per Parent there are rounded 2 childs. As before, we dont use the explicit routing (for bulk), but randomly select some childs and bulk them.

But here seems to be a problem.

Where our Cluster of 4 Servers with a heap size of 16GB each works very smooth for indexing the first mapping (no load above 2-3, no garbage collection, little throtteling) there is an heavy impact when we started to index the childrens. Every server load goes up to over 6, the heap size explodes, gc is heavily active, as throtteling is.

Received response for a request that has timed out, sent [104816ms] ago, timed out [57185ms] ago, action [internal:discovery/zen/fd/ping],

[gc][old][42682][82] duration [43.6s], collections [1]/[44.2s], total [43.6s]/[8.3m], memory [14.8gb]->[14.4gb]/[15.5gb], all_pools {[young] [2.5mb]->[115.3mb]/[399.4mb]}{[survivor] [11mb]->[0b]/[49.8mb]}{[old] [14.7gb]->[14.3gb]/[15.1gb]

Where is the difference. I know the childs will be indexed to the same shard as parent, but thats why we randomly bulk them now (which works very well on parent).

Somebody has an idea here?

You should really upgrade to 2.X, there are a number of improvements around parent/child that may help.