Slowly Indexing speed

Hi,

I insert bulk into elasticsearch with c++ client
each bulk contains 30,000 documents
each document has 40 numeric fields
indexing duration is about 13 seconds
the data in each bulk is targeted to one shard in one index

Why is it taking so long?

Always a good reference below, but do you see high CPU or IO? https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-indexing-speed.html

Basic questions: How big is the cluster, what version, have enough RAM/CPU? All data to one index, with how many replicas? Nothing else going on in the cluster?

What is your index refresh rate? Defaults to 1 sec but ideally set it longer like 30s.

People often run bulk loads in parallel if they can to use parallel CPU, etc. I think (but do not know) a bulk insert is single threaded through the system. However, if you run many big bulk requests at once, you can run the system out of memory so you have to test (10MB seems a recommended batch size).

In the meantime I have 25 shards
cluster’s size is 5gb
Elasticsearch 6.x
0 replicas
one node
Have enough RAM/CPU
data is inserted to one index only
Nothing else going on in the cluster
Index refresh interval is 30 sec
No parallel bulk load
My bulk size is already less than 10MB

I will test the insert with another bulk size

Thanks,
Liron

How much RAM and CPU does the node have? Are you using local SSD as storage?

Why do you have 25 shards on a one node system, as seems way too much? Why not one shard, up to 50GB? How much RAM & Java Heap? Try rebuilding index with 1 shard and try bulk again, at least as test. Also go to Version 7 when you can; not much difference, but good to be on new versions for new systems.

image
because I want to scale
now I have 5GB of data but I need that the system will be able to handle with TBs of data

image
4Cores/32GB RAM
Java heap is the default, 8GB

I will try it tomorrow
image


Yes, I'm using SSD as storage
4Cores/32GB RAM

Monitor GC, CPU usage as well as disk I/O and iowait. See if you can spot any resource bottleneck based on this. Use Elasticsearch monitoring, top and iostat ( or other similar tools).

Thanks , I’ll try it tomorrow

Also raise your heap to 16GB, about 50% of VM RAM, up to about 32GB - this may make huge difference depending on all the other meta data, cluster state, etc. in there - it's easy to have no buffer space, etc. but with one node right now and testing, I'd use a single shard to get a sense of things, then when you add lots of nodes you can reindex & up your shard count.

I have no idea how 32 shard indexes perform on one node with limited heap, but I'd guess poorly.

Thanks a lot!!
You right, I also had the problem when I tried to retrieve millions of records
Only the coordinator node need to be configured with that size?
data node need to be configured with high heap size too?

Depends a lot on what you are doing, data size, etc. Generally 50% of RAM up to 32GB heap is first guide, but beyond that depends on data size, how many indexes, shards, etc. Data nodes need RAM to manage the data, coordinators need RAM to manage results, etc. but all I can suggest is start with something medium, at least 8GB Heap on 16G VM and run monitoring to watch Heap, GCs, any circuit breakers, etc. Happy to give you free license to our ELKman tool also to help manage and tune.

each document has 40 fields
and I need to be able to fetch 50 millions of records (for now)
one index,2 nodes, I configured the index to be with 3 shards and one replica (just for now)

do you think that I will be able to do that with 8GB heap (32GB of RAM) in the coordinator node?(the same heap size configuration in the second node)

thanks a lot!!!

As others have mentioned in your other (numerous) threads, this is a very unusual use case for Elasticsearch, at most people want a few records, not millions, so there is really no way to know unless you test it - load up a cluster and query it and see is about all I can say - at least you only have numeric metric fields, which is good; also be sure you only get the fields you want, i.e. don't pull back _all or _source, etc. or pull aggregates, etc.

For shards, you generally want total shards <= node count, so for two nodes, that means one shard and one replica, BUT there are exceptions as don't let the shard get bigger than 50GB or so (maybe 75G for metrics only), and MAYBE the shard queries are multi-theaded, but I don't think so on a single node; not sure.

Regardless, if you have a 32G node, use 16GB of heap, at least. Since you have an unusual use case, maybe even more if you have heap issues, but all you can do is load in 50-100M records and query them.

Also, why are you doing this? What will you do with 50M records you get, and why not do 'that' thing inside ES, such as aggregations? Or split your feed, so whatever feeds ES can also feed your other target platform or use, or process with Spark, etc.?

1 Like

thanks a lot !

There is not support for scatter graph in Kibana and it's important to analyze the data with that tool
there is no aggregations in scatter graph
I need millions of dots to be able to see a shape in the scatter graph
image

The decision to use Elasticsearch was made following the need to manage big data
Spark has the same search performance as Elasticsearch?

The next question is I suppose, what are you doing with this graph? What is it used for? What is the meaningful information you want to extract from this visualization?

And every graph of any sizable data has to aggregate so do that aggregation in elasticsearch to get a base set of data you can pull into Grafana or even Excel, etc. The 'big' data part is storing data (though 50M you can do in MySQL, too) and doing some level of analyses on it, often with time or other buckets - at least let ES aggregate your 50M down to 50-50K using your Mass K1/K2/K3 buckets - I'm not a query expert, but would seem doable and you probably not care about how long it takes.

I can't tell what is it used for but I can tell that I learn a lot of thing just from looking at the shapes that are created by the graph :slight_smile:

In excel I need only to pick two columns without any aggregations and scatter graph is created after a while. the graph need to show me just the dots of the picked rows, it is a finite number.
so there is no need for aggregation at all.. if you meant that I need to retrieve just two columns for getting this results, you right but I have much more use cases for that data(forgot to mention it). when picking Elasticsearch, I didn't care about how long it takes but I did care about just getting better search performance than oracle.
It's true that I thought that I could build a scatter graph in Kibana but it doesn't matter ,I will find another solution for that.

Can Excel handle a scatter graph if over 20 million data points?

As you are looking for a scatter graph I guess you should be able to calculate the values related to the two axis per document. Once you have this you should be able to show document density using e.g. a heat map, which can be based on aggregations and will allow you to zoom into areas and view in greater detail. This should also perform and scale much better than extracting all the documents.

1 Like

I don't know yet
I've tried it about 1 million dots..it was slow..
but I thought Kibana could handle it, the picture above is related to spark, Is Spark can handle with that? Is it possible to build graph with 20 millions dots in any platform(with high density) ?

about heat map, I will try it

thanks