How to improve the data import speed


(Eric Yin Z) #1

Dear ELK developers,

I use the elasticsearch to store the telecommunication level big data which involve more than 3000 fields and relative more than 1000 formula(scripted).

I already use bulk to import the data with no more than 100MB per file and adjust the memory configuration with 64GB since my server has 128 GB.

I tried to use the mutiple process (curl) to import the data but no more improvement. I am not sure if there is any other way to import the data more faster?


(David Pilato) #2

Memory does not play a big role at index time. Hard disk, number of primary shards, number of nodes are more important in that field.

Some thoughts:

100mb of bulk might be a lot.
No more than 30gb of HEAP unless you really understand all the consequences of having more

May I suggest you look at the following resources about sizing:

https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing


(Eric Yin Z) #3

Thanks for your reply. I just have a look one of the video, it looks like to involve more shards to split the data. Previously, I set the shards number to 1, that means just single process, right?

Now, I have one physical server with 16 cores, 128GB, and the data is around 200GB per day. I want to keep at least 30 days data for searching requirement.(totally 6 TB). I have 8TB available HDD.

Could you please give some suggestions how to fully use the resource of this server?
How many nodes and How many shards can be set. If I involve 400 shards with 32GB heap memory for one instance, is it reasonable?(It can make the importing speed far more faster? or any proof limitation)


(David Pilato) #4

it looks like to involve more shards to split the data. Previously, I set the shards number to 1, that means just single process, right?

Correct.

Could you please give some suggestions how to fully use the resource of this server?
How many nodes and How many shards can be set.

That's exactly what the links I gave are about and explain. You need to try and find the best numbers for you.

If I involve 400 shards with 32GB heap memory for one instance, is it reasonable?

Sounds too much IMO. Might work though but you have to test that.
Actually I'd start small and increase. Measure. Then find the ideal sweet spot for you.
A wild guess is that it might something like between 10 to 50 shards per node.

But nothing can replace your tests. :slight_smile:

Then, when you found the ideal number and you're done with an index, it might be interesting to call the shrink API and reduce the number of shards for searching.


(Christian Dahlqvist) #5

Is your data immutable or do you need to perform updates? Are you using time-based indices?


(Eric Yin Z) #6

I have just tested the speed for bulk import by setting the shards number to 30. It just takes around 10s to import 100MB bulk JSON with 2000 fields.(Previously it take 30s) The importing speed is faster than previous SQL DB system.
Sounds like the shrink API will take more time if more shards are involved. I will test 400 shards later. Thank you very much. so nice of you.


(Eric Yin Z) #7

Thanks for your reply.The data should be immutable with no updates after importing the data.

Currently, the data is like
{00:00, field A, field B ....}
{00:00, field B, field C, field D ....}
....
{00:15, field B, field C, field D ....}
...
{00:30, field A, field C, field D ....}
...

the sample of scope is like this below

  1. more than 3000 fields and the number of fields will be dynamic increased. Currently I use dynamic mapping but the data itself will not involve any update so I assign the unique id for each record to aviod the possible duplicated data import.

  2. I want to caculate the formula based on those fields and try to use kibana(scripted fields) to make it happen. I am not sure how to import or export those hundreds of scripted fields(formula) on kibana so if I use time-based index, every time I have to add the scripted fields(formula) into kibana mannually by creating the new time index. I am not sure if any better way to caculate the formula.

  3. 1 months data would occupy 6TB space, I am not sure if there is way to compress the history data or is there any ways/tools to just caculate the aggregated data with time period such as hour/day level aggregated. By transform the history data from 15min level to day level may economize the HDD space.


(Christian Dahlqvist) #8

How come you have such a large number of fields? Could you maybe model your data differently? Every time Elasticsearch encounters a new field and dynamically adds a mapping, the cluster state need to be updated and propagated to all nodes in the cluster. This can slow down ingest considerably.

How come you need so many scripted fields? Can you describe your data in greater detail? Scripted fields add a lot of flexibility, but can slow down querying considerably. It may therefore be good to see if some of this processing can be done upfront before the data is indexed into Elasticsearch.

When you have immutable data with a defined retention period, it is generally recommended to use time-based indices as deleting complete indices is much more efficient than deleting documents individually.


(Eric Yin Z) #9

I write the script to transform our raw data into this JSON bulk format.Here comes 2 records from bulk of the sample data below. May Looks like a bit huge

one time period(18:15:00) mapping multiple thoudsands of nodenames(ENB_733927_SATS_Inflights_Catering_Centre_1_SICC1)

one nodename mapping
thoudsands of objects(XXXXX....msrbs_ManagedElement=ENB_733927_SATS_Inflights_Catering_Centre_1_SICC1,msrbs_Transport=1,msrbs_SctpEndpoint=1,msrbs_SctpAssociation=36422-10.244.10.225)

one nodename mapping several cellnames.

one record will involve different number of counters. The counter is like Class.counter. It will involve caculate the KPI based on the counter value between same/different Class.

The formula just like
KPI_a = ClassA.counter1+ClassB.counter2/(ClassA.counter1+ClassB.counter2)
I want to calculate the result based on different aggreagted level.(Time period level, nodename level, cell name level or counter/object level)

I am not sure if there is better way to optimze the model. Kindly adivse . thank you so much.

{ "index":{ "_index": "eric_stlsg_lte","_type":"pm","_id":"2018-02-19-18:15:00_ENB_733927_SATS_Inflights_Catering_Centre_1_SICC1_SubNetwork=ONRM_ROOT_MO_R,SubNetwork=ENB_CHANGI,MeContext=ENB_733927_SATS_Inflights_Catering_Centre_1_SICC1,ManagedElement=ENB_733927_SATS_Inflights_Catering_Centre_1_SICC1,Transport=1,SctpEndpoint=1,SctpAssociation=36422-10.244.10.225"}}

{"time":"2018-02-19 10:15","object":"SubNetwork=ONRM_ROOT_MO_R,SubNetwork=ENB_CHANGI,MeContext=ENB_733927_SATS_Inflights_Catering_Centre_1_SICC1,msrbs_ManagedElement=ENB_733927_SATS_Inflights_Catering_Centre_1_SICC1,msrbs_Transport=1,msrbs_SctpEndpoint=1,msrbs_SctpAssociation=36422-10.244.10.225","nodename":"ENB_733927_SATS_Inflights_Catering_Centre_1_SICC1","nodetype":"MSRBS","nwid":"stlsg","ropfilename":"MSRBS_2018-02-19-18:15:00_ENB_733927_SATS_Inflights_Catering_Centre_1_SICC1","SctpAssociation.pmSctpAssocInDataChunks":"0","SctpAssociation.pmSctpAssocOutDataChunks":"0","SctpAssociation.pmSctpAssocTimeUnavail":"0","SctpAssociation.pmSctpAssocCongestions":"0","SctpAssociation.pmSctpAssocInDiscardedDataChunks":"0","SctpAssociation.pmSctpAssocInAbnormalDataChunks":"0","SctpAssociation.pmSctpAssocOutControlChunks":"1500","SctpAssociation.pmSctpAssocInOctets":"138000","SctpAssociation.pmSctpAssocRtxChunks":"0","SctpAssociation.pmSctpAssocInDiscardedControlChunks":"0","SctpAssociation.pmSctpAssocOutDiscardedDataChunks":"0","SctpAssociation.pmSctpAssocOutDiscardedUserMsgs":"0","SctpAssociation.pmSctpAssocInPacks":"1500","SctpAssociation.pmSctpAssocInAbnormalControlChunks":"0","SctpAssociation.pmSctpAssocOutOctets":"138000","SctpAssociation.pm_count":"1","SctpAssociation.pmSctpAssocOutPacks":"1500","SctpAssociation.pmSctpAssocInControlChunks":"1500","SctpAssociation.pmSctpAssocAborteds":"0"}

{ "index":{ "_index": "eric_stlsg_lte","_type":"pm","_id":"2018-02-19-18:15:00_ENB_733927_SATS_Inflights_Catering_Centre_1_SICC1_SubNetwork=ONRM_ROOT_MO_R,SubNetwork=ENB_CHANGI,MeContext=ENB_733927_SATS_Inflights_Catering_Centre_1_SICC1,ManagedElement=ENB_733927_SATS_Inflights_Catering_Centre_1_SICC1,ENodeBFunction=1,EUtranCellFDD=7339278"}}

{"time":"2018-02-19 10:15","object":"SubNetwork=ONRM_ROOT_MO_R,SubNetwork=ENB_CHANGI,MeContext=ENB_733927_SATS_Inflights_Catering_Centre_1_SICC1,msrbs_ManagedElement=ENB_733927_SATS_Inflights_Catering_Centre_1_SICC1,msrbs_ENodeBFunction=1,msrbs_EUtranCellFDD=7339278","nodename":"ENB_733927_SATS_Inflights_Catering_Centre_1_SICC1","nodetype":"MSRBS","nwid":"stlsg","ropfilename":"MSRBS_2018-02-19-18:15:00_ENB_733927_SATS_Inflights_Catering_Centre_1_SICC1","cellname":"7339278","EUtranCellFDD.pmMimoSleepTime":"0","EUtranCellFDD.pmPucchCqiResLongUtilCell":"0","EUtranCellFDD.pmRadioUeRepCqiSubband1Sum":"2088","EUtranCellFDD.pmTtiBundlingUeMax":"0","EUtranCellFDD.pmRlcPollRetxDl":"305","EUtranCellFDD.pmUeCtxtRelCsfbCdma1xRtt":"0","EUtranCellFDD.pmPrbUtilUlDistr":[898,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"EUtranCellFDD.pmRlcPollRetxUl":"138","EUtranCellFDD.pmErabRelNormalEnbActArp":[0,0,0,3],"EUtranCellFDD.pmErabRelNormalEnbArp":[0,0,0,23,0,2],"EUtranCellFDD.pmRadioUeRepCqi2Subband1Sum":"2933","EUtranCellFDD.pmCellDownLockAuto":"0","EUtranCellFDD.pmRadioRecInterferencePwrPrb5":"4589506","EUtranCellFDD.pmRadioRecInterferencePwrPrb4":"3096742","EUtranCellFDD.pmRadioRecInterferencePwrPrb3":"1711150","EUtranCellFDD.pmRadioRecInterferencePwrPrb2":"4201523","EUtranCellFDD.pmRadioRecInterferencePwrPrb1":"7005382","EUtranCellFDD.pmRadioUeRepCqiSubband7Sum":"0","EUtranCellFDD.pmRadioRecInterferencePwrPrb9":"1649213","EUtranCellFDD.pmRadioRecInterferencePwrPrb8":"1392616","EUtranCellFDD.pmRadioRecInterferencePwrPrb7":"6562856","EUtranCellFDD.pmRadioRecInterferencePwrPrb6":"1546259","EUtranCellFDD.pmMacHarqDlNack16qam":"34","EUtranCellFDD.pmRrcConnReestSuccHo":"1","EUtranCellFDD.pmRrcConnEstabAttReattMos":"0","EUtranCellFDD.pmPucchCqiResMediumUtilCell":"0","EUtranCellFDD.pmPucchCqiResShortUtilCell":"1","EUtranCellFDD.pmPdcchCceAggregationDistr":[5162,1586,2944,122380],"EUtranCellFDD.pmLcgThpVolUlLcg":[0,37,784,0],"EUtranCellFDD.pmAnrNeighbrelAdd":"0","EUtranCellFDD.pmAdjustAccessDynLoadCtrlDistr":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"EUtranCellFDD.pmErabRelAbnormalEnbActTnFail":"0","EUtranCellFDD.pmFlexPdcpVolDlDrbLastTTIUe":[0,0,0],"EUtranCellFDD.pmRrcConnEstabAttReattMod":"0","EUtranCellFDD.pmPdcpPktDiscDlPelrUuQci":[0,3,0,0,0,0,0,0,0,107],"EUtranCellFDD.pmErabRelMmeActUserInactivity":"0","EUtranCellFDD.pmMacHarqDlNackQpsk":"1310","EUtranCellFDD.pmErabRelNormalEnbQci":[0,0,0,0,0,2,0,0,0,23],"EUtranCellFDD.pmCaCapableDlSum":[24,34,39,3,0],"EUtranCellFDD.pmUeThpTimeUl":"7327","EUtranCellFDD.pmRrcConnReconfSuccNoMobDlComp":"0","EUtranCellFDD.pmRrcConnEstabSuccMos":"20","EUtranCellFDD.pmUeCtxtRelAbnormalMmeAct":"0","EUtranCellFDD.pmPdcpBitrateUlDrbMax":"267","EUtranCellFDD.pmErabRelNormalEnbAct":"3","EUtranCellFDD.pmSchedActivityCellUl":"14454","EUtranCellFDD.pmMacHarqDlDtx16qam":"173","EUtranCellFDD.pmMeasRep":"524","EUtranCellFDD.pmPrbUsedDlSrbFirstTrans":"19059","EUtranCellFDD.pmRrcConnEstabSuccMod":"18","EUtranCellFDD.pmFlexErabEstabAttAddedGbr":[0,0,0]}}


(Christian Dahlqvist) #10

As I do not know your data, it is hard for me to give any concrete advice. I can try to give some general advice though:

  1. Ensure that you do not have field names that contain a dynamic component, e.g. host or network name. (I can not see any examples of this in your samples).
  2. If you have different types of records that contain subsets of non-overlapping counters, try to send these to different indices as this will keep the size of the mapping per index down. In your example one document has a lot of SctpAssociation.* counters, which are not present in the second document, where most counters seem to start with EUtranCellFDD.*
  3. If you know all or a portion of the fields to expect for an index, create an index template with these in order to reduce the amount of dynamic mappings that need to be performed for new indices.
  4. Use time-based indices that suits your retention period.

(Eric Yin Z) #11

I find kibana can support index* which can not only get the data from mutiple index which is really good to split the big index to smaller ones but also it is really convient to adjust the important parameter such as shards and so on. Thank you so much and I will try to split the index based on your advice. So nice of you:blush:


(system) #12

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.