Design woes!

I have 1600 types of logs, for example log-a, log-b, log-c and I can parse %{value:fieldName} nicely with Grok and Logstash. I have no problem indexing the data into Elasticsearch via Logstash across a 20 node cluster made up of [HP-DL360] with no mappings as all the fields get treated as strings. The problem comes when i want to do mapping. I need to map the fields correctly to either string, double, float etc. So, i made a mapping file and there are 3600 unique field mappings, for example, cpu-time=double, name=string, iowait=double etc. Applying a mapping template of this size to a single index causes chaos on index, the shards (40) go into meltdown, becoming unallocated , then allocated and nothing gets indexed. Does anyone have a rule of thumb as regards how to do mapping templates efficiently please ?


I'm curious to see what is causing all the trouble with the template. You certainly made a 3,600 string mapping just fine by indexing all the fields as strings. You should try to make a 3,600 field mapping I think - like use a script to spit out all the fields and apply it manually. It ought to be fine....

Once you get that working you may run into issues around data sparsity - the trouble is that the way that doc values are encoded it is inefficient for some documents in a Lucene segment to have a field and node others. It is one of those insideous things that shows up as lots of disk usage but only after you have enough data that it is a huge pain to redo the layout.

So may I suggest grouping your types of logs into 3 or 5 or 10 groups that have similar attributes and making the their own indexes? More indexes is more overhead in general but if you can dodge the sparse fields issue it might be worth it.