Recommendations on "schema" for an ElasticSearch project


(Mark Hanford) #1

I've build an ELK cluster for an internal proof-of-concept project, consisting of log-analysis (which is straight forward enough in an ELK context) but also for a data-mining exercise, for which I'm just using ElasticSearch (and possibly Kibana as the front-end).

I'm hoping that if I describe the scenario here, I'll get some tips'n'tricks and guidance to save me from some nightmares later when I have ALL the data in there and can't easily re-index it all...

I'm storing a few million contact and company records into ElasticSearch, with the requirement for later on extracting a subset of them based on various criteria. The query will be of the form

Show me all contacts that have a job-level of "CEO" at a company with a countryCode of "UK", revenue > £1m and industryCodes of "healthcare" or "finance" but not "marketing".

We currently hold the data in various SQL Server databases, and I'm in the process of pumping them all into ES using a C# project with NEST. The aim is to combine various disparate data-sources into one definitive source.

The questions:

  1. I'm putting the company and contact objects into a single index. Because Kibana doesn't appear to support the parent/child relationships of ES, I'm duplicating a lot of the data by including the company data on the contact. For example, the field-list in Kibana shows both "companyname" (from all the company records) and also "company.companyname" (from all the contact records). Should I split these into two indexes? Can I even have a parent-child relationship across indexes? I could also ditch the parent-child concept and just flatten the "contact" object to contain all the fields on "company", and have a separate (much smaller!) index for just searching company info.

  2. I'm using a 5*2 shard setup. Is there any advantage to splitting the index like the logstash defaults do? For example, corpdata-a, corpdata-b, etc. Or would increasing the number of shards be worthwhile?

Currently running on two Dell PowerEdge R710 servers with 12 cores and 98Gb of memory. The hosts are an ESX environment running a total of 4 ES nodes with 20G ES_HEAP_SIZE.

Apologies for the ramble, I just thought it might help to provide some context :smile:


(system) #2