How to structure multi site documents in elasticsearch?

I have 12 sites all producing 5 different types of log. The number of sites is growing rapidly so by the end of the year I might have 100 sites.

Each log can have millions of documents.

So would you recommend

  1. One index for each site and 5 types in each index representing each log type,
    OR
  2. One index with 5 types that contain all the site logs.

I'm thinking of option 1 because of the quantity of documents ? Any advice is much appreciated

One index per day/week/100 million docs. A keyword field for the site. A
type for the log type. Keep in mind that types are very similar to just
having a keyword field. If the field named "foo" is a string in one type it
can't be a number in another type.

If your logs are very different, with great disparity it field names or
values you may want an index per day/month/whatever per log type. This is a
hard balancing act. More indexes have more overhead but sparse fields which
are often caused by squashing lots of types together can cause inefficient
data structures both on disk and off, mostly in doc_values.

Thanks Nick.

All the logs are identical in structure in each site. So each site has log_type_a, log_type_b, log_type_c, ... The structure of the documents are identical for each type. So log_type_a on site 10 has the same mapping as log_type_a on site 9.
There will be no requirement to search across sites so I thought that keeping each sites logs encapsulated in its own index was neater? So you suggest combining the documents into the same index and add a site field, then having a new index each month ( for example ) is a better solution. Is this more efficient then?

Keeping each site's logs in its own index would be neater and might net you some small performance boost because you don't have to filter on site its not a good idea for two reasons:

  1. Each shard has a non-trivial overhead.
  2. Deleting old documents from an index is way, way more work for elasticsearch than deleting old indexes.

Given these anytime you can rotate your problem into one of time series indexes you'll tend to do better.

Its fine, for instance, to put your biggest customers in their own indexes. Its just that you can't have tons of and of indexes because then you'll have tons and tons of shards.

The "index per week" thing is one of those balancing act things - the overhead of having lots of indexes is worth it because we can delete stuff after its retention period has passed more easily. And there are a few other nice things - writing to empty indexes is faster than full ones. Once you know you'll never modify an index again you can _optimize it to squash it into a single segment for faster searching and, typically, less disk usage. Also each index can only hold a maximum of java's MAX_INT documents, so about 2 billion. And time series indexes gives you a convenient place to say "now I'm making a new one" so you don't run into that.

1 Like

Thanks Nik, will take all that on board, much appreciated.