High Scale Use Case

SGE · December 20, 2016, 12:53pm

Hi,

Our system currently uses other NoSQL data store to manage its data.
It currently supports 180,000 inserts per second (we never update), each document is 1KB and is stored in the system for 180 days yielding 2.6PB of raw data. The system continuously analyzes the data and allows the users to browse analysis resultss and to randomely access the stored documents (by id).
We have a new requirement to support advanced search capabilities over a subset of fields (20 out of 280).
We are considering using Elasticsearch in one of two ways:

As a search engine only (with _source disabled) - we index the 20 fields and when the user performs a search we retrieve the ids from elasticsearch and then use them to query the actual documents from a document store (e.g. - cassandre).
As both search engine and document store (with _source enabled)

What would be the better approach considering the high insert rate ?

Thanks

jpountz · December 20, 2016, 1:28pm

When it comes to indexing speed, _source in not the bottleneck in my experience: it is cheap to add to the index as it is mostly about compressing chunks of 16KB and writing them sequentially. The bottleneck is usually more about analyzing documents (computing tokens) and indexing them, which involves a lot of sorting for instance.

Maybe you should rather look into the complexity of your indexing chain, ie. which analyzers you are using, how many fields you have, etc.

_source will have to be evaluated too, but it should come later in the decision process I think. To me the biggest benefit of disabling _source is to make the index smaller to improve how much of the index the file system cache can cover.

nik9000 · December 20, 2016, 1:29pm

Elasticsearch is more designed around the second use case. Aggregations are
all implemented against a column store built when the documents are
indexed. Fetching IDs is not implemented against the column store. It uses
the more compact storage mechanism used to store _source which works fine
if you want to fetch a few of them, but poorly for millions. Fetching all
the results requires this scroll mechanism to let you get deep into the
results set and it has more overhead than elasticsearch doing the
aggregation on its own.

A couple things:

Aggregations have fairly limited support for join-like concepts, mostly
parent/child like stuff. If you are used to arbitrary joins you are likely
going to miss something. I'd index a few million docs and test.
You don't have to store the _source to do the aggregations. It is worth
investigating not storing it if you have another spot for it. But if you
are thinking of shifting the hardware you were using then maybe you do need
to store it so you can look at the document.
Looks at the shrink/rollover APIs to help manage your indexes. Experiment
with index sizes and such. This is enough data that you'll have to find
some comfortable index sizes to work with.
Looks at hot/warm machines and migrating data from one to the other. It
might be a good setup if you don't usually query all the data. Like if you
usually query the first month and only sometimes go back all 180 days.
Make sure you clean up the data that rolls off the end of your time
window by deleting the index it is in. Deleting documents doesn't work as
well as deleting the whole index.
Beware elasticsearch's dynamic mapping updates. They work fine for
demonstrations but have scalability issues with lots of documents at once.
Also having a zillion different fields doesn't work super well. The column
store isn't supper happy with sparse values. It is OK, but not great.

Folks run elasticsearch at this scale, but it is big enough that you'll
need to pay some attention to it to get it humming along.

SGE · December 20, 2016, 7:26pm

Thanks for the quick and throrogh replies.

I have two more questions:
1- According to my calculations, the first approach (_source enabled) will result in roughly 200TB while the second approach will result in 2700TB being managed by ES.
Do these extra 2500TB have any implication other than additional storage requirements? (I've read something about increasing the likelyhood of index corruption taking place)
2- Is it best practice to use ES as a document store which is its own source-of-truth?

nik9000 · December 20, 2016, 8:08pm

For the first question it is probably worth experimenting with. It'll certainly take longer to replicate those indexes. If you don't need the source documents you certainly can turn off storing them. Source usually doesn't cause as drastic a size change as you've calculated though so I'm not sure what is up there.

As far as source of truth-ness. This is a good read. It really depends on your tolerance. Elasticsearch has a ways to go to before I'd consider it rock solid but for many uses it is fine.

SGE · December 21, 2016, 12:38pm

Thanks

system · January 18, 2017, 12:39pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Performance issues around _source and large page size Elasticsearch	5	1001	July 5, 2017
Possible optimisations for large _source documents Elasticsearch	7	595	July 5, 2017
Search source code and also use as store for them at the same time Elasticsearch	7	883	March 13, 2020
Elastic Search Query performance when source is disabled Elasticsearch	12	657	November 25, 2022
Same space usage when disabling storing the source document Elasticsearch	2	110	April 30, 2024

High Scale Use Case

Related topics