High Scale Use Case

Hi,

Our system currently uses other NoSQL data store to manage its data.
It currently supports 180,000 inserts per second (we never update), each document is 1KB and is stored in the system for 180 days yielding 2.6PB of raw data. The system continuously analyzes the data and allows the users to browse analysis resultss and to randomely access the stored documents (by id).
We have a new requirement to support advanced search capabilities over a subset of fields (20 out of 280).
We are considering using Elasticsearch in one of two ways:

  1. As a search engine only (with _source disabled) - we index the 20 fields and when the user performs a search we retrieve the ids from elasticsearch and then use them to query the actual documents from a document store (e.g. - cassandre).
  2. As both search engine and document store (with _source enabled)

What would be the better approach considering the high insert rate ?

Thanks

When it comes to indexing speed, _source in not the bottleneck in my experience: it is cheap to add to the index as it is mostly about compressing chunks of 16KB and writing them sequentially. The bottleneck is usually more about analyzing documents (computing tokens) and indexing them, which involves a lot of sorting for instance.

Maybe you should rather look into the complexity of your indexing chain, ie. which analyzers you are using, how many fields you have, etc.

_source will have to be evaluated too, but it should come later in the decision process I think. To me the biggest benefit of disabling _source is to make the index smaller to improve how much of the index the file system cache can cover.

Elasticsearch is more designed around the second use case. Aggregations are
all implemented against a column store built when the documents are
indexed. Fetching IDs is not implemented against the column store. It uses
the more compact storage mechanism used to store _source which works fine
if you want to fetch a few of them, but poorly for millions. Fetching all
the results requires this scroll mechanism to let you get deep into the
results set and it has more overhead than elasticsearch doing the
aggregation on its own.

A couple things:

  • Aggregations have fairly limited support for join-like concepts, mostly
    parent/child like stuff. If you are used to arbitrary joins you are likely
    going to miss something. I'd index a few million docs and test.
  • You don't have to store the _source to do the aggregations. It is worth
    investigating not storing it if you have another spot for it. But if you
    are thinking of shifting the hardware you were using then maybe you do need
    to store it so you can look at the document.
  • Looks at the shrink/rollover APIs to help manage your indexes. Experiment
    with index sizes and such. This is enough data that you'll have to find
    some comfortable index sizes to work with.
  • Looks at hot/warm machines and migrating data from one to the other. It
    might be a good setup if you don't usually query all the data. Like if you
    usually query the first month and only sometimes go back all 180 days.
  • Make sure you clean up the data that rolls off the end of your time
    window by deleting the index it is in. Deleting documents doesn't work as
    well as deleting the whole index.
  • Beware elasticsearch's dynamic mapping updates. They work fine for
    demonstrations but have scalability issues with lots of documents at once.
    Also having a zillion different fields doesn't work super well. The column
    store isn't supper happy with sparse values. It is OK, but not great.

Folks run elasticsearch at this scale, but it is big enough that you'll
need to pay some attention to it to get it humming along.

1 Like

Thanks for the quick and throrogh replies.

I have two more questions:
1- According to my calculations, the first approach (_source enabled) will result in roughly 200TB while the second approach will result in 2700TB being managed by ES.
Do these extra 2500TB have any implication other than additional storage requirements? (I've read something about increasing the likelyhood of index corruption taking place)
2- Is it best practice to use ES as a document store which is its own source-of-truth?

For the first question it is probably worth experimenting with. It'll certainly take longer to replicate those indexes. If you don't need the source documents you certainly can turn off storing them. Source usually doesn't cause as drastic a size change as you've calculated though so I'm not sure what is up there.

As far as source of truth-ness. This is a good read. It really depends on your tolerance. Elasticsearch has a ways to go to before I'd consider it rock solid but for many uses it is fine.

Thanks

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.