Handling Huge Index Cardinality

I'm planning for a system that will be indexing a large number (>100M/day) of telephone records. A new date based index will be created each day. Important fields within each document are the called and calling phone number. These fields are strings and would be "not_analyzed". I would have to assume each call has a distinct called number and therefore that there would be close to 100M unique values in the called number field each day.

My question is, can a phone number search across a week on an index with that level of cardinality be expected to perform reasonably well (say < 5 seconds) on a 3 to 5 node cluster (spinning disks)? Would a cluster of that size cope with that many incoming documents?

I realize there's no straight answer here, but any guidance would be appreciated.


100M records per day is not that much, and a 3 to 5 node cluster should easily be able to index this even with spinning disk assuming sufficient CPU is available and the records are or a reasonable size. If the records have an average size of 1kB, that corresponds to 100GB a day if I have calculated correctly.

What often determines the size of the cluster needed is a combination of indexing and query volume as well as retention period. The best way to find out is to perform some benchmarking on your chosen hardware.

Give it a shot! I expect the cardinality won't be an issue. The terms
dictionary is stored in an fst so I expect it'll take advantage of the
common prefixed quite well.

Depending on how you query you may want to pull bits of the phone numbers
out into fields or use a pattern based analyzer to extract them into "sub
fields". The thing is that prefix query over those high cardinality terms
is going to be much slower than an exact match. So if you think you'll be
searching for things like " all calls from numbers starting with 1321 to
numbers starting with 1252" frequently then you might want to make a field
that indexes the npa.

Another thing: strip the non-numeric stuff before sending it to
elasticsearch. They just get in the way. Do send the leading 1 even if you
are only dealing with nanpa just in case.

I ran a couple of tests on our index. We have daily indexes with a somewhat similar load as you. Hardware and index structure can always make a difference though so the results may be meaningless.

Doing a cardinality on a string field across 125 million records took 5 seconds. Doing it across 760 million records took 16 seconds. We have less distinct records than you do though, the second test only had 1.6 million distinct values. We also have other indexes and searching going on that could have affected the performance.

Thanks! Dealing with international traffic here, but that's helpful.