Elastic Fan Out Read

I am a first time user with elastic search. I am tasked with implementing a news feed which uses a fan-out-read to scan news from my friends and return the news in chronological order.

My first hypothesis is that I could create an index which is tagged with the publisher's user id. The consumer will get the news, passing a giant array of their friends user Ids to include in the request. Is this a realistic solution?

Additionally, a friend suggested that I could leverage shards (each publisher has their own shard, and the consumer will scan all their friends shards).

Am I on the right path ? Could you recommend any related best practices or blogs ?

That'll hit limits eventually, each shard can only hold ~2^32 docs.

Just go with time based indices and tag them with the various IDs, then you can filter with good efficiency.

That'll hit limits eventually, each shard can only hold ~2^32 docs.

What if I were to include a TTL on the document ?

Just go with time based indices and tag them with the various IDs

I`m not sure what you mean by this. Like, create a new monthly index or something like that ? This seems strange to me.

That said, I came up with a minimal example yesterday. I have a single index and tagged each document with a publisherId. I could request documents with associated publishers using a Terms query pretty easily.

I had 400k documents in my single index and had no special shard logic at first. On a second iteration I split my documents into two shards. When I did this I saw no performance increases and was wondering if this extra step was worth the risk. Perhaps I was doing it wrong ?

Bulk insert Logic

            for (int p = 0; p < publisherCount; p++)
            {
                for (int r = 0; r < recordCount; r++)
                {
                    var ops = new BulkCreateDescriptor<NewsModel>();
                    ops.Routing(route); // "1" or "2"
                    ops.Document(new NewsModel
                    {
                        id = Guid.NewGuid().ToString(),
                        created = DateTime.UtcNow.Subtract(TimeSpan.FromHours(rand.Next(1, publisherCount * recordCount))),
                        publisherId = p + publisherStart
                    });

                    descriptor.AddOperation(ops);
                }
            }

            var response = await client.BulkAsync(descriptor);

Query Logic

        List<int> friends = new List<int>();
        for (int i = 0; i < friendCount; i++)
        {
            friends.Add(i);
        }

        var search = new SearchDescriptor<NewsModel>();
        
        search.Routing("1", "2");
        search.Sort(so => so.Descending(a => a.created));
        search.Size(size);
        search.Query(q => q.Terms(t => t.Field(f => f.publisherId).Terms<int>(friends)));

        var result = client.Search<NewsModel>(search);

What if I were to include a TTL on the document ?

TTL feature have been removed now. Mapping changes | Elasticsearch Guide [5.2] | Elastic

Removing docs in elasticsearch with something like a TTL field will cost you a lot of IOs. Removing a doc is actually adding somewhere an indicator that a doc has been removed. It does not really remove physically the doc one the disk until a Lucene merge operation happens.

Removing docs will generate a lot of merges, so a lot of IOs.

That's why @warkolm recommended:

Just go with time based indices and tag them with the various IDs

That's the way to go.

I had 400k documents in my single index and had no special shard logic at first. On a second iteration I split my documents into two shards. When I did this I saw no performance increases and was wondering if this extra step was worth the risk. Perhaps I was doing it wrong ?

No. You will see a lot of difference at scale. If you go to time based indices, I'd recommend to test if a single index, single shard can hold all the data you need per timeframe (day, month, whatever). And increase the number of shards if needed only.
Having more shards will also help to spread the index load on more writers I'd say. So find the right balance for you.

My 2 cents.

Just go with time based indices and tag them with the various IDs

So, maybe an index each week and then search for the last 4 weeks ? Then my cron can just nuke older indicies ? This sounds good.

1 Like

Exactly!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.