Elastic Fan Out Read

Nicholas_Ventimiglia · March 15, 2017, 7:22pm

I am a first time user with elastic search. I am tasked with implementing a news feed which uses a fan-out-read to scan news from my friends and return the news in chronological order.

My first hypothesis is that I could create an index which is tagged with the publisher's user id. The consumer will get the news, passing a giant array of their friends user Ids to include in the request. Is this a realistic solution?

Additionally, a friend suggested that I could leverage shards (each publisher has their own shard, and the consumer will scan all their friends shards).

Am I on the right path ? Could you recommend any related best practices or blogs ?

warkolm · March 16, 2017, 2:58am

That'll hit limits eventually, each shard can only hold ~2^32 docs.

Just go with time based indices and tag them with the various IDs, then you can filter with good efficiency.

Nicholas_Ventimiglia · March 16, 2017, 4:44pm

That'll hit limits eventually, each shard can only hold ~2^32 docs.

What if I were to include a TTL on the document ?

Just go with time based indices and tag them with the various IDs

I`m not sure what you mean by this. Like, create a new monthly index or something like that ? This seems strange to me.

That said, I came up with a minimal example yesterday. I have a single index and tagged each document with a publisherId. I could request documents with associated publishers using a Terms query pretty easily.

I had 400k documents in my single index and had no special shard logic at first. On a second iteration I split my documents into two shards. When I did this I saw no performance increases and was wondering if this extra step was worth the risk. Perhaps I was doing it wrong ?

Bulk insert Logic

            for (int p = 0; p < publisherCount; p++)
            {
                for (int r = 0; r < recordCount; r++)
                {
                    var ops = new BulkCreateDescriptor<NewsModel>();
                    ops.Routing(route); // "1" or "2"
                    ops.Document(new NewsModel
                    {
                        id = Guid.NewGuid().ToString(),
                        created = DateTime.UtcNow.Subtract(TimeSpan.FromHours(rand.Next(1, publisherCount * recordCount))),
                        publisherId = p + publisherStart
                    });

                    descriptor.AddOperation(ops);
                }
            }

            var response = await client.BulkAsync(descriptor);

Query Logic

        List<int> friends = new List<int>();
        for (int i = 0; i < friendCount; i++)
        {
            friends.Add(i);
        }

        var search = new SearchDescriptor<NewsModel>();
        
        search.Routing("1", "2");
        search.Sort(so => so.Descending(a => a.created));
        search.Size(size);
        search.Query(q => q.Terms(t => t.Field(f => f.publisherId).Terms<int>(friends)));

        var result = client.Search<NewsModel>(search);

dadoonet · March 16, 2017, 5:23pm

What if I were to include a TTL on the document ?

TTL feature have been removed now. Mapping changes | Elasticsearch Guide [5.2] | Elastic

Removing docs in elasticsearch with something like a TTL field will cost you a lot of IOs. Removing a doc is actually adding somewhere an indicator that a doc has been removed. It does not really remove physically the doc one the disk until a Lucene merge operation happens.

Removing docs will generate a lot of merges, so a lot of IOs.

That's why @warkolm recommended:

Just go with time based indices and tag them with the various IDs

That's the way to go.

I had 400k documents in my single index and had no special shard logic at first. On a second iteration I split my documents into two shards. When I did this I saw no performance increases and was wondering if this extra step was worth the risk. Perhaps I was doing it wrong ?

No. You will see a lot of difference at scale. If you go to time based indices, I'd recommend to test if a single index, single shard can hold all the data you need per timeframe (day, month, whatever). And increase the number of shards if needed only.
Having more shards will also help to spread the index load on more writers I'd say. So find the right balance for you.

My 2 cents.

Nicholas_Ventimiglia · March 16, 2017, 5:45pm

Just go with time based indices and tag them with the various IDs

So, maybe an index each week and then search for the last 4 weeks ? Then my cron can just nuke older indicies ? This sounds good.

dadoonet · March 16, 2017, 5:56pm

Exactly!

system · April 13, 2017, 5:56pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Scalability questions Elasticsearch	6	462	July 6, 2017
Just Pushed: API: Allow to control document shard routing, and search shard routing Elasticsearch	2	355	July 6, 2017
Is shard splitting supported in Elastic search, any alternate Elasticsearch	9	446	July 6, 2017
Scaling strategies without shard splitting Elasticsearch	4	671	July 6, 2017
Choosing shard vs alice in elasticsearch Elasticsearch	3	374	July 6, 2017

Elastic Fan Out Read

Related topics