Entity Centric Architecture

Hello guys,

I have lately watched @Mark_Harwood's lecture about entity-centric indices. It's very recommended, especially for people who use ElasticSearch for aggregation purposes!

I figured out that this architecture could solve us a problem with our toughest query. For demonstration, let's say we need to know how many products have been watched and how many products have been bought - in order to calculate the percentage of watched vs. bought in every one of our products. This will allow us to quickly get the products with the lowest or highest percentages, instead of first getting all the products and then make calculations for both watched and bought during the query phase. So basically, I decided to make a nightly calcualtion procedure which get the last day events, make the calculation, and then send bulks of upserts into the entity index. Each one of our products is descried in one document in the entity index.

I have got some questions concerning this technique:

  1. Technical question about field limitations: to make it easy, I store session guids in a keyword field inside the documents of the entity index. Is it okay to store something like 1000 ids in one array field?

  2. Technical question about the procedure itself: I'm making a 2-phase calculation. I first get the events (actually I prefer reading the data from separate store-first storage and not from ElasticSearch, but it doesn't really matter), then add values to watched list and bought list in each document and finally set the isDirtyBit to 1 - via a python script. After I finish to roll over the events, i'm making a update by query which finds all the documents with IsDirtyBit and update the percentage of every relevant document via painless script. What do you think?

  3. Architectural question: Making pre-calculations is really great and a good move to the right direction. BUT, sometimes even that isn't enough. Making dynamic queries on runtime demands considering all the view angles on building time. My client would be happy to know what happen to his products during a dynamic date range and not specific fixed date range. Is there a solution for that? The only solution I could think of is maintaining 2 indices describing the last week and the last month.

Feel free to share your thoughts! :slight_smile:

Remember our heritage is search and it would not be unusual to index a document with a thousand words in it. This should be manageable. Watch for the outlier with a million IDs though and have a policy for dealing with that (reject vs truncate).

I'm not clear on why the code that is setting "isDirty" is not also performing the percentage calculation but I suspect I'm missing something about the design.

It's possible that the "entity" you are choosing to roll up might represent both an entity AND a time period e.g. Figures for Product X in month Y.

1 Like

I'm not clear on why the code that is setting "isDirty" is not also performing the percentage calculation but I suspect I'm missing something about the design.

Actually, you are right. Haven't thought about it. When I'm making the percentage calculation, the upsert script does have the REAL updated values right? After all, the document is re-indexed for each update

It's possible that the "entity" you are choosing to roll up might represent both an entity AND a time period e.g. Figures for Product X in month Y.

Actually, after reconsidering the problem I decided to try avoiding the pre-calculation of the bought and watched lists. I basically do want to create this entity index to stick relevant data together in each document so the data is much less spread out, but the calculation itself will be made in the query phase itself to get more control. This is described in other thread here. Hope it would also work.

Remember our heritage is search and it would not be unusual to index a document with a thousand words in it. This should be manageable. Watch for the outlier with a million IDs though and have a policy for dealing with that (reject vs truncate).

I would probably re-index the entity index every month, meaning I will use the scroll API in order to index products which have been relevant in the last 2 months, and only within each documents only actions which are still relevant.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.