Index structure for storing blog posts and page analytics for said posts?

Consider a typical content platform. Thousands of stories published every day. Millions of pageviews a day. Millions of analytics data events per day.

How should I structure the index(s) to be able to query page views for a given time period for a subset of the posts? Example query - Show page views for the duration Jan 1 to Mar 31 for posts written by Author A in the Section S.
One constraint is that we don't necessarily care who the author or what the section was at the time of pageview. We need to filter on the current value of those properties. Meaning, the author of the post could have been Author B on Jan 5, but today (June 1) the author is Author A. This post should be included in the results.

The posts table (in the rdbms) has the post_id, title, sections array, authors array. The pageviews table has event_id, event_time, post_id.

Hi,
if you want to aggregate by author and section you have to not analyze those fields (or add a fieldname_raw to be used for aggregations). A draft of the mapping could be like:

{
	"Post": {
		"properties": {
			"PostID": {
				"type": "long"
			},
			"PostTitle":{
				"type": "string"
			},
			"PostDate": {
				"format": "dateOptionalTime",
                "type": "date"
			}
			"Author":{
				"type":"string"
			},
			"Author_raw":{
				"index": "not_analyzed",
				"type":"string"
			},
			"Section":{
				"type":"string"
			},
			"Section_raw":{
				"index": "not_analyzed",
				"type":"string"
			},
			"RelatedPostIDs":{
				"type": "long"
			}
		}
	}
}

This mapping is very simple, but it should be enough for your requirements.

Hi,

The pageviews will still be another index? How do I "join" them (or perform some action that helps me answer the sample question I asked).

Also, any reason why the author and section names are not analyzed? I need to see page view analytics for a specific author and section's posts. The way I see it, I need to filter on those two fields.

Hi,
You can add a counter field to the mapping and follow this for the partial update (i.e. the counter increment). Please note that using partial updates the version of the document will be changed accordingly, so keep in mind this if you have to deal with post updates and you rely on the version of the document.
But a simple counter will not allow you to see any time-based analytics on the pageviews. If you need to see the trend of the views it would be better to have an index for the posts and an index for the pageviews. The mapping for the views could be like:

{
	"pageview": {
		"properties": {
			"PostID": {
				"type": "long"
			},
			"ViewedOn": {
				"format": "dateOptionalTime",
                "type": "date"
			},
			...
		}
	}
}

You can add also additional information in place of the dots (like geodata, language, IP, ...) if you want to drilldown by different criteria.
For a simple counter you can just aggregate by PostID and merge/join the results at application layer.

About the aggregations and analyzed/not-analyzed fields you should find the answers in the related documentation.

Hi,

Separate indexes for post and views. Yes, a counter based approach isn't useful for our needs.

How do I join these two indexes? From the elasticsearch documentation, ES doesn't support joins on different indexes. I wish to filter the posts by some author/section criteria and filter the views on these specific posts for a time period. The only way I can think of is making 1 query to get all post ids that match section and author, this could be close to a million depending on the section. And a second query to the pageviews index that filters using these post ids and time duration. Is this even possible? How slow would it be to return many thousands of post_ids?

Hi,
As you and I wrote you should perform the join at application layer, i.e. get the post ids from an index and then get the pageviews from the other index and join - the first filtered by author and section, the second by post ids (from the first query results) and time window.

About response time, you can start with a simple cluster and then scale horizontally when needed - if you host it on premise - or go for a managed service on cloud. In the first scenario prepare the cluster scaling-oriented (think to data and no-data nodes, etc...).

Consider to read about shards and replicas @ basic concepts (maybe also this could help you). Btw don't worry about performances, Elasticsearch is a bit more fast than fast, especially in aggregations. If your cluster is correctly sized and you followed these tips the major issue could be more on the data fetching side when data is sent from ES cluster to your client: explicit the fields to be returned in place of the whole documents to reduce the amount of data to be transferred.

Focus on modeling your data and simulate any kind of search you need to be sure you get what you expected when the search involves string fields (the casing/accenting of user input can matter).

This validates the thoughts I had. Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.