Feedback on Document Structure (and Inefficiencies)


(James Addison) #1

Only just noticed the mailing list post re: this new forum, so sorry for the posting there and here.


I'm looking to see if I can structure my current document schema better for the functionality we're providing - or maybe it's the queries that need improvement. Everything is working fine, but I can't help but see inefficiencies. Keep in mind that this is based on relational data from a Django project.

The basic idea is we have two types of Activities: Events and Classes.

An Event is a much simplified version of a Class - it's essentially a series of dates, each with a start and end time. The dates are usually sequential and while the times are usually the same, they can differ (the last day in the Event's date sequence might end earlier, for example).

A Class is more complex:
each Class has one or more Sessions
each Session has one or more 'session_data' structures, which holds days of the week (represented by a list of numbers, from 0-6), start and end dates, and start/end times
reason for this is that a Session might happen from May 6 to Sept 8 on Tuesdays at 8pm and Thursdays at 7:30pm - two 'session_data' structures are needed to maintain this

Each Event and Class is indexed in Elasticsearch with the same document mapping 'activity' with a 'type' field. This is trimmed for brevity:

{
    "properties": {
        "type": {
            "type": "long"
        },
        "name": {
            "type": "string"
        },
        "dates": {
            "type": "date",
            "format": "dateOptionalTime"
        },
        "sessions": {
            "properties": {
                "session_data": {
                    "properties": {
                        "days_of_week": {
                            "type": "long"
                        },
                        "end": {
                            "type": "date",
                            "format": "dateOptionalTime"
                        },
                        "end_time": {
                            "type": "date",
                            "format": "hour_minute"
                        },
                        "start": {
                            "type": "date",
                            "format": "dateOptionalTime"
                        },
                        "start_time": {
                            "type": "date",
                            "format": "hour_minute"
                        }
                    }
                },
                "end": {
                    "type": "date",
                    "format": "dateOptionalTime"
                },
                "start": {
                    "type": "date",
                    "format": "dateOptionalTime"
                }
            }
        }
    }
}

In our product, we never show Events happening in the past - we base it completely on upcoming 'dates' - the same concept holds for Classes, but we base it on 'sessions.start'.

So 'session_data' and 'dates' are not separate documents at all, they're embedded in the 'activity' document. For Classes in particular, I'd like to pull out 'session_data' into another document type/mapping with a '_parent' to 'activity', but when we're returning results in our product we don't want to display duplicate items. In fact, we want to display the Class information along with the 'session_data' underneath in our search results listing (see http://www.chatterblock.com/camps-and-classes/victoria-british-columbia-c4098/?classes=y for a working demo).

To get that result, I have to do some looping date checking in my Django view/template code to ensure I only display relevant (ie. upcoming, not past) Sessions. This is the particularly inefficient bit I'm talking about. Sometimes there are MANY Sessions in an activity.

I have a feeling that the new inner_hits functionality might apply here, but I think I'll get into a 'grandparent' relationship scenario (two levels of parent relationships) as there's an existing relationship between 'activity' and another document type called 'business' (who's running the Event or Class).

I really hope that this writeup wasn't too confusing! Everything 'works' at the moment, but Class listings are slower than other listings we've got, which is why I'm digging into inefficiencies to deal with this as we're growing nicely.

Thanks,
James


(system) #2