Feedback on Document Structure (and Inefficiencies)


(James Addison) #1

I'm looking to see if I can structure my current document schema better for
the functionality we're providing - or maybe it's the queries that need
improvement. Everything is working fine, but I can't help but see
inefficiencies. Keep in mind that this is based on relational data from a
Django project.

The basic idea is we have two types of Activities: Events and Classes.

An Event is a much simplified version of a Class - it's essentially a
series of dates, each with a start and end time. The dates are usually
sequential and while the times are usually the same, they can differ (the
last day in the Event's date sequence might end earlier, for example).

A Class is more complex:

  • each Class has one or more Sessions
  • each Session has one or more 'session_data' structures, which
    holds days of the week (represented by a list of numbers, from 0-6), start
    and end dates, and start/end times
  • reason for this is that a Session might happen from May 6 to Sept 8 on
    Tuesdays at 8pm and Thursdays at 7:30pm - two 'session_data' structures are
    needed to maintain this

Each Event and Class is indexed in Elasticsearch with the same document
mapping 'activity' with a 'type' field. This is trimmed for brevity:

{
"properties": {
"type": {
"type": "long"
},
"name": {
"type": "string"
},
"dates": {
"type": "date",
"format": "dateOptionalTime"
},
"sessions": {
"properties": {
"session_data": {
"properties": {
"days_of_week": {
"type": "long"
},
"end": {
"type": "date",
"format": "dateOptionalTime"
},
"end_time": {
"type": "date",
"format": "hour_minute"
},
"start": {
"type": "date",
"format": "dateOptionalTime"
},
"start_time": {
"type": "date",
"format": "hour_minute"
}
}
},
"end": {
"type": "date",
"format": "dateOptionalTime"
},
"start": {
"type": "date",
"format": "dateOptionalTime"
}
}
}
}
}

In our product, we never show Events happening in the past - we base it
completely on upcoming 'dates' - the same concept holds for Classes, but we
base it on 'sessions.start'.

So 'session_data' and 'dates' are not separate documents at all, they're
embedded in the 'activity' document. For Classes in particular, I'd like to
pull out 'session_data' into another document type/mapping with a '_parent'
to 'activity', but when we're returning results in our product we don't
want to display duplicate items. In fact, we want to display the Class
information along with the 'session_data' underneath in our search results
listing
(see http://www.chatterblock.com/camps-and-classes/victoria-british-columbia-c4098/?classes=y
for a working demo).

To get that result, I have to do some looping date checking in my Django
view/template code to ensure I only display relevant (ie. upcoming, not
past) Sessions. This is the particularly inefficient bit I'm talking
about. Sometimes there are MANY Sessions in an activity.

I have a feeling that the new inner_hits functionality might apply here,
but I think I'll get into a 'grandparent' relationship scenario (two levels
of parent relationships) as there's an existing relationship between
'activity' and another document type called 'business' (who's running the
Event or Class).

I really hope that this writeup wasn't too confusing! Everything 'works' at
the moment, but Class listings are slower than other listings we've got,
which is why I'm digging into inefficiencies to deal with this as we're
growing nicely.

Thanks,
James

--
Please update your bookmarks! We moved to https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9a4e3285-5001-4bd0-a9cf-1178635ebd54%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #2