Hello ES users group,
I am trying to use elasticsearch for a somewhat unusual search task. We are
creating a web app that one could call "Google books for music documents
(i.e. notes on a staff)". The idea is that the user will be able to enter a
sequence of pitches and have all instances of that pitch sequence be
highlighted in the document. Our data is originally stored in MEI (music
encoding initiative) xml files, in which all aspects of the music document
and their location on the page are described. We decided to store our data
using couchdb and set up an ES river to search our data using elastic
search. I'm running into problems when I'm trying to get the exact location
of a given pitch sequence.
In our couch, we were thinking of organizing our data in the following
manner: our base document is a page of music and every page will contain a
field called music which contains notes which have locations. The specific
data representation is not yet set in stone as we are trying to design it in
such a way that will be easily searchable. I am assuming that ES should be
able to munge our data in such away that will allow it to return the desired
location attributes, but I can't figure out how.
At first I thought that I could write a mapping that would specify ngram
indexing of the children/arrays of the music field. I've looked into using
an ngram tokenizer, but the query seems to still returns the entire
(couchdb) document, not just the found sequence. I've also looked into the
built in highlighting functionality but it's not clear how to use this
feature in this case where we need to return the specific coordinates of a
sequence of notes. There must be some way to index our data such that there
is an ES internal representation for each ngram, no?
My only other idea so far is to perform the ngram indexing step in couchdb
by having a separate database for each length gram. In such a database,
there would be a document for every possible ngram with a field "location"
specifying the coordinates of the sequence. Though I think this would work,
it feels like a poor solution since shouldn't the search engine be
responsible for the indexing?
Any suggestions as to how to organize our data in couchdb or to munge our
data in elasticsearch to make it possible to search for pitch sequences
would be greatly appreciated.
Thank you in advance for your help,