Google books for music - trying to highlight pitch sequences


(Jessica Thompson) #1

Hello ES users group,

I am trying to use elasticsearch for a somewhat unusual search task. We are
creating a web app that one could call "Google books for music documents
(i.e. notes on a staff)". The idea is that the user will be able to enter a
sequence of pitches and have all instances of that pitch sequence be
highlighted in the document. Our data is originally stored in MEI (music
encoding initiative) xml files, in which all aspects of the music document
and their location on the page are described. We decided to store our data
using couchdb and set up an ES river to search our data using elastic
search. I'm running into problems when I'm trying to get the exact location
of a given pitch sequence.

In our couch, we were thinking of organizing our data in the following
manner: our base document is a page of music and every page will contain a
field called music which contains notes which have locations. The specific
data representation is not yet set in stone as we are trying to design it in
such a way that will be easily searchable. I am assuming that ES should be
able to munge our data in such away that will allow it to return the desired
location attributes, but I can't figure out how.

At first I thought that I could write a mapping that would specify ngram
indexing of the children/arrays of the music field. I've looked into using
an ngram tokenizer, but the query seems to still returns the entire
(couchdb) document, not just the found sequence. I've also looked into the
built in highlighting functionality but it's not clear how to use this
feature in this case where we need to return the specific coordinates of a
sequence of notes. There must be some way to index our data such that there
is an ES internal representation for each ngram, no?

My only other idea so far is to perform the ngram indexing step in couchdb
by having a separate database for each length gram. In such a database,
there would be a document for every possible ngram with a field "location"
specifying the coordinates of the sequence. Though I think this would work,
it feels like a poor solution since shouldn't the search engine be
responsible for the indexing?

Any suggestions as to how to organize our data in couchdb or to munge our
data in elasticsearch to make it possible to search for pitch sequences
would be greatly appreciated.

Thank you in advance for your help,
Jess Thompson


(Paul Loy) #2

Hi Jessica,

the MEI looks a little complicated for a quick glance through to determine
what your data looks like. Can you perhaps gist an example? Something like
this (https://gist.github.com/c86ae9e406cbb0c51279) would be ideal. You can
create a gist here: https://gist.github.com/

Thanks,

Paul.

On Fri, May 13, 2011 at 8:25 PM, Jessica Thompson <
jessicathompson00@gmail.com> wrote:

Hello ES users group,

I am trying to use elasticsearch for a somewhat unusual search task. We are
creating a web app that one could call "Google books for music documents
(i.e. notes on a staff)". The idea is that the user will be able to enter a
sequence of pitches and have all instances of that pitch sequence be
highlighted in the document. Our data is originally stored in MEI (music
encoding initiative) xml files, in which all aspects of the music document
and their location on the page are described. We decided to store our data
using couchdb and set up an ES river to search our data using elastic
search. I'm running into problems when I'm trying to get the exact location
of a given pitch sequence.

In our couch, we were thinking of organizing our data in the following
manner: our base document is a page of music and every page will contain a
field called music which contains notes which have locations. The specific
data representation is not yet set in stone as we are trying to design it in
such a way that will be easily searchable. I am assuming that ES should be
able to munge our data in such away that will allow it to return the desired
location attributes, but I can't figure out how.

At first I thought that I could write a mapping that would specify ngram
indexing of the children/arrays of the music field. I've looked into using
an ngram tokenizer, but the query seems to still returns the entire
(couchdb) document, not just the found sequence. I've also looked into the
built in highlighting functionality but it's not clear how to use this
feature in this case where we need to return the specific coordinates of a
sequence of notes. There must be some way to index our data such that there
is an ES internal representation for each ngram, no?

My only other idea so far is to perform the ngram indexing step in couchdb
by having a separate database for each length gram. In such a database,
there would be a document for every possible ngram with a field "location"
specifying the coordinates of the sequence. Though I think this would work,
it feels like a poor solution since shouldn't the search engine be
responsible for the indexing?

Any suggestions as to how to organize our data in couchdb or to munge our
data in elasticsearch to make it possible to search for pitch sequences
would be greatly appreciated.

Thank you in advance for your help,
Jess Thompson

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy


(Jessica Thompson) #3

Hi Paul,

I've created a gist here: https://gist.github.com/973486

Hopefully that will give you a better idea of the data I'm working
with and the type of queries I would like to be able to make. As you
can see from my example queries, for a search of n pitches, the search
engine will need to consider all possible subsequences of n pitches (n-
grams) that occur on every page. Given my data in the format presented
in the gist above (or another organization at your suggestion), is
there a way to ask ES to index every n-gram (say for 0 < n < 11) WITH
the corresponding location for each n-gram? OR do I need to do this
myself by having a separate document in my couch for each possible
sequence a user might search for?

Thanks again,
Jessica

On May 14, 4:46 pm, Paul Loy ketera...@gmail.com wrote:

Hi Jessica,

the MEI looks a little complicated for a quick glance through to determine
what your data looks like. Can you perhaps gist an example? Something like
this (https://gist.github.com/c86ae9e406cbb0c51279) would be ideal. You can
create a gist here:https://gist.github.com/

Thanks,

Paul.

On Fri, May 13, 2011 at 8:25 PM, Jessica Thompson <

jessicathompso...@gmail.com> wrote:

Hello ES users group,

I am trying to use elasticsearch for a somewhat unusual search task. We are
creating a web app that one could call "Google books for music documents
(i.e. notes on a staff)". The idea is that the user will be able to enter a
sequence of pitches and have all instances of that pitch sequence be
highlighted in the document. Our data is originally stored in MEI (music
encoding initiative) xml files, in which all aspects of the music document
and their location on the page are described. We decided to store our data
using couchdb and set up an ES river to search our data using elastic
search. I'm running into problems when I'm trying to get the exact location
of a given pitch sequence.

In our couch, we were thinking of organizing our data in the following
manner: our base document is a page of music and every page will contain a
field called music which contains notes which have locations. The specific
data representation is not yet set in stone as we are trying to design it in
such a way that will be easily searchable. I am assuming that ES should be
able to munge our data in such away that will allow it to return the desired
location attributes, but I can't figure out how.

At first I thought that I could write a mapping that would specify ngram
indexing of the children/arrays of the music field. I've looked into using
an ngram tokenizer, but the query seems to still returns the entire
(couchdb) document, not just the found sequence. I've also looked into the
built in highlighting functionality but it's not clear how to use this
feature in this case where we need to return the specific coordinates of a
sequence of notes. There must be some way to index our data such that there
is an ES internal representation for each ngram, no?

My only other idea so far is to perform the ngram indexing step in couchdb
by having a separate database for each length gram. In such a database,
there would be a document for every possible ngram with a field "location"
specifying the coordinates of the sequence. Though I think this would work,
it feels like a poor solution since shouldn't the search engine be
responsible for the indexing?

Any suggestions as to how to organize our data in couchdb or to munge our
data in elasticsearch to make it possible to search for pitch sequences
would be greatly appreciated.

Thank you in advance for your help,
Jess Thompson

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy


(system) #4