Best way to proceed


(Ryan Stuart) #1

Hi Everyone,

Advanced warning, this post is quite long and contains a lot of
questions. Apologies in advanced

*
*
I'm currently working with a university helping them to implement a test
suite to further refine some research they have been conducting. Their
research is based around dynamic schema searching. After spending some time
evaluating the various open source search solutions I settled on
elasticsearch as the base platform and I am wondering what the best way to
proceed would be. I have spent about a week looking into the elasticsearch
documentation and the code itself and also reading the documentation of
Lucene but I am struggling to see a clear way forward. (On a side note, I
was getting frustrated by the lack of documentation in the elasticsearch
code. I did a quick grep to find how many classes in the codebase have an
empty class level documentation placeholder. The result was 1378 classes.
Is there any work going on to rectify this?)

The goal of the project is to provide the researches with a piece of
software they can use to plugin revisions of the searching algorithm to
test and refine. They would like to be able to write the pluggable
algorithm in languages other then Java that is supported by the JVM like
Groovy, Python or Closure but that isn't a hard requirement. Part of that
will be to provide them with a front end to run queries and see output and
an admin interface to add documents to an index. I am comfortable with all
of that thanks to the very powerful and complete REST API. What I am not so
sure about is how to proceed with implementing the pluggable search
algorithm.

The researcher's algorithm requires 4 inputs to function:

  1. The query terms(s).
  2. A Word (term) x Document matrix across a index.
  3. A Document x Word (term) matrix across a index.
  4. A Word (term) frequency list across a index. That is how many times
    each word appears across the entire index.

For their purposes, a document doesn't correspond to an actual real-world
document (they actually call them text events). Rather, for now, it
corresponds to one sentence (having that configurable might also be
useful). I figure the best way to handle this is to break down documents
into their sentences (using Apache Tika or something similar), putting each
sentence in as its own document in the index. I am confident I can do this
in the Admin UI I provide using the mapper-attachement plugin as a starting
point. The downside is that breaking up the document before giving it to
elasticsearch isn't a very configurable way of doing it. If they want to
change the resolution to their algorithm, they would need to re-add all
documents to the index again. If the index stored that full documents as is
and the search algorithm could chose what resolution to work at per query
then that would be perfect. I'm not sure it is possible or not though.

The next problem is how to get the three inputs they require and pass it
into their pluggable search algorithm. I'm really struggling where to start
with this one. It seems from looking at Luecene that I need to provide my
own search/query implementation, but I'm not sure if this is right or not.
There also doesn't seem to be any search plugins listed on the
elasticsearch site, so I'm not even sure if it is possible. The important
things here are that the algorithm needs to operate at the index level with
the query terms available to generate its schema before using the schema to
score each document in the index. From what I can tell, this means that the
scripting interface provided by elasticsearch won't be of any use. The
description of the scripting interface in the elasticsearch guide makes it
sound like a script operates at the document level and not the index level.
Other concerns/considerations are the ability to program this algorithm in
a range of languages (just like the scripting interface) and the ability to
augment what is returned by the REST API for a search to include the schema
the algorithm generated (which I assume means I will need to define my own
REST endpoint(s)).

Can anybody give me some advice on where to get started here? It seems like
I am going to have to write my own search plugin that can accept scripts as
it's core algorithm. The plugin will be responsible for organising the 4
inputs that I outlined earlier before passing control to the script. It
will also be responsible for getting the output from the script and
returning it via it's own REST API. Does this seem logical? If so, how do I
get started with doing this? What parts of the code do I need to look it?

If you have managed to read down this far then much gratitude to you. If
you can help me at all I'd really appreciate it.

Cheers

--


(Ryan Stuart) #2

I was hoping someone might have a few words of encouragement about where to
start. This is an unashamed bump. Can anyone help at all?

Cheers

On Wednesday, 26 September 2012 15:08:15 UTC+10, Ryan Stuart wrote:

Hi Everyone,

Advanced warning, this post is quite long and contains a lot of
questions. Apologies in advanced

*
*
I'm currently working with a university helping them to implement a test
suite to further refine some research they have been conducting. Their
research is based around dynamic schema searching. After spending some time
evaluating the various open source search solutions I settled on
elasticsearch as the base platform and I am wondering what the best way to
proceed would be. I have spent about a week looking into the elasticsearch
documentation and the code itself and also reading the documentation of
Lucene but I am struggling to see a clear way forward. (On a side note, I
was getting frustrated by the lack of documentation in the elasticsearch
code. I did a quick grep to find how many classes in the codebase have an
empty class level documentation placeholder. The result was 1378 classes.
Is there any work going on to rectify this?)

The goal of the project is to provide the researches with a piece of
software they can use to plugin revisions of the searching algorithm to
test and refine. They would like to be able to write the pluggable
algorithm in languages other then Java that is supported by the JVM like
Groovy, Python or Closure but that isn't a hard requirement. Part of that
will be to provide them with a front end to run queries and see output and
an admin interface to add documents to an index. I am comfortable with all
of that thanks to the very powerful and complete REST API. What I am not so
sure about is how to proceed with implementing the pluggable search
algorithm.

The researcher's algorithm requires 4 inputs to function:

  1. The query terms(s).
  2. A Word (term) x Document matrix across a index.
  3. A Document x Word (term) matrix across a index.
  4. A Word (term) frequency list across a index. That is how many times
    each word appears across the entire index.

For their purposes, a document doesn't correspond to an actual real-world
document (they actually call them text events). Rather, for now, it
corresponds to one sentence (having that configurable might also be
useful). I figure the best way to handle this is to break down documents
into their sentences (using Apache Tika or something similar), putting each
sentence in as its own document in the index. I am confident I can do this
in the Admin UI I provide using the mapper-attachement plugin as a starting
point. The downside is that breaking up the document before giving it to
elasticsearch isn't a very configurable way of doing it. If they want to
change the resolution to their algorithm, they would need to re-add all
documents to the index again. If the index stored that full documents as is
and the search algorithm could chose what resolution to work at per query
then that would be perfect. I'm not sure it is possible or not though.

The next problem is how to get the three inputs they require and pass it
into their pluggable search algorithm. I'm really struggling where to start
with this one. It seems from looking at Luecene that I need to provide my
own search/query implementation, but I'm not sure if this is right or not.
There also doesn't seem to be any search plugins listed on the
elasticsearch site, so I'm not even sure if it is possible. The important
things here are that the algorithm needs to operate at the index level with
the query terms available to generate its schema before using the schema to
score each document in the index. From what I can tell, this means that the
scripting interface provided by elasticsearch won't be of any use. The
description of the scripting interface in the elasticsearch guide makes it
sound like a script operates at the document level and not the index level.
Other concerns/considerations are the ability to program this algorithm in
a range of languages (just like the scripting interface) and the ability to
augment what is returned by the REST API for a search to include the schema
the algorithm generated (which I assume means I will need to define my own
REST endpoint(s)).

Can anybody give me some advice on where to get started here? It seems
like I am going to have to write my own search plugin that can accept
scripts as it's core algorithm. The plugin will be responsible
for organising the 4 inputs that I outlined earlier before passing control
to the script. It will also be responsible for getting the output from the
script and returning it via it's own REST API. Does this seem logical? If
so, how do I get started with doing this? What parts of the code do I need
to look it?

If you have managed to read down this far then much gratitude to you. If
you can help me at all I'd really appreciate it.

Cheers

--


(Clinton Gormley) #3

Hi Ryan

On Mon, 2012-10-01 at 05:55 -0700, Ryan Stuart wrote:

I was hoping someone might have a few words of encouragement about
where to start. This is an unashamed bump. Can anyone help at all?

I did look at your email, but unfortunately its an area I know nothing
about. You'll need somebody with Lucene experience to help

good luck

clint

Cheers

On Wednesday, 26 September 2012 15:08:15 UTC+10, Ryan Stuart wrote:
Hi Everyone,

    *Advanced warning, this post is quite long and contains a lot
    of questions. Apologies in advanced*
    
    
    I'm currently working with a university helping them to
    implement a test suite to further refine some research they
    have been conducting. Their research is based around dynamic
    schema searching. After spending some time evaluating the
    various open source search solutions I settled on
    elasticsearch as the base platform and I am wondering what the
    best way to proceed would be. I have spent about a week
    looking into the elasticsearch documentation and the code
    itself and also reading the documentation of Lucene but I am
    struggling to see a clear way forward. (On a side note, I was
    getting frustrated by the lack of documentation in the
    elasticsearch code. I did a quick grep to find how many
    classes in the codebase have an empty class level
    documentation placeholder. The result was 1378 classes. Is
    there any work going on to rectify this?)
    
    
    The goal of the project is to provide the researches with a
    piece of software they can use to plugin revisions of the
    searching algorithm to test and refine. They would like to be
    able to write the pluggable algorithm in languages other then
    Java that is supported by the JVM like Groovy, Python
    or Closure but that isn't a hard requirement. Part of that
    will be to provide them with a front end to run queries and
    see output and an admin interface to add documents to an
    index. I am comfortable with all of that thanks to the very
    powerful and complete REST API. What I am not so sure about is
    how to proceed with implementing the pluggable search
    algorithm.
    
    
    The researcher's algorithm requires 4 inputs to function:
         1. The query terms(s).
         2. A Word (term) x Document matrix across a index.
         3. A Document x Word (term) matrix across a index.
         4. A Word (term) frequency list across a index. That is
            how many times each word appears across the entire
            index.
    For their purposes, a document doesn't correspond to an actual
    real-world document (they actually call them text events).
    Rather, for now, it corresponds to one sentence (having that
    configurable might also be useful). I figure the best way to
    handle this is to break down documents into their sentences
    (using Apache Tika or something similar), putting each
    sentence in as its own document in the index. I am confident I
    can do this in the Admin UI I provide using the
    mapper-attachement plugin as a starting point. The downside is
    that breaking up the document before giving it to
    elasticsearch isn't a very configurable way of doing it. If
    they want to change the resolution to their algorithm, they
    would need to re-add all documents to the index again. If the
    index stored that full documents as is and the
    search algorithm could chose what resolution to work at per
    query then that would be perfect. I'm not sure it is possible
    or not though.
    
    
    The next problem is how to get the three inputs they require
    and pass it into their pluggable search algorithm. I'm really
    struggling where to start with this one. It seems from looking
    at Luecene that I need to provide my own search/query
    implementation, but I'm not sure if this is right or not.
    There also doesn't seem to be any search plugins listed on the
    elasticsearch site, so I'm not even sure if it is possible.
    The important things here are that the algorithm needs to
    operate at the index level with the query terms available to
    generate its schema before using the schema to score each
    document in the index. From what I can tell, this means that
    the scripting interface provided by elasticsearch won't be of
    any use. The description of the scripting interface in the
    elasticsearch guide makes it sound like a script operates at
    the document level and not the index level. Other
    concerns/considerations are the ability to program
    this algorithm in a range of languages (just like the
    scripting interface) and the ability to augment what is
    returned by the REST API for a search to include the schema
    the algorithm generated (which I assume means I will need to
    define my own REST endpoint(s)). 
    
    
    Can anybody give me some advice on where to get started here?
    It seems like I am going to have to write my own search plugin
    that can accept scripts as it's core algorithm. The plugin
    will be responsible for organising the 4 inputs that I
    outlined earlier before passing control to the script. It will
    also be responsible for getting the output from the script and
    returning it via it's own REST API. Does this seem logical? If
    so, how do I get started with doing this? What parts of the
    code do I need to look it?
    
    If you have managed to read down this far then much gratitude
    to you. If you can help me at all I'd really appreciate it.
    
    
    Cheers

--

--


(Ryan Stuart-2) #4

Hi Clinton,

On Mon, Oct 1, 2012 at 11:59 PM, Clinton Gormley clint@traveljury.comwrote:

I did look at your email, but unfortunately its an area I know nothing
about. You'll need somebody with Lucene experience to help

Thanks for taking the time to reply. The need for someone with Lucene makes
perfect sense. Is there a Dev specific mailing list as opposed to an
administration/operations list? The posts here seem to be quite operations
specific.

Cheers

Cheers

On Wednesday, 26 September 2012 15:08:15 UTC+10, Ryan Stuart wrote:
Hi Everyone,

    *Advanced warning, this post is quite long and contains a lot
    of questions. Apologies in advanced*


    I'm currently working with a university helping them to
    implement a test suite to further refine some research they
    have been conducting. Their research is based around dynamic
    schema searching. After spending some time evaluating the
    various open source search solutions I settled on
    elasticsearch as the base platform and I am wondering what the
    best way to proceed would be. I have spent about a week
    looking into the elasticsearch documentation and the code
    itself and also reading the documentation of Lucene but I am
    struggling to see a clear way forward. (On a side note, I was
    getting frustrated by the lack of documentation in the
    elasticsearch code. I did a quick grep to find how many
    classes in the codebase have an empty class level
    documentation placeholder. The result was 1378 classes. Is
    there any work going on to rectify this?)


    The goal of the project is to provide the researches with a
    piece of software they can use to plugin revisions of the
    searching algorithm to test and refine. They would like to be
    able to write the pluggable algorithm in languages other then
    Java that is supported by the JVM like Groovy, Python
    or Closure but that isn't a hard requirement. Part of that
    will be to provide them with a front end to run queries and
    see output and an admin interface to add documents to an
    index. I am comfortable with all of that thanks to the very
    powerful and complete REST API. What I am not so sure about is
    how to proceed with implementing the pluggable search
    algorithm.


    The researcher's algorithm requires 4 inputs to function:
         1. The query terms(s).
         2. A Word (term) x Document matrix across a index.
         3. A Document x Word (term) matrix across a index.
         4. A Word (term) frequency list across a index. That is
            how many times each word appears across the entire
            index.
    For their purposes, a document doesn't correspond to an actual
    real-world document (they actually call them text events).
    Rather, for now, it corresponds to one sentence (having that
    configurable might also be useful). I figure the best way to
    handle this is to break down documents into their sentences
    (using Apache Tika or something similar), putting each
    sentence in as its own document in the index. I am confident I
    can do this in the Admin UI I provide using the
    mapper-attachement plugin as a starting point. The downside is
    that breaking up the document before giving it to
    elasticsearch isn't a very configurable way of doing it. If
    they want to change the resolution to their algorithm, they
    would need to re-add all documents to the index again. If the
    index stored that full documents as is and the
    search algorithm could chose what resolution to work at per
    query then that would be perfect. I'm not sure it is possible
    or not though.


    The next problem is how to get the three inputs they require
    and pass it into their pluggable search algorithm. I'm really
    struggling where to start with this one. It seems from looking
    at Luecene that I need to provide my own search/query
    implementation, but I'm not sure if this is right or not.
    There also doesn't seem to be any search plugins listed on the
    elasticsearch site, so I'm not even sure if it is possible.
    The important things here are that the algorithm needs to
    operate at the index level with the query terms available to
    generate its schema before using the schema to score each
    document in the index. From what I can tell, this means that
    the scripting interface provided by elasticsearch won't be of
    any use. The description of the scripting interface in the
    elasticsearch guide makes it sound like a script operates at
    the document level and not the index level. Other
    concerns/considerations are the ability to program
    this algorithm in a range of languages (just like the
    scripting interface) and the ability to augment what is
    returned by the REST API for a search to include the schema
    the algorithm generated (which I assume means I will need to
    define my own REST endpoint(s)).


    Can anybody give me some advice on where to get started here?
    It seems like I am going to have to write my own search plugin
    that can accept scripts as it's core algorithm. The plugin
    will be responsible for organising the 4 inputs that I
    outlined earlier before passing control to the script. It will
    also be responsible for getting the output from the script and
    returning it via it's own REST API. Does this seem logical? If
    so, how do I get started with doing this? What parts of the
    code do I need to look it?

    If you have managed to read down this far then much gratitude
    to you. If you can help me at all I'd really appreciate it.


    Cheers

--

--

--
Ryan Stuart, B.Eng
Software Engineer

--


(Ivan Brusic) #5

Ryan,

There are two Lucene mailing lists: one for users and another for
those developing the Lucene software. You would want to former:
http://lucene.apache.org/core/discussion.html

Your question is very specific/unique and IMHO the ElasticSearch
mailing list excels at providing solutions to the standard problems.
Once you get off the beaten path, it is hard to suggest solutions. :slight_smile:
The Lucene mailing list is incredibly good (they really know the
internals) and if there is a Lucene solution to your problem, perhaps
the ES mailing list can help translate it into ES.

Good luck,

Ivan

On Mon, Oct 1, 2012 at 2:41 PM, Ryan Stuart ryan@stuart.id.au wrote:

Hi Clinton,

On Mon, Oct 1, 2012 at 11:59 PM, Clinton Gormley clint@traveljury.com
wrote:

I did look at your email, but unfortunately its an area I know nothing
about. You'll need somebody with Lucene experience to help

Thanks for taking the time to reply. The need for someone with Lucene makes
perfect sense. Is there a Dev specific mailing list as opposed to an
administration/operations list? The posts here seem to be quite operations
specific.

Cheers

Cheers

On Wednesday, 26 September 2012 15:08:15 UTC+10, Ryan Stuart wrote:
Hi Everyone,

    *Advanced warning, this post is quite long and contains a lot
    of questions. Apologies in advanced*


    I'm currently working with a university helping them to
    implement a test suite to further refine some research they
    have been conducting. Their research is based around dynamic
    schema searching. After spending some time evaluating the
    various open source search solutions I settled on
    elasticsearch as the base platform and I am wondering what the
    best way to proceed would be. I have spent about a week
    looking into the elasticsearch documentation and the code
    itself and also reading the documentation of Lucene but I am
    struggling to see a clear way forward. (On a side note, I was
    getting frustrated by the lack of documentation in the
    elasticsearch code. I did a quick grep to find how many
    classes in the codebase have an empty class level
    documentation placeholder. The result was 1378 classes. Is
    there any work going on to rectify this?)


    The goal of the project is to provide the researches with a
    piece of software they can use to plugin revisions of the
    searching algorithm to test and refine. They would like to be
    able to write the pluggable algorithm in languages other then
    Java that is supported by the JVM like Groovy, Python
    or Closure but that isn't a hard requirement. Part of that
    will be to provide them with a front end to run queries and
    see output and an admin interface to add documents to an
    index. I am comfortable with all of that thanks to the very
    powerful and complete REST API. What I am not so sure about is
    how to proceed with implementing the pluggable search
    algorithm.


    The researcher's algorithm requires 4 inputs to function:
         1. The query terms(s).
         2. A Word (term) x Document matrix across a index.
         3. A Document x Word (term) matrix across a index.
         4. A Word (term) frequency list across a index. That is
            how many times each word appears across the entire
            index.
    For their purposes, a document doesn't correspond to an actual
    real-world document (they actually call them text events).
    Rather, for now, it corresponds to one sentence (having that
    configurable might also be useful). I figure the best way to
    handle this is to break down documents into their sentences
    (using Apache Tika or something similar), putting each
    sentence in as its own document in the index. I am confident I
    can do this in the Admin UI I provide using the
    mapper-attachement plugin as a starting point. The downside is
    that breaking up the document before giving it to
    elasticsearch isn't a very configurable way of doing it. If
    they want to change the resolution to their algorithm, they
    would need to re-add all documents to the index again. If the
    index stored that full documents as is and the
    search algorithm could chose what resolution to work at per
    query then that would be perfect. I'm not sure it is possible
    or not though.


    The next problem is how to get the three inputs they require
    and pass it into their pluggable search algorithm. I'm really
    struggling where to start with this one. It seems from looking
    at Luecene that I need to provide my own search/query
    implementation, but I'm not sure if this is right or not.
    There also doesn't seem to be any search plugins listed on the
    elasticsearch site, so I'm not even sure if it is possible.
    The important things here are that the algorithm needs to
    operate at the index level with the query terms available to
    generate its schema before using the schema to score each
    document in the index. From what I can tell, this means that
    the scripting interface provided by elasticsearch won't be of
    any use. The description of the scripting interface in the
    elasticsearch guide makes it sound like a script operates at
    the document level and not the index level. Other
    concerns/considerations are the ability to program
    this algorithm in a range of languages (just like the
    scripting interface) and the ability to augment what is
    returned by the REST API for a search to include the schema
    the algorithm generated (which I assume means I will need to
    define my own REST endpoint(s)).


    Can anybody give me some advice on where to get started here?
    It seems like I am going to have to write my own search plugin
    that can accept scripts as it's core algorithm. The plugin
    will be responsible for organising the 4 inputs that I
    outlined earlier before passing control to the script. It will
    also be responsible for getting the output from the script and
    returning it via it's own REST API. Does this seem logical? If
    so, how do I get started with doing this? What parts of the
    code do I need to look it?

    If you have managed to read down this far then much gratitude
    to you. If you can help me at all I'd really appreciate it.


    Cheers

--

--

--
Ryan Stuart, B.Eng
Software Engineer

--

--


(Artem Grinblat) #6

The downside is that breaking up the document before giving it to
elasticsearch isn't a very configurable way of doing it. If they want to
change the resolution to their algorithm, they would need to re-add all
documents to the index again. If the index stored that full documents as is
and the search algorithm could chose what resolution to work at per query
then that would be perfect.

One way to do that is, from a custom tokenizer, emit multiple versions with
different previx.
E.g. for the text

“My good monsieur,” said the poor man, falling on his knees, “pray forgive
me; it is the first time and I swear that it shall be the last.”
“These rascals always say the same thing!”

the tokenizer would emit terms:
rP_“My good monsieur,” said the poor man, falling on his knees, “pray
forgive me; it is the first time and I swear that it shall be the last.”
rP_“These rascals always say the same thing!”
rW_My
rW_good
rW_monsieur
rW_said
rW_the
rW_poor
rW_man

Then in the search you choose the correct tokenizer / filter to only use
the terms with the correct prefix.

Just my two cents.

--


(system) #7