Advanced warning, this post is quite long and contains a lot of
questions. Apologies in advanced
*
*
I'm currently working with a university helping them to implement a test
suite to further refine some research they have been conducting. Their
research is based around dynamic schema searching. After spending some time
evaluating the various open source search solutions I settled on
elasticsearch as the base platform and I am wondering what the best way to
proceed would be. I have spent about a week looking into the elasticsearch
documentation and the code itself and also reading the documentation of
Lucene but I am struggling to see a clear way forward. (On a side note, I
was getting frustrated by the lack of documentation in the elasticsearch
code. I did a quick grep to find how many classes in the codebase have an
empty class level documentation placeholder. The result was 1378 classes.
Is there any work going on to rectify this?)
The goal of the project is to provide the researches with a piece of
software they can use to plugin revisions of the searching algorithm to
test and refine. They would like to be able to write the pluggable
algorithm in languages other then Java that is supported by the JVM like
Groovy, Python or Closure but that isn't a hard requirement. Part of that
will be to provide them with a front end to run queries and see output and
an admin interface to add documents to an index. I am comfortable with all
of that thanks to the very powerful and complete REST API. What I am not so
sure about is how to proceed with implementing the pluggable search
algorithm.
The researcher's algorithm requires 4 inputs to function:
The query terms(s).
A Word (term) x Document matrix across a index.
A Document x Word (term) matrix across a index.
A Word (term) frequency list across a index. That is how many times
each word appears across the entire index.
For their purposes, a document doesn't correspond to an actual real-world
document (they actually call them text events). Rather, for now, it
corresponds to one sentence (having that configurable might also be
useful). I figure the best way to handle this is to break down documents
into their sentences (using Apache Tika or something similar), putting each
sentence in as its own document in the index. I am confident I can do this
in the Admin UI I provide using the mapper-attachement plugin as a starting
point. The downside is that breaking up the document before giving it to
elasticsearch isn't a very configurable way of doing it. If they want to
change the resolution to their algorithm, they would need to re-add all
documents to the index again. If the index stored that full documents as is
and the search algorithm could chose what resolution to work at per query
then that would be perfect. I'm not sure it is possible or not though.
The next problem is how to get the three inputs they require and pass it
into their pluggable search algorithm. I'm really struggling where to start
with this one. It seems from looking at Luecene that I need to provide my
own search/query implementation, but I'm not sure if this is right or not.
There also doesn't seem to be any search plugins listed on the
elasticsearch site, so I'm not even sure if it is possible. The important
things here are that the algorithm needs to operate at the index level with
the query terms available to generate its schema before using the schema to
score each document in the index. From what I can tell, this means that the
scripting interface provided by elasticsearch won't be of any use. The
description of the scripting interface in the elasticsearch guide makes it
sound like a script operates at the document level and not the index level.
Other concerns/considerations are the ability to program this algorithm in
a range of languages (just like the scripting interface) and the ability to
augment what is returned by the REST API for a search to include the schema
the algorithm generated (which I assume means I will need to define my own
REST endpoint(s)).
Can anybody give me some advice on where to get started here? It seems like
I am going to have to write my own search plugin that can accept scripts as
it's core algorithm. The plugin will be responsible for organising the 4
inputs that I outlined earlier before passing control to the script. It
will also be responsible for getting the output from the script and
returning it via it's own REST API. Does this seem logical? If so, how do I
get started with doing this? What parts of the code do I need to look it?
If you have managed to read down this far then much gratitude to you. If
you can help me at all I'd really appreciate it.
I was hoping someone might have a few words of encouragement about where to
start. This is an unashamed bump. Can anyone help at all?
Cheers
On Wednesday, 26 September 2012 15:08:15 UTC+10, Ryan Stuart wrote:
Hi Everyone,
Advanced warning, this post is quite long and contains a lot of
questions. Apologies in advanced
*
*
I'm currently working with a university helping them to implement a test
suite to further refine some research they have been conducting. Their
research is based around dynamic schema searching. After spending some time
evaluating the various open source search solutions I settled on
elasticsearch as the base platform and I am wondering what the best way to
proceed would be. I have spent about a week looking into the elasticsearch
documentation and the code itself and also reading the documentation of
Lucene but I am struggling to see a clear way forward. (On a side note, I
was getting frustrated by the lack of documentation in the elasticsearch
code. I did a quick grep to find how many classes in the codebase have an
empty class level documentation placeholder. The result was 1378 classes.
Is there any work going on to rectify this?)
The goal of the project is to provide the researches with a piece of
software they can use to plugin revisions of the searching algorithm to
test and refine. They would like to be able to write the pluggable
algorithm in languages other then Java that is supported by the JVM like
Groovy, Python or Closure but that isn't a hard requirement. Part of that
will be to provide them with a front end to run queries and see output and
an admin interface to add documents to an index. I am comfortable with all
of that thanks to the very powerful and complete REST API. What I am not so
sure about is how to proceed with implementing the pluggable search
algorithm.
The researcher's algorithm requires 4 inputs to function:
The query terms(s).
A Word (term) x Document matrix across a index.
A Document x Word (term) matrix across a index.
A Word (term) frequency list across a index. That is how many times
each word appears across the entire index.
For their purposes, a document doesn't correspond to an actual real-world
document (they actually call them text events). Rather, for now, it
corresponds to one sentence (having that configurable might also be
useful). I figure the best way to handle this is to break down documents
into their sentences (using Apache Tika or something similar), putting each
sentence in as its own document in the index. I am confident I can do this
in the Admin UI I provide using the mapper-attachement plugin as a starting
point. The downside is that breaking up the document before giving it to
elasticsearch isn't a very configurable way of doing it. If they want to
change the resolution to their algorithm, they would need to re-add all
documents to the index again. If the index stored that full documents as is
and the search algorithm could chose what resolution to work at per query
then that would be perfect. I'm not sure it is possible or not though.
The next problem is how to get the three inputs they require and pass it
into their pluggable search algorithm. I'm really struggling where to start
with this one. It seems from looking at Luecene that I need to provide my
own search/query implementation, but I'm not sure if this is right or not.
There also doesn't seem to be any search plugins listed on the
elasticsearch site, so I'm not even sure if it is possible. The important
things here are that the algorithm needs to operate at the index level with
the query terms available to generate its schema before using the schema to
score each document in the index. From what I can tell, this means that the
scripting interface provided by elasticsearch won't be of any use. The
description of the scripting interface in the elasticsearch guide makes it
sound like a script operates at the document level and not the index level.
Other concerns/considerations are the ability to program this algorithm in
a range of languages (just like the scripting interface) and the ability to
augment what is returned by the REST API for a search to include the schema
the algorithm generated (which I assume means I will need to define my own
REST endpoint(s)).
Can anybody give me some advice on where to get started here? It seems
like I am going to have to write my own search plugin that can accept
scripts as it's core algorithm. The plugin will be responsible
for organising the 4 inputs that I outlined earlier before passing control
to the script. It will also be responsible for getting the output from the
script and returning it via it's own REST API. Does this seem logical? If
so, how do I get started with doing this? What parts of the code do I need
to look it?
If you have managed to read down this far then much gratitude to you. If
you can help me at all I'd really appreciate it.
On Mon, 2012-10-01 at 05:55 -0700, Ryan Stuart wrote:
I was hoping someone might have a few words of encouragement about
where to start. This is an unashamed bump. Can anyone help at all?
I did look at your email, but unfortunately its an area I know nothing
about. You'll need somebody with Lucene experience to help
good luck
clint
Cheers
On Wednesday, 26 September 2012 15:08:15 UTC+10, Ryan Stuart wrote:
Hi Everyone,
*Advanced warning, this post is quite long and contains a lot
of questions. Apologies in advanced*
I'm currently working with a university helping them to
implement a test suite to further refine some research they
have been conducting. Their research is based around dynamic
schema searching. After spending some time evaluating the
various open source search solutions I settled on
elasticsearch as the base platform and I am wondering what the
best way to proceed would be. I have spent about a week
looking into the elasticsearch documentation and the code
itself and also reading the documentation of Lucene but I am
struggling to see a clear way forward. (On a side note, I was
getting frustrated by the lack of documentation in the
elasticsearch code. I did a quick grep to find how many
classes in the codebase have an empty class level
documentation placeholder. The result was 1378 classes. Is
there any work going on to rectify this?)
The goal of the project is to provide the researches with a
piece of software they can use to plugin revisions of the
searching algorithm to test and refine. They would like to be
able to write the pluggable algorithm in languages other then
Java that is supported by the JVM like Groovy, Python
or Closure but that isn't a hard requirement. Part of that
will be to provide them with a front end to run queries and
see output and an admin interface to add documents to an
index. I am comfortable with all of that thanks to the very
powerful and complete REST API. What I am not so sure about is
how to proceed with implementing the pluggable search
algorithm.
The researcher's algorithm requires 4 inputs to function:
1. The query terms(s).
2. A Word (term) x Document matrix across a index.
3. A Document x Word (term) matrix across a index.
4. A Word (term) frequency list across a index. That is
how many times each word appears across the entire
index.
For their purposes, a document doesn't correspond to an actual
real-world document (they actually call them text events).
Rather, for now, it corresponds to one sentence (having that
configurable might also be useful). I figure the best way to
handle this is to break down documents into their sentences
(using Apache Tika or something similar), putting each
sentence in as its own document in the index. I am confident I
can do this in the Admin UI I provide using the
mapper-attachement plugin as a starting point. The downside is
that breaking up the document before giving it to
elasticsearch isn't a very configurable way of doing it. If
they want to change the resolution to their algorithm, they
would need to re-add all documents to the index again. If the
index stored that full documents as is and the
search algorithm could chose what resolution to work at per
query then that would be perfect. I'm not sure it is possible
or not though.
The next problem is how to get the three inputs they require
and pass it into their pluggable search algorithm. I'm really
struggling where to start with this one. It seems from looking
at Luecene that I need to provide my own search/query
implementation, but I'm not sure if this is right or not.
There also doesn't seem to be any search plugins listed on the
elasticsearch site, so I'm not even sure if it is possible.
The important things here are that the algorithm needs to
operate at the index level with the query terms available to
generate its schema before using the schema to score each
document in the index. From what I can tell, this means that
the scripting interface provided by elasticsearch won't be of
any use. The description of the scripting interface in the
elasticsearch guide makes it sound like a script operates at
the document level and not the index level. Other
concerns/considerations are the ability to program
this algorithm in a range of languages (just like the
scripting interface) and the ability to augment what is
returned by the REST API for a search to include the schema
the algorithm generated (which I assume means I will need to
define my own REST endpoint(s)).
Can anybody give me some advice on where to get started here?
It seems like I am going to have to write my own search plugin
that can accept scripts as it's core algorithm. The plugin
will be responsible for organising the 4 inputs that I
outlined earlier before passing control to the script. It will
also be responsible for getting the output from the script and
returning it via it's own REST API. Does this seem logical? If
so, how do I get started with doing this? What parts of the
code do I need to look it?
If you have managed to read down this far then much gratitude
to you. If you can help me at all I'd really appreciate it.
Cheers
I did look at your email, but unfortunately its an area I know nothing
about. You'll need somebody with Lucene experience to help
Thanks for taking the time to reply. The need for someone with Lucene makes
perfect sense. Is there a Dev specific mailing list as opposed to an
administration/operations list? The posts here seem to be quite operations
specific.
Cheers
Cheers
On Wednesday, 26 September 2012 15:08:15 UTC+10, Ryan Stuart wrote:
Hi Everyone,
*Advanced warning, this post is quite long and contains a lot
of questions. Apologies in advanced*
I'm currently working with a university helping them to
implement a test suite to further refine some research they
have been conducting. Their research is based around dynamic
schema searching. After spending some time evaluating the
various open source search solutions I settled on
elasticsearch as the base platform and I am wondering what the
best way to proceed would be. I have spent about a week
looking into the elasticsearch documentation and the code
itself and also reading the documentation of Lucene but I am
struggling to see a clear way forward. (On a side note, I was
getting frustrated by the lack of documentation in the
elasticsearch code. I did a quick grep to find how many
classes in the codebase have an empty class level
documentation placeholder. The result was 1378 classes. Is
there any work going on to rectify this?)
The goal of the project is to provide the researches with a
piece of software they can use to plugin revisions of the
searching algorithm to test and refine. They would like to be
able to write the pluggable algorithm in languages other then
Java that is supported by the JVM like Groovy, Python
or Closure but that isn't a hard requirement. Part of that
will be to provide them with a front end to run queries and
see output and an admin interface to add documents to an
index. I am comfortable with all of that thanks to the very
powerful and complete REST API. What I am not so sure about is
how to proceed with implementing the pluggable search
algorithm.
The researcher's algorithm requires 4 inputs to function:
1. The query terms(s).
2. A Word (term) x Document matrix across a index.
3. A Document x Word (term) matrix across a index.
4. A Word (term) frequency list across a index. That is
how many times each word appears across the entire
index.
For their purposes, a document doesn't correspond to an actual
real-world document (they actually call them text events).
Rather, for now, it corresponds to one sentence (having that
configurable might also be useful). I figure the best way to
handle this is to break down documents into their sentences
(using Apache Tika or something similar), putting each
sentence in as its own document in the index. I am confident I
can do this in the Admin UI I provide using the
mapper-attachement plugin as a starting point. The downside is
that breaking up the document before giving it to
elasticsearch isn't a very configurable way of doing it. If
they want to change the resolution to their algorithm, they
would need to re-add all documents to the index again. If the
index stored that full documents as is and the
search algorithm could chose what resolution to work at per
query then that would be perfect. I'm not sure it is possible
or not though.
The next problem is how to get the three inputs they require
and pass it into their pluggable search algorithm. I'm really
struggling where to start with this one. It seems from looking
at Luecene that I need to provide my own search/query
implementation, but I'm not sure if this is right or not.
There also doesn't seem to be any search plugins listed on the
elasticsearch site, so I'm not even sure if it is possible.
The important things here are that the algorithm needs to
operate at the index level with the query terms available to
generate its schema before using the schema to score each
document in the index. From what I can tell, this means that
the scripting interface provided by elasticsearch won't be of
any use. The description of the scripting interface in the
elasticsearch guide makes it sound like a script operates at
the document level and not the index level. Other
concerns/considerations are the ability to program
this algorithm in a range of languages (just like the
scripting interface) and the ability to augment what is
returned by the REST API for a search to include the schema
the algorithm generated (which I assume means I will need to
define my own REST endpoint(s)).
Can anybody give me some advice on where to get started here?
It seems like I am going to have to write my own search plugin
that can accept scripts as it's core algorithm. The plugin
will be responsible for organising the 4 inputs that I
outlined earlier before passing control to the script. It will
also be responsible for getting the output from the script and
returning it via it's own REST API. Does this seem logical? If
so, how do I get started with doing this? What parts of the
code do I need to look it?
If you have managed to read down this far then much gratitude
to you. If you can help me at all I'd really appreciate it.
Cheers
There are two Lucene mailing lists: one for users and another for
those developing the Lucene software. You would want to former:
Your question is very specific/unique and IMHO the Elasticsearch
mailing list excels at providing solutions to the standard problems.
Once you get off the beaten path, it is hard to suggest solutions.
The Lucene mailing list is incredibly good (they really know the
internals) and if there is a Lucene solution to your problem, perhaps
the ES mailing list can help translate it into ES.
Good luck,
Ivan
On Mon, Oct 1, 2012 at 2:41 PM, Ryan Stuart ryan@stuart.id.au wrote:
I did look at your email, but unfortunately its an area I know nothing
about. You'll need somebody with Lucene experience to help
Thanks for taking the time to reply. The need for someone with Lucene makes
perfect sense. Is there a Dev specific mailing list as opposed to an
administration/operations list? The posts here seem to be quite operations
specific.
Cheers
Cheers
On Wednesday, 26 September 2012 15:08:15 UTC+10, Ryan Stuart wrote:
Hi Everyone,
*Advanced warning, this post is quite long and contains a lot
of questions. Apologies in advanced*
I'm currently working with a university helping them to
implement a test suite to further refine some research they
have been conducting. Their research is based around dynamic
schema searching. After spending some time evaluating the
various open source search solutions I settled on
elasticsearch as the base platform and I am wondering what the
best way to proceed would be. I have spent about a week
looking into the elasticsearch documentation and the code
itself and also reading the documentation of Lucene but I am
struggling to see a clear way forward. (On a side note, I was
getting frustrated by the lack of documentation in the
elasticsearch code. I did a quick grep to find how many
classes in the codebase have an empty class level
documentation placeholder. The result was 1378 classes. Is
there any work going on to rectify this?)
The goal of the project is to provide the researches with a
piece of software they can use to plugin revisions of the
searching algorithm to test and refine. They would like to be
able to write the pluggable algorithm in languages other then
Java that is supported by the JVM like Groovy, Python
or Closure but that isn't a hard requirement. Part of that
will be to provide them with a front end to run queries and
see output and an admin interface to add documents to an
index. I am comfortable with all of that thanks to the very
powerful and complete REST API. What I am not so sure about is
how to proceed with implementing the pluggable search
algorithm.
The researcher's algorithm requires 4 inputs to function:
1. The query terms(s).
2. A Word (term) x Document matrix across a index.
3. A Document x Word (term) matrix across a index.
4. A Word (term) frequency list across a index. That is
how many times each word appears across the entire
index.
For their purposes, a document doesn't correspond to an actual
real-world document (they actually call them text events).
Rather, for now, it corresponds to one sentence (having that
configurable might also be useful). I figure the best way to
handle this is to break down documents into their sentences
(using Apache Tika or something similar), putting each
sentence in as its own document in the index. I am confident I
can do this in the Admin UI I provide using the
mapper-attachement plugin as a starting point. The downside is
that breaking up the document before giving it to
elasticsearch isn't a very configurable way of doing it. If
they want to change the resolution to their algorithm, they
would need to re-add all documents to the index again. If the
index stored that full documents as is and the
search algorithm could chose what resolution to work at per
query then that would be perfect. I'm not sure it is possible
or not though.
The next problem is how to get the three inputs they require
and pass it into their pluggable search algorithm. I'm really
struggling where to start with this one. It seems from looking
at Luecene that I need to provide my own search/query
implementation, but I'm not sure if this is right or not.
There also doesn't seem to be any search plugins listed on the
elasticsearch site, so I'm not even sure if it is possible.
The important things here are that the algorithm needs to
operate at the index level with the query terms available to
generate its schema before using the schema to score each
document in the index. From what I can tell, this means that
the scripting interface provided by elasticsearch won't be of
any use. The description of the scripting interface in the
elasticsearch guide makes it sound like a script operates at
the document level and not the index level. Other
concerns/considerations are the ability to program
this algorithm in a range of languages (just like the
scripting interface) and the ability to augment what is
returned by the REST API for a search to include the schema
the algorithm generated (which I assume means I will need to
define my own REST endpoint(s)).
Can anybody give me some advice on where to get started here?
It seems like I am going to have to write my own search plugin
that can accept scripts as it's core algorithm. The plugin
will be responsible for organising the 4 inputs that I
outlined earlier before passing control to the script. It will
also be responsible for getting the output from the script and
returning it via it's own REST API. Does this seem logical? If
so, how do I get started with doing this? What parts of the
code do I need to look it?
If you have managed to read down this far then much gratitude
to you. If you can help me at all I'd really appreciate it.
Cheers
The downside is that breaking up the document before giving it to
elasticsearch isn't a very configurable way of doing it. If they want to
change the resolution to their algorithm, they would need to re-add all
documents to the index again. If the index stored that full documents as is
and the search algorithm could chose what resolution to work at per query
then that would be perfect.
One way to do that is, from a custom tokenizer, emit multiple versions with
different previx.
E.g. for the text
“My good monsieur,” said the poor man, falling on his knees, “pray forgive
me; it is the first time and I swear that it shall be the last.”
“These rascals always say the same thing!”
the tokenizer would emit terms:
rP_“My good monsieur,” said the poor man, falling on his knees, “pray
forgive me; it is the first time and I swear that it shall be the last.”
rP_“These rascals always say the same thing!”
rW_My
rW_good
rW_monsieur
rW_said
rW_the
rW_poor
rW_man
Then in the search you choose the correct tokenizer / filter to only use
the terms with the correct prefix.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.