Wow thanks for your quick reply Jörg!
Tagging does not consume that much space - remember, you have an inverted
index, the frequency of words does not correlate with index growth.
Thank you for pointing that out, I didn't think of it!
It is the easiest method to classify documents, see folksonomies.
I will check folksonomies for sure, cause I am not that familiar with
classication strategies.
Be careful of synonym files, they are inefficient, slow, and comes with an
extra price - you have to restart the cluster each time you modify the
synonyms, and if you use synonyms in the index, you have to reindex. Maybe
you do not want that overhead.
True. But in this specific case, synonyms wouldn't vary once properly set,
as it is just (in the way I thought of it) a matter of reducing a few terms
(if present) to either "rent" or "buy" (from my example). I realized that
this is "tagging" strategy as well (sort of), except that it is just a "one
term tag" each time.
Stemming can not help either in document classification.
I thought of stemming because of such cases: "rent a car in London", or
"car rental in London". The "keyword" here are "rent" and "rental". If I
did understand your tagging suggestion, I would then tag each and every
"rent"-typed document with both "rent" and "rental" (the example is trivial
in English but becomes a bit more complicated in French for instance, where
you can see more occurrences due to verb conjugation which produce
"correct" terms even though not relevant nor necessary in the search
context - quite a few people would for example type "loué", past participle
of "louer", because they are pronounced the same way, and both terms are
correct spelling-wise), am I right? In case of of a keyword analysis, there
is no stemming, right?
So the tough part of this would be to try and tag documents with all the
possible terms and forms so that ES could finally identify either a "rent"
or a "buy" documen, am I still right? This seems pretty complicated to me,
actually, much more than proccessing the query with a langage-specific
analyzer (which would use stemming, right?) and comparing it to the "type"
field. But again I may be totally wrong cause I am really new to ES and
everything it brings. I have read a lot about it, the docs, etc, and quite
a few elements tend to mix up in my mind...
If you want to process natural language queries and examine the sentence
for the meaning and express the meaning in useful tags, you can try plugins
for POS tagging, e.g.
GitHub - richardwilly98/elasticsearch-auto-tagging
Oh thank you I will have a look at it, it will probably help me understand
properly the pros of this approach!
There are plenty of approaches in the natural language processing field,
most of them work in front of ES, not as plugins.
That is really interesting. As said, I would definitely prefer a solution
that would allow me to know, from the query, which type to query instead of
querying over them all and then filtering. For instance, imagine that I
have those two index types, "buy" with 10000 documents and "rent" with
1000000. Querying 1010000 total documents would not be a real issue if the
query was expected to return "rent" typed documents. But if the expected
set of documents was from "buy", querying that many documents for an
original pool of just 10000 would be a real overhead (well, so I think, but
I may be wrong again...)
Plus, filtering would still remain "redundant" in some way, as documents
are already properly stored either in "buy" or "rent": would be much better
to me to use only the relevant type right away in any case...
What I thought of a few minutes after I posted my original question was the
following: a 2-pass process.
- analyze the query (for instance:
/test_index/_analyze?analyzer=my_custom_analyzer&text=the+text+of+the+query)
=> this would return the parsed result, which could contain either "rent"
(from terms like "rent", "rental", "hire", "hiring",...) or "buy" (from
terms like "buy", "purchase", etc.) or none of them
- then do the real _search query on the proper index type, or on both
if no "rent" or "buy" term has been found from the first pass analysis
Does that process make sense? It is the only thing I can think of right now
that would avoid querying several index types then filtering the matches,
and avoid at the same time the use of an external process before querying
my ES index...
Many thanks for your help!
Cheers
JM
Le mercredi 4 mars 2015 16:25:37 UTC+1, Jörg Prante a écrit :
Tagging does not consume that much space - remember, you have an inverted
index, the frequency of words does not correlate with index growth.
It is the easiest method to classify documents, see folksonomies.
Be careful of synonym files, they are inefficient, slow, and comes with an
extra price - you have to restart the cluster each time you modify the
synonyms, and if you use synonyms in the index, you have to reindex. Maybe
you do not want that overhead.
Stemming can not help either in document classification.
If you want to process natural language queries and examine the sentence
for the meaning and express the meaning in useful tags, you can try plugins
for POS tagging, e.g.
GitHub - richardwilly98/elasticsearch-auto-tagging
There are plenty of approaches in the natural language processing field,
most of them work in front of ES, not as plugins.
Jörg
On Wed, Mar 4, 2015 at 4:02 PM, Jean-Marc F. <jims...@gmail.com
<javascript:>> wrote:
Thank you Jörg !
I did think of the tag approach: it is very close to the first scheme I
described in my question, that is: querying over the two types then
filtering. It still seems to me that it is an overhead that can be avoided?
(not critical with a few documents but might become when both types' size
increase...)
I discarded the tag approach for another reason too: the need to tag each
"rent" or "buy" document with always the same words/expressions, which
would enflate the data size and would not leverage ES' intrinsic full-text
abilities (such as stemming, synonym handling, etc.). I do think that, in
that context, working on a simple field/tag ("type" or even "_type" if
feasible?) with the proper analyzer and synonym file would be more
efficient and less error prone.
But thank you again anyway for your feedback on this topic, it makes me
feel more confident as I did envisage this approach - letting me think I am
not totally lost ^^
Cheers,
JM
Le mercredi 4 mars 2015 12:19:17 UTC+1, Jörg Prante a écrit :
My suggestion is, instead of selecting a unique type, you should tag
documents in the index with a given vocabulary, and at query time, you
could match certain phrases in the query text with that vocabulary in order
to build a filter clause.
Jörg
On Wed, Mar 4, 2015 at 11:10 AM, Jean-Marc F. jims...@gmail.com wrote:
Now that I have written my question: would be a 2 pass job? First pass:
send an "analyze" query to get the proper term "rent" or "buy" (or both if
none), then second pass => query the proper type?
Le mercredi 4 mars 2015 11:07:43 UTC+1, Jean-Marc F. a écrit :
Hi everyone,
I am pretty new to ES and need some advice for the following use case:
I have a unique input field for user search (Google like). In my test
index, I have two different types, let's call them "rent" and "buy". What I
would like to achieve is leverage ES's full-text powerful features to
determine which index type to query depending on the query (part of it).
For instance, for a query such as "rent a motorcycle in Paris" or
"hire a flat in Rome" => is there a way to have ES "know" it should look
into the "rent" type?
I thought of a first possibility: query both types (/rent,buy/_search)
then filter on a (quite redundant) "type" field created each time a
document is indexed, this "type" field being applied the proper
analyzers/synonyms to always simplify things to "rent" or "buy". (or more
directly the "_type" field but I don't think you can apply analysis to it,
can you?)
The "cons" to this approach is that I have to query both the rent and
the buy types then filter to narrow the results to the expected type of
documents. The "pros" is that it should not be complicated to have it work
properly.
Now, I am wondering if it would be possible to have ES "figure out"
what index to query right after analysis? In a process like: query =>
analysis => "rent" or "buy" term identified => perform on the right index
type.
The pros would be that you obviously query one index type thus don't
need to filter afterwards: smaller data set + no filtering, should be
lighter/faster.
The cons: I do not think that ES can do it.
Another scenario would be to handle a first, app specific analysis
step before querying ES just to determine "rent" or "buy". With this
example it would not be that tough (two types, a few synonyms/a bit of
stemming to take into account, etc.), but with a more complex setup it
would become a real nightmare - not to mention the fact that not using ES's
abilities would be quite a pity, actually...
I would really appreciate your thoughts on this, you all
Thanks
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/f5de1e5b-c2e6-4cd2-9019-8e520979b6a2%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f5de1e5b-c2e6-4cd2-9019-8e520979b6a2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1d6222c6-6d5a-4b6e-b68f-d7d9d415fa23%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/1d6222c6-6d5a-4b6e-b68f-d7d9d415fa23%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1e43146c-70ce-4197-9274-7d46546e96d7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.