Text Categorization in ES


(Prashant Agrawal) #1

Hi,

I am looking forward to write queries w.r.t. text categorization in Elasticsearch.
So is there any API exists already if not how can I proceed with that?

Any help is appreciable.


(Hannes Korte) #2

On 21.02.2014 10:50, prashant.agrawal wrote:

I am looking forward to write queries w.r.t. text categorization in
Elasticsearch.
So is there any API exists already if not how can I proceed with that?

Hi,

you could do something like kNN:

Simply perform an MLT query and count the categories of the top-N docs.
Additionally, you could weight the categories by score.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html

I don't think there is anything directly usable via the API to classify
documents. Maybe there is some neat trick using aggregations in ES 1.0.

Hannes

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5307A8B8.6080801%40hkorte.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Prashant Agrawal) #3

Hi Hannes,

Thanks for the info , also I came to know about lingo3G/Carrot Search.
So whether that could also be a solution for that?


(Jörg Prante) #4

Install the carrot2 plugin and see if it fits your requirments:
http://download.carrotsearch.com/lingo3g/manual/#section.es

Jörg

On Mon, Feb 24, 2014 at 7:00 AM, prashant.agrawal <
prashant.agrawal@paladion.net> wrote:

Hi Hannes,

Thanks for the info , also I came to know about lingo3G/Carrot Search.
So whether that could also be a solution for that?

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Text-Categorization-in-ES-tp4050194p4050349.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1393221646984-4050349.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE8hAPZvuXgPMwbRRdM%3DxuMNpff7q5LrdpnSeMM_A--kw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Dawid Weiss) #5

If you want classification then Carrot2/ Lingo3G won't be of much use
-- in short classification is assigning an unlabeled example to a pool
of (previously known or computed) labels, Lingo3G and Carrot2 are for
clustering (finding "labels" in an otherwise untagged set of documents
or search results).


I would agree with Hannes that the simplest way to "classify"
documents with an inverted index would be to use a knn-like algorithm.

Dawid

On Mon, Feb 24, 2014 at 9:06 AM, joergprante@gmail.com
joergprante@gmail.com wrote:

Install the carrot2 plugin and see if it fits your requirments:
http://download.carrotsearch.com/lingo3g/manual/#section.es

Jörg

On Mon, Feb 24, 2014 at 7:00 AM, prashant.agrawal
prashant.agrawal@paladion.net wrote:

Hi Hannes,

Thanks for the info , also I came to know about lingo3G/Carrot Search.
So whether that could also be a solution for that?

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Text-Categorization-in-ES-tp4050194p4050349.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1393221646984-4050349.post%40n3.nabble.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE8hAPZvuXgPMwbRRdM%3DxuMNpff7q5LrdpnSeMM_A--kw%40mail.gmail.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAM21Rt-2L7%3D0TZY_%3D4RaqJ2UOxd9%2B3TsdS38-GexOp5HWzu-7Q%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Prashant Agrawal) #6

Hi All,

To be specific I want a query like :
Searching for Laptop will automatically give result for "Dell, Sony, HP, Lenevo, Samsung..." as well. As lingo3g is used for clustering the documents so it will store the reference for above terms as well.

For that I have installed Carrot2 and Lingo3g on top of ES.

So what should be my query wrt lingo3g to search the specified items. Or is there anything else I have to do to make it work.


(Dawid Weiss) #7

Searching for Laptop will automatically give result for "Dell, Sony, HP,
Lenevo, Samsung..." as well. As lingo3g is used for clustering the documents
so it will store the reference for above terms as well.

There is no way to get a clear, intuitive classification like this
from an unsupervised clustering algorithm. You rely on prior knowledge
(that these are companies, that they produce laptops, etc.).

I would use faceting and pre-tag your documents with all the labels
you may wish to display in your user interface. This will be more
reliable and faster. You can then add clustering on top of that as a
form of "dynamic faceting" which users may use to lookup keywords/ key
phrases of groups of search results not covered in regular facets.

So what should be my query wrt lingo3g to search the specified items.

The plugin contains the required documentation. Like I said though,
the results will be disappointing if you expect perfect ontology from
raw text.

Dawid

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAM21Rt_oFKS1X_8juC_m%3DHAQZabCfCfKgnJ%2Bhn30pcuykYQpCA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Prashant Agrawal) #8

So it means that all the classification has to be done prior, on the basis of user defined scenario.

And automatically this feature is not supported either through carrot or Lingo3g. Like we have the feature of word-delimiter, hunspell filter etc.

So what all things are there we can achieve by using lingo3g?


(Dawid Weiss) #9

So it means that all the classification has to be done prior, on the basis of
user defined scenario.

For proper faceting yes -- this information would either come with
each document or would be extracted statically (when indexing each
document). I'm sure OpenNLP and other text mining projects have named
entity recognition that would be of help here. You may want to check
out Grant's book on the subject.

And automatically this feature is not supported either through carrot or
Lingo3g. Like we have the feature of word-delimiter, hunspell filter etc.

Feel free to try Carrot2 (and Lingo3G) on your data. Cluster labels
are sort of dynamic facet labels, but they are not as "ideal" as
statically indexed facets. Also, they are context-dependent (they will
be created from scratch for each search result). They are essentially
a different tool for a different task (to get a fast glimpse into a
larger window of search results, for which static facets are not
indexed).

Dawid

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAM21Rt9WMiVe70HGJATdARRqyFyui2GUO8j3O%2B7NC4n_9JS3CA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Hannes Korte) #10

On 26.02.2014 08:28, prashy wrote:

To be specific I want a query like :
Searching for Laptop will automatically give result for "Dell, Sony, HP,
Lenevo, Samsung..." as well.

I'm not sure I got that correctly. Besides the text classification we
talked about, this sentence could also mean that you want to expand your
query. So instead of searching only for the term "Laptop" you want the
query to be expanded automatically by adding highly correlated words
like "Dell", "Sony", "HP", etc. to get a broader search result. Is it
like that?

Hannes

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/530DBCF5.3030106%40hkorte.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #11