User defined dictionary in lingo3g for Elasticsearch wrt label/word/synonym


(Prashant Agrawal) #1

While browsing the lingo3g manual I came across with http://download.carrotsearch.com/lingo3g/1.9.0/manual/#chapter.lexical-resources

Which states that we can customize the name of the label as per pre defined Word/Label dictionary.

So I have some doubts on basis of that:

  1. Where exactly these files have to be kept in ES (either in ES/config or somewhere else)

  2. Is it like if we are using these dictionaries so default dictionary with POS will not work in clustering the label?

  3. If we use these particular dictionaries so the label name after clustering will be formed on basis of this only or some other logic is also there?

  4. How I can check the built-in word databases wrt ES for clustering, is word-dictionary.en.xml is the built-in databse file for ES? Source: http://download.carrotsearch.com/lingo3g/manual/#section.attribute.use-built-in-word-database-for-label-filtering


(Jörg Prante) #2

Have you checked http://download.carrotsearch.com/lingo3g/manual/#section.esand
https://github.com/carrot2/elasticsearch-carrot2 and
https://github.com/carrot2/elasticsearch-carrot2/tree/master/src/main/resources?

Jörg

On Fri, Mar 21, 2014 at 9:08 AM, Prashant Agrawal <
prashant.agrawal@paladion.net> wrote:

While browsing the lingo3g manual I came across with

http://download.carrotsearch.com/lingo3g/1.9.0/manual/#chapter.lexical-resources

Which states that we can customize the name of the label as per pre defined
Word/Label dictionary.

So I have some doubts on basis of that:

  1. Where exactly these files have to be kept in ES (either in ES/config or
    somewhere else)

  2. Is it like if we are using these dictionaries so default dictionary with
    POS will not work in clustering the label?

  3. If we use these particular dictionaries so the label name after
    clustering will be formed on basis of this only or some other logic is also
    there?

  4. How I can check the built-in word databases wrt ES for clustering, is
    word-dictionary.en.xml is the built-in databse file for ES? Source:

http://download.carrotsearch.com/lingo3g/manual/#section.attribute.use-built-in-word-database-for-label-filtering

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/User-defined-dictionary-in-lingo3g-for-Elasticsearch-wrt-label-word-synonym-tp4052442.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1395389303866-4052442.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHBroMKpEdKLcXYBqhMwPVZHgcTbJw66U1gQ5cZC97Qcg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Prashant Agrawal) #3

Hi Jörg,

While exploring more I got the answer for first 3 points just wanted to get clarification for point 4:

<<4) How I can check the built-in word databases wrt ES for clustering, is word-dictionary.en.xml is the built-<<in databse file for ES if yes where I can find in ES after configuring ES with carrot2 and Lingo3g?


(Dawid Weiss) #4

I'm not sure I understand your 4th question... The Lingo3G manual
(pointed to by Jörg) has an explicit location where lexical resources
should be placed:

If you have any custom lexical resources then the override folder is${es.home}/config/ by default.
So, for example, placing word-dictionary.en.xml there will override the default English word dictionary.

All the default lexical resources come with Lingo3G bundles (for
example the Lingo3G Java API bundle) and you can copy the defaults
over from there.

Dawid

On Fri, Mar 21, 2014 at 9:53 AM, Prashant Agrawal
prashant.agrawal@paladion.net wrote:

Hi Jörg,

Wile exploring more I got the answer for first 3 points just wanted to get
clarification for point 4:

<<4) How I can check the built-in word databases wrt ES for clustering, is
word-dictionary.en.xml is the built-<<in databse file for ES if yes where I
can find in ES after configuring ES with carrot2 and Lingo3g?
<< Source: <a
href="http://download.carrotsearch.com/lingo3g/manual/#section.attribute.use-built-in-word-database-for-label-filtering">http://download.carrotsearch.com/lingo3g/manual/#section.attribute.use-built-in-word-database-for-label-filtering

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/User-defined-dictionary-in-lingo3g-for-Elasticsearch-wrt-label-word-synonym-tp4052442p4052451.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1395391994271-4052451.post%40n3.nabble.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAM21Rt9dqVNXzXauNGg6n7QD4EEuA4c4OcYBDH%2BKrf%3DoQe4T7g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Prashant Agrawal) #5

From the 4th point I mean to say that where exactly I can have a look for default word-dictionary in ES (as per pre setup I installed ES + carrot2 + copied Java API for lingo3g) though I have not copied the word dictionary manually to my config.

As I have not copied the defaults dictionary from java API bundle to My ES/Config still clustering of documents happens so on what basis that clustering happens?
Is it like that the default dictionary is also bundled with the lingo3g jar file so after that if we places the custom dictionary file then it will override the default dictionary bundled with jar files if any?


(Dawid Weiss) #6

As I have not copied the defaults dictionary from java API bundle to My
ES/Config still clustering of documents happens so on what basis that
clustering happens? Is it like that the default dictionary is also bundled with the lingo3g jar
file so after that if we places the custom dictionary file then it will
override the default dictionary bundled with jar files if any?

The default lexical resources are also included in the lingo3g.jar,
correct. These are fallback defaults, so if you specify your own
resources these will have a priority.

Dawid

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAM21Rt_sawqV6UZGJbPuf5CD1rOHau2-NWSK9Vn7wecG1YBN0A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Prashant Agrawal) #7

Hi Dawid,

so if you specify your own resources these will have a priority.
Is it like the custom resources(if specified) will have priority or it will override the default one.

How clustering will happen in below scenario:

  1. Default resources are enabled , Custom resources(having empty tags i.e. ... for all ) present in config dir.
    So clustering will happen only in terms of custom resource or both.

  2. Default resources are disabled by setting use-built-in-word-database-for-label-filtering attribute, Custom resources(having empty tags i.e. ... for all ) present in config dir.

  3. Default resources are disabled by setting use-built-in-word-database-for-label-filtering attribute, no custom resources are present in config.

So I am some what more interested to know that how the clustering will happen in case 2 and 3 as defaults are disabled and custom are also not present.


(Prashant Agrawal) #8

Hi Dawid,
Is there any attribute in lingo3g to suppress the label name returned by the ES with respect to multiple keyword separated by comma.

For ex.
If my cluster query returns label name as :

  1. India development , india , hello india
  2. mobile samsung , motorola g 205, micromax canvas

So is there any attribute which can be used to return the single value(may have multiple keyword separated by space) not the multiple value with comma separated. But label as "mobile samsung" is fine because that can be controlled by attribute "max-label-words" .


(Dawid Weiss) #9

Is it like the custom resources(if specified) will have priority or it will
override the default one.

Every resource file is read once, from the first location it is found
at. From there an internal dictionary is built and used for the
algorithm.

  1. Default resources are disabled by setting
    use-built-in-word-database-for-label-filtering attribute, Custom

This attribute adds additional lexical information (part of speech
data). If it's disabled, it won't be used.

So I am some what more interested to know that how the clustering will
happen in case 2 and 3 as defaults are disabled and custom are also not
present.

It will work without any additional hints and the results may be of
poorer quality.

D.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAM21Rt-g2-oaXxyQTrvnG7Y_rHdaTe49A3ptWhHUFGvz7o_5Cw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(dawid.weiss) #10

These labels are a result of merging multiple smaller clusters
together. If you don't care about the multiple tags, trim everything
after the first comma (with a regular expression, for example).

You can also disable cluster merging by setting merge threshold to 1.
http://download.carrotsearch.com/lingo3g/manual/#section.attribute.merge-threshold

Dawid

On Thu, Apr 17, 2014 at 9:26 AM, Prashant Agrawal [via ElasticSearch
Users] ml-node+s115913n4054318h38@n3.nabble.com wrote:

Hi Dawid,
Is there any attribute in lingo3g to suppress the label name returned by the
ES with respect to multiple keyword separated by comma.

For ex.
If my cluster query returns label name as :

  1. India development , india , hello india
  2. mobile samsung , motorola g 205, micromax canvas

So is there any attribute which can be used to return the single value(may
have multiple keyword separated by space) not the multiple value with comma
separated. But label as "mobile samsung" is fine because that can be
controlled by attribute "max-label-words" .


If you reply to this email, your message will be added to the discussion
below:
http://elasticsearch-users.115913.n3.nabble.com/User-defined-dictionary-in-lingo3g-for-Elasticsearch-wrt-label-word-synonym-tp4052442p4054318.html
This email was sent by Prashant Agrawal (via Nabble)
To receive all replies by email, subscribe to this discussion


(system) #11