While exploring more I got the answer for first 3 points just wanted to get clarification for point 4:
<<4) How I can check the built-in word databases wrt ES for clustering, is word-dictionary.en.xml is the built-<<in databse file for ES if yes where I can find in ES after configuring ES with carrot2 and Lingo3g?
I'm not sure I understand your 4th question... The Lingo3G manual
(pointed to by Jörg) has an explicit location where lexical resources
should be placed:
If you have any custom lexical resources then the override folder is${es.home}/config/ by default.
So, for example, placing word-dictionary.en.xml there will override the default English word dictionary.
All the default lexical resources come with Lingo3G bundles (for
example the Lingo3G Java API bundle) and you can copy the defaults
over from there.
From the 4th point I mean to say that where exactly I can have a look for default word-dictionary in ES (as per pre setup I installed ES + carrot2 + copied Java API for lingo3g) though I have not copied the word dictionary manually to my config.
As I have not copied the defaults dictionary from java API bundle to My ES/Config still clustering of documents happens so on what basis that clustering happens?
Is it like that the default dictionary is also bundled with the lingo3g jar file so after that if we places the custom dictionary file then it will override the default dictionary bundled with jar files if any?
As I have not copied the defaults dictionary from java API bundle to My
ES/Config still clustering of documents happens so on what basis that
clustering happens? Is it like that the default dictionary is also bundled with the lingo3g jar
file so after that if we places the custom dictionary file then it will
override the default dictionary bundled with jar files if any?
The default lexical resources are also included in the lingo3g.jar,
correct. These are fallback defaults, so if you specify your own
resources these will have a priority.
so if you specify your own resources these will have a priority.
Is it like the custom resources(if specified) will have priority or it will override the default one.
How clustering will happen in below scenario:
Default resources are enabled , Custom resources(having empty tags i.e. ... for all ) present in config dir.
So clustering will happen only in terms of custom resource or both.
Default resources are disabled by setting use-built-in-word-database-for-label-filtering attribute, Custom resources(having empty tags i.e. ... for all ) present in config dir.
Default resources are disabled by setting use-built-in-word-database-for-label-filtering attribute, no custom resources are present in config.
So I am some what more interested to know that how the clustering will happen in case 2 and 3 as defaults are disabled and custom are also not present.
Hi Dawid,
Is there any attribute in lingo3g to suppress the label name returned by the ES with respect to multiple keyword separated by comma.
For ex.
If my cluster query returns label name as :
India development , india , hello india
mobile samsung , motorola g 205, micromax canvas
So is there any attribute which can be used to return the single value(may have multiple keyword separated by space) not the multiple value with comma separated. But label as "mobile samsung" is fine because that can be controlled by attribute "max-label-words" .
Is it like the custom resources(if specified) will have priority or it will
override the default one.
Every resource file is read once, from the first location it is found
at. From there an internal dictionary is built and used for the
algorithm.
Default resources are disabled by setting use-built-in-word-database-for-label-filtering attribute, Custom
This attribute adds additional lexical information (part of speech
data). If it's disabled, it won't be used.
So I am some what more interested to know that how the clustering will
happen in case 2 and 3 as defaults are disabled and custom are also not
present.
It will work without any additional hints and the results may be of
poorer quality.
These labels are a result of merging multiple smaller clusters
together. If you don't care about the multiple tags, trim everything
after the first comma (with a regular expression, for example).
Hi Dawid,
Is there any attribute in lingo3g to suppress the label name returned by the
ES with respect to multiple keyword separated by comma.
For ex.
If my cluster query returns label name as :
India development , india , hello india
mobile samsung , motorola g 205, micromax canvas
So is there any attribute which can be used to return the single value(may
have multiple keyword separated by space) not the multiple value with comma
separated. But label as "mobile samsung" is fine because that can be
controlled by attribute "max-label-words" .
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.