Do you use an analyzer provider?
Example
public class RussianLemmatizingTwitterAnalyzerProvider extends
AbstractIndexAnalyzerProvider {
private final MorphAnalyzer morphAnalyzer;
...
@Inject
public RussianLemmatizingTwitterAnalyzerProvider(Index index,
@IndexSettings Settings
indexSettings,
Environment environment,
@Assisted String name, @Assisted
Settings settings) {
super(index, indexSettings, name, settings);
this.morphAnalyzer = createMorphAnalyzer(environment, settings,
...);
}
@Override
public RussianLemmatizingTwitterAnalyzer get() {
return new RussianLemmatizingTwitterAnalyzer(morphAnalyzer, ...);
}
private MorphAnalyzer createMorphAnalyzer(...) {
}
}
Only such a provider is bound to a singleton. So the analyzer provider can
set up the analyzer configuration exactly once (with a MorphAnalyzer
instance etc.), and with get() method, it creates analyzers as required.
Jörg
On Wed, Mar 18, 2015 at 6:40 PM, Dmitry Kan dmitry.kan@gmail.com wrote:
Jörg,
Thanks for replying!
Here is the code of the RussianLemmatizingTwitterAnalyzer, the deepest
class in the simplified class sequence I have posted in the original
message.
> public class RussianLemmatizingTwitterAnalyzer extends Analyzer {
>
> private static MorphAnalyzer morphAnalyzerGlobal;
>
> boolean useSyncMethod = true;
>
> private static final boolean verbose = false;
> private MorphAnalyzer morphAnalyzer;
> private boolean analyzeBest = false;
>
> private static final Logger Log = Logger.getLogger(RussianLemmatizingTwitterAnalyzer.class.getName());
>
> public RussianLemmatizingTwitterAnalyzer(String lemmatizerConfFile, boolean analyzeBest) throws IOException {
> this.analyzeBest = analyzeBest;
>
> if (useSyncMethod) {
> this.morphAnalyzer = loadCustomAnalyzer(lemmatizerConfFile);
> } else {
> Properties properties = new Properties();
>
> Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);
>
> properties.load(new StringReader(IOUtils.readFile(new File(lemmatizerConfFile), Charsets.UTF_8)));
> this.morphAnalyzer = MorphAnalyzerLoader.load(new MorphAnalyzerConfig(properties));
> }
> }
>
> private static MorphAnalyzer loadAnalyzer(String lemmatizerConfFile) throws IOException {
> Properties properties = new Properties();
>
> Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);
>
> properties.load(new StringReader(IOUtils.readFile(new File(lemmatizerConfFile), Charsets.UTF_8)));
> MorphAnalyzer morphAnalyzer1 = MorphAnalyzerLoader.load(new MorphAnalyzerConfig(properties));
>
> if (verbose) {
> if (morphAnalyzer1 != null) {
> Log.info("Successfully created the analyzer!");
> Log.info(morphAnalyzer1.analyzeBest("билета").toString());
> } else {
> Log.severe("Failed to create the morphAnalyzer object");
> }
> }
>
> return morphAnalyzer1;
> }
>
> public static synchronized MorphAnalyzer loadCustomAnalyzer(String lemmatizerConfFile)
> throws IOException {
> if (morphAnalyzerGlobal == null) {
> morphAnalyzerGlobal = loadAnalyzer(lemmatizerConfFile);
> }
>
> return morphAnalyzerGlobal;
> }
>
> @Override
> protected TokenStreamComponents createComponents(String fieldName, final Reader reader) {
> Tokenizer tokenizer = new TwitterFlexLuceneTokenizer(reader);
>
> Log.config("Using Tokenizer: " + tokenizer.getClass().getSimpleName());
>
> TokenStream tokenStream = tokenizer;
> tokenStream = new LowerCaseFilter(Version.LUCENE_4_9,tokenStream);
> tokenStream = new MorphAnalTokenFilter(tokenStream, morphAnalyzer, analyzeBest);
> return new TokenStreamComponents(tokenizer, tokenStream);
> }
>
> }
>
>
Note, that in the code above the TwitterFlexLuceneTokenizer is not thread
safe and extends o.a.lucene.analysis.Tokenizer. In jvisualvm there are 97
instances of this class.
Let me know, if I should copy other code snippets up the class stream.
Dmitry
On Wednesday, 18 March 2015 18:22:47 UTC+2, Jörg Prante wrote:
Is it possible to examine the code of your plugin?
Generally speaking, analyzers are instantiated per index creation for
each thread.
In org.elasticsearch.index.analysis.AnalysisModule, you can see how
analyzer providers and factories are prepared for injection by the help of
the ES injection modul which is based on Guice. Basically, the factories
are kept as singletons, and each thread can pick analyzer instances from
the factory when needed. All in all, Lucene analyzer classes are not
threadsafe, in particular the tokenizers. It means, it is up to the
implementor of an analyzer/tokenizer to store immutable objects as
singletons in a correct way so that all threads can safely access them.
Jörg
On Wed, Mar 18, 2015 at 4:02 PM, Dmitry Kan dmitr...@gmail.com wrote:
Hi,
Could somebody answer, please?
On Tuesday, 17 March 2015 19:05:38 UTC+2, Dmitry Kan wrote:
Hello!
I'm a newbie in elasticsearch, so forgive if the question is lame.
I have implemented a custom plugin using a custom lemmatizer and a
tokenizer. The simplified class sequence:
AnalysisMorphologyPlugin->MorphologyAnalysisBinderProcessor->SemanticAnalyzerTwitterLemmatizerProvider->RussianLemmatizingTwitterAnalyzer
In the RussianLemmatizingTwitterAnalyzer's ctor I load the custom object for lemmatization (object unrelated to lucene/es) in a singleton fashion (in a syncrhonized code block).
Then, when creating 14 indices in the same JVM I see
14 instances of RussianLemmatizingTwitterAnalyzer,
4 instances of SemanticAnalyzerTwitterLemmatizerProvider,
4 instances of MorphologyAnalysisBinderProcessor,
30 instances of the custom lemmatizer (in each RussianLemmatizingTwitterAnalyzer only one instance is expected, so should be 14),
1 instance of AnalysisMorphologyPlugin.
The question is, can RussianLemmatizingTwitterAnalyzer object be made shared between indices? Or is it by design, that they must load separately per index?
What could be wrong in the code that makes 30 instances of the custom singleton lemmatizer instead of 14?
The current standing is that with the plugin 100M of RAM is reserved by the JVM with no data. Without the plugin the JVM reserves 2M with no data. Elasticsearch 1.3.2, Lucene 4.9.0.
Regards,
Dmitry Kan
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE0KbyNxQkgqtN-JCjfWseqF5gm9g9KNpaX-8hgqG%2BXVw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.