Issue with singleton analyzer in single JVM multi-index setup

Dmitry_Kan · March 17, 2015, 5:05pm

Hello!

I'm a newbie in elasticsearch, so forgive if the question is lame.

I have implemented a custom plugin using a custom lemmatizer and a
tokenizer. The simplified class sequence:

AnalysisMorphologyPlugin->MorphologyAnalysisBinderProcessor->SemanticAnalyzerTwitterLemmatizerProvider->RussianLemmatizingTwitterAnalyzer

In the RussianLemmatizingTwitterAnalyzer's ctor I load the custom object for lemmatization (object unrelated to lucene/es) in a singleton fashion (in a syncrhonized code block).
Then, when creating 14 indices in the same JVM I see
14 instances of RussianLemmatizingTwitterAnalyzer,
4 instances of SemanticAnalyzerTwitterLemmatizerProvider,
4 instances of MorphologyAnalysisBinderProcessor,
30 instances of the custom lemmatizer (in each RussianLemmatizingTwitterAnalyzer only one instance is expected, so should be 14),
1 instance of AnalysisMorphologyPlugin.

The question is, can RussianLemmatizingTwitterAnalyzer object be made shared between indices? Or is it by design, that they must load separately per index?
What could be wrong in the code that makes 30 instances of the custom singleton lemmatizer instead of 14?

The current standing is that with the plugin 100M of RAM is reserved by the JVM with no data. Without the plugin the JVM reserves 2M with no data. Elasticsearch 1.3.2, Lucene 4.9.0.

Regards,

Dmitry Kan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7e0b09a0-c88c-4c56-bc8f-1b895d534cc0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dmitry_Kan · March 18, 2015, 3:02pm

Hi,

Could somebody answer, please?

On Tuesday, 17 March 2015 19:05:38 UTC+2, Dmitry Kan wrote:

Hello!

I'm a newbie in elasticsearch, so forgive if the question is lame.

I have implemented a custom plugin using a custom lemmatizer and a
tokenizer. The simplified class sequence:

AnalysisMorphologyPlugin->MorphologyAnalysisBinderProcessor->SemanticAnalyzerTwitterLemmatizerProvider->RussianLemmatizingTwitterAnalyzer

In the RussianLemmatizingTwitterAnalyzer's ctor I load the custom object for lemmatization (object unrelated to lucene/es) in a singleton fashion (in a syncrhonized code block).
Then, when creating 14 indices in the same JVM I see
14 instances of RussianLemmatizingTwitterAnalyzer,
4 instances of SemanticAnalyzerTwitterLemmatizerProvider,
4 instances of MorphologyAnalysisBinderProcessor,
30 instances of the custom lemmatizer (in each RussianLemmatizingTwitterAnalyzer only one instance is expected, so should be 14),
1 instance of AnalysisMorphologyPlugin.

The question is, can RussianLemmatizingTwitterAnalyzer object be made shared between indices? Or is it by design, that they must load separately per index?
What could be wrong in the code that makes 30 instances of the custom singleton lemmatizer instead of 14?

The current standing is that with the plugin 100M of RAM is reserved by the JVM with no data. Without the plugin the JVM reserves 2M with no data. Elasticsearch 1.3.2, Lucene 4.9.0.

Regards,

Dmitry Kan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · March 18, 2015, 4:22pm

Is it possible to examine the code of your plugin?

Generally speaking, analyzers are instantiated per index creation for each
thread.

In org.elasticsearch.index.analysis.AnalysisModule, you can see how
analyzer providers and factories are prepared for injection by the help of
the ES injection modul which is based on Guice. Basically, the factories
are kept as singletons, and each thread can pick analyzer instances from
the factory when needed. All in all, Lucene analyzer classes are not
threadsafe, in particular the tokenizers. It means, it is up to the
implementor of an analyzer/tokenizer to store immutable objects as
singletons in a correct way so that all threads can safely access them.

Jörg

On Wed, Mar 18, 2015 at 4:02 PM, Dmitry Kan dmitry.kan@gmail.com wrote:

Hi,

Could somebody answer, please?

On Tuesday, 17 March 2015 19:05:38 UTC+2, Dmitry Kan wrote:

Hello!

I'm a newbie in elasticsearch, so forgive if the question is lame.

I have implemented a custom plugin using a custom lemmatizer and a
tokenizer. The simplified class sequence:

AnalysisMorphologyPlugin->MorphologyAnalysisBinderProcessor->SemanticAnalyzerTwitterLemmatizerProvider->RussianLemmatizingTwitterAnalyzer

In the RussianLemmatizingTwitterAnalyzer's ctor I load the custom object for lemmatization (object unrelated to lucene/es) in a singleton fashion (in a syncrhonized code block).
Then, when creating 14 indices in the same JVM I see
14 instances of RussianLemmatizingTwitterAnalyzer,
4 instances of SemanticAnalyzerTwitterLemmatizerProvider,
4 instances of MorphologyAnalysisBinderProcessor,
30 instances of the custom lemmatizer (in each RussianLemmatizingTwitterAnalyzer only one instance is expected, so should be 14),
1 instance of AnalysisMorphologyPlugin.

The question is, can RussianLemmatizingTwitterAnalyzer object be made shared between indices? Or is it by design, that they must load separately per index?
What could be wrong in the code that makes 30 instances of the custom singleton lemmatizer instead of 14?

The current standing is that with the plugin 100M of RAM is reserved by the JVM with no data. Without the plugin the JVM reserves 2M with no data. Elasticsearch 1.3.2, Lucene 4.9.0.

Regards,

Dmitry Kan

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGJp6MkhrHCuHiJp%2Bd55cPWZ0bsMNCj0pnS13oGKtdoxQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Dmitry_Kan · March 18, 2015, 5:40pm

Jörg,

Thanks for replying!

Here is the code of the RussianLemmatizingTwitterAnalyzer, the deepest
class in the simplified class sequence I have posted in the original
message.


public class RussianLemmatizingTwitterAnalyzer extends Analyzer {

    private static MorphAnalyzer morphAnalyzerGlobal;

    boolean useSyncMethod = true;

    private static final boolean verbose = false;
    private MorphAnalyzer morphAnalyzer;
    private boolean analyzeBest = false;

    private static final Logger Log = Logger.getLogger(RussianLemmatizingTwitterAnalyzer.class.getName());

    public RussianLemmatizingTwitterAnalyzer(String lemmatizerConfFile, boolean analyzeBest) throws IOException {
        this.analyzeBest = analyzeBest;

        if (useSyncMethod) {
            this.morphAnalyzer = loadCustomAnalyzer(lemmatizerConfFile);
        } else {
            Properties properties = new Properties();

            Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);

            properties.load(new StringReader(IOUtils.readFile(new File(lemmatizerConfFile), Charsets.UTF_8)));
            this.morphAnalyzer = MorphAnalyzerLoader.load(new MorphAnalyzerConfig(properties));
        }
    }

    private static MorphAnalyzer loadAnalyzer(String lemmatizerConfFile) throws IOException {
        Properties properties = new Properties();

        Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);

        properties.load(new StringReader(IOUtils.readFile(new File(lemmatizerConfFile), Charsets.UTF_8)));
        MorphAnalyzer morphAnalyzer1 = MorphAnalyzerLoader.load(new MorphAnalyzerConfig(properties));

        if (verbose) {
            if (morphAnalyzer1 != null) {
                Log.info("Successfully created the analyzer!");
                Log.info(morphAnalyzer1.analyzeBest("билета").toString());
            } else {
                Log.severe("Failed to create the morphAnalyzer object");
            }
        }

        return morphAnalyzer1;
    }

    public static synchronized MorphAnalyzer loadCustomAnalyzer(String lemmatizerConfFile)
            throws IOException {
        if (morphAnalyzerGlobal == null) {
            morphAnalyzerGlobal = loadAnalyzer(lemmatizerConfFile);
        }

        return morphAnalyzerGlobal;
    }

    @Override
    protected TokenStreamComponents createComponents(String fieldName, final Reader reader) {
        Tokenizer tokenizer = new TwitterFlexLuceneTokenizer(reader);

        Log.config("Using Tokenizer: " + tokenizer.getClass().getSimpleName());

        TokenStream tokenStream = tokenizer;
        tokenStream = new LowerCaseFilter(Version.LUCENE_4_9,tokenStream);
        tokenStream = new MorphAnalTokenFilter(tokenStream, morphAnalyzer, analyzeBest);
        return new TokenStreamComponents(tokenizer, tokenStream);
    }

}

Note, that in the code above the TwitterFlexLuceneTokenizer is not thread
safe and extends o.a.lucene.analysis.Tokenizer. In jvisualvm there are 97
instances of this class.

Let me know, if I should copy other code snippets up the class stream.

Dmitry

On Wednesday, 18 March 2015 18:22:47 UTC+2, Jörg Prante wrote:

Is it possible to examine the code of your plugin?

Generally speaking, analyzers are instantiated per index creation for each
thread.

In org.elasticsearch.index.analysis.AnalysisModule, you can see how
analyzer providers and factories are prepared for injection by the help of
the ES injection modul which is based on Guice. Basically, the factories
are kept as singletons, and each thread can pick analyzer instances from
the factory when needed. All in all, Lucene analyzer classes are not
threadsafe, in particular the tokenizers. It means, it is up to the
implementor of an analyzer/tokenizer to store immutable objects as
singletons in a correct way so that all threads can safely access them.

Jörg

On Wed, Mar 18, 2015 at 4:02 PM, Dmitry Kan <dmitr...@gmail.com
<javascript:>> wrote:

Hi,

Could somebody answer, please?

On Tuesday, 17 March 2015 19:05:38 UTC+2, Dmitry Kan wrote:

Hello!

I'm a newbie in elasticsearch, so forgive if the question is lame.

I have implemented a custom plugin using a custom lemmatizer and a
tokenizer. The simplified class sequence:

AnalysisMorphologyPlugin->MorphologyAnalysisBinderProcessor->SemanticAnalyzerTwitterLemmatizerProvider->RussianLemmatizingTwitterAnalyzer

In the RussianLemmatizingTwitterAnalyzer's ctor I load the custom object for lemmatization (object unrelated to lucene/es) in a singleton fashion (in a syncrhonized code block).
Then, when creating 14 indices in the same JVM I see
14 instances of RussianLemmatizingTwitterAnalyzer,
4 instances of SemanticAnalyzerTwitterLemmatizerProvider,
4 instances of MorphologyAnalysisBinderProcessor,
30 instances of the custom lemmatizer (in each RussianLemmatizingTwitterAnalyzer only one instance is expected, so should be 14),
1 instance of AnalysisMorphologyPlugin.

The question is, can RussianLemmatizingTwitterAnalyzer object be made shared between indices? Or is it by design, that they must load separately per index?
What could be wrong in the code that makes 30 instances of the custom singleton lemmatizer instead of 14?

The current standing is that with the plugin 100M of RAM is reserved by the JVM with no data. Without the plugin the JVM reserves 2M with no data. Elasticsearch 1.3.2, Lucene 4.9.0.

Regards,

Dmitry Kan

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · March 18, 2015, 6:52pm

Do you use an analyzer provider?

Example

public class RussianLemmatizingTwitterAnalyzerProvider extends
AbstractIndexAnalyzerProvider {

private final MorphAnalyzer morphAnalyzer;

...

@Inject
public RussianLemmatizingTwitterAnalyzerProvider(Index index,
                                       @IndexSettings Settings

indexSettings,
Environment environment,
@Assisted String name, @Assisted
Settings settings) {
super(index, indexSettings, name, settings);
this.morphAnalyzer = createMorphAnalyzer(environment, settings,
...);
}

@Override
public RussianLemmatizingTwitterAnalyzer get() {
    return new RussianLemmatizingTwitterAnalyzer(morphAnalyzer, ...);
}

private MorphAnalyzer createMorphAnalyzer(...) {
}

}

Only such a provider is bound to a singleton. So the analyzer provider can
set up the analyzer configuration exactly once (with a MorphAnalyzer
instance etc.), and with get() method, it creates analyzers as required.

Jörg

On Wed, Mar 18, 2015 at 6:40 PM, Dmitry Kan dmitry.kan@gmail.com wrote:

Jörg,

Thanks for replying!

Here is the code of the RussianLemmatizingTwitterAnalyzer, the deepest
class in the simplified class sequence I have posted in the original
message.
> public class RussianLemmatizingTwitterAnalyzer extends Analyzer {
>
>     private static MorphAnalyzer morphAnalyzerGlobal;
>
>     boolean useSyncMethod = true;
>
>     private static final boolean verbose = false;
>     private MorphAnalyzer morphAnalyzer;
>     private boolean analyzeBest = false;
>
>     private static final Logger Log = Logger.getLogger(RussianLemmatizingTwitterAnalyzer.class.getName());
>
>     public RussianLemmatizingTwitterAnalyzer(String lemmatizerConfFile, boolean analyzeBest) throws IOException {
>         this.analyzeBest = analyzeBest;
>
>         if (useSyncMethod) {
>             this.morphAnalyzer = loadCustomAnalyzer(lemmatizerConfFile);
>         } else {
>             Properties properties = new Properties();
>
>             Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);
>
>             properties.load(new StringReader(IOUtils.readFile(new File(lemmatizerConfFile), Charsets.UTF_8)));
>             this.morphAnalyzer = MorphAnalyzerLoader.load(new MorphAnalyzerConfig(properties));
>         }
>     }
>
>     private static MorphAnalyzer loadAnalyzer(String lemmatizerConfFile) throws IOException {
>         Properties properties = new Properties();
>
>         Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);
>
>         properties.load(new StringReader(IOUtils.readFile(new File(lemmatizerConfFile), Charsets.UTF_8)));
>         MorphAnalyzer morphAnalyzer1 = MorphAnalyzerLoader.load(new MorphAnalyzerConfig(properties));
>
>         if (verbose) {
>             if (morphAnalyzer1 != null) {
>                 Log.info("Successfully created the analyzer!");
>                 Log.info(morphAnalyzer1.analyzeBest("билета").toString());
>             } else {
>                 Log.severe("Failed to create the morphAnalyzer object");
>             }
>         }
>
>         return morphAnalyzer1;
>     }
>
>     public static synchronized MorphAnalyzer loadCustomAnalyzer(String lemmatizerConfFile)
>             throws IOException {
>         if (morphAnalyzerGlobal == null) {
>             morphAnalyzerGlobal = loadAnalyzer(lemmatizerConfFile);
>         }
>
>         return morphAnalyzerGlobal;
>     }
>
>     @Override
>     protected TokenStreamComponents createComponents(String fieldName, final Reader reader) {
>         Tokenizer tokenizer = new TwitterFlexLuceneTokenizer(reader);
>
>         Log.config("Using Tokenizer: " + tokenizer.getClass().getSimpleName());
>
>         TokenStream tokenStream = tokenizer;
>         tokenStream = new LowerCaseFilter(Version.LUCENE_4_9,tokenStream);
>         tokenStream = new MorphAnalTokenFilter(tokenStream, morphAnalyzer, analyzeBest);
>         return new TokenStreamComponents(tokenizer, tokenStream);
>     }
>
> }
>
>
Note, that in the code above the TwitterFlexLuceneTokenizer is not thread
safe and extends o.a.lucene.analysis.Tokenizer. In jvisualvm there are 97
instances of this class.

Let me know, if I should copy other code snippets up the class stream.

Dmitry

On Wednesday, 18 March 2015 18:22:47 UTC+2, Jörg Prante wrote:

Is it possible to examine the code of your plugin?

Generally speaking, analyzers are instantiated per index creation for
each thread.

In org.elasticsearch.index.analysis.AnalysisModule, you can see how
analyzer providers and factories are prepared for injection by the help of
the ES injection modul which is based on Guice. Basically, the factories
are kept as singletons, and each thread can pick analyzer instances from
the factory when needed. All in all, Lucene analyzer classes are not
threadsafe, in particular the tokenizers. It means, it is up to the
implementor of an analyzer/tokenizer to store immutable objects as
singletons in a correct way so that all threads can safely access them.

Jörg

On Wed, Mar 18, 2015 at 4:02 PM, Dmitry Kan dmitr...@gmail.com wrote:

Hi,

Could somebody answer, please?

On Tuesday, 17 March 2015 19:05:38 UTC+2, Dmitry Kan wrote:

Hello!

I'm a newbie in elasticsearch, so forgive if the question is lame.

I have implemented a custom plugin using a custom lemmatizer and a
tokenizer. The simplified class sequence:

AnalysisMorphologyPlugin->MorphologyAnalysisBinderProcessor->SemanticAnalyzerTwitterLemmatizerProvider->RussianLemmatizingTwitterAnalyzer

In the RussianLemmatizingTwitterAnalyzer's ctor I load the custom object for lemmatization (object unrelated to lucene/es) in a singleton fashion (in a syncrhonized code block).
Then, when creating 14 indices in the same JVM I see
14 instances of RussianLemmatizingTwitterAnalyzer,
4 instances of SemanticAnalyzerTwitterLemmatizerProvider,
4 instances of MorphologyAnalysisBinderProcessor,
30 instances of the custom lemmatizer (in each RussianLemmatizingTwitterAnalyzer only one instance is expected, so should be 14),
1 instance of AnalysisMorphologyPlugin.

The question is, can RussianLemmatizingTwitterAnalyzer object be made shared between indices? Or is it by design, that they must load separately per index?
What could be wrong in the code that makes 30 instances of the custom singleton lemmatizer instead of 14?

The current standing is that with the plugin 100M of RAM is reserved by the JVM with no data. Without the plugin the JVM reserves 2M with no data. Elasticsearch 1.3.2, Lucene 4.9.0.

Regards,

Dmitry Kan

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE0KbyNxQkgqtN-JCjfWseqF5gm9g9KNpaX-8hgqG%2BXVw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Dmitry_Kan · March 18, 2015, 7:12pm

Yes, I use an analyzer provider. Here is the code:


public class SemanticAnalyzerTwitterLemmatizerProvider extends AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> {
    private final RussianLemmatizingTwitterAnalyzer russianLemmatizingGenericAnalyzer;

    private final Logger Log = Logger.getLogger(SemanticAnalyzerTwitterLemmatizerProvider.class.getName());

    @Inject
    public SemanticAnalyzerTwitterLemmatizerProvider(Index index, @IndexSettings Settings indexSettings,
                                                     @Assisted String name, Settings settings) {
        super(index, indexSettings, name, settings);
        Log.info("called super with name=" + name);
        try {
            String lemmatizerConfFile = settings.get("lemmatizerConf");
            boolean analyzeBest = Boolean.parseBoolean(settings.get("analyzeBest"));
            russianLemmatizingGenericAnalyzer = new RussianLemmatizingTwitterAnalyzer(lemmatizerConfFile, analyzeBest);
        } catch (IOException ioe) {
            throw new ElasticsearchIllegalArgumentException("Unable to load Russian morphology analyzer", ioe);
        } catch (Exception e) {
            throw new ElasticsearchIllegalArgumentException("Unable to load Russian morphology analyzer", e);
        }
    }

    @Override
    public RussianLemmatizingTwitterAnalyzer get() {
        return russianLemmatizingGenericAnalyzer;
    }
}

Would you recommend to use your approach instead of this one? Do you spot
issues in my implementation of the provider?

On Wednesday, 18 March 2015 20:52:08 UTC+2, Jörg Prante wrote:

Do you use an analyzer provider?

Example

public class RussianLemmatizingTwitterAnalyzerProvider extends
AbstractIndexAnalyzerProvider {
private final MorphAnalyzer morphAnalyzer;

...
 
@Inject
public RussianLemmatizingTwitterAnalyzerProvider(Index index,
                                       @IndexSettings Settings 
indexSettings,
Environment environment,
@Assisted String name,
@Assisted Settings settings) {
super(index, indexSettings, name, settings);
this.morphAnalyzer = createMorphAnalyzer(environment, settings,
...);
}
@Override
public RussianLemmatizingTwitterAnalyzer get() {
    return new RussianLemmatizingTwitterAnalyzer(morphAnalyzer, ...);
}

private MorphAnalyzer createMorphAnalyzer(...) {
}
}

Only such a provider is bound to a singleton. So the analyzer provider can
set up the analyzer configuration exactly once (with a MorphAnalyzer
instance etc.), and with get() method, it creates analyzers as required.

Jörg

On Wed, Mar 18, 2015 at 6:40 PM, Dmitry Kan <dmitr...@gmail.com
<javascript:>> wrote:
Jörg,

Thanks for replying!

Here is the code of the RussianLemmatizingTwitterAnalyzer, the deepest
class in the simplified class sequence I have posted in the original
message.
>> public class RussianLemmatizingTwitterAnalyzer extends Analyzer {
>>
>>     private static MorphAnalyzer morphAnalyzerGlobal;
>>
>>     boolean useSyncMethod = true;
>>
>>     private static final boolean verbose = false;
>>     private MorphAnalyzer morphAnalyzer;
>>     private boolean analyzeBest = false;
>>
>>     private static final Logger Log = Logger.getLogger(RussianLemmatizingTwitterAnalyzer.class.getName());
>>
>>     public RussianLemmatizingTwitterAnalyzer(String lemmatizerConfFile, boolean analyzeBest) throws IOException {
>>         this.analyzeBest = analyzeBest;
>>
>>         if (useSyncMethod) {
>>             this.morphAnalyzer = loadCustomAnalyzer(lemmatizerConfFile);
>>         } else {
>>             Properties properties = new Properties();
>>
>>             Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);
>>
>>             properties.load(new StringReader(IOUtils.readFile(new File(lemmatizerConfFile), Charsets.UTF_8)));
>>             this.morphAnalyzer = MorphAnalyzerLoader.load(new MorphAnalyzerConfig(properties));
>>         }
>>     }
>>
>>     private static MorphAnalyzer loadAnalyzer(String lemmatizerConfFile) throws IOException {
>>         Properties properties = new Properties();
>>
>>         Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);
>>
>>         properties.load(new StringReader(IOUtils.readFile(new File(lemmatizerConfFile), Charsets.UTF_8)));
>>         MorphAnalyzer morphAnalyzer1 = MorphAnalyzerLoader.load(new MorphAnalyzerConfig(properties));
>>
>>         if (verbose) {
>>             if (morphAnalyzer1 != null) {
>>                 Log.info("Successfully created the analyzer!");
>>                 Log.info(morphAnalyzer1.analyzeBest("билета").toString());
>>             } else {
>>                 Log.severe("Failed to create the morphAnalyzer object");
>>             }
>>         }
>>
>>         return morphAnalyzer1;
>>     }
>>
>>     public static synchronized MorphAnalyzer loadCustomAnalyzer(String lemmatizerConfFile)
>>             throws IOException {
>>         if (morphAnalyzerGlobal == null) {
>>             morphAnalyzerGlobal = loadAnalyzer(lemmatizerConfFile);
>>         }
>>
>>         return morphAnalyzerGlobal;
>>     }
>>
>>     @Override
>>     protected TokenStreamComponents createComponents(String fieldName, final Reader reader) {
>>         Tokenizer tokenizer = new TwitterFlexLuceneTokenizer(reader);
>>
>>         Log.config("Using Tokenizer: " + tokenizer.getClass().getSimpleName());
>>
>>         TokenStream tokenStream = tokenizer;
>>         tokenStream = new LowerCaseFilter(Version.LUCENE_4_9,tokenStream);
>>         tokenStream = new MorphAnalTokenFilter(tokenStream, morphAnalyzer, analyzeBest);
>>         return new TokenStreamComponents(tokenizer, tokenStream);
>>     }
>>
>> }
>>
>>
Note, that in the code above the TwitterFlexLuceneTokenizer is not thread
safe and extends o.a.lucene.analysis.Tokenizer. In jvisualvm there are 97
instances of this class.

Let me know, if I should copy other code snippets up the class stream.

Dmitry

On Wednesday, 18 March 2015 18:22:47 UTC+2, Jörg Prante wrote:

Is it possible to examine the code of your plugin?

Generally speaking, analyzers are instantiated per index creation for
each thread.

In org.elasticsearch.index.analysis.AnalysisModule, you can see how
analyzer providers and factories are prepared for injection by the help of
the ES injection modul which is based on Guice. Basically, the factories
are kept as singletons, and each thread can pick analyzer instances from
the factory when needed. All in all, Lucene analyzer classes are not
threadsafe, in particular the tokenizers. It means, it is up to the
implementor of an analyzer/tokenizer to store immutable objects as
singletons in a correct way so that all threads can safely access them.

Jörg

On Wed, Mar 18, 2015 at 4:02 PM, Dmitry Kan dmitr...@gmail.com wrote:

Hi,

Could somebody answer, please?

On Tuesday, 17 March 2015 19:05:38 UTC+2, Dmitry Kan wrote:

Hello!

I'm a newbie in elasticsearch, so forgive if the question is lame.

I have implemented a custom plugin using a custom lemmatizer and a
tokenizer. The simplified class sequence:

AnalysisMorphologyPlugin->MorphologyAnalysisBinderProcessor->SemanticAnalyzerTwitterLemmatizerProvider->RussianLemmatizingTwitterAnalyzer

In the RussianLemmatizingTwitterAnalyzer's ctor I load the custom object for lemmatization (object unrelated to lucene/es) in a singleton fashion (in a syncrhonized code block).
Then, when creating 14 indices in the same JVM I see
14 instances of RussianLemmatizingTwitterAnalyzer,
4 instances of SemanticAnalyzerTwitterLemmatizerProvider,
4 instances of MorphologyAnalysisBinderProcessor,
30 instances of the custom lemmatizer (in each RussianLemmatizingTwitterAnalyzer only one instance is expected, so should be 14),
1 instance of AnalysisMorphologyPlugin.

The question is, can RussianLemmatizingTwitterAnalyzer object be made shared between indices? Or is it by design, that they must load separately per index?
What could be wrong in the code that makes 30 instances of the custom singleton lemmatizer instead of 14?

The current standing is that with the plugin 100M of RAM is reserved by the JVM with no data. Without the plugin the JVM reserves 2M with no data. Elasticsearch 1.3.2, Lucene 4.9.0.

Regards,

Dmitry Kan

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b80d3251-c0b4-4081-8a6b-f585b3e3c60d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · March 18, 2015, 7:27pm

In the get() method of the provider, I would better try to always return a
new analyzer instance.

The configuration and setup of the analyzer could be refactored to the
provider.

Jörg

On Wed, Mar 18, 2015 at 8:12 PM, Dmitry Kan dmitry.kan@gmail.com wrote:

Yes, I use an analyzer provider. Here is the code:
> public class SemanticAnalyzerTwitterLemmatizerProvider extends AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> {
>     private final RussianLemmatizingTwitterAnalyzer russianLemmatizingGenericAnalyzer;
>
>     private final Logger Log = Logger.getLogger(SemanticAnalyzerTwitterLemmatizerProvider.class.getName());
>
>     @Inject
>     public SemanticAnalyzerTwitterLemmatizerProvider(Index index, @IndexSettings Settings indexSettings,
>                                                      @Assisted String name, Settings settings) {
>         super(index, indexSettings, name, settings);
>         Log.info("called super with name=" + name);
>         try {
>             String lemmatizerConfFile = settings.get("lemmatizerConf");
>             boolean analyzeBest = Boolean.parseBoolean(settings.get("analyzeBest"));
>             russianLemmatizingGenericAnalyzer = new RussianLemmatizingTwitterAnalyzer(lemmatizerConfFile, analyzeBest);
>         } catch (IOException ioe) {
>             throw new ElasticsearchIllegalArgumentException("Unable to load Russian morphology analyzer", ioe);
>         } catch (Exception e) {
>             throw new ElasticsearchIllegalArgumentException("Unable to load Russian morphology analyzer", e);
>         }
>     }
>
>     @Override
>     public RussianLemmatizingTwitterAnalyzer get() {
>         return russianLemmatizingGenericAnalyzer;
>     }
> }
>
>
Would you recommend to use your approach instead of this one? Do you spot
issues in my implementation of the provider?

On Wednesday, 18 March 2015 20:52:08 UTC+2, Jörg Prante wrote:
Do you use an analyzer provider?

Example

public class RussianLemmatizingTwitterAnalyzerProvider extends
AbstractIndexAnalyzerProvider {
private final MorphAnalyzer morphAnalyzer;

...

@Inject
public RussianLemmatizingTwitterAnalyzerProvider(Index index,
                                       @IndexSettings Settings
indexSettings,
Environment environment,
@Assisted String name,
@Assisted Settings settings) {
super(index, indexSettings, name, settings);
this.morphAnalyzer = createMorphAnalyzer(environment, settings,
...);
}
@Override
public RussianLemmatizingTwitterAnalyzer get() {
    return new RussianLemmatizingTwitterAnalyzer(morphAnalyzer, ...);
}

private MorphAnalyzer createMorphAnalyzer(...) {
}
}

Only such a provider is bound to a singleton. So the analyzer provider
can set up the analyzer configuration exactly once (with a MorphAnalyzer
instance etc.), and with get() method, it creates analyzers as required.

Jörg

On Wed, Mar 18, 2015 at 6:40 PM, Dmitry Kan dmitr...@gmail.com wrote:
Jörg,

Thanks for replying!

Here is the code of the RussianLemmatizingTwitterAnalyzer, the deepest
class in the simplified class sequence I have posted in the original
message.
>>> public class RussianLemmatizingTwitterAnalyzer extends Analyzer {
>>>
>>>     private static MorphAnalyzer morphAnalyzerGlobal;
>>>
>>>     boolean useSyncMethod = true;
>>>
>>>     private static final boolean verbose = false;
>>>     private MorphAnalyzer morphAnalyzer;
>>>     private boolean analyzeBest = false;
>>>
>>>     private static final Logger Log = Logger.getLogger(RussianLemmatizingTwitterAnalyzer.class.getName());
>>>
>>>     public RussianLemmatizingTwitterAnalyzer(String lemmatizerConfFile, boolean analyzeBest) throws IOException {
>>>         this.analyzeBest = analyzeBest;
>>>
>>>         if (useSyncMethod) {
>>>             this.morphAnalyzer = loadCustomAnalyzer(lemmatizerConfFile);
>>>         } else {
>>>             Properties properties = new Properties();
>>>
>>>             Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);
>>>
>>>             properties.load(new StringReader(IOUtils.readFile(new File(lemmatizerConfFile), Charsets.UTF_8)));
>>>             this.morphAnalyzer = MorphAnalyzerLoader.load(new MorphAnalyzerConfig(properties));
>>>         }
>>>     }
>>>
>>>     private static MorphAnalyzer loadAnalyzer(String lemmatizerConfFile) throws IOException {
>>>         Properties properties = new Properties();
>>>
>>>         Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);
>>>
>>>         properties.load(new StringReader(IOUtils.readFile(new File(lemmatizerConfFile), Charsets.UTF_8)));
>>>         MorphAnalyzer morphAnalyzer1 = MorphAnalyzerLoader.load(new MorphAnalyzerConfig(properties));
>>>
>>>         if (verbose) {
>>>             if (morphAnalyzer1 != null) {
>>>                 Log.info("Successfully created the analyzer!");
>>>                 Log.info(morphAnalyzer1.analyzeBest("билета").toString());
>>>             } else {
>>>                 Log.severe("Failed to create the morphAnalyzer object");
>>>             }
>>>         }
>>>
>>>         return morphAnalyzer1;
>>>     }
>>>
>>>     public static synchronized MorphAnalyzer loadCustomAnalyzer(String lemmatizerConfFile)
>>>             throws IOException {
>>>         if (morphAnalyzerGlobal == null) {
>>>             morphAnalyzerGlobal = loadAnalyzer(lemmatizerConfFile);
>>>         }
>>>
>>>         return morphAnalyzerGlobal;
>>>     }
>>>
>>>     @Override
>>>     protected TokenStreamComponents createComponents(String fieldName, final Reader reader) {
>>>         Tokenizer tokenizer = new TwitterFlexLuceneTokenizer(reader);
>>>
>>>         Log.config("Using Tokenizer: " + tokenizer.getClass().getSimpleName());
>>>
>>>         TokenStream tokenStream = tokenizer;
>>>         tokenStream = new LowerCaseFilter(Version.LUCENE_4_9,tokenStream);
>>>         tokenStream = new MorphAnalTokenFilter(tokenStream, morphAnalyzer, analyzeBest);
>>>         return new TokenStreamComponents(tokenizer, tokenStream);
>>>     }
>>>
>>> }
>>>
>>>
Note, that in the code above the TwitterFlexLuceneTokenizer is not
thread safe and extends o.a.lucene.analysis.Tokenizer. In jvisualvm there
are 97 instances of this class.

Let me know, if I should copy other code snippets up the class stream.

Dmitry

On Wednesday, 18 March 2015 18:22:47 UTC+2, Jörg Prante wrote:

Is it possible to examine the code of your plugin?

Generally speaking, analyzers are instantiated per index creation for
each thread.

In org.elasticsearch.index.analysis.AnalysisModule, you can see how
analyzer providers and factories are prepared for injection by the help of
the ES injection modul which is based on Guice. Basically, the factories
are kept as singletons, and each thread can pick analyzer instances from
the factory when needed. All in all, Lucene analyzer classes are not
threadsafe, in particular the tokenizers. It means, it is up to the
implementor of an analyzer/tokenizer to store immutable objects as
singletons in a correct way so that all threads can safely access them.

Jörg

On Wed, Mar 18, 2015 at 4:02 PM, Dmitry Kan dmitr...@gmail.com wrote:

Hi,

Could somebody answer, please?

On Tuesday, 17 March 2015 19:05:38 UTC+2, Dmitry Kan wrote:

Hello!

I'm a newbie in elasticsearch, so forgive if the question is lame.

I have implemented a custom plugin using a custom lemmatizer and a
tokenizer. The simplified class sequence:

AnalysisMorphologyPlugin->MorphologyAnalysisBinderProcessor->SemanticAnalyzerTwitterLemmatizerProvider->RussianLemmatizingTwitterAnalyzer

In the RussianLemmatizingTwitterAnalyzer's ctor I load the custom object for lemmatization (object unrelated to lucene/es) in a singleton fashion (in a syncrhonized code block).
Then, when creating 14 indices in the same JVM I see
14 instances of RussianLemmatizingTwitterAnalyzer,
4 instances of SemanticAnalyzerTwitterLemmatizerProvider,
4 instances of MorphologyAnalysisBinderProcessor,
30 instances of the custom lemmatizer (in each RussianLemmatizingTwitterAnalyzer only one instance is expected, so should be 14),
1 instance of AnalysisMorphologyPlugin.

The question is, can RussianLemmatizingTwitterAnalyzer object be made shared between indices? Or is it by design, that they must load separately per index?
What could be wrong in the code that makes 30 instances of the custom singleton lemmatizer instead of 14?

The current standing is that with the plugin 100M of RAM is reserved by the JVM with no data. Without the plugin the JVM reserves 2M with no data. Elasticsearch 1.3.2, Lucene 4.9.0.

Regards,

Dmitry Kan

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b80d3251-c0b4-4081-8a6b-f585b3e3c60d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b80d3251-c0b4-4081-8a6b-f585b3e3c60d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHkS-kFjA1CoxWYNzsgD60sqc7KZxYX-Kysw1pFCAB%2BFA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Dmitry_Kan · March 18, 2015, 8:36pm

Jörg,

Following your suggestion I refactored the code like so:


public class SemanticAnalyzerTwitterLemmatizerProvider extends AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> {
    //private final RussianLemmatizingTwitterAnalyzer russianLemmatizingGenericAnalyzer;
    private final MorphAnalyzer morphAnalyzer;
    private String lemmatizerConfFile;
    boolean analyzeBest;

    private final Logger Log = Logger.getLogger(SemanticAnalyzerTwitterLemmatizerProvider.class.getName());

    @Inject
    public SemanticAnalyzerTwitterLemmatizerProvider(Index index, @IndexSettings Settings indexSettings,
                                                     @Assisted String name, Settings settings) {
        super(index, indexSettings, name, settings);
        Log.info("called super with name=" + name);
        try {
            /*
            String lemmatizerConfFile = settings.get("lemmatizerConf");
            boolean analyzeBest = Boolean.parseBoolean(settings.get("analyzeBest"));
            russianLemmatizingGenericAnalyzer = new RussianLemmatizingTwitterAnalyzer(lemmatizerConfFile, analyzeBest);
            */
            lemmatizerConfFile = settings.get("lemmatizerConf");
            morphAnalyzer = createMorphAnalyzer();

        } catch (IOException ioe) {
            throw new ElasticsearchIllegalArgumentException("Unable to load Russian morphology analyzer", ioe);
        } catch (Exception e) {
            throw new ElasticsearchIllegalArgumentException("Unable to load Russian morphology analyzer", e);
        }
    }

    private MorphAnalyzer createMorphAnalyzer() throws IOException {
        Log.info("start of createMorphAnalyzer()");
        MorphAnalyzer morphAnalyzer1;

        Properties properties = new Properties();

        Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);

        properties.load(new StringReader(IOUtils.readFile(new File(lemmatizerConfFile), Charsets.UTF_8)));
        morphAnalyzer1 = MorphAnalyzerLoader.load(new MorphAnalyzerConfig(properties));

        Log.info("end of createMorphAnalyzer()");

        return morphAnalyzer1;
    }

    @Override
    public RussianLemmatizingTwitterAnalyzer get() {
        return new RussianLemmatizingTwitterAnalyzer(morphAnalyzer);
    }
}

Still in the logs I see the creation of MorphAnalyzer object more than
once. Probably something is still missing in the logic?

log excerpt:

[2015-03-18 22:34:06,900][INFO ][cluster.metadata ] [Soldier X]
[rustest] deleting index
Mar 18, 2015 10:34:06 PM
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider

INFO: called super with name=russian_morphology_twitter
Mar 18, 2015 10:34:06 PM
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
createMorphAnalyzer
INFO: start of createMorphAnalyzer()
Mar 18, 2015 10:34:06 PM
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
createMorphAnalyzer
INFO: Loading lemmatizer properties from
/Users/dmitry/projects/information_retrieval/elasticsearch-analysis-morphology-youscan/lemmatizer/lemmatizer-ru.properties
Mar 18, 2015 10:34:07 PM
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
createMorphAnalyzer
INFO: end of createMorphAnalyzer()
[2015-03-18 22:34:07,711][INFO ][cluster.metadata ] [Soldier X]
[rustest] creating index, cause [api], shards [5]/[1], mappings
Mar 18, 2015 10:34:07 PM
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider

INFO: called super with name=russian_morphology_twitter
Mar 18, 2015 10:34:07 PM
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
createMorphAnalyzer
INFO: start of createMorphAnalyzer()
Mar 18, 2015 10:34:07 PM
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
createMorphAnalyzer
INFO: Loading lemmatizer properties from
/Users/dmitry/projects/information_retrieval/elasticsearch-analysis-morphology-youscan/lemmatizer/lemmatizer-ru.properties
Mar 18, 2015 10:34:08 PM
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
createMorphAnalyzer
INFO: end of createMorphAnalyzer()

On Wednesday, 18 March 2015 21:27:12 UTC+2, Jörg Prante wrote:

In the get() method of the provider, I would better try to always return a
new analyzer instance.

The configuration and setup of the analyzer could be refactored to the
provider.

Jörg

On Wed, Mar 18, 2015 at 8:12 PM, Dmitry Kan <dmitr...@gmail.com
<javascript:>> wrote:
Yes, I use an analyzer provider. Here is the code:
>> public class SemanticAnalyzerTwitterLemmatizerProvider extends AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> {
>>     private final RussianLemmatizingTwitterAnalyzer russianLemmatizingGenericAnalyzer;
>>
>>     private final Logger Log = Logger.getLogger(SemanticAnalyzerTwitterLemmatizerProvider.class.getName());
>>
>>     @Inject
>>     public SemanticAnalyzerTwitterLemmatizerProvider(Index index, @IndexSettings Settings indexSettings,
>>                                                      @Assisted String name, Settings settings) {
>>         super(index, indexSettings, name, settings);
>>         Log.info("called super with name=" + name);
>>         try {
>>             String lemmatizerConfFile = settings.get("lemmatizerConf");
>>             boolean analyzeBest = Boolean.parseBoolean(settings.get("analyzeBest"));
>>             russianLemmatizingGenericAnalyzer = new RussianLemmatizingTwitterAnalyzer(lemmatizerConfFile, analyzeBest);
>>         } catch (IOException ioe) {
>>             throw new ElasticsearchIllegalArgumentException("Unable to load Russian morphology analyzer", ioe);
>>         } catch (Exception e) {
>>             throw new ElasticsearchIllegalArgumentException("Unable to load Russian morphology analyzer", e);
>>         }
>>     }
>>
>>     @Override
>>     public RussianLemmatizingTwitterAnalyzer get() {
>>         return russianLemmatizingGenericAnalyzer;
>>     }
>> }
>>
>>
Would you recommend to use your approach instead of this one? Do you spot
issues in my implementation of the provider?

On Wednesday, 18 March 2015 20:52:08 UTC+2, Jörg Prante wrote:
Do you use an analyzer provider?

Example

public class RussianLemmatizingTwitterAnalyzerProvider extends
AbstractIndexAnalyzerProvider {
private final MorphAnalyzer morphAnalyzer;

...
 
@Inject
public RussianLemmatizingTwitterAnalyzerProvider(Index index,
                                       @IndexSettings Settings 
indexSettings,
Environment environment,
@Assisted String name,
@Assisted Settings settings) {
super(index, indexSettings, name, settings);
this.morphAnalyzer = createMorphAnalyzer(environment, settings,
...);
}
@Override
public RussianLemmatizingTwitterAnalyzer get() {
    return new RussianLemmatizingTwitterAnalyzer(morphAnalyzer, 
...);
}
private MorphAnalyzer createMorphAnalyzer(...) {
}
}

Only such a provider is bound to a singleton. So the analyzer provider
can set up the analyzer configuration exactly once (with a MorphAnalyzer
instance etc.), and with get() method, it creates analyzers as required.

Jörg

On Wed, Mar 18, 2015 at 6:40 PM, Dmitry Kan dmitr...@gmail.com wrote:
Jörg,

Thanks for replying!

Here is the code of the RussianLemmatizingTwitterAnalyzer, the deepest
class in the simplified class sequence I have posted in the original
message.
>>>> public class RussianLemmatizingTwitterAnalyzer extends Analyzer {
>>>>
>>>>     private static MorphAnalyzer morphAnalyzerGlobal;
>>>>
>>>>     boolean useSyncMethod = true;
>>>>
>>>>     private static final boolean verbose = false;
>>>>     private MorphAnalyzer morphAnalyzer;
>>>>     private boolean analyzeBest = false;
>>>>
>>>>     private static final Logger Log = Logger.getLogger(RussianLemmatizingTwitterAnalyzer.class.getName());
>>>>
>>>>     public RussianLemmatizingTwitterAnalyzer(String lemmatizerConfFile, boolean analyzeBest) throws IOException {
>>>>         this.analyzeBest = analyzeBest;
>>>>
>>>>         if (useSyncMethod) {
>>>>             this.morphAnalyzer = loadCustomAnalyzer(lemmatizerConfFile);
>>>>         } else {
>>>>             Properties properties = new Properties();
>>>>
>>>>             Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);
>>>>
>>>>             properties.load(new StringReader(IOUtils.readFile(new File(lemmatizerConfFile), Charsets.UTF_8)));
>>>>             this.morphAnalyzer = MorphAnalyzerLoader.load(new MorphAnalyzerConfig(properties));
>>>>         }
>>>>     }
>>>>
>>>>     private static MorphAnalyzer loadAnalyzer(String lemmatizerConfFile) throws IOException {
>>>>         Properties properties = new Properties();
>>>>
>>>>         Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);
>>>>
>>>>         properties.load(new StringReader(IOUtils.readFile(new File(lemmatizerConfFile), Charsets.UTF_8)));
>>>>         MorphAnalyzer morphAnalyzer1 = MorphAnalyzerLoader.load(new MorphAnalyzerConfig(properties));
>>>>
>>>>         if (verbose) {
>>>>             if (morphAnalyzer1 != null) {
>>>>                 Log.info("Successfully created the analyzer!");
>>>>                 Log.info(morphAnalyzer1.analyzeBest("билета").toString());
>>>>             } else {
>>>>                 Log.severe("Failed to create the morphAnalyzer object");
>>>>             }
>>>>         }
>>>>
>>>>         return morphAnalyzer1;
>>>>     }
>>>>
>>>>     public static synchronized MorphAnalyzer loadCustomAnalyzer(String lemmatizerConfFile)
>>>>             throws IOException {
>>>>         if (morphAnalyzerGlobal == null) {
>>>>             morphAnalyzerGlobal = loadAnalyzer(lemmatizerConfFile);
>>>>         }
>>>>
>>>>         return morphAnalyzerGlobal;
>>>>     }
>>>>
>>>>     @Override
>>>>     protected TokenStreamComponents createComponents(String fieldName, final Reader reader) {
>>>>         Tokenizer tokenizer = new TwitterFlexLuceneTokenizer(reader);
>>>>
>>>>         Log.config("Using Tokenizer: " + tokenizer.getClass().getSimpleName());
>>>>
>>>>         TokenStream tokenStream = tokenizer;
>>>>         tokenStream = new LowerCaseFilter(Version.LUCENE_4_9,tokenStream);
>>>>         tokenStream = new MorphAnalTokenFilter(tokenStream, morphAnalyzer, analyzeBest);
>>>>         return new TokenStreamComponents(tokenizer, tokenStream);
>>>>     }
>>>>
>>>> }
>>>>
>>>>
Note, that in the code above the TwitterFlexLuceneTokenizer is not
thread safe and extends o.a.lucene.analysis.Tokenizer. In jvisualvm there
are 97 instances of this class.

Let me know, if I should copy other code snippets up the class stream.

Dmitry

On Wednesday, 18 March 2015 18:22:47 UTC+2, Jörg Prante wrote:

Is it possible to examine the code of your plugin?

Generally speaking, analyzers are instantiated per index creation for
each thread.

In org.elasticsearch.index.analysis.AnalysisModule, you can see how
analyzer providers and factories are prepared for injection by the help of
the ES injection modul which is based on Guice. Basically, the factories
are kept as singletons, and each thread can pick analyzer instances from
the factory when needed. All in all, Lucene analyzer classes are not
threadsafe, in particular the tokenizers. It means, it is up to the
implementor of an analyzer/tokenizer to store immutable objects as
singletons in a correct way so that all threads can safely access them.

Jörg

On Wed, Mar 18, 2015 at 4:02 PM, Dmitry Kan dmitr...@gmail.com
wrote:

Hi,

Could somebody answer, please?

On Tuesday, 17 March 2015 19:05:38 UTC+2, Dmitry Kan wrote:

Hello!

I'm a newbie in elasticsearch, so forgive if the question is lame.

I have implemented a custom plugin using a custom lemmatizer and a
tokenizer. The simplified class sequence:

AnalysisMorphologyPlugin->MorphologyAnalysisBinderProcessor->SemanticAnalyzerTwitterLemmatizerProvider->RussianLemmatizingTwitterAnalyzer

In the RussianLemmatizingTwitterAnalyzer's ctor I load the custom object for lemmatization (object unrelated to lucene/es) in a singleton fashion (in a syncrhonized code block).
Then, when creating 14 indices in the same JVM I see
14 instances of RussianLemmatizingTwitterAnalyzer,
4 instances of SemanticAnalyzerTwitterLemmatizerProvider,
4 instances of MorphologyAnalysisBinderProcessor,
30 instances of the custom lemmatizer (in each RussianLemmatizingTwitterAnalyzer only one instance is expected, so should be 14),
1 instance of AnalysisMorphologyPlugin.

The question is, can RussianLemmatizingTwitterAnalyzer object be made shared between indices? Or is it by design, that they must load separately per index?
What could be wrong in the code that makes 30 instances of the custom singleton lemmatizer instead of 14?

The current standing is that with the plugin 100M of RAM is reserved by the JVM with no data. Without the plugin the JVM reserves 2M with no data. Elasticsearch 1.3.2, Lucene 4.9.0.

Regards,

Dmitry Kan

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b80d3251-c0b4-4081-8a6b-f585b3e3c60d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b80d3251-c0b4-4081-8a6b-f585b3e3c60d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c79ac418-4129-4a3e-9227-64dd840a30cf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dmitry_Kan · March 20, 2015, 4:55am

Jörg,

Looks like I have found a solution: making a singleton wrapper around the
MorphAnalyzer object has solved the issue (to be tested on larger scale
still).

Here is the code:


public class MorphAnalyzerSingleton {
    private static MorphAnalyzer INSTANCE = null;

    private final static Logger Log = Logger.getLogger(MorphAnalyzerSingleton.class.getName());

    private MorphAnalyzerSingleton() {
        if (INSTANCE != null) {
            throw new AssertionError("Instance already exists");
        }
    }

    public static MorphAnalyzer getInstance(String lemmatizerConfFile) throws IOException {
        Log.info("start of getInstance()");

        if (INSTANCE == null) {
            synchronized (MorphAnalyzerSingleton.class) {

                if (INSTANCE == null) {

                    Properties properties = new Properties();

                    Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);

                    properties.load(new StringReader(IOUtils.readFile(new File(lemmatizerConfFile), Charsets.UTF_8)));
                    INSTANCE = MorphAnalyzerLoader.load(new MorphAnalyzerConfig(properties));
                }
            }
        }

        Log.info("end of getInstance()");

        return INSTANCE;
    }
}

On Wednesday, 18 March 2015 22:36:58 UTC+2, Dmitry Kan wrote:

Jörg,

Following your suggestion I refactored the code like so:
> public class SemanticAnalyzerTwitterLemmatizerProvider extends AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> {
>     //private final RussianLemmatizingTwitterAnalyzer russianLemmatizingGenericAnalyzer;
>     private final MorphAnalyzer morphAnalyzer;
>     private String lemmatizerConfFile;
>     boolean analyzeBest;
>
>     private final Logger Log = Logger.getLogger(SemanticAnalyzerTwitterLemmatizerProvider.class.getName());
>
>     @Inject
>     public SemanticAnalyzerTwitterLemmatizerProvider(Index index, @IndexSettings Settings indexSettings,
>                                                      @Assisted String name, Settings settings) {
>         super(index, indexSettings, name, settings);
>         Log.info("called super with name=" + name);
>         try {
>             /*
>             String lemmatizerConfFile = settings.get("lemmatizerConf");
>             boolean analyzeBest = Boolean.parseBoolean(settings.get("analyzeBest"));
>             russianLemmatizingGenericAnalyzer = new RussianLemmatizingTwitterAnalyzer(lemmatizerConfFile, analyzeBest);
>             */
>             lemmatizerConfFile = settings.get("lemmatizerConf");
>             morphAnalyzer = createMorphAnalyzer();
>
>         } catch (IOException ioe) {
>             throw new ElasticsearchIllegalArgumentException("Unable to load Russian morphology analyzer", ioe);
>         } catch (Exception e) {
>             throw new ElasticsearchIllegalArgumentException("Unable to load Russian morphology analyzer", e);
>         }
>     }
>
>     private MorphAnalyzer createMorphAnalyzer() throws IOException {
>         Log.info("start of createMorphAnalyzer()");
>         MorphAnalyzer morphAnalyzer1;
>
>         Properties properties = new Properties();
>
>         Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);
>
>         properties.load(new StringReader(IOUtils.readFile(new File(lemmatizerConfFile), Charsets.UTF_8)));
>         morphAnalyzer1 = MorphAnalyzerLoader.load(new MorphAnalyzerConfig(properties));
>
>         Log.info("end of createMorphAnalyzer()");
>
>         return morphAnalyzer1;
>     }
>
>     @Override
>     public RussianLemmatizingTwitterAnalyzer get() {
>         return new RussianLemmatizingTwitterAnalyzer(morphAnalyzer);
>     }
> }
>
>
Still in the logs I see the creation of MorphAnalyzer object more than
once. Probably something is still missing in the logic?

log excerpt:

[2015-03-18 22:34:06,900][INFO ][cluster.metadata ] [Soldier X]
[rustest] deleting index
Mar 18, 2015 10:34:06 PM
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider

INFO: called super with name=russian_morphology_twitter
Mar 18, 2015 10:34:06 PM
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
createMorphAnalyzer
INFO: start of createMorphAnalyzer()
Mar 18, 2015 10:34:06 PM
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
createMorphAnalyzer
INFO: Loading lemmatizer properties from
/Users/dmitry/projects/information_retrieval/elasticsearch-analysis-morphology-youscan/lemmatizer/lemmatizer-ru.properties
Mar 18, 2015 10:34:07 PM
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
createMorphAnalyzer
INFO: end of createMorphAnalyzer()
[2015-03-18 22:34:07,711][INFO ][cluster.metadata ] [Soldier X]
[rustest] creating index, cause [api], shards [5]/[1], mappings
Mar 18, 2015 10:34:07 PM
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider

INFO: called super with name=russian_morphology_twitter
Mar 18, 2015 10:34:07 PM
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
createMorphAnalyzer
INFO: start of createMorphAnalyzer()
Mar 18, 2015 10:34:07 PM
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
createMorphAnalyzer
INFO: Loading lemmatizer properties from
/Users/dmitry/projects/information_retrieval/elasticsearch-analysis-morphology-youscan/lemmatizer/lemmatizer-ru.properties
Mar 18, 2015 10:34:08 PM
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
createMorphAnalyzer
INFO: end of createMorphAnalyzer()

On Wednesday, 18 March 2015 21:27:12 UTC+2, Jörg Prante wrote:
In the get() method of the provider, I would better try to always return
a new analyzer instance.

The configuration and setup of the analyzer could be refactored to the
provider.

Jörg

On Wed, Mar 18, 2015 at 8:12 PM, Dmitry Kan dmitr...@gmail.com wrote:
Yes, I use an analyzer provider. Here is the code:
>>> public class SemanticAnalyzerTwitterLemmatizerProvider extends AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> {
>>>     private final RussianLemmatizingTwitterAnalyzer russianLemmatizingGenericAnalyzer;
>>>
>>>     private final Logger Log = Logger.getLogger(SemanticAnalyzerTwitterLemmatizerProvider.class.getName());
>>>
>>>     @Inject
>>>     public SemanticAnalyzerTwitterLemmatizerProvider(Index index, @IndexSettings Settings indexSettings,
>>>                                                      @Assisted String name, Settings settings) {
>>>         super(index, indexSettings, name, settings);
>>>         Log.info("called super with name=" + name);
>>>         try {
>>>             String lemmatizerConfFile = settings.get("lemmatizerConf");
>>>             boolean analyzeBest = Boolean.parseBoolean(settings.get("analyzeBest"));
>>>             russianLemmatizingGenericAnalyzer = new RussianLemmatizingTwitterAnalyzer(lemmatizerConfFile, analyzeBest);
>>>         } catch (IOException ioe) {
>>>             throw new ElasticsearchIllegalArgumentException("Unable to load Russian morphology analyzer", ioe);
>>>         } catch (Exception e) {
>>>             throw new ElasticsearchIllegalArgumentException("Unable to load Russian morphology analyzer", e);
>>>         }
>>>     }
>>>
>>>     @Override
>>>     public RussianLemmatizingTwitterAnalyzer get() {
>>>         return russianLemmatizingGenericAnalyzer;
>>>     }
>>> }
>>>
>>>
Would you recommend to use your approach instead of this one? Do you
spot issues in my implementation of the provider?

On Wednesday, 18 March 2015 20:52:08 UTC+2, Jörg Prante wrote:
Do you use an analyzer provider?

Example

public class RussianLemmatizingTwitterAnalyzerProvider extends
AbstractIndexAnalyzerProvider {
private final MorphAnalyzer morphAnalyzer;

...
 
@Inject
public RussianLemmatizingTwitterAnalyzerProvider(Index index,
                                       @IndexSettings Settings 
indexSettings,
Environment environment,
@Assisted String name,
@Assisted Settings settings) {
super(index, indexSettings, name, settings);
this.morphAnalyzer = createMorphAnalyzer(environment,
settings, ...);
}
@Override
public RussianLemmatizingTwitterAnalyzer get() {
    return new RussianLemmatizingTwitterAnalyzer(morphAnalyzer, 
...);
}
private MorphAnalyzer createMorphAnalyzer(...) {
}
}

Only such a provider is bound to a singleton. So the analyzer provider
can set up the analyzer configuration exactly once (with a MorphAnalyzer
instance etc.), and with get() method, it creates analyzers as required.

Jörg

On Wed, Mar 18, 2015 at 6:40 PM, Dmitry Kan dmitr...@gmail.com wrote:
Jörg,

Thanks for replying!

Here is the code of the RussianLemmatizingTwitterAnalyzer, the
deepest class in the simplified class sequence I have posted in the
original message.
>>>>> public class RussianLemmatizingTwitterAnalyzer extends Analyzer {
>>>>>
>>>>>     private static MorphAnalyzer morphAnalyzerGlobal;
>>>>>
>>>>>     boolean useSyncMethod = true;
>>>>>
>>>>>     private static final boolean verbose = false;
>>>>>     private MorphAnalyzer morphAnalyzer;
>>>>>     private boolean analyzeBest = false;
>>>>>
>>>>>     private static final Logger Log = Logger.getLogger(RussianLemmatizingTwitterAnalyzer.class.getName());
>>>>>
>>>>>     public RussianLemmatizingTwitterAnalyzer(String lemmatizerConfFile, boolean analyzeBest) throws IOException {
>>>>>         this.analyzeBest = analyzeBest;
>>>>>
>>>>>         if (useSyncMethod) {
>>>>>             this.morphAnalyzer = loadCustomAnalyzer(lemmatizerConfFile);
>>>>>         } else {
>>>>>             Properties properties = new Properties();
>>>>>
>>>>>             Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);
>>>>>
>>>>>             properties.load(new StringReader(IOUtils.readFile(new File(lemmatizerConfFile), Charsets.UTF_8)));
>>>>>             this.morphAnalyzer = MorphAnalyzerLoader.load(new MorphAnalyzerConfig(properties));
>>>>>         }
>>>>>     }
>>>>>
>>>>>     private static MorphAnalyzer loadAnalyzer(String lemmatizerConfFile) throws IOException {
>>>>>         Properties properties = new Properties();
>>>>>
>>>>>         Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);
>>>>>
>>>>>         properties.load(new StringReader(IOUtils.readFile(new File(lemmatizerConfFile), Charsets.UTF_8)));
>>>>>         MorphAnalyzer morphAnalyzer1 = MorphAnalyzerLoader.load(new MorphAnalyzerConfig(properties));
>>>>>
>>>>>         if (verbose) {
>>>>>             if (morphAnalyzer1 != null) {
>>>>>                 Log.info("Successfully created the analyzer!");
>>>>>                 Log.info(morphAnalyzer1.analyzeBest("билета").toString());
>>>>>             } else {
>>>>>                 Log.severe("Failed to create the morphAnalyzer object");
>>>>>             }
>>>>>         }
>>>>>
>>>>>         return morphAnalyzer1;
>>>>>     }
>>>>>
>>>>>     public static synchronized MorphAnalyzer loadCustomAnalyzer(String lemmatizerConfFile)
>>>>>             throws IOException {
>>>>>         if (morphAnalyzerGlobal == null) {
>>>>>             morphAnalyzerGlobal = loadAnalyzer(lemmatizerConfFile);
>>>>>         }
>>>>>
>>>>>         return morphAnalyzerGlobal;
>>>>>     }
>>>>>
>>>>>     @Override
>>>>>     protected TokenStreamComponents createComponents(String fieldName, final Reader reader) {
>>>>>         Tokenizer tokenizer = new TwitterFlexLuceneTokenizer(reader);
>>>>>
>>>>>         Log.config("Using Tokenizer: " + tokenizer.getClass().getSimpleName());
>>>>>
>>>>>         TokenStream tokenStream = tokenizer;
>>>>>         tokenStream = new LowerCaseFilter(Version.LUCENE_4_9,tokenStream);
>>>>>         tokenStream = new MorphAnalTokenFilter(tokenStream, morphAnalyzer, analyzeBest);
>>>>>         return new TokenStreamComponents(tokenizer, tokenStream);
>>>>>     }
>>>>>
>>>>> }
>>>>>
>>>>>
Note, that in the code above the TwitterFlexLuceneTokenizer is not
thread safe and extends o.a.lucene.analysis.Tokenizer. In jvisualvm there
are 97 instances of this class.

Let me know, if I should copy other code snippets up the class stream.

Dmitry

On Wednesday, 18 March 2015 18:22:47 UTC+2, Jörg Prante wrote:

Is it possible to examine the code of your plugin?

Generally speaking, analyzers are instantiated per index creation for
each thread.

In org.elasticsearch.index.analysis.AnalysisModule, you can see how
analyzer providers and factories are prepared for injection by the help of
the ES injection modul which is based on Guice. Basically, the factories
are kept as singletons, and each thread can pick analyzer instances from
the factory when needed. All in all, Lucene analyzer classes are not
threadsafe, in particular the tokenizers. It means, it is up to the
implementor of an analyzer/tokenizer to store immutable objects as
singletons in a correct way so that all threads can safely access them.

Jörg

On Wed, Mar 18, 2015 at 4:02 PM, Dmitry Kan dmitr...@gmail.com
wrote:

Hi,

Could somebody answer, please?

On Tuesday, 17 March 2015 19:05:38 UTC+2, Dmitry Kan wrote:

Hello!

I'm a newbie in elasticsearch, so forgive if the question is lame.

I have implemented a custom plugin using a custom lemmatizer and a
tokenizer. The simplified class sequence:

AnalysisMorphologyPlugin->MorphologyAnalysisBinderProcessor->SemanticAnalyzerTwitterLemmatizerProvider->RussianLemmatizingTwitterAnalyzer

In the RussianLemmatizingTwitterAnalyzer's ctor I load the custom object for lemmatization (object unrelated to lucene/es) in a singleton fashion (in a syncrhonized code block).
Then, when creating 14 indices in the same JVM I see
14 instances of RussianLemmatizingTwitterAnalyzer,
4 instances of SemanticAnalyzerTwitterLemmatizerProvider,
4 instances of MorphologyAnalysisBinderProcessor,
30 instances of the custom lemmatizer (in each RussianLemmatizingTwitterAnalyzer only one instance is expected, so should be 14),
1 instance of AnalysisMorphologyPlugin.

The question is, can RussianLemmatizingTwitterAnalyzer object be made shared between indices? Or is it by design, that they must load separately per index?
What could be wrong in the code that makes 30 instances of the custom singleton lemmatizer instead of 14?

The current standing is that with the plugin 100M of RAM is reserved by the JVM with no data. Without the plugin the JVM reserves 2M with no data. Elasticsearch 1.3.2, Lucene 4.9.0.

Regards,

Dmitry Kan

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3
b-4600-9091-a515b496b867%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b80d3251-c0b4-4081-8a6b-f585b3e3c60d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b80d3251-c0b4-4081-8a6b-f585b3e3c60d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9d90deed-be5d-420c-b83a-4535d5d0f207%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Access analyzer defined in index settings from plugin Elasticsearch	0	638	July 24, 2015
Importing language analyzers Elasticsearch	7	1166	July 24, 2011
Possible solution found Re: custom analyzer in ES Elasticsearch	1	353	November 24, 2011
Overriding built-in analyzer and set it as default Elasticsearch	9	1109	August 26, 2014
How to use my customer lucene analyzer(tokenizer)? Elasticsearch	5	1123	August 26, 2014

Issue with singleton analyzer in single JVM multi-index setup

Related topics