Possible solution found Re: custom analyzer in ES


(Sisu Alexandru) #1

... but I'm not sure its the correct one.
After checking a bit, the AnalysisService class and the way the providers
and analysers are instantiated,
I made the method name() returning "default", and after deploy it worked.

@Override
public String name() {
System.out.println("name() called");
return "default";
}

Cheers,

Alex

On Tue, Nov 22, 2011 at 3:46 PM, alex sisu.eugen@gmail.com wrote:

Hi,

I would need some help with the setup of a custom written analyzer.
I want to provide a certain type of analyzing for my "posting"-type
documents.

Here's my setup:

[elasticsearch.yml]

index:
analysis:
analyzer:
default:
type:
com.buzzbuzz.elasticsearch.analysis.CustomAnalyzerAugmentedProvider

The custom analysis classes are being placed in lib:
customanalysis.zip

[CustomAnalyzerAugmentedProvider]
public class CustomAnalyzerAugmentedProvider extends
AbstractIndexAnalyzerProvider {

   private CustomAnalyzerAugmented analyzer;

   @Inject
   public CustomAnalyzerAugmentedProvider(Index index, @IndexSettings

Settings indexSettings,
@Assisted String name, @Assisted Settings settings)
{
super(index, indexSettings, name, settings);
System.out
.println("{CustomAnalyzerAugmentedProvider}
initialized");
this.analyzer = new
CustomAnalyzerAugmented(Version.LUCENE_31);
}

   @Override
   public CustomAnalyzerAugmented get() {
           System.out.println("get() called");
           return this.analyzer;

   }

   @Override
   public String name() {
           System.out.println("name() called");
           return "escustomanalyzer";
   }

}

[schema]
SCHEMA = {

   "posting":
           {
           "properties":
                   {
                   "impact_factor": {"type": "integer", "store":

"yes"},
"username": {"type": "string", "store":
"yes"},
"body": {"type": "string", "store": "yes"},
}
}
}

[settings]
SETTINGS = {
"index":{
"number_of_shards":1
}
}

[version of elastic search] 0.18.4

I startup the elasticsearch. I can see the println() statements
displayed.
I'm trying to index some documents from python.
At the end of the mail you have the errors that I got.

Additional information:
I've tried to investigate the problem by debugging the code (!!the
latest version from git!!).
The problem occurs but at another method:
return getAnalyzer(fieldName).reusableTokenStream(fieldName, reader);
for field: "_all", the analyzers map contain no analyzer for all, and
what's interesting is that also defaultAnalyzer from the
FieldNameAnalyzer class is null.

The same problem occurs, when I'm not altering the elasticsearch.ym
file, but when I'm creating the index through rest calls,
The setting that I use is this one:
SETTINGS = {
"index":{
"number_of_shards":1,
"analysis" : {
"analyzer" : {
"default" : {
"type" :
"com.buzzbuzz.elasticsearch.analysis.CustomAnalyzerAugmentedProvider",

                                  }
                       }

}
}

Other questions:

  1. I want to make it work, what's the problem? How can I solve it?

  2. Is the configuration of elasticsearch.yml (+ writing providers
    extending AbstractIndexAnalyzerProvider) the the only way of
    providing a custom analysis?

  3. I took a look at the ICU plugin. That example shows a way of using
    a custom analysis. If I write a custom analysis plugin like ICU, what
    the difference (performance, memory consumption, threads used) between
    this method and extending AbstractIndexAnalyzerProvide?

[Error]

[2011-11-22 15:23:52,031][DEBUG][action.index ] [Mentus]
[testindex][0], node[sz3-xLAfQk61NHnuIKQbgA], [P], s[STARTED]: Failed
to execute [index {[testindex][posting][Gh4_DfM3Tt-aerr129kVOg],
source[{
"username":"Username_Alex",
"body":"some very important text to index",
"impact_factor":100,
}]}]
java.lang.NullPointerException
at

org.elasticsearch.index.analysis.FieldNameAnalyzer.getOffsetGap(FieldNameAnalyzer.java:
66)
at

org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:
196)
at

org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:
278)
at

org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:
766)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:
2067)
at

org.elasticsearch.index.engine.robin.RobinEngine.innerCreate(RobinEngine.java:
460)
at
org.elasticsearch.index.engine.robin.RobinEngine.create(RobinEngine.java:
353)
at

org.elasticsearch.index.shard.service.InternalIndexShard.create(InternalIndexShard.java:
293)
at

org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:
193)
at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction

$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:
487)
at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction
$AsyncShardOperationAction
$1.run(TransportShardReplicationOperationAction.java:400)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)


(Shay Banon) #2

Or just don't override the name method in your custom analyzer provider.

On Thu, Nov 24, 2011 at 6:13 PM, Sisu Alexandru sisu.eugen@gmail.comwrote:

... but I'm not sure its the correct one.
After checking a bit, the AnalysisService class and the way the providers
and analysers are instantiated,
I made the method name() returning "default", and after deploy it worked.

@Override
public String name() {
System.out.println("name() called");
return "default";
}

Cheers,

Alex

On Tue, Nov 22, 2011 at 3:46 PM, alex sisu.eugen@gmail.com wrote:

Hi,

I would need some help with the setup of a custom written analyzer.
I want to provide a certain type of analyzing for my "posting"-type
documents.

Here's my setup:

[elasticsearch.yml]

index:
analysis:
analyzer:
default:
type:
com.buzzbuzz.elasticsearch.analysis.CustomAnalyzerAugmentedProvider

The custom analysis classes are being placed in lib:
customanalysis.zip

[CustomAnalyzerAugmentedProvider]
public class CustomAnalyzerAugmentedProvider extends
AbstractIndexAnalyzerProvider {

   private CustomAnalyzerAugmented analyzer;

   @Inject
   public CustomAnalyzerAugmentedProvider(Index index, @IndexSettings

Settings indexSettings,
@Assisted String name, @Assisted Settings
settings) {
super(index, indexSettings, name, settings);
System.out

.println("{CustomAnalyzerAugmentedProvider} initialized");
this.analyzer = new
CustomAnalyzerAugmented(Version.LUCENE_31);
}

   @Override
   public CustomAnalyzerAugmented get() {
           System.out.println("get() called");
           return this.analyzer;

   }

   @Override
   public String name() {
           System.out.println("name() called");
           return "escustomanalyzer";
   }

}

[schema]
SCHEMA = {

   "posting":
           {
           "properties":
                   {
                   "impact_factor": {"type": "integer", "store":

"yes"},
"username": {"type": "string", "store":
"yes"},
"body": {"type": "string", "store": "yes"},
}
}
}

[settings]
SETTINGS = {
"index":{
"number_of_shards":1
}
}

[version of elastic search] 0.18.4

I startup the elasticsearch. I can see the println() statements
displayed.
I'm trying to index some documents from python.
At the end of the mail you have the errors that I got.

Additional information:
I've tried to investigate the problem by debugging the code (!!the
latest version from git!!).
The problem occurs but at another method:
return getAnalyzer(fieldName).reusableTokenStream(fieldName, reader);
for field: "_all", the analyzers map contain no analyzer for all, and
what's interesting is that also defaultAnalyzer from the
FieldNameAnalyzer class is null.

The same problem occurs, when I'm not altering the elasticsearch.ym
file, but when I'm creating the index through rest calls,
The setting that I use is this one:
SETTINGS = {
"index":{
"number_of_shards":1,
"analysis" : {
"analyzer" : {
"default" : {
"type" :
"com.buzzbuzz.elasticsearch.analysis.CustomAnalyzerAugmentedProvider",

                                  }
                       }

}
}

Other questions:

  1. I want to make it work, what's the problem? How can I solve it?

  2. Is the configuration of elasticsearch.yml (+ writing providers
    extending AbstractIndexAnalyzerProvider) the the only way of
    providing a custom analysis?

  3. I took a look at the ICU plugin. That example shows a way of using
    a custom analysis. If I write a custom analysis plugin like ICU, what
    the difference (performance, memory consumption, threads used) between
    this method and extending AbstractIndexAnalyzerProvide?

[Error]

[2011-11-22 15:23:52,031][DEBUG][action.index ] [Mentus]
[testindex][0], node[sz3-xLAfQk61NHnuIKQbgA], [P], s[STARTED]: Failed
to execute [index {[testindex][posting][Gh4_DfM3Tt-aerr129kVOg],
source[{
"username":"Username_Alex",
"body":"some very important text to index",
"impact_factor":100,
}]}]
java.lang.NullPointerException
at

org.elasticsearch.index.analysis.FieldNameAnalyzer.getOffsetGap(FieldNameAnalyzer.java:
66)
at

org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:
196)
at

org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:
278)
at

org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:
766)
at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:
2067)
at

org.elasticsearch.index.engine.robin.RobinEngine.innerCreate(RobinEngine.java:
460)
at
org.elasticsearch.index.engine.robin.RobinEngine.create(RobinEngine.java:
353)
at

org.elasticsearch.index.shard.service.InternalIndexShard.create(InternalIndexShard.java:
293)
at

org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:
193)
at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction

$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:
487)
at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction
$AsyncShardOperationAction
$1.run(TransportShardReplicationOperationAction.java:400)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)


(system) #3