Is there a concatenation filter?


(cole) #1

Was there any progress in adding the concatenation filter [1] to
Lucene (and ES) last summer? I can't find any evidence of built-in
support for this type of filter.

Thanks,
Cole

[1] http://elasticsearch-users.115913.n3.nabble.com/Code-contribution-Concatenate-filter-td3137058.html#a3707818


(Stephane Bastian) #2

Hi Cole,

no, I got so busy right after my email last summer that I didn't follow up
and dropped the ball.
However if you are interested I can send you the code for the filter. just
let me know

Stephane


(cole) #3

Hi Stephane,

Thanks for the reply. I'd very much appreciate seeing your
concatentation filter code. Do you have it up somewhere you can link
to?

Thanks,
Cole

On Feb 3, 5:42 am, Stephane Bastian stephane.bastian....@gmail.com
wrote:

Hi Cole,

no, I got so busy right after my email last summer that I didn't follow up
and dropped the ball.
However if you are interested I can send you the code for the filter. just
let me know

Stephane


(Stephane Bastian) #4

Hello Cole,

Here is the code for the concatenate filter. As you can see, it's very
simple but does the job for me.

|public final class ConcatenateFilter extends TokenFilter {

 private final static String DEFAULT_TOKEN_SEPARATOR = " ";

 private final CharTermAttribute termAtt = 

addAttribute(CharTermAttribute.class);
private String tokenSeparator = null;
private StringBuilder builder = new StringBuilder();

 public ConcatenateFilter(Version matchVersion, TokenStream input, 

String tokenSeparator) {
super(input);
this.tokenSeparator = tokenSeparator!=null ? tokenSeparator :
DEFAULT_TOKEN_SEPARATOR;
}

 @Override
 public boolean incrementToken() throws IOException {
     boolean result = false;
     builder.setLength(0);
     while (input.incrementToken()) {
         if (builder.length()>0) {
             // append the token separator
             builder.append(tokenSeparator);
         }
         // append the term of the current token
         builder.append(termAtt.buffer(), 0, termAtt.length());
     }
     if (builder.length()>0) {
         termAtt.setEmpty().append(builder);
         result = true;
     }
     return result;
 }

}|

As you can see above the code is pure lucene (no ES code). In order to
use the filter in ES you need to implement another class:
|
public class ConcatenateTokenFilterFactory extends
AbstractTokenFilterFactory {

 private String tokenSeparator = null;

 @Inject
 public ConcatenateTokenFilterFactory(Index index, @IndexSettings 

Settings indexSettings, @Assisted String name, @Assisted Settings
settings) {
super(index, indexSettings, name, settings);
// ||the token_separator is defined in the ES configuration file|
| tokenSeparator = settings.get("token_separator");
}

 @Override
 public TokenStream create(TokenStream tokenStream) {
     return new *ConcatenateFilter*(Version.LUCENE_CURRENT, 

tokenStream, tokenSeparator);
}
}|

and to glue things together you then need to declare the
|ConcatenateTokenFilterFactory in ES config file:|

| "index": {
"analysis": {
"analyzer": {
"myAnalyzer": {
"tokenizer": "letter",
"filter": ["lowercase", "asciifolding", "filter-concatenate"]
}
},
"filter": {
"filter-concatenate": {
"type":
"com.monpetitguide.elasticsearch.analysis.ConcatenateTokenFilterFactory",
"token_separator": " "
}
}
}
} |

Cole, feel free to use any part of the code above. I'm glad if it helps

Al the best,

Stephane Bastian

On 02/03/2012 08:27 PM, cole wrote:

Hi Stephane,

Thanks for the reply. I'd very much appreciate seeing your
concatentation filter code. Do you have it up somewhere you can link
to?

Thanks,
Cole

On Feb 3, 5:42 am, Stephane Bastianstephane.bastian....@gmail.com
wrote:

Hi Cole,

no, I got so busy right after my email last summer that I didn't follow up
and dropped the ball.
However if you are interested I can send you the code for the filter. just
let me know

Stephane


(cole) #5

Thanks, Stephane! I appreciate you explaining how everything is glued
together. Very helpful!

Thanks,
Cole

On Feb 6, 2:25 am, Stephane Bastian stephane.bastian....@gmail.com
wrote:

Hello Cole,

Here is the code for the concatenate filter. As you can see, it's very
simple but does the job for me.

|public final class ConcatenateFilter extends TokenFilter {

 private final static String DEFAULT_TOKEN_SEPARATOR = " ";

 private final CharTermAttribute termAtt =

addAttribute(CharTermAttribute.class);
private String tokenSeparator = null;
private StringBuilder builder = new StringBuilder();

 public ConcatenateFilter(Version matchVersion, TokenStream input,

String tokenSeparator) {
super(input);
this.tokenSeparator = tokenSeparator!=null ? tokenSeparator :
DEFAULT_TOKEN_SEPARATOR;
}

 @Override
 public boolean incrementToken() throws IOException {
     boolean result = false;
     builder.setLength(0);
     while (input.incrementToken()) {
         if (builder.length()>0) {
             // append the token separator
             builder.append(tokenSeparator);
         }
         // append the term of the current token
         builder.append(termAtt.buffer(), 0, termAtt.length());
     }
     if (builder.length()>0) {
         termAtt.setEmpty().append(builder);
         result = true;
     }
     return result;
 }

}|

As you can see above the code is pure lucene (no ES code). In order to
use the filter in ES you need to implement another class:
|
public class ConcatenateTokenFilterFactory extends
AbstractTokenFilterFactory {

 private String tokenSeparator = null;

 @Inject
 public ConcatenateTokenFilterFactory(Index index, @IndexSettings

Settings indexSettings, @Assisted String name, @Assisted Settings
settings) {
super(index, indexSettings, name, settings);
// ||the token_separator is defined in the ES configuration file|
| tokenSeparator = settings.get("token_separator");
}

 @Override
 public TokenStream create(TokenStream tokenStream) {
     return new *ConcatenateFilter*(Version.LUCENE_CURRENT,

tokenStream, tokenSeparator);
}

}|

and to glue things together you then need to declare the
|ConcatenateTokenFilterFactory in ES config file:|

| "index": {
"analysis": {
"analyzer": {
"myAnalyzer": {
"tokenizer": "letter",
"filter": ["lowercase", "asciifolding", "filter-concatenate"]
}
},
"filter": {
"filter-concatenate": {
"type":
"com.monpetitguide.elasticsearch.analysis.ConcatenateTokenFilterFactory",
"token_separator": " "
}
}
}
} |

Cole, feel free to use any part of the code above. I'm glad if it helps

Al the best,

Stephane Bastian

On 02/03/2012 08:27 PM, cole wrote:

Hi Stephane,

Thanks for the reply. I'd very much appreciate seeing your
concatentation filter code. Do you have it up somewhere you can link
to?

Thanks,
Cole

On Feb 3, 5:42 am, Stephane Bastianstephane.bastian....@gmail.com
wrote:

Hi Cole,

no, I got so busy right after my email last summer that I didn't follow up
and dropped the ball.
However if you are interested I can send you the code for the filter. just
let me know

Stephane


(system) #6