Is there a concatenation filter?

cole · February 2, 2012, 8:25pm

Was there any progress in adding the concatenation filter [1] to
Lucene (and ES) last summer? I can't find any evidence of built-in
support for this type of filter.

Thanks,
Cole

[1] http://elasticsearch-users.115913.n3.nabble.com/Code-contribution-Concatenate-filter-td3137058.html#a3707818

Stephane_Bastian · February 3, 2012, 1:42pm

Hi Cole,

no, I got so busy right after my email last summer that I didn't follow up
and dropped the ball.
However if you are interested I can send you the code for the filter. just
let me know

Stephane

cole · February 3, 2012, 7:27pm

Hi Stephane,

Thanks for the reply. I'd very much appreciate seeing your
concatentation filter code. Do you have it up somewhere you can link
to?

Thanks,
Cole

On Feb 3, 5:42 am, Stephane Bastian stephane.bastian....@gmail.com
wrote:

Hi Cole,

no, I got so busy right after my email last summer that I didn't follow up
and dropped the ball.
However if you are interested I can send you the code for the filter. just
let me know

Stephane

Stephane_Bastian · February 6, 2012, 10:25am

Hello Cole,

Here is the code for the concatenate filter. As you can see, it's very
simple but does the job for me.

|public final class ConcatenateFilter extends TokenFilter {

 private final static String DEFAULT_TOKEN_SEPARATOR = " ";

 private final CharTermAttribute termAtt =

addAttribute(CharTermAttribute.class);
private String tokenSeparator = null;
private StringBuilder builder = new StringBuilder();

 public ConcatenateFilter(Version matchVersion, TokenStream input,

String tokenSeparator) {
super(input);
this.tokenSeparator = tokenSeparator!=null ? tokenSeparator :
DEFAULT_TOKEN_SEPARATOR;
}

 @Override
 public boolean incrementToken() throws IOException {
     boolean result = false;
     builder.setLength(0);
     while (input.incrementToken()) {
         if (builder.length()>0) {
             // append the token separator
             builder.append(tokenSeparator);
         }
         // append the term of the current token
         builder.append(termAtt.buffer(), 0, termAtt.length());
     }
     if (builder.length()>0) {
         termAtt.setEmpty().append(builder);
         result = true;
     }
     return result;
 }

}|

As you can see above the code is pure lucene (no ES code). In order to
use the filter in ES you need to implement another class:
|
public class ConcatenateTokenFilterFactory extends
AbstractTokenFilterFactory {

 private String tokenSeparator = null;

 @Inject
 public ConcatenateTokenFilterFactory(Index index, @IndexSettings

Settings indexSettings, @Assisted String name, @Assisted Settings
settings) {
super(index, indexSettings, name, settings);
// ||the token_separator is defined in the ES configuration file|
| tokenSeparator = settings.get("token_separator");
}

 @Override
 public TokenStream create(TokenStream tokenStream) {
     return new *ConcatenateFilter*(Version.LUCENE_CURRENT,

tokenStream, tokenSeparator);
}
}|

and to glue things together you then need to declare the
|ConcatenateTokenFilterFactory in ES config file:|

| "index": {
"analysis": {
"analyzer": {
"myAnalyzer": {
"tokenizer": "letter",
"filter": ["lowercase", "asciifolding", "filter-concatenate"]
}
},
"filter": {
"filter-concatenate": {
"type":
"com.monpetitguide.elasticsearch.analysis.ConcatenateTokenFilterFactory",
"token_separator": " "
}
}
}
} |

Cole, feel free to use any part of the code above. I'm glad if it helps

Al the best,

Stephane Bastian

On 02/03/2012 08:27 PM, cole wrote:

Hi Stephane,

Thanks for the reply. I'd very much appreciate seeing your
concatentation filter code. Do you have it up somewhere you can link
to?

Thanks,
Cole

On Feb 3, 5:42 am, Stephane Bastianstephane.bastian....@gmail.com
wrote:

Hi Cole,

no, I got so busy right after my email last summer that I didn't follow up
and dropped the ball.
However if you are interested I can send you the code for the filter. just
let me know

Stephane

cole · February 6, 2012, 10:58pm

Thanks, Stephane! I appreciate you explaining how everything is glued
together. Very helpful!

Thanks,
Cole

On Feb 6, 2:25 am, Stephane Bastian stephane.bastian....@gmail.com
wrote:

Hello Cole,

Here is the code for the concatenate filter. As you can see, it's very
simple but does the job for me.

|public final class ConcatenateFilter extends TokenFilter {
 private final static String DEFAULT_TOKEN_SEPARATOR = " ";

 private final CharTermAttribute termAtt =
addAttribute(CharTermAttribute.class);
private String tokenSeparator = null;
private StringBuilder builder = new StringBuilder();
 public ConcatenateFilter(Version matchVersion, TokenStream input,
String tokenSeparator) {
super(input);
this.tokenSeparator = tokenSeparator!=null ? tokenSeparator :
DEFAULT_TOKEN_SEPARATOR;
}
 @Override
 public boolean incrementToken() throws IOException {
     boolean result = false;
     builder.setLength(0);
     while (input.incrementToken()) {
         if (builder.length()>0) {
             // append the token separator
             builder.append(tokenSeparator);
         }
         // append the term of the current token
         builder.append(termAtt.buffer(), 0, termAtt.length());
     }
     if (builder.length()>0) {
         termAtt.setEmpty().append(builder);
         result = true;
     }
     return result;
 }
}|

As you can see above the code is pure lucene (no ES code). In order to
use the filter in ES you need to implement another class:
|
public class ConcatenateTokenFilterFactory extends
AbstractTokenFilterFactory {
 private String tokenSeparator = null;

 @Inject
 public ConcatenateTokenFilterFactory(Index index, @IndexSettings
Settings indexSettings, @Assisted String name, @Assisted Settings
settings) {
super(index, indexSettings, name, settings);
// ||the token_separator is defined in the ES configuration file|
| tokenSeparator = settings.get("token_separator");
}
 @Override
 public TokenStream create(TokenStream tokenStream) {
     return new *ConcatenateFilter*(Version.LUCENE_CURRENT,
tokenStream, tokenSeparator);
}

}|

and to glue things together you then need to declare the
|ConcatenateTokenFilterFactory in ES config file:|

| "index": {
"analysis": {
"analyzer": {
"myAnalyzer": {
"tokenizer": "letter",
"filter": ["lowercase", "asciifolding", "filter-concatenate"]
}
},
"filter": {
"filter-concatenate": {
"type":
"com.monpetitguide.elasticsearch.analysis.ConcatenateTokenFilterFactory",
"token_separator": " "
}
}
}
} |

Cole, feel free to use any part of the code above. I'm glad if it helps

Al the best,

Stephane Bastian

On 02/03/2012 08:27 PM, cole wrote:

Hi Stephane,

Thanks for the reply. I'd very much appreciate seeing your
concatentation filter code. Do you have it up somewhere you can link
to?

Thanks,
Cole

On Feb 3, 5:42 am, Stephane Bastianstephane.bastian....@gmail.com
wrote:

Hi Cole,

no, I got so busy right after my email last summer that I didn't follow up
and dropped the ball.
However if you are interested I can send you the code for the filter. just
let me know

Stephane

Topic		Replies	Views
Code contribution - Concatenate filter Elasticsearch	8	515	July 6, 2017
A less aggressive stemming token filter that strips only plural Elasticsearch	12	1005	July 6, 2017
Dealing with concatenated words in elasticsearch Elasticsearch	1	72	March 31, 2024
Wrapping TokenFilters Elasticsearch	5	323	July 6, 2017
Is it possible to write my own filter? Elasticsearch	4	517	July 6, 2017

Is there a concatenation filter?

Related topics