Analizer with stop words removal by language


(Sebastian Gavarini) #1

Hi all,

I am facing an issue with stop words removal for one of my fields. I
have like 10 to 15 fields that are analyzed without the stop filter,
but I have a long field called "description", that needs stop words
removal.
The problem I have is that I need a solution for many languages, I
added in elasticsearch.yml definitions for my analyzers, for example
"default", "en_stop_analyzer" "es_stop_analyzer", ...
(en, es being English and Spanish).
So far so good, I have also some custom mappings explicitly defined,
and the dynamic features off.
The problem is I can't use for my types the "analyzer" setting in the
JSON mapping, because I use the same mapping for all the languages,
let's say I have a mapping for "myDocument", which has a field
"description", that I know must be indexed with stop filter, but I
will only know the language at indexing time.

I could create many mappings, one for each language, but I have 5
different object types already, multiplied by the languages I must
support, it's not nice to maintain for just the stop word list of a
single field.
Is there a way to use at least some "include/import" feature to
minimize the differences among files?
Could the analyzer be passed with the index and bulk apis?
any other ideas?

Thanks,
Sebastian.


(Shay Banon) #2

Hi,

Yea, one of the features on my list is to have the ability to drive the
analyzer used based on a field in the json doc. I got most of it implemented
on a local branch, open a feature for it?

-shay.bnaon

On Sun, Nov 7, 2010 at 7:34 AM, Sebastian sgavarini@gmail.com wrote:

Hi all,

I am facing an issue with stop words removal for one of my fields. I
have like 10 to 15 fields that are analyzed without the stop filter,
but I have a long field called "description", that needs stop words
removal.
The problem I have is that I need a solution for many languages, I
added in elasticsearch.yml definitions for my analyzers, for example
"default", "en_stop_analyzer" "es_stop_analyzer", ...
(en, es being English and Spanish).
So far so good, I have also some custom mappings explicitly defined,
and the dynamic features off.
The problem is I can't use for my types the "analyzer" setting in the
JSON mapping, because I use the same mapping for all the languages,
let's say I have a mapping for "myDocument", which has a field
"description", that I know must be indexed with stop filter, but I
will only know the language at indexing time.

I could create many mappings, one for each language, but I have 5
different object types already, multiplied by the languages I must
support, it's not nice to maintain for just the stop word list of a
single field.
Is there a way to use at least some "include/import" feature to
minimize the differences among files?
Could the analyzer be passed with the index and bulk apis?
any other ideas?

Thanks,
Sebastian.


(Sebastian Gavarini) #3

Hi Shay,

I think that is a very good idea.

Sure, I have just opened it: https://github.com/elasticsearch/elasticsearch/issues/#issue/487

I don't want to rush you, but I would like to know more or less when
do you expect that to be implemented? For now I can go with my plan to
create many files, maybe with a template generation.

Thanks,
Sebastian.

On Nov 7, 8:57 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

Yea, one of the features on my list is to have the ability to drive the
analyzer used based on a field in the json doc. I got most of it implemented
on a local branch, open a feature for it?

-shay.bnaon

On Sun, Nov 7, 2010 at 7:34 AM, Sebastian sgavar...@gmail.com wrote:

Hi all,

I am facing an issue with stop words removal for one of my fields. I
have like 10 to 15 fields that are analyzed without the stop filter,
but I have a long field called "description", that needs stop words
removal.
The problem I have is that I need a solution for many languages, I
added in elasticsearch.yml definitions for my analyzers, for example
"default", "en_stop_analyzer" "es_stop_analyzer", ...
(en, es being English and Spanish).
So far so good, I have also some custom mappings explicitly defined,
and the dynamic features off.
The problem is I can't use for my types the "analyzer" setting in the
JSON mapping, because I use the same mapping for all the languages,
let's say I have a mapping for "myDocument", which has a field
"description", that I know must be indexed with stop filter, but I
will only know the language at indexing time.

I could create many mappings, one for each language, but I have 5
different object types already, multiplied by the languages I must
support, it's not nice to maintain for just the stop word list of a
single field.
Is there a way to use at least some "include/import" feature to
minimize the differences among files?
Could the analyzer be passed with the index and bulk apis?
any other ideas?

Thanks,
Sebastian.


(Shay Banon) #4

Already implemented and pushed to master:
https://github.com/elasticsearch/elasticsearch/issues/closed#issue/485.

On Sun, Nov 7, 2010 at 8:28 PM, Sebastian sgavarini@gmail.com wrote:

Hi Shay,

I think that is a very good idea.

Sure, I have just opened it:
https://github.com/elasticsearch/elasticsearch/issues/#issue/487

I don't want to rush you, but I would like to know more or less when
do you expect that to be implemented? For now I can go with my plan to
create many files, maybe with a template generation.

Thanks,
Sebastian.

On Nov 7, 8:57 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

Yea, one of the features on my list is to have the ability to drive the
analyzer used based on a field in the json doc. I got most of it
implemented
on a local branch, open a feature for it?

-shay.bnaon

On Sun, Nov 7, 2010 at 7:34 AM, Sebastian sgavar...@gmail.com wrote:

Hi all,

I am facing an issue with stop words removal for one of my fields. I
have like 10 to 15 fields that are analyzed without the stop filter,
but I have a long field called "description", that needs stop words
removal.
The problem I have is that I need a solution for many languages, I
added in elasticsearch.yml definitions for my analyzers, for example
"default", "en_stop_analyzer" "es_stop_analyzer", ...
(en, es being English and Spanish).
So far so good, I have also some custom mappings explicitly defined,
and the dynamic features off.
The problem is I can't use for my types the "analyzer" setting in the
JSON mapping, because I use the same mapping for all the languages,
let's say I have a mapping for "myDocument", which has a field
"description", that I know must be indexed with stop filter, but I
will only know the language at indexing time.

I could create many mappings, one for each language, but I have 5
different object types already, multiplied by the languages I must
support, it's not nice to maintain for just the stop word list of a
single field.
Is there a way to use at least some "include/import" feature to
minimize the differences among files?
Could the analyzer be passed with the index and bulk apis?
any other ideas?

Thanks,
Sebastian.


(Sebastian Gavarini) #5

Hi Shay,

I posted an update in issue https://github.com/elasticsearch/elasticsearch/issues/issue/485

It isn't working for me, I followed the example but the analyzer field
is not found by AnalyzerMapper, line 85.
The document where the mapper tries to find the analyzer field doesn't
contain it yet.

Sebastian.

On Nov 7, 4:02 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Already implemented and pushed to master:https://github.com/elasticsearch/elasticsearch/issues/closed#issue/485.

On Sun, Nov 7, 2010 at 8:28 PM, Sebastian sgavar...@gmail.com wrote:

Hi Shay,

I think that is a very good idea.

Sure, I have just opened it:
https://github.com/elasticsearch/elasticsearch/issues/#issue/487

I don't want to rush you, but I would like to know more or less when
do you expect that to be implemented? For now I can go with my plan to
create many files, maybe with a template generation.

Thanks,
Sebastian.

On Nov 7, 8:57 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

Yea, one of the features on my list is to have the ability to drive the
analyzer used based on a field in the json doc. I got most of it
implemented
on a local branch, open a feature for it?

-shay.bnaon

On Sun, Nov 7, 2010 at 7:34 AM, Sebastian sgavar...@gmail.com wrote:

Hi all,

I am facing an issue with stop words removal for one of my fields. I
have like 10 to 15 fields that are analyzed without the stop filter,
but I have a long field called "description", that needs stop words
removal.
The problem I have is that I need a solution for many languages, I
added in elasticsearch.yml definitions for my analyzers, for example
"default", "en_stop_analyzer" "es_stop_analyzer", ...
(en, es being English and Spanish).
So far so good, I have also some custom mappings explicitly defined,
and the dynamic features off.
The problem is I can't use for my types the "analyzer" setting in the
JSON mapping, because I use the same mapping for all the languages,
let's say I have a mapping for "myDocument", which has a field
"description", that I know must be indexed with stop filter, but I
will only know the language at indexing time.

I could create many mappings, one for each language, but I have 5
different object types already, multiplied by the languages I must
support, it's not nice to maintain for just the stop word list of a
single field.
Is there a way to use at least some "include/import" feature to
minimize the differences among files?
Could the analyzer be passed with the index and bulk apis?
any other ideas?

Thanks,
Sebastian.


(system) #6