Correct way to handle faceted search tokenization issue for dynamic fields?

Onur_Aktas · October 1, 2013, 10:26am

Hi all,

I want to index an object which has some static fields and dynamic fields
(kept in HashMap) holding Product technical feature name and value pair.You
can guess that there are thousands of different product types from various
product categories which cause 1000s of different technical
features/attributes.

Each technical feature value starts with f_ so I applied a mapping
something like below.

dynamic_templates: [
- {
  - template_feature: {
    - mapping: {
      - type: multi_field
      - fields: {
        
        {name}: {
        
        type: {dynamic_type}
        
        index: analyzed
        }
        
        org: {
        
        type: {dynamic_type}
        
        index: not_analyzed
        }
        }
        }
    - match: f_*
      }
      }
      ]

However, when I check mapping I see that ElasticSearch creates a mapping
for each inserted technical feature. So it means when product list grows;
mapping will significantly grow and there are about total 2000 different
technical feature value.

f_material: {
- type: multi_field
- fields: {
  - f_material: {
    - type: string
      }
  - org: {
    - type: string
    - index: not_analyzed
    - omit_norms: true
    - index_options: docs
    - include_in_all: false
      }
      }
      }
f_period_type: {
- type: multi_field
- fields: {
  - f_period_type: {
    - type: string
      }
  - org: {
    - type: string
    - index: not_analyzed
    - omit_norms: true
    - index_options: docs
    - include_in_all: false
      }
      }
      }
f_production_type: {
- type: multi_field
- fields: {
  - f_production_type: {
    - type: string
      }
  - org: {
    - type: string
    - index: not_analyzed
    - omit_norms: true
    - index_options: docs
    - include_in_all: false
      }
      }
      }
f_size: {
- type: multi_field
- fields: {
  - f_size: {
    - type: string
      }
  - org: {
    - type: string
    - index: not_analyzed
    - omit_norms: true
    - index_options: docs
    - include_in_all: false
      }
      }
      }
f_style: {
- type: multi_field
- fields: {
  - f_style: {
    - type: string
      }
  - org: {
    - type: string
    - index: not_analyzed
    - omit_norms: true
    - index_options: docs
    - include_in_all: false
      }
      }
      }

I applied this mapping just to perform a faceted search without whitespace
tokenization; but I think it is not the correct way.

Could you please advice the correct way to do this?

KR,
Onur

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Tchinkatchuk · November 21, 2013, 10:43am

Hi,

I have exactly the same way to do it and the sema issue.
COuld you help me if you foudn the answer ?

Thanks.

Onur_Aktas · November 21, 2013, 10:58am

On Tuesday, October 1, 2013 1:26:01 PM UTC+3, Onur Aktaş wrote:

Hi all,

I want to index an object which has some static fields and dynamic fields
(kept in HashMap) holding Product technical feature name and value pair.You
can guess that there are thousands of different product types from various
product categories which cause 1000s of different technical
features/attributes.

Each technical feature value starts with f_ so I applied a mapping
something like below.

dynamic_templates: [

{

template_feature: {

mapping: {

type: multi_field

fields: {

{name}: {

type: {dynamic_type}

index: analyzed
}

org: {

type: {dynamic_type}

index: not_analyzed
}
}
}

match: f_*
}
}
]

However, when I check mapping I see that Elasticsearch creates a mapping
for each inserted technical feature. So it means when product list grows;
mapping will significantly grow and there are about total 2000 different
technical feature value.

f_material: {

type: multi_field

fields: {

f_material: {

type: string
}

org: {

type: string

index: not_analyzed

omit_norms: true

index_options: docs

include_in_all: false
}
}
}

f_period_type: {

type: multi_field

fields: {

f_period_type: {

type: string
}

org: {

type: string

index: not_analyzed

omit_norms: true

index_options: docs

include_in_all: false
}
}
}

f_production_type: {

type: multi_field

fields: {

f_production_type: {

type: string
}

org: {

type: string

index: not_analyzed

omit_norms: true

index_options: docs

include_in_all: false
}
}
}

f_size: {

type: multi_field

fields: {

f_size: {

type: string
}

org: {

type: string

index: not_analyzed

omit_norms: true

index_options: docs

include_in_all: false
}
}
}

f_style: {

type: multi_field

fields: {

f_style: {

type: string
}

org: {

type: string

index: not_analyzed

omit_norms: true

index_options: docs

include_in_all: false
}
}
}

I applied this mapping just to perform a faceted search without whitespace
tokenization; but I think it is not the correct way.

Could you please advice the correct way to do this?

KR,
Onur

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Onur_Aktas · November 21, 2013, 11:01am

Hi Georges,

We decided to give a name to each feature and use it instead of it's title.
name -> title where name is feature + (order of the feature)

For example, lets say Category A and B has following technical feature
fields.

Category A: Material, Size
Category B: Color, Size

Then we mapped Category A as:
p01 -> Material,
p02 -> Size

Category B as:
p01-> Color
p02 -> Size

Then we assumed any category can have max 20 feature values and then mapped
each feature by its name instead of its title.

Finally we had a mapping something like this:

     "p01":{
        "type":"multi_field",
        "fields":{
           "analyzed":{
              "type":"string",
              "index":"analyzed"
           },
           "notanalyzed":{
              "type":"string",
              "index":"not_analyzed"
           }
        }
     },
     "p02":{
        "type":"multi_field",
        "fields":{
           "analyzed":{
              "type":"string",
              "index":"analyzed"
           },
           "notanalyzed":{
              "type":"string",
              "index":"not_analyzed"
           }
        }
     }

.. goes up to p20

So products will have a data something like:
Product A
p01 -> Steel
p02 -> 15 meters.

Pros
You do not have to create (category count * unique feature name) mappings.

Cons
You should not rename feature's name; otherwise products will show wrong
data.

Hope it helps.

KR,
Onur

On Thursday, November 21, 2013 12:43:32 PM UTC+2, Georges@Bibtol wrote:

Hi,

I have exactly the same way to do it and the sema issue.
COuld you help me if you foudn the answer ?

Thanks.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Correct-way-to-handle-faceted-search-tokenization-issue-for-dynamic-fields-tp4041972p4044698.html

Sent from the Elasticsearch Users mailing list archive at Nabble.com.

On Thursday, November 21, 2013 12:43:32 PM UTC+2, Georges@Bibtol wrote:

Hi,

I have exactly the same way to do it and the sema issue.
COuld you help me if you foudn the answer ?

Thanks.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Correct-way-to-handle-faceted-search-tokenization-issue-for-dynamic-fields-tp4041972p4044698.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Tchinkatchuk · November 27, 2013, 4:54pm

Thanks for the answer.
Unfortunately, I do not want to rename all my dynamic attributes.

here's a little mapping configuration I have :

{ "article": { "_default_": { "dynamic_templates": [ { "base": { "match": "*", "mapping": { "type" : "multi_field", "fields" : { "{name}" : { "type" : "string", "index" : "analyzed", "store" : "yes", "analyzer" : "my_string_analyzer", "search_analyzer" : "default", "index_analyzer" : "default_edge_n_grams" }, "raw_value": {"type": "string", "analyzer": "not_analyzed"} } } } }] } } }

I want to be able to get facets this way :

GET _search
{
"facets": {
"brand": {
"terms": {
"field" : "brand.raw_value"
}
}
},
"query": {
"filtered" : {
"query" : {
"query_string" : {
"query" : "***"
}
}
}
}
}

cause if i do it on brand and not brand.raw_value, my bran value is tokenized.
-> "Elastic Search" wille render 2 facets possibilities "Elastic" & "Value" instead of just one.

Such a shame.
Do I miss something ?