Problem indexing with my analyzer


(Tanguy Bernard) #1

Hello
I have some issue, when I index a particular data "note_source" (sql
longtext).
I use the same analyzer for each fields (except date_source and id_source)
but for "note_source", I have a "warn monitor.jvm".
When I remove "note_source", everything fine. If I don't use analyzer on
"note_source", everything fine, but if I use my analyzer on "note_source" I
have some crash.

I think I have enough memory, I have used ES_HEAP_SIZE.
Maybe my problem it's with accent (ascii, utf-8)

Can you help me with this ?

My Setting

public function createSetting($pf){
$params = array('index' => $pf, 'body' => array(
'settings' => array(
'number_of_shards' => 5,
'number_of_replicas' => 0,
'analysis' => array(
'filter' => array(
'nGram' => array(
"token_chars" =>array(),
"type" => "nGram",
"min_gram" => 3,
"max_gram" => 250
)
),
'analyzer' => array(
'reuters' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'asciifolding',
'nGram')
)
)
)
)
));
$this->elasticsearchClient->indices()->create($params);
return;
}

My Indexing

public function indexTable($pf,$typeElement){

    $params =array(
        "index" =>'_river', 
        "type" => $typeElement, 
        "id" => "_meta", 
        "body" =>array(
      
            "type" => "jdbc",
            "jdbc" => array(
                "url" => "jdbc:mysql://ip/name",
                "user" => 'root',
                "password" => 'mdp',
                "index" => $pf,
                "type" => $typeElement,
                "sql" => select id_source as _id, id_sous_theme, 

titre_source, desc_source, note_source, adresse_source, type_source,
date_source from source,
"max_bulk_requests" => 5,
)
)

    );
    
     
    $this->elasticsearchClient->index($params);

}

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dd6e60dc-d394-4d7d-b994-2105002d7bd7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Tanguy Bernard) #2

Information
My "note_source" contain picture (.jpg, .png ...) in base64 and text.

For my mapping I have used :
"type" => "string"
"analyzer" => "reuteurs" (the name of my analyzer)

Any idea ?

Le jeudi 19 juin 2014 17:57:46 UTC+2, Tanguy Bernard a écrit :

Hello
I have some issue, when I index a particular data "note_source" (sql
longtext).
I use the same analyzer for each fields (except date_source and id_source)
but for "note_source", I have a "warn monitor.jvm".
When I remove "note_source", everything fine. If I don't use analyzer on
"note_source", everything fine, but if I use my analyzer on "note_source" I
have some crash.

I think I have enough memory, I have used ES_HEAP_SIZE.
Maybe my problem it's with accent (ascii, utf-8)

Can you help me with this ?

My Setting

public function createSetting($pf){
$params = array('index' => $pf, 'body' => array(
'settings' => array(
'number_of_shards' => 5,
'number_of_replicas' => 0,
'analysis' => array(
'filter' => array(
'nGram' => array(
"token_chars" =>array(),
"type" => "nGram",
"min_gram" => 3,
"max_gram" => 250
)
),
'analyzer' => array(
'reuters' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'asciifolding',
'nGram')
)
)
)
)
));
$this->elasticsearchClient->indices()->create($params);
return;
}

My Indexing

public function indexTable($pf,$typeElement){

    $params =array(
        "index" =>'_river', 
        "type" => $typeElement, 
        "id" => "_meta", 
        "body" =>array(
      
            "type" => "jdbc",
            "jdbc" => array(
                "url" => "jdbc:mysql://ip/name",
                "user" => 'root',
                "password" => 'mdp',
                "index" => $pf,
                "type" => $typeElement,
                "sql" => select id_source as _id, id_sous_theme, 

titre_source, desc_source, note_source, adresse_source, type_source,
date_source from source,
"max_bulk_requests" => 5,
)
)

    );
    
     
    $this->elasticsearchClient->index($params);

}

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5d93217c-bded-40fa-8fd2-fdac576c57ee%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Cédric Hourcade) #3

Does it mean your applying the "reuters" analyzer on your base64
encoded pictures?

I guess it generates a really huge number of tokens for each entry
because of your nGram filter (with a max at 250).

Cédric Hourcade
ced@wal.fr

On Fri, Jun 20, 2014 at 9:09 AM, Tanguy Bernard
bernardtanguy1pro@gmail.com wrote:

Information
My "note_source" contain picture (.jpg, .png ...) in base64 and text.

For my mapping I have used :
"type" => "string"
"analyzer" => "reuteurs" (the name of my analyzer)

Any idea ?

Le jeudi 19 juin 2014 17:57:46 UTC+2, Tanguy Bernard a écrit :

Hello
I have some issue, when I index a particular data "note_source" (sql
longtext).
I use the same analyzer for each fields (except date_source and id_source)
but for "note_source", I have a "warn monitor.jvm".
When I remove "note_source", everything fine. If I don't use analyzer on
"note_source", everything fine, but if I use my analyzer on "note_source" I
have some crash.

I think I have enough memory, I have used ES_HEAP_SIZE.
Maybe my problem it's with accent (ascii, utf-8)

Can you help me with this ?

My Setting

public function createSetting($pf){
$params = array('index' => $pf, 'body' => array(
'settings' => array(
'number_of_shards' => 5,
'number_of_replicas' => 0,
'analysis' => array(
'filter' => array(
'nGram' => array(
"token_chars" =>array(),
"type" => "nGram",
"min_gram" => 3,
"max_gram" => 250
)
),
'analyzer' => array(
'reuters' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'asciifolding',
'nGram')
)
)
)
)
));
$this->elasticsearchClient->indices()->create($params);
return;
}

My Indexing

public function indexTable($pf,$typeElement){

    $params =array(
        "index" =>'_river',
        "type" => $typeElement,
        "id" => "_meta",
        "body" =>array(

            "type" => "jdbc",
            "jdbc" => array(
                "url" => "jdbc:mysql://ip/name",
                "user" => 'root',
                "password" => 'mdp',
                "index" => $pf,
                "type" => $typeElement,
                "sql" => select id_source as _id, id_sous_theme,

titre_source, desc_source, note_source, adresse_source, type_source,
date_source from source,
"max_bulk_requests" => 5,
)
)

    );


    $this->elasticsearchClient->index($params);

}

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5d93217c-bded-40fa-8fd2-fdac576c57ee%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJQxjPM8qvsmcxB7Xu4KqN28pfvk%2BcBn5bpV2Emw42M5HzAAUA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Tanguy Bernard) #4

Yes, I am applying "reuters" on my document (compose by text and picture).
My goal is to do my research on the text of the document with any word or
part of a word.

Yes the problem it's my nGram filter.
How do I solve this problem ? Deacrease nGram max ? Change Analyzer by an
other but who satisfy my goal ?

Le vendredi 20 juin 2014 10:58:49 UTC+2, Cédric Hourcade a écrit :

Does it mean your applying the "reuters" analyzer on your base64
encoded pictures?

I guess it generates a really huge number of tokens for each entry
because of your nGram filter (with a max at 250).

Cédric Hourcade
c...@wal.fr <javascript:>

On Fri, Jun 20, 2014 at 9:09 AM, Tanguy Bernard
<bernardt...@gmail.com <javascript:>> wrote:

Information
My "note_source" contain picture (.jpg, .png ...) in base64 and text.

For my mapping I have used :
"type" => "string"
"analyzer" => "reuteurs" (the name of my analyzer)

Any idea ?

Le jeudi 19 juin 2014 17:57:46 UTC+2, Tanguy Bernard a écrit :

Hello
I have some issue, when I index a particular data "note_source" (sql
longtext).
I use the same analyzer for each fields (except date_source and
id_source)

but for "note_source", I have a "warn monitor.jvm".
When I remove "note_source", everything fine. If I don't use analyzer
on

"note_source", everything fine, but if I use my analyzer on
"note_source" I

have some crash.

I think I have enough memory, I have used ES_HEAP_SIZE.
Maybe my problem it's with accent (ascii, utf-8)

Can you help me with this ?

My Setting

public function createSetting($pf){
$params = array('index' => $pf, 'body' => array(
'settings' => array(
'number_of_shards' => 5,
'number_of_replicas' => 0,
'analysis' => array(
'filter' => array(
'nGram' => array(
"token_chars" =>array(),
"type" => "nGram",
"min_gram" => 3,
"max_gram" => 250
)
),
'analyzer' => array(
'reuters' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'asciifolding',
'nGram')
)
)
)
)
));
$this->elasticsearchClient->indices()->create($params);
return;
}

My Indexing

public function indexTable($pf,$typeElement){

    $params =array( 
        "index" =>'_river', 
        "type" => $typeElement, 
        "id" => "_meta", 
        "body" =>array( 

            "type" => "jdbc", 
            "jdbc" => array( 
                "url" => "jdbc:mysql://ip/name", 
                "user" => 'root', 
                "password" => 'mdp', 
                "index" => $pf, 
                "type" => $typeElement, 
                "sql" => select id_source as _id, id_sous_theme, 

titre_source, desc_source, note_source, adresse_source, type_source,
date_source from source,
"max_bulk_requests" => 5,
)
)

    ); 


    $this->elasticsearchClient->index($params); 

}

Thanks in advance.

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/5d93217c-bded-40fa-8fd2-fdac576c57ee%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b7daa716-cb5f-45cc-916b-43c7c0aea6b9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Cédric Hourcade) #5

If you are only searching in the text you should index the images in
an other field field. With no analyzer ("index: not_analyzed"), or
even better "index: no" (not indexed). If you need to retrieve the
image data it's still in the _source.

But to be honest I wouldn't even store this kind of information in ES,
your index is going to be bigger, merges are going to be slower... I'd
keep the binary files stored elsewhere.

Cédric Hourcade
ced@wal.fr

On Fri, Jun 20, 2014 at 11:25 AM, Tanguy Bernard
bernardtanguy1pro@gmail.com wrote:

Yes, I am applying "reuters" on my document (compose by text and picture).
My goal is to do my research on the text of the document with any word or
part of a word.

Yes the problem it's my nGram filter.
How do I solve this problem ? Deacrease nGram max ? Change Analyzer by an
other but who satisfy my goal ?

Le vendredi 20 juin 2014 10:58:49 UTC+2, Cédric Hourcade a écrit :

Does it mean your applying the "reuters" analyzer on your base64
encoded pictures?

I guess it generates a really huge number of tokens for each entry
because of your nGram filter (with a max at 250).

Cédric Hourcade
c...@wal.fr

On Fri, Jun 20, 2014 at 9:09 AM, Tanguy Bernard
bernardt...@gmail.com wrote:

Information
My "note_source" contain picture (.jpg, .png ...) in base64 and text.

For my mapping I have used :
"type" => "string"
"analyzer" => "reuteurs" (the name of my analyzer)

Any idea ?

Le jeudi 19 juin 2014 17:57:46 UTC+2, Tanguy Bernard a écrit :

Hello
I have some issue, when I index a particular data "note_source" (sql
longtext).
I use the same analyzer for each fields (except date_source and
id_source)
but for "note_source", I have a "warn monitor.jvm".
When I remove "note_source", everything fine. If I don't use analyzer
on
"note_source", everything fine, but if I use my analyzer on
"note_source" I
have some crash.

I think I have enough memory, I have used ES_HEAP_SIZE.
Maybe my problem it's with accent (ascii, utf-8)

Can you help me with this ?

My Setting

public function createSetting($pf){
$params = array('index' => $pf, 'body' => array(
'settings' => array(
'number_of_shards' => 5,
'number_of_replicas' => 0,
'analysis' => array(
'filter' => array(
'nGram' => array(
"token_chars" =>array(),
"type" => "nGram",
"min_gram" => 3,
"max_gram" => 250
)
),
'analyzer' => array(
'reuters' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'asciifolding',
'nGram')
)
)
)
)
));
$this->elasticsearchClient->indices()->create($params);
return;
}

My Indexing

public function indexTable($pf,$typeElement){

    $params =array(
        "index" =>'_river',
        "type" => $typeElement,
        "id" => "_meta",
        "body" =>array(

            "type" => "jdbc",
            "jdbc" => array(
                "url" => "jdbc:mysql://ip/name",
                "user" => 'root',
                "password" => 'mdp',
                "index" => $pf,
                "type" => $typeElement,
                "sql" => select id_source as _id, id_sous_theme,

titre_source, desc_source, note_source, adresse_source, type_source,
date_source from source,
"max_bulk_requests" => 5,
)
)

    );


    $this->elasticsearchClient->index($params);

}

Thanks in advance.

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/5d93217c-bded-40fa-8fd2-fdac576c57ee%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b7daa716-cb5f-45cc-916b-43c7c0aea6b9%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJQxjPOf8kbDpr-EuDfskLj4UjQs4FAq04GrWH87fFy0df8EPQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Tanguy Bernard) #6

I set max_gram=20. It's better but at the end I have this many times :

[2014-06-20 11:42:14,201][WARN ][monitor.jvm ] [ik-test2]
[gc][young][528][263] duration [2s], collections [1]/[2.1s], total
[2s]/[43.9s], memory [536mb]->[580.2mb]/[1015.6mb], all_pools {[young]
[22.5mb]->[22.3mb]/[66.5mb]}{[survivor] [14.9kb]->[49.3kb]/[8.3mb]}{[old]
[513.4mb]->[557.8mb]/[940.8mb]}

I put ES_HEAP_SIZE : 2G. I think it's enough.
Something wrong ?

Le jeudi 19 juin 2014 17:57:46 UTC+2, Tanguy Bernard a écrit :

Hello
I have some issue, when I index a particular data "note_source" (sql
longtext).
I use the same analyzer for each fields (except date_source and id_source)
but for "note_source", I have a "warn monitor.jvm".
When I remove "note_source", everything fine. If I don't use analyzer on
"note_source", everything fine, but if I use my analyzer on "note_source" I
have some crash.

I think I have enough memory, I have used ES_HEAP_SIZE.
Maybe my problem it's with accent (ascii, utf-8)

Can you help me with this ?

My Setting

public function createSetting($pf){
$params = array('index' => $pf, 'body' => array(
'settings' => array(
'number_of_shards' => 5,
'number_of_replicas' => 0,
'analysis' => array(
'filter' => array(
'nGram' => array(
"token_chars" =>array(),
"type" => "nGram",
"min_gram" => 3,
"max_gram" => 250
)
),
'analyzer' => array(
'reuters' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'asciifolding',
'nGram')
)
)
)
)
));
$this->elasticsearchClient->indices()->create($params);
return;
}

My Indexing

public function indexTable($pf,$typeElement){

    $params =array(
        "index" =>'_river', 
        "type" => $typeElement, 
        "id" => "_meta", 
        "body" =>array(
      
            "type" => "jdbc",
            "jdbc" => array(
                "url" => "jdbc:mysql://ip/name",
                "user" => 'root',
                "password" => 'mdp',
                "index" => $pf,
                "type" => $typeElement,
                "sql" => select id_source as _id, id_sous_theme, 

titre_source, desc_source, note_source, adresse_source, type_source,
date_source from source,
"max_bulk_requests" => 5,
)
)

    );
    
     
    $this->elasticsearchClient->index($params);

}

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/154b8ca2-a130-4062-b5ce-0e0fa63d98fe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Tanguy Bernard) #7

The user copy/paste the content of an html page and me, I index this
information. I take the entire document with image. I can't change this
behavior.

I set max_gram=20. It's better but at the end I have this many times :

[2014-06-20 11:42:14,201][WARN ][monitor.jvm ] [ik-test2]
[gc][young][528][263] duration [2s], collections [1]/[2.1s], total
[2s]/[43.9s], memory [536mb]->[580.2mb]/[1015.6mb], all_pools {[young]
[22.5mb]->[22.3mb]/[66.5mb]}{[survivor] [14.9kb]->[49.3kb]/[8.3mb]}{[old]
[513.4mb]->[557.8mb]/[940.8mb]}

I put ES_HEAP_SIZE : 2G. I think it's enough.
Something wrong ?

Le vendredi 20 juin 2014 11:45:22 UTC+2, Cédric Hourcade a écrit :

If you are only searching in the text you should index the images in
an other field field. With no analyzer ("index: not_analyzed"), or
even better "index: no" (not indexed). If you need to retrieve the
image data it's still in the _source.

But to be honest I wouldn't even store this kind of information in ES,
your index is going to be bigger, merges are going to be slower... I'd
keep the binary files stored elsewhere.

Cédric Hourcade
c...@wal.fr <javascript:>

On Fri, Jun 20, 2014 at 11:25 AM, Tanguy Bernard
<bernardt...@gmail.com <javascript:>> wrote:

Yes, I am applying "reuters" on my document (compose by text and
picture).
My goal is to do my research on the text of the document with any word
or
part of a word.

Yes the problem it's my nGram filter.
How do I solve this problem ? Deacrease nGram max ? Change Analyzer by
an
other but who satisfy my goal ?

Le vendredi 20 juin 2014 10:58:49 UTC+2, Cédric Hourcade a écrit :

Does it mean your applying the "reuters" analyzer on your base64
encoded pictures?

I guess it generates a really huge number of tokens for each entry
because of your nGram filter (with a max at 250).

Cédric Hourcade
c...@wal.fr

On Fri, Jun 20, 2014 at 9:09 AM, Tanguy Bernard
bernardt...@gmail.com wrote:

Information
My "note_source" contain picture (.jpg, .png ...) in base64 and text.

For my mapping I have used :
"type" => "string"
"analyzer" => "reuteurs" (the name of my analyzer)

Any idea ?

Le jeudi 19 juin 2014 17:57:46 UTC+2, Tanguy Bernard a écrit :

Hello
I have some issue, when I index a particular data "note_source" (sql
longtext).
I use the same analyzer for each fields (except date_source and
id_source)
but for "note_source", I have a "warn monitor.jvm".
When I remove "note_source", everything fine. If I don't use
analyzer

on
"note_source", everything fine, but if I use my analyzer on
"note_source" I
have some crash.

I think I have enough memory, I have used ES_HEAP_SIZE.
Maybe my problem it's with accent (ascii, utf-8)

Can you help me with this ?

My Setting

public function createSetting($pf){
$params = array('index' => $pf, 'body' => array(
'settings' => array(
'number_of_shards' => 5,
'number_of_replicas' => 0,
'analysis' => array(
'filter' => array(
'nGram' => array(
"token_chars" =>array(),
"type" => "nGram",
"min_gram" => 3,
"max_gram" => 250
)
),
'analyzer' => array(
'reuters' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase',
'asciifolding',

'nGram')
)
)
)
)
));
$this->elasticsearchClient->indices()->create($params);
return;
}

My Indexing

public function indexTable($pf,$typeElement){

    $params =array( 
        "index" =>'_river', 
        "type" => $typeElement, 
        "id" => "_meta", 
        "body" =>array( 

            "type" => "jdbc", 
            "jdbc" => array( 
                "url" => "jdbc:mysql://ip/name", 
                "user" => 'root', 
                "password" => 'mdp', 
                "index" => $pf, 
                "type" => $typeElement, 
                "sql" => select id_source as _id, id_sous_theme, 

titre_source, desc_source, note_source, adresse_source, type_source,
date_source from source,
"max_bulk_requests" => 5,
)
)

    ); 


    $this->elasticsearchClient->index($params); 

}

Thanks in advance.

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send

an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/5d93217c-bded-40fa-8fd2-fdac576c57ee%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/b7daa716-cb5f-45cc-916b-43c7c0aea6b9%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7e086bdb-6eac-4d92-a9b1-c60262576588%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Cédric Hourcade) #8

If your base64 encodes are long, they are going to be splited in a lot
of tokens by the standard tokenizer.

Theses tokens are often going to be a lot longer than standard words,
so your nGram filter will generate even more tokens, a lot more than
with standard text. That may be your problem there.

You should really try to strip the encoded images with a simple regex
from your documents before indexing them. If you need to keep the
source, put the raw text in an unindexed field, and the cleaned one in
another.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJQxjPPD4UXAjX%2Buwi84LSsPeiy0C80uzcb4C1QFxwLzfyjQGA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Tanguy Bernard) #9

Thank you Cédric Hourcade !

Le vendredi 20 juin 2014 15:32:29 UTC+2, Cédric Hourcade a écrit :

If your base64 encodes are long, they are going to be splited in a lot
of tokens by the standard tokenizer.

Theses tokens are often going to be a lot longer than standard words,
so your nGram filter will generate even more tokens, a lot more than
with standard text. That may be your problem there.

You should really try to strip the encoded images with a simple regex
from your documents before indexing them. If you need to keep the
source, put the raw text in an unindexed field, and the cleaned one in
another.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b62f4e12-1b54-4621-986a-93411404f7af%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Clinton Gormley) #10

You seriously don't want 3..250 length ngrams!!!! That's ENORMOUS

Typically set min/max to 3 or 4, and that's it

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_ngrams_for_partial_matching.html#_ngrams_for_partial_matching

On 20 June 2014 16:05, Tanguy Bernard bernardtanguy1pro@gmail.com wrote:

Thank you Cédric Hourcade !

Le vendredi 20 juin 2014 15:32:29 UTC+2, Cédric Hourcade a écrit :

If your base64 encodes are long, they are going to be splited in a lot
of tokens by the standard tokenizer.

Theses tokens are often going to be a lot longer than standard words,
so your nGram filter will generate even more tokens, a lot more than
with standard text. That may be your problem there.

You should really try to strip the encoded images with a simple regex
from your documents before indexing them. If you need to keep the
source, put the raw text in an unindexed field, and the cleaned one in
another.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b62f4e12-1b54-4621-986a-93411404f7af%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b62f4e12-1b54-4621-986a-93411404f7af%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPt3XKRS_zD%3DkVpKBpqp3hkcgJacAWsETGgJwMQJM%2BqJMuvscw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Tanguy Bernard) #11

Yes I did not know how nGram works !
I find a perfect solution for my picture (base64) problem : use 'char_filter'
=>array('html_strip'),

public function createSetting($pf){
$params = array('index' => $pf, 'body' => array(
'settings' => array(
'number_of_shards' => 5,
'number_of_replicas' => 0,
'analysis' => array(
'filter' => array(
'MYnGram' => array(
"token_chars" =>array(),
"type" => "nGram",
"min_gram" => 3,
"max_gram" => 20
)
),
'analyzer' => array(
'reuters' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'asciifolding',
'MYnGram'),
'char_filter' =>array('html_strip'),
),

            )
        )
    )
    ));
    $this->elasticsearchClient->indices()->create($params);

}

Thanks to all of you !

Le samedi 21 juin 2014 00:35:39 UTC+2, Clinton Gormley a écrit :

You seriously don't want 3..250 length ngrams!!!! That's ENORMOUS

Typically set min/max to 3 or 4, and that's it

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_ngrams_for_partial_matching.html#_ngrams_for_partial_matching

On 20 June 2014 16:05, Tanguy Bernard <bernardt...@gmail.com <javascript:>

wrote:

Thank you Cédric Hourcade !

Le vendredi 20 juin 2014 15:32:29 UTC+2, Cédric Hourcade a écrit :

If your base64 encodes are long, they are going to be splited in a lot
of tokens by the standard tokenizer.

Theses tokens are often going to be a lot longer than standard words,
so your nGram filter will generate even more tokens, a lot more than
with standard text. That may be your problem there.

You should really try to strip the encoded images with a simple regex
from your documents before indexing them. If you need to keep the
source, put the raw text in an unindexed field, and the cleaned one in
another.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b62f4e12-1b54-4621-986a-93411404f7af%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b62f4e12-1b54-4621-986a-93411404f7af%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2bdd5f30-8e97-43e0-8478-08cc26a03ed9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #12