Wrong analyser used when indexing dynamic property

paulmclellan_00 · August 15, 2013, 5:07pm

Hi,

I'm seeing some unusual behaviour when indexing documents in Elasticsearch
and am hoping someone here might be able to help me solve the problem. So...

I have a unit test that's failing intermittently. The flow of the test is
as follows:

Initialise in-memory Elasticsearch cluster (one local node, no
replicas)
Create new index
Create new type mapping
Index some documents
Refresh index and wait for all documents to be processed
Query Elasticsearch for documents

The type mapping I'm using includes the following dynamic template
definition:

{
    "participants": {
        "path_match": "participants.*",
        "mapping": {
            "type": "string",
            "store": "yes",
            "index": "analyzed",
            "analyzer": "whitespace"
        }
    }
}

This template is intended to produce fields of the form:

participants.new = [ 'user-1', 'user-2' ]
participants.removed = [ 'user-3' ]

The problem I have is that occasionally (perhaps once in every ten runs)
the test will fail because step 6 does not return all the expected
documents. When I check the indexed terms for the missing documents I see
that values in the 'participants' field have been split into separate
tokens on the '-' character. This seems to suggest that the default
analyzer is being used for indexing instead of the whitespace one.

So far I haven't been able to detect any pattern to the failures. The
unexpected tokenisation only affects a portion of the indexed documents and
can occur at any point in the indexing process (i.e. it isn't always the
first or last document that has problems).

Let me know if I can provide any additional information to help diagnose
this issue. Any help you can provide will be much appreciated as I'm not
sure what to try next.

Cheers,
Paul

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

spinscale · August 17, 2013, 4:35pm

Hey,

can you provide that unit test somewhere in order to take a look?

--Alex

On Thu, Aug 15, 2013 at 7:07 PM, paulmclellan_00@yahoo.com wrote:

Hi,

I'm seeing some unusual behaviour when indexing documents in Elasticsearch
and am hoping someone here might be able to help me solve the problem. So...

I have a unit test that's failing intermittently. The flow of the test is
as follows:

Initialise in-memory Elasticsearch cluster (one local node, no
replicas)

Create new index

Create new type mapping

Index some documents

Refresh index and wait for all documents to be processed

Query Elasticsearch for documents

The type mapping I'm using includes the following dynamic template
definition:
{
    "participants": {
        "path_match": "participants.*",
        "mapping": {
            "type": "string",
            "store": "yes",
            "index": "analyzed",
            "analyzer": "whitespace"
        }
    }
}
This template is intended to produce fields of the form:
participants.new = [ 'user-1', 'user-2' ]
participants.removed = [ 'user-3' ]
The problem I have is that occasionally (perhaps once in every ten runs)
the test will fail because step 6 does not return all the expected
documents. When I check the indexed terms for the missing documents I see
that values in the 'participants' field have been split into separate
tokens on the '-' character. This seems to suggest that the default
analyzer is being used for indexing instead of the whitespace one.

So far I haven't been able to detect any pattern to the failures. The
unexpected tokenisation only affects a portion of the indexed documents and
can occur at any point in the indexing process (i.e. it isn't always the
first or last document that has problems).

Let me know if I can provide any additional information to help diagnose
this issue. Any help you can provide will be much appreciated as I'm not
sure what to try next.

Cheers,
Paul

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

paulmclellan_00 · August 19, 2013, 10:51pm

Hi,

can't provide the actual unit test unfortunately, but I'll put together
something less company specific to recreate the problem and add it to the
post.

Cheers,
Paul

On Saturday, 17 August 2013 09:35:38 UTC-7, Alexander Reelsen wrote:

Hey,

can you provide that unit test somewhere in order to take a look?

--Alex

On Thu, Aug 15, 2013 at 7:07 PM, <paulmcl...@yahoo.com <javascript:>>wrote:
Hi,

I'm seeing some unusual behaviour when indexing documents in
Elasticsearch and am hoping someone here might be able to help me solve the
problem. So...

I have a unit test that's failing intermittently. The flow of the test is
as follows:

Initialise in-memory Elasticsearch cluster (one local node, no
replicas)

Create new index

Create new type mapping

Index some documents

Refresh index and wait for all documents to be processed

Query Elasticsearch for documents

The type mapping I'm using includes the following dynamic template
definition:
{
    "participants": {
        "path_match": "participants.*",
        "mapping": {
            "type": "string",
            "store": "yes",
            "index": "analyzed",
            "analyzer": "whitespace"
        }
    }
}
This template is intended to produce fields of the form:
participants.new = [ 'user-1', 'user-2' ]
participants.removed = [ 'user-3' ]
The problem I have is that occasionally (perhaps once in every ten runs)
the test will fail because step 6 does not return all the expected
documents. When I check the indexed terms for the missing documents I see
that values in the 'participants' field have been split into separate
tokens on the '-' character. This seems to suggest that the default
analyzer is being used for indexing instead of the whitespace one.

So far I haven't been able to detect any pattern to the failures. The
unexpected tokenisation only affects a portion of the indexed documents and
can occur at any point in the indexing process (i.e. it isn't always the
first or last document that has problems).

Let me know if I can provide any additional information to help diagnose
this issue. Any help you can provide will be much appreciated as I'm not
sure what to try next.

Cheers,
Paul

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

spinscale · August 20, 2013, 6:41am

Hey,

also, please create a github issue along with the test, so we have a
central place for this.

Thanks a lot for your additional efforts!

--Alex

On Tue, Aug 20, 2013 at 12:51 AM, paulmclellan_00@yahoo.com wrote:

Hi,

can't provide the actual unit test unfortunately, but I'll put together
something less company specific to recreate the problem and add it to the
post.

Cheers,
Paul

On Saturday, 17 August 2013 09:35:38 UTC-7, Alexander Reelsen wrote:
Hey,

can you provide that unit test somewhere in order to take a look?

--Alex

On Thu, Aug 15, 2013 at 7:07 PM, paulmcl...@yahoo.com wrote:
Hi,

I'm seeing some unusual behaviour when indexing documents in
Elasticsearch and am hoping someone here might be able to help me solve the
problem. So...

I have a unit test that's failing intermittently. The flow of the test
is as follows:

Initialise in-memory Elasticsearch cluster (one local node, no
replicas)

Create new index

Create new type mapping

Index some documents

Refresh index and wait for all documents to be processed

Query Elasticsearch for documents

The type mapping I'm using includes the following dynamic template
definition:
{
    "participants": {
        "path_match": "participants.*",
        "mapping": {
            "type": "string",
            "store": "yes",
            "index": "analyzed",
            "analyzer": "whitespace"
        }
    }
}
This template is intended to produce fields of the form:
participants.new = [ 'user-1', 'user-2' ]
participants.removed = [ 'user-3' ]
The problem I have is that occasionally (perhaps once in every ten runs)
the test will fail because step 6 does not return all the expected
documents. When I check the indexed terms for the missing documents I see
that values in the 'participants' field have been split into separate
tokens on the '-' character. This seems to suggest that the default
analyzer is being used for indexing instead of the whitespace one.

So far I haven't been able to detect any pattern to the failures. The
unexpected tokenisation only affects a portion of the indexed documents and
can occur at any point in the indexing process (i.e. it isn't always the
first or last document that has problems).

Let me know if I can provide any additional information to help diagnose
this issue. Any help you can provide will be much appreciated as I'm not
sure what to try next.

Cheers,
Paul

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

paulmclellan_00 · August 20, 2013, 7:18pm

Hi Alex,

I've put together a simplified version of my original test and created issue
#3544 on github that includes a link to my gist at
DynamicMappingTest.java · GitHub One thing I noticed
while building this test was that the problem only happens when indexing
documents asynchronously. If I include an 'actionGet()' call at the end of
each index operation then everything runs fine.

Let me know if there's anything else I can do to help.

Cheers,
Paul

On Monday, August 19, 2013 11:41:52 PM UTC-7, Alexander Reelsen wrote:

Hey,

also, please create a github issue along with the test, so we have a
central place for this.

Thanks a lot for your additional efforts!

--Alex

On Tue, Aug 20, 2013 at 12:51 AM, <paulmcl...@yahoo.com <javascript:>>wrote:
Hi,

can't provide the actual unit test unfortunately, but I'll put together
something less company specific to recreate the problem and add it to the
post.

Cheers,
Paul

On Saturday, 17 August 2013 09:35:38 UTC-7, Alexander Reelsen wrote:
Hey,

can you provide that unit test somewhere in order to take a look?

--Alex

On Thu, Aug 15, 2013 at 7:07 PM, paulmcl...@yahoo.com wrote:
Hi,

I'm seeing some unusual behaviour when indexing documents in
Elasticsearch and am hoping someone here might be able to help me solve the
problem. So...

I have a unit test that's failing intermittently. The flow of the test
is as follows:

Initialise in-memory Elasticsearch cluster (one local node, no
replicas)

Create new index

Create new type mapping

Index some documents

Refresh index and wait for all documents to be processed

Query Elasticsearch for documents

The type mapping I'm using includes the following dynamic template
definition:
{
    "participants": {
        "path_match": "participants.*",
        "mapping": {
            "type": "string",
            "store": "yes",
            "index": "analyzed",
            "analyzer": "whitespace"
        }
    }
}
This template is intended to produce fields of the form:
participants.new = [ 'user-1', 'user-2' ]
participants.removed = [ 'user-3' ]
The problem I have is that occasionally (perhaps once in every ten
runs) the test will fail because step 6 does not return all the expected
documents. When I check the indexed terms for the missing documents I see
that values in the 'participants' field have been split into separate
tokens on the '-' character. This seems to suggest that the default
analyzer is being used for indexing instead of the whitespace one.

So far I haven't been able to detect any pattern to the failures. The
unexpected tokenisation only affects a portion of the indexed documents and
can occur at any point in the indexing process (i.e. it isn't always the
first or last document that has problems).

Let me know if I can provide any additional information to help
diagnose this issue. Any help you can provide will be much appreciated as
I'm not sure what to try next.

Cheers,
Paul

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Wrong analyser used when indexing dynamic property Elasticsearch	1	339	July 6, 2017
Wrong analyser used when indexing dynamic property Elasticsearch	1	326	July 6, 2017
Analyzer randomly applied Elasticsearch	8	338	July 6, 2017
Adding analyzer to dynamic fields Elasticsearch	5	826	July 6, 2017
Custom analyzer not applied on property in query Elasticsearch	6	485	July 6, 2017

Wrong analyser used when indexing dynamic property

Related topics