Wrong analyser used when indexing dynamic property

Hi,

I'm seeing some unusual behaviour when indexing documents in Elasticsearch
and am hoping someone here might be able to help me solve the problem. So...

I have a unit test that's failing intermittently. The flow of the test is
as follows:

  1. Initialise in-memory Elasticsearch cluster (one local node, no
    replicas)
  2. Create new index
  3. Create new type mapping
  4. Index some documents
  5. Refresh index and wait for all documents to be processed
  6. Query Elasticsearch for documents

The type mapping I'm using includes the following dynamic template
definition:

{
    "participants": {
        "path_match": "participants.*",
        "mapping": {
            "type": "string",
            "store": "yes",
            "index": "analyzed",
            "analyzer": "whitespace"
        }
    }
}

This template is intended to produce fields of the form:

participants.new = [ 'user-1', 'user-2' ]
participants.removed = [ 'user-3' ]

The problem I have is that occasionally (perhaps once in every ten runs)
the test will fail because step 6 does not return all the expected
documents. When I check the indexed terms for the missing documents I see
that values in the 'participants' field have been split into separate
tokens on the '-' character. This seems to suggest that the default
analyzer is being used for indexing instead of the whitespace one.

So far I haven't been able to detect any pattern to the failures. The
unexpected tokenisation only affects a portion of the indexed documents and
can occur at any point in the indexing process (i.e. it isn't always the
first or last document that has problems).

Let me know if I can provide any additional information to help diagnose
this issue. Any help you can provide will be much appreciated as I'm not
sure what to try next.

Cheers,
Paul

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey,

can you provide that unit test somewhere in order to take a look?

--Alex

On Thu, Aug 15, 2013 at 7:07 PM, paulmclellan_00@yahoo.com wrote:

Hi,

I'm seeing some unusual behaviour when indexing documents in Elasticsearch
and am hoping someone here might be able to help me solve the problem. So...

I have a unit test that's failing intermittently. The flow of the test is
as follows:

  1. Initialise in-memory Elasticsearch cluster (one local node, no
    replicas)
  2. Create new index
  3. Create new type mapping
  4. Index some documents
  5. Refresh index and wait for all documents to be processed
  6. Query Elasticsearch for documents

The type mapping I'm using includes the following dynamic template
definition:

{
    "participants": {
        "path_match": "participants.*",
        "mapping": {
            "type": "string",
            "store": "yes",
            "index": "analyzed",
            "analyzer": "whitespace"
        }
    }
}

This template is intended to produce fields of the form:

participants.new = [ 'user-1', 'user-2' ]
participants.removed = [ 'user-3' ]

The problem I have is that occasionally (perhaps once in every ten runs)
the test will fail because step 6 does not return all the expected
documents. When I check the indexed terms for the missing documents I see
that values in the 'participants' field have been split into separate
tokens on the '-' character. This seems to suggest that the default
analyzer is being used for indexing instead of the whitespace one.

So far I haven't been able to detect any pattern to the failures. The
unexpected tokenisation only affects a portion of the indexed documents and
can occur at any point in the indexing process (i.e. it isn't always the
first or last document that has problems).

Let me know if I can provide any additional information to help diagnose
this issue. Any help you can provide will be much appreciated as I'm not
sure what to try next.

Cheers,
Paul

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

can't provide the actual unit test unfortunately, but I'll put together
something less company specific to recreate the problem and add it to the
post.

Cheers,
Paul

On Saturday, 17 August 2013 09:35:38 UTC-7, Alexander Reelsen wrote:

Hey,

can you provide that unit test somewhere in order to take a look?

--Alex

On Thu, Aug 15, 2013 at 7:07 PM, <paulmcl...@yahoo.com <javascript:>>wrote:

Hi,

I'm seeing some unusual behaviour when indexing documents in
Elasticsearch and am hoping someone here might be able to help me solve the
problem. So...

I have a unit test that's failing intermittently. The flow of the test is
as follows:

  1. Initialise in-memory Elasticsearch cluster (one local node, no
    replicas)
  2. Create new index
  3. Create new type mapping
  4. Index some documents
  5. Refresh index and wait for all documents to be processed
  6. Query Elasticsearch for documents

The type mapping I'm using includes the following dynamic template
definition:

{
    "participants": {
        "path_match": "participants.*",
        "mapping": {
            "type": "string",
            "store": "yes",
            "index": "analyzed",
            "analyzer": "whitespace"
        }
    }
}

This template is intended to produce fields of the form:

participants.new = [ 'user-1', 'user-2' ]
participants.removed = [ 'user-3' ]

The problem I have is that occasionally (perhaps once in every ten runs)
the test will fail because step 6 does not return all the expected
documents. When I check the indexed terms for the missing documents I see
that values in the 'participants' field have been split into separate
tokens on the '-' character. This seems to suggest that the default
analyzer is being used for indexing instead of the whitespace one.

So far I haven't been able to detect any pattern to the failures. The
unexpected tokenisation only affects a portion of the indexed documents and
can occur at any point in the indexing process (i.e. it isn't always the
first or last document that has problems).

Let me know if I can provide any additional information to help diagnose
this issue. Any help you can provide will be much appreciated as I'm not
sure what to try next.

Cheers,
Paul

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey,

also, please create a github issue along with the test, so we have a
central place for this.

Thanks a lot for your additional efforts!

--Alex

On Tue, Aug 20, 2013 at 12:51 AM, paulmclellan_00@yahoo.com wrote:

Hi,

can't provide the actual unit test unfortunately, but I'll put together
something less company specific to recreate the problem and add it to the
post.

Cheers,
Paul

On Saturday, 17 August 2013 09:35:38 UTC-7, Alexander Reelsen wrote:

Hey,

can you provide that unit test somewhere in order to take a look?

--Alex

On Thu, Aug 15, 2013 at 7:07 PM, paulmcl...@yahoo.com wrote:

Hi,

I'm seeing some unusual behaviour when indexing documents in
Elasticsearch and am hoping someone here might be able to help me solve the
problem. So...

I have a unit test that's failing intermittently. The flow of the test
is as follows:

  1. Initialise in-memory Elasticsearch cluster (one local node, no
    replicas)
  2. Create new index
  3. Create new type mapping
  4. Index some documents
  5. Refresh index and wait for all documents to be processed
  6. Query Elasticsearch for documents

The type mapping I'm using includes the following dynamic template
definition:

{
    "participants": {
        "path_match": "participants.*",
        "mapping": {
            "type": "string",
            "store": "yes",
            "index": "analyzed",
            "analyzer": "whitespace"
        }
    }
}

This template is intended to produce fields of the form:

participants.new = [ 'user-1', 'user-2' ]
participants.removed = [ 'user-3' ]

The problem I have is that occasionally (perhaps once in every ten runs)
the test will fail because step 6 does not return all the expected
documents. When I check the indexed terms for the missing documents I see
that values in the 'participants' field have been split into separate
tokens on the '-' character. This seems to suggest that the default
analyzer is being used for indexing instead of the whitespace one.

So far I haven't been able to detect any pattern to the failures. The
unexpected tokenisation only affects a portion of the indexed documents and
can occur at any point in the indexing process (i.e. it isn't always the
first or last document that has problems).

Let me know if I can provide any additional information to help diagnose
this issue. Any help you can provide will be much appreciated as I'm not
sure what to try next.

Cheers,
Paul

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Alex,

I've put together a simplified version of my original test and created issue
#3544 on github that includes a link to my gist at
https://gist.github.com/pmclellan/64b192537c97529ec2e4 One thing I noticed
while building this test was that the problem only happens when indexing
documents asynchronously. If I include an 'actionGet()' call at the end of
each index operation then everything runs fine.

Let me know if there's anything else I can do to help.

Cheers,
Paul

On Monday, August 19, 2013 11:41:52 PM UTC-7, Alexander Reelsen wrote:

Hey,

also, please create a github issue along with the test, so we have a
central place for this.

Thanks a lot for your additional efforts!

--Alex

On Tue, Aug 20, 2013 at 12:51 AM, <paulmcl...@yahoo.com <javascript:>>wrote:

Hi,

can't provide the actual unit test unfortunately, but I'll put together
something less company specific to recreate the problem and add it to the
post.

Cheers,
Paul

On Saturday, 17 August 2013 09:35:38 UTC-7, Alexander Reelsen wrote:

Hey,

can you provide that unit test somewhere in order to take a look?

--Alex

On Thu, Aug 15, 2013 at 7:07 PM, paulmcl...@yahoo.com wrote:

Hi,

I'm seeing some unusual behaviour when indexing documents in
Elasticsearch and am hoping someone here might be able to help me solve the
problem. So...

I have a unit test that's failing intermittently. The flow of the test
is as follows:

  1. Initialise in-memory Elasticsearch cluster (one local node, no
    replicas)
  2. Create new index
  3. Create new type mapping
  4. Index some documents
  5. Refresh index and wait for all documents to be processed
  6. Query Elasticsearch for documents

The type mapping I'm using includes the following dynamic template
definition:

{
    "participants": {
        "path_match": "participants.*",
        "mapping": {
            "type": "string",
            "store": "yes",
            "index": "analyzed",
            "analyzer": "whitespace"
        }
    }
}

This template is intended to produce fields of the form:

participants.new = [ 'user-1', 'user-2' ]
participants.removed = [ 'user-3' ]

The problem I have is that occasionally (perhaps once in every ten
runs) the test will fail because step 6 does not return all the expected
documents. When I check the indexed terms for the missing documents I see
that values in the 'participants' field have been split into separate
tokens on the '-' character. This seems to suggest that the default
analyzer is being used for indexing instead of the whitespace one.

So far I haven't been able to detect any pattern to the failures. The
unexpected tokenisation only affects a portion of the indexed documents and
can occur at any point in the indexing process (i.e. it isn't always the
first or last document that has problems).

Let me know if I can provide any additional information to help
diagnose this issue. Any help you can provide will be much appreciated as
I'm not sure what to try next.

Cheers,
Paul

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.