Strange encoding related regression when upgrading to 0.19.0


(Jan Fiedler) #1

I ran into a strange regression in a test case when upgrading to 0.19.0.
The test uses the Java API (and I therefore do not have a standalone curl
regression at hand). The test is creating a brand new index every time so
the issue can not be related to data indexed with 0.18.x. Before digging
deeper I wanted to check if anybody has seen similar issues. This is the
exception I am seeing when trying to get call 'sourceAsString' on a search
hit.

java.lang.AssertionError
at org.elasticsearch.common.Unicode.UTF8toUTF16(Unicode.java:177)
at org.elasticsearch.common.Unicode.unsafeFromBytesAsUtf16(Unicode.java:106)
at org.elasticsearch.common.Unicode.fromBytes(Unicode.java:80)
at
org.elasticsearch.search.internal.InternalSearchHit.sourceAsString(InternalSearchHit.java:195)


(Shay Banon) #2

Can you provide a simple Java recreation for this? If its a problem, would love to fix this for 0.19.1.

On Tuesday, March 13, 2012 at 4:58 PM, Jan Fiedler wrote:

I ran into a strange regression in a test case when upgrading to 0.19.0. The test uses the Java API (and I therefore do not have a standalone curl regression at hand). The test is creating a brand new index every time so the issue can not be related to data indexed with 0.18.x. Before digging deeper I wanted to check if anybody has seen similar issues. This is the exception I am seeing when trying to get call 'sourceAsString' on a search hit.

java.lang.AssertionError
at org.elasticsearch.common.Unicode.UTF8toUTF16(Unicode.java:177)
at org.elasticsearch.common.Unicode.unsafeFromBytesAsUtf16(Unicode.java:106)
at org.elasticsearch.common.Unicode.fromBytes(Unicode.java:80)
at org.elasticsearch.search.internal.InternalSearchHit.sourceAsString(InternalSearchHit.java:195)


(Rotem Hermon) #3

Hi,

Just ran into the same issue with 0.19.1.

I've upgraded a server from 0.18.7. Client uses the Java API with transport
connection. Getting sourceAsString from the hit now returns a malformed
string (containing all kind of junk bytes).

Some more things that may help -
If I look at the toString of the response object itself the sources look
fine, opposed to getting it from the hit with sourceAsString.

When searching I get some errors from some indices (which is fine - the
errors themselves are something that I expect). But the formatting of the
query on the server side seems again wrong, as if the index didn't read the
string correctly.
For instance:

query[ConstantScore(NotDeleted(org.elasticsearch.common.lucene.search.AndFilter@67e4e437))],from[-1],size[30]:
Parse Failure [Failed to parse source
[:)\n\u0005\uD9CF\uDCE9ze$¼?query\uD9F6\uDCEFnstant_score\uD9D6\uDDA9lter\uD9CA\uDC6Ed\uD9DA\uDDA9lters\uD8A8\uDD34erms\uD9CB\uDD69d?_25801560\uD9AF\uDEFA?range\uD9DF\uDC34.$date\uD9CE\uDDB2omW2011-07-31T21:00:00.000ZtoW2012-03-27T12:13:24.991ZŒinclude_lower#Œinclude_upper#\uDBAF\uDEFA?not\uD9CF\uDE83term\uD9CE\uDD2Ctd#\uDBAF\uDEFB\uD9AF\uDEFB\uDACF\uDCEFrt\uD8A8\uDE7A?orderCdesc\uDBAF\uDE7B]]];
nested: SearchParseException[[posts-811][0]:
query[ConstantScore(NotDeleted(org.elasticsearch.common.lucene.search.AndFilter@67e4e437))],from[-1],size[30]:
Parse Failure [No mapping found for [pt.$date] in order to sort on]];

This of course blocks from upgrading to the new release...

On Wednesday, March 14, 2012 2:15:14 PM UTC+2, kimchy wrote:

Can you provide a simple Java recreation for this? If its a problem,
would love to fix this for 0.19.1.

On Tuesday, March 13, 2012 at 4:58 PM, Jan Fiedler wrote:

I ran into a strange regression in a test case when upgrading to 0.19.0.
The test uses the Java API (and I therefore do not have a standalone curl
regression at hand). The test is creating a brand new index every time so
the issue can not be related to data indexed with 0.18.x. Before digging
deeper I wanted to check if anybody has seen similar issues. This is the
exception I am seeing when trying to get call 'sourceAsString' on a search
hit.

java.lang.AssertionError
at org.elasticsearch.common.Unicode.UTF8toUTF16(Unicode.java:177)
at
org.elasticsearch.common.Unicode.unsafeFromBytesAsUtf16(Unicode.java:106)
at org.elasticsearch.common.Unicode.fromBytes(Unicode.java:80)
at
org.elasticsearch.search.internal.InternalSearchHit.sourceAsString(InternalSearchHit.java:195)


(Shay Banon) #4

The format of the query you show in the logs seems fine, its just a smile
format query (and not json) if you use the Java API to build the query.

Do you use compression on the source field?

On Tue, Mar 27, 2012 at 2:35 PM, Rotem rotem.hermon@gmail.com wrote:

Hi,

Just ran into the same issue with 0.19.1.

I've upgraded a server from 0.18.7. Client uses the Java API with
transport connection. Getting sourceAsString from the hit now returns a
malformed string (containing all kind of junk bytes).

Some more things that may help -
If I look at the toString of the response object itself the sources look
fine, opposed to getting it from the hit with sourceAsString.

When searching I get some errors from some indices (which is fine - the
errors themselves are something that I expect). But the formatting of the
query on the server side seems again wrong, as if the index didn't read the
string correctly.
For instance:

query[ConstantScore(NotDeleted(org.elasticsearch.common.lucene.search.AndFilter@67e4e437))],from[-1],size[30]:
Parse Failure [Failed to parse source
[:)\n\u0005\uD9CF\uDCE9ze$¼?query\uD9F6\uDCEFnstant_score\uD9D6\uDDA9lter\uD9CA\uDC6Ed\uD9DA\uDDA9lters\uD8A8\uDD34erms\uD9CB\uDD69d?_25801560\uD9AF\uDEFA?range\uD9DF\uDC34.$date\uD9CE\uDDB2omW2011-07-31T21:00:00.000Z
toW2012-03-27T12:13:24.991ZŒinclude_lower#Œinclude_upper#\uDBAF\uDEFA?not\uD9CF\uDE83term\uD9CE\uDD2Ctd#\uDBAF\uDEFB\uD9AF\uDEFB\uDACF\uDCEFrt\uD8A8\uDE7A?orderCdesc\uDBAF\uDE7B]]];
nested: SearchParseException[[posts-811][0]:
query[ConstantScore(NotDeleted(org.elasticsearch.common.lucene.search.AndFilter@67e4e437))],from[-1],size[30]:
Parse Failure [No mapping found for [pt.$date] in order to sort on]];

This of course blocks from upgrading to the new release...

On Wednesday, March 14, 2012 2:15:14 PM UTC+2, kimchy wrote:

Can you provide a simple Java recreation for this? If its a problem,
would love to fix this for 0.19.1.

On Tuesday, March 13, 2012 at 4:58 PM, Jan Fiedler wrote:

I ran into a strange regression in a test case when upgrading to 0.19.0.
The test uses the Java API (and I therefore do not have a standalone curl
regression at hand). The test is creating a brand new index every time so
the issue can not be related to data indexed with 0.18.x. Before digging
deeper I wanted to check if anybody has seen similar issues. This is the
exception I am seeing when trying to get call 'sourceAsString' on a search
hit.

java.lang.AssertionError
at org.elasticsearch.common.**Unicode.UTF8toUTF16(Unicode.**java:177)
at org.elasticsearch.common.**Unicode.unsafeFromBytesAsUtf16(
Unicode.java:106)
at org.elasticsearch.common.**Unicode.fromBytes(Unicode.**java:80)
at org.elasticsearch.search.internal.InternalSearchHit.
sourceAsString(**InternalSearchHit.java:195)


(Rotem Hermon) #5

Yes, sources are compressed.

On Tue, Mar 27, 2012 at 2:43 PM, Shay Banon kimchy@gmail.com wrote:

The format of the query you show in the logs seems fine, its just a smile
format query (and not json) if you use the Java API to build the query.

Do you use compression on the source field?

On Tue, Mar 27, 2012 at 2:35 PM, Rotem rotem.hermon@gmail.com wrote:

Hi,

Just ran into the same issue with 0.19.1.

I've upgraded a server from 0.18.7. Client uses the Java API with
transport connection. Getting sourceAsString from the hit now returns a
malformed string (containing all kind of junk bytes).

Some more things that may help -
If I look at the toString of the response object itself the sources look
fine, opposed to getting it from the hit with sourceAsString.

When searching I get some errors from some indices (which is fine - the
errors themselves are something that I expect). But the formatting of the
query on the server side seems again wrong, as if the index didn't read the
string correctly.
For instance:

query[ConstantScore(NotDeleted(org.elasticsearch.common.lucene.search.AndFilter@67e4e437))],from[-1],size[30]:
Parse Failure [Failed to parse source
[:)\n\u0005\uD9CF\uDCE9ze$¼?query\uD9F6\uDCEFnstant_score\uD9D6\uDDA9lter\uD9CA\uDC6Ed\uD9DA\uDDA9lters\uD8A8\uDD34erms\uD9CB\uDD69d?_25801560\uD9AF\uDEFA?range\uD9DF\uDC34.$date\uD9CE\uDDB2omW2011-07-31T21:00:00.000Z
toW2012-03-27T12:13:24.991ZŒinclude_lower#Œinclude_upper#\uDBAF\uDEFA?not\uD9CF\uDE83term\uD9CE\uDD2Ctd#\uDBAF\uDEFB\uD9AF\uDEFB\uDACF\uDCEFrt\uD8A8\uDE7A?orderCdesc\uDBAF\uDE7B]]];
nested: SearchParseException[[posts-811][0]:
query[ConstantScore(NotDeleted(org.elasticsearch.common.lucene.search.AndFilter@67e4e437))],from[-1],size[30]:
Parse Failure [No mapping found for [pt.$date] in order to sort on]];

This of course blocks from upgrading to the new release...

On Wednesday, March 14, 2012 2:15:14 PM UTC+2, kimchy wrote:

Can you provide a simple Java recreation for this? If its a problem,
would love to fix this for 0.19.1.

On Tuesday, March 13, 2012 at 4:58 PM, Jan Fiedler wrote:

I ran into a strange regression in a test case when upgrading to 0.19.0.
The test uses the Java API (and I therefore do not have a standalone curl
regression at hand). The test is creating a brand new index every time so
the issue can not be related to data indexed with 0.18.x. Before digging
deeper I wanted to check if anybody has seen similar issues. This is the
exception I am seeing when trying to get call 'sourceAsString' on a search
hit.

java.lang.AssertionError
at org.elasticsearch.common.**Unicode.UTF8toUTF16(Unicode.**java:177)
at org.elasticsearch.common.**Unicode.unsafeFromBytesAsUtf16(
Unicode.java:106)
at org.elasticsearch.common.**Unicode.fromBytes(Unicode.**java:80)
at org.elasticsearch.search.internal.InternalSearchHit.
sourceAsString(**InternalSearchHit.java:195)


(Shay Banon) #6

That migth be the problem, I double checked the code, and it seems like it
does not decompress the source when converting to string. You can call
source() to get the uncompressed byte array and build a string from it
until its fixed: https://github.com/elasticsearch/elasticsearch/issues/1814.

On Tue, Mar 27, 2012 at 2:49 PM, Rotem Hermon rotem.hermon@gmail.comwrote:

Yes, sources are compressed.

On Tue, Mar 27, 2012 at 2:43 PM, Shay Banon kimchy@gmail.com wrote:

The format of the query you show in the logs seems fine, its just a smile
format query (and not json) if you use the Java API to build the query.

Do you use compression on the source field?

On Tue, Mar 27, 2012 at 2:35 PM, Rotem rotem.hermon@gmail.com wrote:

Hi,

Just ran into the same issue with 0.19.1.

I've upgraded a server from 0.18.7. Client uses the Java API with
transport connection. Getting sourceAsString from the hit now returns a
malformed string (containing all kind of junk bytes).

Some more things that may help -
If I look at the toString of the response object itself the sources look
fine, opposed to getting it from the hit with sourceAsString.

When searching I get some errors from some indices (which is fine - the
errors themselves are something that I expect). But the formatting of the
query on the server side seems again wrong, as if the index didn't read the
string correctly.
For instance:

query[ConstantScore(NotDeleted(org.elasticsearch.common.lucene.search.AndFilter@67e4e437))],from[-1],size[30]:
Parse Failure [Failed to parse source
[:)\n\u0005\uD9CF\uDCE9ze$¼?query\uD9F6\uDCEFnstant_score\uD9D6\uDDA9lter\uD9CA\uDC6Ed\uD9DA\uDDA9lters\uD8A8\uDD34erms\uD9CB\uDD69d?_25801560\uD9AF\uDEFA?range\uD9DF\uDC34.$date\uD9CE\uDDB2omW2011-07-31T21:00:00.000Z
toW2012-03-27T12:13:24.991ZŒinclude_lower#Œinclude_upper#\uDBAF\uDEFA?not\uD9CF\uDE83term\uD9CE\uDD2Ctd#\uDBAF\uDEFB\uD9AF\uDEFB\uDACF\uDCEFrt\uD8A8\uDE7A?orderCdesc\uDBAF\uDE7B]]];
nested: SearchParseException[[posts-811][0]:
query[ConstantScore(NotDeleted(org.elasticsearch.common.lucene.search.AndFilter@67e4e437))],from[-1],size[30]:
Parse Failure [No mapping found for [pt.$date] in order to sort on]];

This of course blocks from upgrading to the new release...

On Wednesday, March 14, 2012 2:15:14 PM UTC+2, kimchy wrote:

Can you provide a simple Java recreation for this? If its a problem,
would love to fix this for 0.19.1.

On Tuesday, March 13, 2012 at 4:58 PM, Jan Fiedler wrote:

I ran into a strange regression in a test case when upgrading to
0.19.0. The test uses the Java API (and I therefore do not have a
standalone curl regression at hand). The test is creating a brand new index
every time so the issue can not be related to data indexed with 0.18.x.
Before digging deeper I wanted to check if anybody has seen similar issues.
This is the exception I am seeing when trying to get call 'sourceAsString'
on a search hit.

java.lang.AssertionError
at org.elasticsearch.common.**Unicode.UTF8toUTF16(Unicode.**java:177)
at org.elasticsearch.common.**Unicode.unsafeFromBytesAsUtf16(
Unicode.java:106)
at org.elasticsearch.common.**Unicode.fromBytes(Unicode.**java:80)
at org.elasticsearch.search.internal.InternalSearchHit.
sourceAsString(**InternalSearchHit.java:195)


(Rotem Hermon) #7

ok, thanks for the quick response.

On Tue, Mar 27, 2012 at 2:51 PM, Shay Banon kimchy@gmail.com wrote:

That migth be the problem, I double checked the code, and it seems like it
does not decompress the source when converting to string. You can call
source() to get the uncompressed byte array and build a string from it
until its fixed:
https://github.com/elasticsearch/elasticsearch/issues/1814.

On Tue, Mar 27, 2012 at 2:49 PM, Rotem Hermon rotem.hermon@gmail.comwrote:

Yes, sources are compressed.

On Tue, Mar 27, 2012 at 2:43 PM, Shay Banon kimchy@gmail.com wrote:

The format of the query you show in the logs seems fine, its just a
smile format query (and not json) if you use the Java API to build the
query.

Do you use compression on the source field?

On Tue, Mar 27, 2012 at 2:35 PM, Rotem rotem.hermon@gmail.com wrote:

Hi,

Just ran into the same issue with 0.19.1.

I've upgraded a server from 0.18.7. Client uses the Java API with
transport connection. Getting sourceAsString from the hit now returns a
malformed string (containing all kind of junk bytes).

Some more things that may help -
If I look at the toString of the response object itself the sources
look fine, opposed to getting it from the hit with sourceAsString.

When searching I get some errors from some indices (which is fine - the
errors themselves are something that I expect). But the formatting of the
query on the server side seems again wrong, as if the index didn't read the
string correctly.
For instance:

query[ConstantScore(NotDeleted(org.elasticsearch.common.lucene.search.AndFilter@67e4e437))],from[-1],size[30]:
Parse Failure [Failed to parse source
[:)\n\u0005\uD9CF\uDCE9ze$¼?query\uD9F6\uDCEFnstant_score\uD9D6\uDDA9lter\uD9CA\uDC6Ed\uD9DA\uDDA9lters\uD8A8\uDD34erms\uD9CB\uDD69d?_25801560\uD9AF\uDEFA?range\uD9DF\uDC34.$date\uD9CE\uDDB2omW2011-07-31T21:00:00.000Z
toW2012-03-27T12:13:24.991ZŒinclude_lower#Œinclude_upper#\uDBAF\uDEFA?not\uD9CF\uDE83term\uD9CE\uDD2Ctd#\uDBAF\uDEFB\uD9AF\uDEFB\uDACF\uDCEFrt\uD8A8\uDE7A?orderCdesc\uDBAF\uDE7B]]];
nested: SearchParseException[[posts-811][0]:
query[ConstantScore(NotDeleted(org.elasticsearch.common.lucene.search.AndFilter@67e4e437))],from[-1],size[30]:
Parse Failure [No mapping found for [pt.$date] in order to sort on]];

This of course blocks from upgrading to the new release...

On Wednesday, March 14, 2012 2:15:14 PM UTC+2, kimchy wrote:

Can you provide a simple Java recreation for this? If its a problem,
would love to fix this for 0.19.1.

On Tuesday, March 13, 2012 at 4:58 PM, Jan Fiedler wrote:

I ran into a strange regression in a test case when upgrading to
0.19.0. The test uses the Java API (and I therefore do not have a
standalone curl regression at hand). The test is creating a brand new index
every time so the issue can not be related to data indexed with 0.18.x.
Before digging deeper I wanted to check if anybody has seen similar issues.
This is the exception I am seeing when trying to get call 'sourceAsString'
on a search hit.

java.lang.AssertionError
at org.elasticsearch.common.**Unicode.UTF8toUTF16(Unicode.**java:177)
at org.elasticsearch.common.**Unicode.unsafeFromBytesAsUtf16(
Unicode.java:106)
at org.elasticsearch.common.**Unicode.fromBytes(Unicode.**java:80)
at org.elasticsearch.search.internal.InternalSearchHit.
sourceAsString(**InternalSearchHit.java:195)


(system) #8