How to resolve NumberFormatException issues caused by an empty string


(pulkitsinghal) #1

BTW, please forgive me in advance for even mentioning the word Solr in
this forum because I know ES folks cringe at comparisons between the
two technologies. I understand they are different and I am simply
making an analogy for the "Data Input & Indexing behavior" angle ...
so bear with me here.

The stacktrace from the ES server's NFE is at the end of this thread.

I have faced similar NumberFormatException issues before in Solr as
well. I think these happen simply because the underlying Lucene isn't
ready to accept/ignore an empty string for numbers or date/time data.
So I am assuming that this is no different for ES which is built atop
Lucene as well. (1) Let me know if you agree with me so far.

In Solr, I got around this by having its Data Import Handler run
scripts on the incoming documents to either place a number like -1 as
a placeholder or by removing the field explicitly from the document
construction.

So with ES, I was hoping it would be more straightforward. My feed in
ES is the magical and much revered CouchDB river :slight_smile: And I try not to
define the mappings myself because ES does such a great job of
figuring them out and it is one of the many many many conveniences of
ES that I want to take advantage of.

I was hoping that ES would acknowledge the fact that letting empty
strings through (for core type fields like number, date and time) has
no merit and would simply ignore the empty values. (2) Is this a "bad"
thing to hope for?

The data that failed looks like:
"shipping" :
[
{
"nextDay" : "",
"vendorDelivery":69.99,
"ground" : "",
"secondDay":""
}
]
So imagine my surprise at how well ES did, in order to be able to
guess that shipping.nextDay was supposed to be a number! But then not
ignoring the junk pumped into it as an empty string.

(2) I'm not bad mouthing ES, I'm asking: Can we expect ES to tackle
this or would we be wrong to place such an expectation on ES?

(3) If the data appropriately had a null value then ES would have
handled it already because when there is a (JSON) null value for the
field and the null_value has not been setup then ES defaults to not
adding the field at all. That is not the case here so what would the
workaround be? If any? Sanitize my data? Oh lord the tears are rolling
down my cheeks, please say that's not my only option.

Please let me know what you think.

=== STACKTRACE ====
org.elasticsearch.index.mapper.MapperParsingException: Failed to parse
[shipping.nextDay]
at
org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:
312)
at
org.elasticsearch.index.mapper.object.ObjectMapper.serializeValue(ObjectMapper.java:
577)
at
org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:
443)
at
org.elasticsearch.index.mapper.object.ObjectMapper.serializeObject(ObjectMapper.java:
491)
at
org.elasticsearch.index.mapper.object.ObjectMapper.serializeArray(ObjectMapper.java:
557)
at
org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:
435)
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:
567)
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:
491)
at
org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:
289)
at
org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:
131)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction
$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:
464)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction
$AsyncShardOperationAction
$1.run(TransportShardReplicationOperationAction.java:377)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)

Caused by: java.lang.NumberFormatException: empty String
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:
992)
at java.lang.Double.parseDouble(Double.java:510)
at
org.elasticsearch.common.xcontent.support.AbstractXContentParser.doubleValue(AbstractXContentParser.java:
88)
at
org.elasticsearch.index.mapper.core.DoubleFieldMapper.parseCreateField(DoubleFieldMapper.java:
227)
at
org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:
299)
... 14 more


(Shay Banon) #2

This actually has nothing to do with Lucene, but how elasticsearch handles
deriving field types and handing "" text for numeric values.

First, deriving a type for a field. If the field is first introduced, then
the type is derived based on its value. This will not work well if the first
document introducing nextDay will be an empty string, since the type for the
field will be string, and not a number (long / double).

As for empty text, then yes, it will fail to index the doc if an empty text
is provided and its a numeric type. As you mentioned a null value for the
field is what it handles, and does not handle empty text as null value.

On Tue, Oct 18, 2011 at 9:29 PM, pulkitsinghal pulkitsinghal@gmail.comwrote:

BTW, please forgive me in advance for even mentioning the word Solr in
this forum because I know ES folks cringe at comparisons between the
two technologies. I understand they are different and I am simply
making an analogy for the "Data Input & Indexing behavior" angle ...
so bear with me here.

The stacktrace from the ES server's NFE is at the end of this thread.

I have faced similar NumberFormatException issues before in Solr as
well. I think these happen simply because the underlying Lucene isn't
ready to accept/ignore an empty string for numbers or date/time data.
So I am assuming that this is no different for ES which is built atop
Lucene as well. (1) Let me know if you agree with me so far.

In Solr, I got around this by having its Data Import Handler run
scripts on the incoming documents to either place a number like -1 as
a placeholder or by removing the field explicitly from the document
construction.

So with ES, I was hoping it would be more straightforward. My feed in
ES is the magical and much revered CouchDB river :slight_smile: And I try not to
define the mappings myself because ES does such a great job of
figuring them out and it is one of the many many many conveniences of
ES that I want to take advantage of.

I was hoping that ES would acknowledge the fact that letting empty
strings through (for core type fields like number, date and time) has
no merit and would simply ignore the empty values. (2) Is this a "bad"
thing to hope for?

The data that failed looks like:
"shipping" :
[
{
"nextDay" : "",
"vendorDelivery":69.99,
"ground" : "",
"secondDay":""
}
]
So imagine my surprise at how well ES did, in order to be able to
guess that shipping.nextDay was supposed to be a number! But then not
ignoring the junk pumped into it as an empty string.

(2) I'm not bad mouthing ES, I'm asking: Can we expect ES to tackle
this or would we be wrong to place such an expectation on ES?

(3) If the data appropriately had a null value then ES would have
handled it already because when there is a (JSON) null value for the
field and the null_value has not been setup then ES defaults to not
adding the field at all. That is not the case here so what would the
workaround be? If any? Sanitize my data? Oh lord the tears are rolling
down my cheeks, please say that's not my only option.

Please let me know what you think.

=== STACKTRACE ====
org.elasticsearch.index.mapper.MapperParsingException: Failed to parse
[shipping.nextDay]
at

org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:
312)
at

org.elasticsearch.index.mapper.object.ObjectMapper.serializeValue(ObjectMapper.java:
577)
at
org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:
443)
at

org.elasticsearch.index.mapper.object.ObjectMapper.serializeObject(ObjectMapper.java:
491)
at

org.elasticsearch.index.mapper.object.ObjectMapper.serializeArray(ObjectMapper.java:
557)
at
org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:
435)
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:
567)
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:
491)
at

org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:
289)
at

org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:
131)
at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction

$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:
464)
at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction
$AsyncShardOperationAction
$1.run(TransportShardReplicationOperationAction.java:377)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)

Caused by: java.lang.NumberFormatException: empty String
at
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:
992)
at java.lang.Double.parseDouble(Double.java:510)
at

org.elasticsearch.common.xcontent.support.AbstractXContentParser.doubleValue(AbstractXContentParser.java:
88)
at

org.elasticsearch.index.mapper.core.DoubleFieldMapper.parseCreateField(DoubleFieldMapper.java:
227)
at

org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:
299)
... 14 more


(pulkitsinghal) #3

Hello Shay,

Thanks for the info!

May I ask: "Does it make sense for ES to handle an empty string as it
would handle a null value, once it has already derived that a field is
numeric based on the first value introduced to the system?"

Thoughts?

  • Pulkit

On Oct 18, 2:37 pm, Shay Banon kim...@gmail.com wrote:

This actually has nothing to do with Lucene, but how elasticsearch handles
deriving field types and handing "" text for numeric values.

First, deriving a type for a field. If the field is first introduced, then
the type is derived based on its value. This will not work well if the first
document introducing nextDay will be an empty string, since the type for the
field will be string, and not a number (long / double).

As for empty text, then yes, it will fail to index the doc if an empty text
is provided and its a numeric type. As you mentioned a null value for the
field is what it handles, and does not handle empty text as null value.

On Tue, Oct 18, 2011 at 9:29 PM, pulkitsinghal pulkitsing...@gmail.comwrote:

BTW, please forgive me in advance for even mentioning the word Solr in
this forum because I know ES folks cringe at comparisons between the
two technologies. I understand they are different and I am simply
making an analogy for the "Data Input & Indexing behavior" angle ...
so bear with me here.

The stacktrace from the ES server's NFE is at the end of this thread.

I have faced similar NumberFormatException issues before in Solr as
well. I think these happen simply because the underlying Lucene isn't
ready to accept/ignore an empty string for numbers or date/time data.
So I am assuming that this is no different for ES which is built atop
Lucene as well. (1) Let me know if you agree with me so far.

In Solr, I got around this by having its Data Import Handler run
scripts on the incoming documents to either place a number like -1 as
a placeholder or by removing the field explicitly from the document
construction.

So with ES, I was hoping it would be more straightforward. My feed in
ES is the magical and much revered CouchDB river :slight_smile: And I try not to
define the mappings myself because ES does such a great job of
figuring them out and it is one of the many many many conveniences of
ES that I want to take advantage of.

I was hoping that ES would acknowledge the fact that letting empty
strings through (for core type fields like number, date and time) has
no merit and would simply ignore the empty values. (2) Is this a "bad"
thing to hope for?

The data that failed looks like:
"shipping" :
[
{
"nextDay" : "",
"vendorDelivery":69.99,
"ground" : "",
"secondDay":""
}
]
So imagine my surprise at how well ES did, in order to be able to
guess that shipping.nextDay was supposed to be a number! But then not
ignoring the junk pumped into it as an empty string.

(2) I'm not bad mouthing ES, I'm asking: Can we expect ES to tackle
this or would we be wrong to place such an expectation on ES?

(3) If the data appropriately had a null value then ES would have
handled it already because when there is a (JSON) null value for the
field and the null_value has not been setup then ES defaults to not
adding the field at all. That is not the case here so what would the
workaround be? If any? Sanitize my data? Oh lord the tears are rolling
down my cheeks, please say that's not my only option.

Please let me know what you think.

=== STACKTRACE ====
org.elasticsearch.index.mapper.MapperParsingException: Failed to parse
[shipping.nextDay]
at

org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:
312)
at

org.elasticsearch.index.mapper.object.ObjectMapper.serializeValue(ObjectMapper.java:
577)
at
org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:
443)
at

org.elasticsearch.index.mapper.object.ObjectMapper.serializeObject(ObjectMapper.java:
491)
at

org.elasticsearch.index.mapper.object.ObjectMapper.serializeArray(ObjectMapper.java:
557)
at
org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:
435)
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:
567)
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:
491)
at

org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:
289)
at

org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:
131)
at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction

$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:
464)
at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction
$AsyncShardOperationAction
$1.run(TransportShardReplicationOperationAction.java:377)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)

Caused by: java.lang.NumberFormatException: empty String
at
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:
992)
at java.lang.Double.parseDouble(Double.java:510)
at

org.elasticsearch.common.xcontent.support.AbstractXContentParser.doubleValue(AbstractXContentParser.java:
88)
at

org.elasticsearch.index.mapper.core.DoubleFieldMapper.parseCreateField(DoubleFieldMapper.java:
227)
at

org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:
299)
... 14 more


(Shay Banon) #4

It can make sense, maybe with a special flag in the mapping, but note, you
still have a problem if your first doc will be empty string, and you did not
specify the mapping initially, since then it will be detected as a string
type, and not a numeric type.

On Tue, Oct 18, 2011 at 11:42 PM, pulkitsinghal pulkitsinghal@gmail.comwrote:

Hello Shay,

Thanks for the info!

May I ask: "Does it make sense for ES to handle an empty string as it
would handle a null value, once it has already derived that a field is
numeric based on the first value introduced to the system?"

Thoughts?

  • Pulkit

On Oct 18, 2:37 pm, Shay Banon kim...@gmail.com wrote:

This actually has nothing to do with Lucene, but how elasticsearch
handles
deriving field types and handing "" text for numeric values.

First, deriving a type for a field. If the field is first introduced,
then
the type is derived based on its value. This will not work well if the
first
document introducing nextDay will be an empty string, since the type for
the
field will be string, and not a number (long / double).

As for empty text, then yes, it will fail to index the doc if an empty
text
is provided and its a numeric type. As you mentioned a null value for the
field is what it handles, and does not handle empty text as null value.

On Tue, Oct 18, 2011 at 9:29 PM, pulkitsinghal <pulkitsing...@gmail.com
wrote:

BTW, please forgive me in advance for even mentioning the word Solr in
this forum because I know ES folks cringe at comparisons between the
two technologies. I understand they are different and I am simply
making an analogy for the "Data Input & Indexing behavior" angle ...
so bear with me here.

The stacktrace from the ES server's NFE is at the end of this thread.

I have faced similar NumberFormatException issues before in Solr as
well. I think these happen simply because the underlying Lucene isn't
ready to accept/ignore an empty string for numbers or date/time data.
So I am assuming that this is no different for ES which is built atop
Lucene as well. (1) Let me know if you agree with me so far.

In Solr, I got around this by having its Data Import Handler run
scripts on the incoming documents to either place a number like -1 as
a placeholder or by removing the field explicitly from the document
construction.

So with ES, I was hoping it would be more straightforward. My feed in
ES is the magical and much revered CouchDB river :slight_smile: And I try not to
define the mappings myself because ES does such a great job of
figuring them out and it is one of the many many many conveniences of
ES that I want to take advantage of.

I was hoping that ES would acknowledge the fact that letting empty
strings through (for core type fields like number, date and time) has
no merit and would simply ignore the empty values. (2) Is this a "bad"
thing to hope for?

The data that failed looks like:
"shipping" :
[
{
"nextDay" : "",
"vendorDelivery":69.99,
"ground" : "",
"secondDay":""
}
]
So imagine my surprise at how well ES did, in order to be able to
guess that shipping.nextDay was supposed to be a number! But then not
ignoring the junk pumped into it as an empty string.

(2) I'm not bad mouthing ES, I'm asking: Can we expect ES to tackle
this or would we be wrong to place such an expectation on ES?

(3) If the data appropriately had a null value then ES would have
handled it already because when there is a (JSON) null value for the
field and the null_value has not been setup then ES defaults to not
adding the field at all. That is not the case here so what would the
workaround be? If any? Sanitize my data? Oh lord the tears are rolling
down my cheeks, please say that's not my only option.

Please let me know what you think.

=== STACKTRACE ====
org.elasticsearch.index.mapper.MapperParsingException: Failed to parse
[shipping.nextDay]
at

org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:

  1. at

org.elasticsearch.index.mapper.object.ObjectMapper.serializeValue(ObjectMapper.java:

  1. at

org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:

  1. at

org.elasticsearch.index.mapper.object.ObjectMapper.serializeObject(ObjectMapper.java:

  1. at

org.elasticsearch.index.mapper.object.ObjectMapper.serializeArray(ObjectMapper.java:

  1. at

org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:

  1. at

org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:

  1. at

org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:

  1. at

org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:

  1. at

org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:

  1. at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction

$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:

  1. at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction

$AsyncShardOperationAction
$1.run(TransportShardReplicationOperationAction.java:377)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)

Caused by: java.lang.NumberFormatException: empty String
at
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:
992)
at java.lang.Double.parseDouble(Double.java:510)
at

org.elasticsearch.common.xcontent.support.AbstractXContentParser.doubleValue(AbstractXContentParser.java:

  1. at

org.elasticsearch.index.mapper.core.DoubleFieldMapper.parseCreateField(DoubleFieldMapper.java:

  1. at

org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:

  1. ... 14 more

(system) #5