Lowering the effort to use Elasticsearch when loading data


(Rinaldo DiGiorgio) #1

Hi,

I have some comments on working with Elasticsearch from an end user
perspective. Data loading often has issues. Expecting users to look at log
files that are distributed across many nodes is kind of odd for a product
that is all about finding things quickly. Are there any plans or
discussions about how to get an audit trail that is perhaps a searchable
index so one can find issues quickly. Am I missing something? What is the
bext practice for debugging a mapping

  1. Post it and go find the log file configure a one node system so you
    can find it reasonably
  2. Use a unit framework and call the indexing code directly so you can
    get the results
  3. Set a to be found synchronous param so the returned json gets the
    data back
  4. use validate -- I don't think validate works with adding of documents

Data issues never go away. I would be interested in helping with this issue
if there is an issue? The entire issue of error state being in the log
file is a bit concerning.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f1249d39-be89-49c6-8b0b-1b92fae6fa9a%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #2

I'm not sure I understand. Elasticsearch is returning usage errors to the
API so the client can react appropriately. Only failures are in the log,
they are often decoupled from the client action and not yet possible to
expose to the caller. Some usage errors or misconceptions are not
detectable and Elasticsearch happily continues.

What do you mean by debugging a mapping?

Maybe you can post an example of your error or failure in the mapping so we
can get a clearer picture about what you have encountered? I know that
first steps to create a custom mapping is quite hard.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFi2g0eN_2Q3v%2B53obzMmAWwuZP3RrFiWAE35TtNK%2BZ5A%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Rinaldo DiGiorgio) #3

On Dec 9, 2013, at 11:53 AM, joergprante@gmail.com wrote:

I'm not sure I understand. Elasticsearch is returning usage errors to the API so the client can react appropriately. Only failures are in the log, they are often decoupled from the client action and not yet possible to expose to the caller. Some usage errors or misconceptions are not detectable and Elasticsearch happily continues.

I am not saying that the approach is wrong -- simply that it can be easier. In case it isn't clear -- I think that elasticsearch will reduce the cost to do complicated things with data allowing one to focus on the problem instead of the mechanics. To me elasticsearch is up there with events like java in the 90s, node most recently and now elasticsearch (hadoop, Mahout, BogTable,.. are too much work to keep running) I hope it keeps going.

What do you mean by debugging a mapping?

Suppose I have a document that has many different data conversion issues with time ( As time goes on there will be other types that have similar issues)

It would be a time save if the post could return the error results.

What happens when someone sends 1 billion documents and wants to see what failed, -- special regex to find it in the log. I used to process 500,000,000 log entires a day with Oracle and it had issues but I could easily find everyone that failed -- using the same tools I used to add them. Log files are out of band. Just a comment.

It is a broader issue about general approach to use of a technology. Often products developed by developers are easy for developers to use. I was trying to suggest an improvement over the status quo. Can I make this work the way it is yes -- but there is a much better way.

Maybe you can post an example of your error or failure in the mapping so we can get a clearer picture about what you have encountered? I know that first steps to create a custom mapping is quite hard.

Thanks for that offer. I can't seem to figure it out.

I am using a river that uses follow from couchdb to automatically index. There are so many exceptions with my date time processing that the indexing actually slows down due to exception stack processing I guess. I have foure different time values in a nested JSON Object

consumerPrintToken: aaaaa
bbbb: [
	 {
		ccc: 2013-12-09T07:30:10.488-0500
		 ......
             }
	 {
		ccc: 2013-12-09T07:30:10.488-0500
		
              }
	]

 unix_timestamp_created: 1386592206  (Seconds since 1970)
date_time_recorded: 2013-12-09T12:30:06.301Z  

 token_time_stamp: 1386592142134 (Milliseconds since 1970)

So I need to create a mapping for the time values above and then start the couchdb river. I am cautions since I spent the last few days undoing a couchbase and couchdb interaction where couchbase changed a template for all documents added and I didn't lknow enough to undo it so I had to start over)

I see that couchbase has a filter option so I could grab the value but it is not clear how you get all the nested time values.

{
"type" : "couchdb",
"couchdb" : {
"script" : "ctx.doc.field1 = 'value1'"
}
}

I am trying something like this in the mapping file below but I don't think I can do division without using the filter, don't recall if Java DateFormat has support for milliseconds or microseconds, sometimes usecs are useful for high data rates so the time can still be the key.

curl $CURL_OPTIONS -XPOST localhost:9200/test -d '{
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"b" : {
"_source" : {
"enabled" : true
},
"default" : {
"date_formats" : ["yyyy-MM-dd", "dd-MM-yyyy", "date_optional_time"]
},
"properties" : {
"cookie" : {
"type" : "string", "index" : "not_analyzed"
}
}
}
}
}' | ./jq .

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/F54855F7-E0D2-4B80-BB28-FAD782EC38EC%40gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #4

Ah, I see. The date/time format. The short version is, just submit ISO 8601
date/time GMT values and you'll make ES happy :wink:

Note that microseconds are not possible, each timestamp value is converted
to milliseconds since 01-01-1970 (unix timestamp) behind the scene - JSON
is not aware of a time type.

BTW the format pattern is processed by the Joda library.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEnEA1mij%2Bx_qtbAnvTJhP0FdQP5-D9XgQZTG0fB5L6_Q%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #5