Ok - well let me know when you're around.
The mapreduce inputformat works fine. I'm using it with Spark to access
the ES data via ESInputFormat and run analytics
and machine learning jobs on that data, and the same _ts field works and
is the correct data (though it comes through as
org.apache.hadoop.io.Text, which I convert to Long or a DateTime as
required).
Perhaps I'm missing it somewhere but is it possible to force a field to
be a type? i.e. similar the es.field.mapping
could I tell it that it must parse the field as a string (since then I
can take it and do whatever parsing / casting I
want).
I could just use the new Spark SQL module (which I'm seriously
considering right now having explored it a bit in the
last few days), but some of the stuff we do requires a SQL Console and
JDBC, so having Shark able to just pull in ES
data is definitely very useful...
On Tue, May 13, 2014 at 8:18 PM, Costin Leau <costin.leau@gmail.com<mailto:
costin.leau@gmail.com>> wrote:
Hi Nick,
I'm glad to see you are making progress. This week I'm mainly on the
road but maybe we can meet on the IRC next
week, my invitation still stands
Timestamp is relatively new type and doesn't handle timezones
properly - it is backed by java.sq.Timestamp so it
inherits a lot of its issues.
For some reason the year in your date is rather off so it's worth
checking the data read by es-hadoop before passing
it to Hive (see [1]).
I've had issues myself with it and it the moment the cluster is in a
different timezone than the dataset itself
things get buggy.
Try using a UDF to do the conversion from the long to a timestamp -
I've tried doing something similar in our
conversion but since we don't know the timezones
used, it's easy for things to get mixed.
Cheers,
[1] http://www.elasticsearch.org/__guide/en/elasticsearch/
hadoop/__current/troubleshooting.html
<http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/
current/troubleshooting.html>
On 5/13/14 8:25 PM, Nick Pentreath wrote:
Hi Costin
Sorry for the silence on this issue. This went a bit quiet.
But the good news is I've come back to it and managed to get it
all working with the new shark 0.9.1 release and
2.0.0RC1. Actually if I used ADD JAR I got the same exception but
when I just put the JAR into the shark lib/
folder it
worked fine (which seems to point to the classpath issue you
mention).
However, I seem to have an issue with date <-> timestamp
conversion.
I have a field in ES called "_ts" that has type "date" and the
default format "dateOptionalTime". When I do a
query that
includes the timestamp it comes back NULL:
select ts from table ...
(note I use a correct es.mapping.names to map the _ts field in ES
to ts field in Hive/Shark that has timestamp
type).
below is some of the debug-level output:
14/05/13 19:19:47 DEBUG lazy.LazyPrimitive: Data not in the
TIMESTAMP data type range so converted to null.
Given data
is :96997506-06-30 19:08:168:16.768
14/05/13 19:19:47 DEBUG lazy.LazyPrimitive: Data not in the
TIMESTAMP data type range so converted to null.
Given data
is :96997605-06-28 19:08:168:16.768
14/05/13 19:19:47 DEBUG lazy.LazyPrimitive: Data not in the
TIMESTAMP data type range so converted to null.
Given data
is :96997624-06-28 19:08:168:16.768
14/05/13 19:19:47 DEBUG lazy.LazyPrimitive: Data not in the
TIMESTAMP data type range so converted to null.
Given data
is :96997629-06-28 19:08:168:16.768
14/05/13 19:19:47 DEBUG lazy.LazyPrimitive: Data not in the
TIMESTAMP data type range so converted to null.
Given data
is :96997634-06-29 19:08:168:16.768
NULL
NULL
NULL
NULL
NULL
The data that I index in the _ts field is timestamp in ms (long).
It doesn't seem to be converted correctly but
the data
is correct (in ms at least) and I can query against it using date
formats and date math in ES.
Example snippet from debug log from above:
,"_ts":1397130475607}}]}}"
Any ideas or am I doing something silly?
I do see that the Hive timestamp expects either seconds since
epoch of a string-based format that has nanosecond
granularity. Is this the issue with just ms long timestamp data?
Thanks
Nick
On Thu, Mar 27, 2014 at 4:50 PM, Costin Leau <
costin.leau@gmail.com mailto:costin.leau@gmail.com
<mailto:costin.leau@gmail.com mailto:costin.leau@gmail.com>__>
wrote:
Using the latest hive and hadoop is preferred as they
contain various bug fixes.
The error suggests a classpath issue - namely the same class
is loaded twice for some reason and hence the
casting
fails.
Let's connect on IRC - give me a ping when you're available
(user is costin).
Cheers,
On 3/27/14 4:29 PM, Nick Pentreath wrote:
Thanks for the response.
I tried latest Shark (cdh4 version of 0.9.1 here
http://cloudera.rst.im/shark/ ) - this uses hadoop
1.0.4 and
hive 0.11
I believe, and build elasticsearch-hadoop from github
master.
Still getting same error:
org.elasticsearch.hadoop.hive.____EsHiveInputFormat$__EsHiveSplit
cannot be cast to
org.elasticsearch.hadoop.hive.__EsHiveInputFormat$
EsHiveSplit
Will using hive 0.11 / hadoop 1.0.4 vs hive 0.12 /
hadoop 1.2.1 in es-hadoop master make a difference?
Anyone else actually got this working?
On Thu, Mar 20, 2014 at 2:44 PM, Costin Leau <
costin.leau@gmail.com mailto:costin.leau@gmail.com
<mailto:costin.leau@gmail.com mailto:costin.leau@gmail.com>
<mailto:costin.leau@gmail.com <mailto:
costin.leau@gmail.com> <mailto:costin.leau@gmail.com
mailto:costin.leau@gmail.com>>> wrote:
I recommend using master - there are several
improvements done in this area. Also using the latest
Shark
(0.9.0) and
Hive (0.12) will help.
On 3/20/14 12:00 PM, Nick Pentreath wrote:
Hi
I am struggling to get this working too. I'm
just trying locally for now, running Shark 0.8.1,
Hive
0.9.0 and ES
1.0.1
with ES-hadoop 1.3.0.M2.
I managed to get a basic example working with
WRITING into an index. But I'm really after
READING and
index.
I believe I have set everything up correctly,
I've added the jar to Shark:
ADD JAR /path/to/es-hadoop.jar;
created a table:
CREATE EXTERNAL TABLE test_read (name string,
price double)
STORED BY 'org.elasticsearch.hadoop.____
__hive.EsStorageHandler'
TBLPROPERTIES('es.resource' =
'test_index/test_type/_search?______q=*');
And then trying to 'SELECT * FROM test _read'
gives me :
org.apache.spark.______SparkException: Job
aborted: Task 3.0:0 failed more than 0 times;
aborting job
java.lang.ClassCastException:
org.elasticsearch.hadoop.hive.______EsHiveInputFormat$____ESHiveSplit
cannot
be cast to
org.elasticsearch.hadoop.hive.
______EsHiveInputFormat$____ESHiveSplit
at
org.apache.spark.scheduler.______DAGScheduler$$anonfun$_____
_abortStage$1.apply(______DAGScheduler.scala:827)
at
org.apache.spark.scheduler.______DAGScheduler$$anonfun$_____
_abortStage$1.apply(______DAGScheduler.scala:825)
at scala.collection.mutable._____
_ResizableArray$class.foreach(______ResizableArray.scala:60)
at scala.collection.mutable._____
_ArrayBuffer.foreach(______ArrayBuffer.scala:47)
at org.apache.spark.scheduler.___
___DAGScheduler.abortStage(______DAGScheduler.scala:825)
at org.apache.spark.scheduler.___
___DAGScheduler.processEvent(______DAGScheduler.scala:440)
at org.apache.spark.scheduler.___
___DAGScheduler.org
<http://org.apache.spark.__sch
__eduler.DAGScheduler.org http://scheduler.DAGScheduler.org
<http://org.apache.spark.__scheduler.DAGScheduler.org
http://org.apache.spark.scheduler.DAGScheduler.org>>$
____apache$spark$scheduler$DAGScheduler$$run(
DAGScheduler.scala:502)
at org.apache.spark.scheduler.___
___DAGScheduler$$anon$1.run(______DAGScheduler.scala:157)
FAILED: Execution Error, return code -101 from
shark.execution.SparkTask
In fact I get the same error thrown when trying
to READ from the table that I successfully
WROTE to...
On Saturday, 22 February 2014 12:31:21 UTC+2,
Costin Leau wrote:
Yeah, it might have been some sort of
network configuration issue where services where
running on
different
machines
and
localhost pointed to a different location.
Either way, I'm glad to hear things have
are moving forward.
Cheers,
On 22/02/2014 1:06 AM, Max Lang wrote:
> I managed to get it working on ec2
without issue this time. I'd say the biggest
difference was
that this
time I set up a
> dedicated ES machine. Is it possible
that, because I was using a cluster with slaves,
when I used
"localhost" the slaves
> couldn't find the ES instance running on
the master? Or do all the requests go through
the master?
>
>
> On Wednesday, February 19, 2014 2:35:40
PM UTC-8, Costin Leau wrote:
>
> Hi,
>
> Setting logging in Hive/Hadoop can
be tricky since the log4j needs to be picked up
by the
running JVM
otherwise you
> won't see anything.
> Take a look at this link on how to
tell Hive to use your logging settings [1].
>
> For the next release, we might
introduce dedicated exceptions for the simple fact
that some
libraries, like Hive,
> swallow the stack trace and it's
unclear what the issue is which makes the exception
(IllegalStateException) ambiguous.
>
> Let me know how it goes and whether
you will encounter any issues with Shark. Or if
you don't
>
> Thanks!
>
>
[1]https://cwiki.apache.org/______confluence/display/Hive/__
____GettingStarted#GettingStarted-______ErrorLogs
<https://cwiki.apache.org/____confluence/display/Hive/____
GettingStarted#GettingStarted-____ErrorLogs>
<https://cwiki.apache.org/____
confluence/display/Hive/____GettingStarted#GettingStarted-____ErrorLogs
<https://cwiki.apache.org/__confluence/display/Hive/__
GettingStarted#GettingStarted-__ErrorLogs>>
<https://cwiki.apache.org/____confluence/display/Hive/____
GettingStarted#GettingStarted-____ErrorLogs
<https://cwiki.apache.org/__confluence/display/Hive/__
GettingStarted#GettingStarted-__ErrorLogs>
<https://cwiki.apache.org/__confluence/display/Hive/__
GettingStarted#GettingStarted-__ErrorLogs
<Home - Apache Hive - Apache Software Foundation
GettingStarted#GettingStarted-ErrorLogs>>>
<https://cwiki.apache.org/______confluence/display/Hive/____
__GettingStarted#GettingStarted-______ErrorLogs
<https://cwiki.apache.org/____confluence/display/Hive/____
GettingStarted#GettingStarted-____ErrorLogs>
<https://cwiki.apache.org/____
confluence/display/Hive/____GettingStarted#GettingStarted-____ErrorLogs
<https://cwiki.apache.org/__confluence/display/Hive/__
GettingStarted#GettingStarted-__ErrorLogs>>
<https://cwiki.apache.org/____confluence/display/Hive/____
GettingStarted#GettingStarted-____ErrorLogs
<https://cwiki.apache.org/__confluence/display/Hive/__
GettingStarted#GettingStarted-__ErrorLogs>
<https://cwiki.apache.org/__confluence/display/Hive/__
GettingStarted#GettingStarted-__ErrorLogs
<Home - Apache Hive - Apache Software Foundation
GettingStarted#GettingStarted-ErrorLogs>>>>
>
<https://cwiki.apache.org/______confluence/display/Hive/____
__GettingStarted#GettingStarted-______ErrorLogs
<https://cwiki.apache.org/____confluence/display/Hive/____
GettingStarted#GettingStarted-____ErrorLogs>
<https://cwiki.apache.org/____
confluence/display/Hive/____GettingStarted#GettingStarted-____ErrorLogs
<https://cwiki.apache.org/__confluence/display/Hive/__
GettingStarted#GettingStarted-__ErrorLogs>>
<https://cwiki.apache.org/____confluence/display/Hive/____
GettingStarted#GettingStarted-____ErrorLogs
<https://cwiki.apache.org/__confluence/display/Hive/__
GettingStarted#GettingStarted-__ErrorLogs>
<https://cwiki.apache.org/__confluence/display/Hive/__
GettingStarted#GettingStarted-__ErrorLogs
<Home - Apache Hive - Apache Software Foundation
GettingStarted#GettingStarted-ErrorLogs>>>
<https://cwiki.apache.org/______confluence/display/Hive/____
__GettingStarted#GettingStarted-______ErrorLogs
<https://cwiki.apache.org/____confluence/display/Hive/____
GettingStarted#GettingStarted-____ErrorLogs>
<https://cwiki.apache.org/____
confluence/display/Hive/____GettingStarted#GettingStarted-____ErrorLogs
<https://cwiki.apache.org/__confluence/display/Hive/__
GettingStarted#GettingStarted-__ErrorLogs>>
<https://cwiki.apache.org/____confluence/display/Hive/____
GettingStarted#GettingStarted-____ErrorLogs
<https://cwiki.apache.org/__confluence/display/Hive/__
GettingStarted#GettingStarted-ErrorLogs>
<https://cwiki.apache.org/__confluence/display/Hive/__
GettingStarted#GettingStarted-ErrorLogs
<Home - Apache Hive - Apache Software Foundation
GettingStarted#GettingStarted-ErrorLogs>>>>>
>
> On 20/02/2014 12:02 AM, Max Lang
wrote:
> > Hey Costin,
> >
> > Thanks for the swift reply. I
abandoned EC2 to take that out of the equation and
managed
to get
everything working
> > locally using the latest version
of everything (though I realized just now I'm
still on
hive 0.9).
I'm guessing you're
> > right about some port connection
issue because I definitely had ES running on
that machine.
> >
> > I changed hive-log4j.properties
and added
> > |
> > #custom logging levels
> > #log4j.logger.xxx=DEBUG
> > log4j.logger.org <
http://log4j.logger.org> http://log4j.logger.org
http://log4j.logger.org.
__elasticsearch.hadoop.rest=______TRACE
> >log4j.logger.org.__elasticsea
____rch.hadoop.mr http://elasticsea__rch.hadoop.mr
<http://elasticsearch.hadoop.__mr <http://elasticsearch.hadoop.mr
<http://log4j.logger.org.__ela__sticsearch.hadoop.mr <
http://elasticsearch.hadoop.mr>
<http://log4j.logger.org.__elasticsearch.hadoop.mr <
http://log4j.logger.org.elasticsearch.hadoop.mr>>>
<http://log4j.logger.org.__ela
____sticsearch.hadoop.mr http://ela__sticsearch.hadoop.mr
<http://elasticsearch.hadoop.__mr <http://elasticsearch.hadoop.mr
<http://log4j.logger.org.__ela__sticsearch.hadoop.mr <
http://elasticsearch.hadoop.mr>
<http://log4j.logger.org.__elasticsearch.hadoop.mr <
http://log4j.logger.org.elasticsearch.hadoop.mr>>>>
<http://log4j.logger.org.__ela
____sticsearch.hadoop.mr http://ela__sticsearch.hadoop.mr
<http://elasticsearch.hadoop.__mr <http://elasticsearch.hadoop.mr
<http://log4j.logger.org.__ela__sticsearch.hadoop.mr <
http://elasticsearch.hadoop.mr>
<http://log4j.logger.org.__elasticsearch.hadoop.mr <
http://log4j.logger.org.elasticsearch.hadoop.mr>>>
<http://log4j.logger.org.__ela
____sticsearch.hadoop.mr http://ela__sticsearch.hadoop.mr
<http://elasticsearch.hadoop.__mr <http://elasticsearch.hadoop.mr
<http://log4j.logger.org.__ela__sticsearch.hadoop.mr <
http://elasticsearch.hadoop.mr>
<http://log4j.logger.org.__elasticsearch.hadoop.mr <
http://log4j.logger.org.elasticsearch.hadoop.mr>>>>>=______TRACE
> > |
> >
> > But I didn't see any trace
logging. Hopefully I can get it working on EC2 without
issue,
but, for
the future, is this
> > the correct way to set TRACE
logging?
> >
> > Oh and, for reference, I tried
running without ES up and I got the following,
exceptions:
> >
> > 2014-02-19 13:46:08,803 ERROR
shark.SharkDriver (Logging.scala:logError(64)) -
FAILED: Hive
Internal Error:
> > java.lang.______IllegalStateException(Cannot
discover Elasticsearch version)
> > java.lang.______IllegalStateException:
Cannot discover Elasticsearch version
> > at
org.elasticsearch.hadoop.hive.____EsStorageHandler.init(
____EsStorageHandler.java:101)
> > at
org.elasticsearch.hadoop.hive.______EsStorageHandler.______
configureOutputJobProperties(______EsStorageHandler.java:83)
> > at
org.apache.hadoop.hive.ql.______plan.PlanUtils.______
configureJobPropertiesForStora______geHandler(PlanUtils.java:____706)
> > at
org.apache.hadoop.hive.ql.______plan.PlanUtils.______
configureOutputJobPropertiesFo______rStorageHandler(
PlanUtils.__java:675)
> > at
org.apache.hadoop.hive.ql.
__exec.FileSinkOperator._____augmentPlan(FileSinkOperator.
_____java:764)
> > at
org.apache.hadoop.hive.ql.______parse.SemanticAnalyzer._____
_putOpInsertMap(______SemanticAnalyzer.java:1518)
> > at
org.apache.hadoop.hive.ql.______parse.SemanticAnalyzer._____
_genFileSinkPlan(______SemanticAnalyzer.java:4337)
> > at
org.apache.hadoop.hive.ql.______parse.SemanticAnalyzer._____
_genPostGroupByBodyPlan(__SemanticAnalyzer.java:6207)
> > at
org.apache.hadoop.hive.ql.
__parse.SemanticAnalyzer.______genBodyPlan(SemanticAnalyzer.
__java:6138)
> > at
org.apache.hadoop.hive.ql.
__parse.SemanticAnalyzer.______genPlan(SemanticAnalyzer.java:______6764)
> > at
shark.parse.SharkSemanticAnalyzer.
analyzeInternal(______SharkSemanticAnalyzer.scala:______149)
> > at
org.apache.hadoop.hive.ql.______parse.BaseSemanticAnalyzer._
_____analyze(BaseSemanticAnalyzer.java:244)
> > at shark.SharkDriver.compile(
SharkDriver.scala:215)
> > at org.apache.hadoop.hive.ql.
Driver.compile(Driver.java:336)
> > at org.apache.hadoop.hive.ql.
Driver.run(Driver.java:895)
> > at shark.SharkCliDriver.
processCmd(SharkCliDriver.___scala:324)
> > at org.apache.hadoop.hive.cli.
___CliDriver.processLine(__CliDriver.java:406)
> > at shark.SharkCliDriver$.main(
SharkCliDriver.scala:232)
> > at shark.SharkCliDriver.main(
__SharkCliDriver.scala)
> > Caused by: java.io.IOException:
Out of nodes and retries; caught exception
> > at
org.elasticsearch.hadoop.rest.____NetworkClient.execute(
____NetworkClient.java:81)
> > at org.elasticsearch.hadoop.rest.
______RestClient.execute(____RestClient.__java:221)
> > at org.elasticsearch.hadoop.rest.
______RestClient.execute(____RestClient.__java:205)
> > at org.elasticsearch.hadoop.rest.
______RestClient.execute(____RestClient.__java:209)
> > at org.elasticsearch.hadoop.rest.
______RestClient.get(RestClient.______java:103)
> > at
org.elasticsearch.hadoop.rest.___RestClient.esVersion(
___RestClient.java:274)
> > at
org.elasticsearch.hadoop.rest.______InitializationUtils.____
__discoverEsVersion(______InitializationUtils.java:84)
> > at
org.elasticsearch.hadoop.hive.____EsStorageHandler.init(
____EsStorageHandler.java:99)
> > ... 18 more
> > Caused by:
java.net.ConnectException: Connection refused
> > at java.net.PlainSocketImpl.______socketConnect(Native
Method)
> > at java.net <http://java.net> <
http://java.net>
<http://java.net>.______AbstractPlainSocketImpl.______
doConnect(______AbstractPlainSocketImpl.java:______339)
> > at java.net <http://java.net> <
http://java.net>
<http://java.net>.______AbstractPlainSocketImpl.______
connectToAddress(______AbstractPlainSocketImpl.java:______200)
> > at java.net <http://java.net> <
http://java.net>
http://java.net.AbstractPlainSocketImpl.
connect(__AbstractPlainSocketImpl.java:182)
> > at java.net.SocksSocketImpl.
connect(SocksSocketImpl.java:391)
> > at java.net.Socket.connect(
Socket.java:579)
> > at java.net.Socket.connect(
Socket.java:528)
> > at java.net.Socket.(Socket.
______java:425)
> > at java.net.Socket.(Socket.
______java:280)
> > at
org.apache.commons.httpclient.______protocol.______
DefaultProtocolSocketFactory.createSocket(
DefaultProtocolSocketFactory.______java:80)
> > at
org.apache.commons.httpclient.______protocol.______
DefaultProtocolSocketFactory.createSocket(
DefaultProtocolSocketFactory.______java:122)
> > at
org.apache.commons.httpclient.__HttpConnection.open(
__HttpConnection.java:707)
> > at
org.apache.commons.httpclient.______HttpMethodDirector._____
_executeWithRetry(______HttpMethodDirector.java:387)
> > at
org.apache.commons.httpclient.______HttpMethodDirector._____
_executeMethod(______HttpMethodDirector.java:171)
> > at
org.apache.commons.httpclient.______HttpClient.
executeMethod(______HttpClient.java:397)
> > at
org.apache.commons.httpclient.______HttpClient.
executeMethod(______HttpClient.java:323)
> > at
org.elasticsearch.hadoop.rest.______commonshttp.______
CommonsHttpTransport.execute(______CommonsHttpTransport.java:____160)
> > at
org.elasticsearch.hadoop.rest.____NetworkClient.execute(
____NetworkClient.java:74)
> > ... 25 more
> >
> > Let me know if there's anything in
particular you'd like me to try on EC2.
> >
> > (For posterity, the versions I
used were: hadoop 2.2.0, hive 0.9.0, shark 8.1,
spark 8.1,
es-hadoop
1.3.0.M2, java
> > 1.7.0_15, scala 2.9.3,
elasticsearch 1.0.0)
> >
> > Thanks again,
> > Max
> >
> > On Tuesday, February 18, 2014
10:16:38 PM UTC-8, Costin Leau wrote:
> >
> > The error indicates a network
error - namely es-hadoop cannot connect to
Elasticsearch
on the
default (localhost:9200)
> > HTTP port. Can you double
check whether that's indeed the case (using curl or
even
telnet on
that port) - maybe the
> > firewall prevents any
connections to be made...
> > Also you could try using the
latest Hive, 0.12 and a more recent Hadoop such
as 1.1.2
or 1.2.1.
> >
> > Additionally, can you enable
TRACE logging in your job on es-hadoop packages
org.elasticsearch.hadoop.rest and
> >org.elasticsearch.hadoop.mr <
http://org.elasticsearch.hadoop.mr>
<http://org.elasticsearch.__hadoop.mr <http://org.elasticsearch.
hadoop.mr>>
<http://org.elasticsearch.__ha__doop.mr <
http://hadoop.mr> <http://org.elasticsearch.__hadoop.mr
http://org.elasticsearch.hadoop.mr>>
<http://org.elasticsearch.__ha____doop.mr <
http://ha__doop.mr> http://hadoop.mr
<http://org.elasticsearch.__ha__doop.mr <http://hadoop.mr>
<http://org.elasticsearch.__hadoop.mr <
http://org.elasticsearch.hadoop.mr>>>>
<http://org.elasticsearch.__ha____doop.mr <
http://ha__doop.mr> http://hadoop.mr
<http://org.elasticsearch.__ha__doop.mr <http://hadoop.mr>
<http://org.elasticsearch.__hadoop.mr <
http://org.elasticsearch.hadoop.mr>>>
<http://org.elasticsearch.__ha____doop.mr<
http://ha__doop.mr> http://hadoop.mr
<http://org.elasticsearch.__ha__doop.mr <http://hadoop.mr>
<http://org.elasticsearch.__hadoop.mr <
http://org.elasticsearch.hadoop.mr>>>>>
<http://org.elasticsearch.__ha____doop.mr <
http://ha__doop.mr> http://hadoop.mr
<http://org.elasticsearch.__ha__doop.mr <http://hadoop.mr>
<http://org.elasticsearch.__hadoop.mr <
http://org.elasticsearch.hadoop.mr>>>
<http://org.elasticsearch.__ha____doop.mr http://ha__doop.mr <
http://hadoop.mr>
<http://org.elasticsearch.__ha__doop.mr <
http://hadoop.mr>
<http://org.elasticsearch.__hadoop.mr <http://org.elasticsearch.
hadoop.mr>>>>
> <http://org.elasticsearch.__ha
____doop.mr http://ha__doop.mr http://hadoop.mr
<http://org.elasticsearch.__ha__doop.mr <
http://hadoop.mr> <http://org.elasticsearch.__hadoop.mr
http://org.elasticsearch.hadoop.mr>>
<http://org.elasticsearch.__ha____doop.mr <
http://ha__doop.mr> http://hadoop.mr
<http://org.elasticsearch.__ha__doop.mr <http://hadoop.mr>
<http://org.elasticsearch.__hadoop.mr <
http://org.elasticsearch.hadoop.mr>>>>>> packages and report back ?
> >
> > Thanks,
> >
> > On 19/02/2014 4:03 AM, Max
Lang wrote:
> > > I set everything up using
this
guide:https://github.com/_____
_amplab/shark/wiki/Running-______Shark-on-EC2
<https://github.com/____amplab/shark/wiki/Running-____
Shark-on-EC2>
<https://github.com/__amplab/_
_shark/wiki/Running-__Shark-on-__EC2
https://github.com/__amplab/shark/wiki/Running-__Shark-on-EC2>
<https://github.com/amplab/___
_shark/wiki/Running-Shark-on-____EC2
https://github.com/amplab/__shark/wiki/Running-Shark-on-__EC2
<https://github.com/amplab/__
shark/wiki/Running-Shark-on-__EC2
https://github.com/amplab/shark/wiki/Running-Shark-on-EC2>>
<https://github.com/amplab/___
___shark/wiki/Running-Shark-on-______EC2
<https://github.com/amplab/____shark/wiki/Running-Shark-on-_
___EC2>
<https://github.com/amplab/___
_shark/wiki/Running-Shark-on-____EC2
https://github.com/amplab/__shark/wiki/Running-Shark-on-__EC2>
<https://github.com/amplab/___
_shark/wiki/Running-Shark-on-____EC2
https://github.com/amplab/__shark/wiki/Running-Shark-on-__EC2
<https://github.com/amplab/__
shark/wiki/Running-Shark-on-__EC2
https://github.com/amplab/shark/wiki/Running-Shark-on-EC2>>>
<https://github.com/amplab/___
___shark/wiki/Running-Shark-on-______EC2
<https://github.com/amplab/____shark/wiki/Running-Shark-on-_
___EC2>
<https://github.com/amplab/___
_shark/wiki/Running-Shark-on-____EC2
https://github.com/amplab/__shark/wiki/Running-Shark-on-__EC2>
<https://github.com/amplab/___
_shark/wiki/Running-Shark-on-____EC2
https://github.com/amplab/__shark/wiki/Running-Shark-on-__EC2
<https://github.com/amplab/__
shark/wiki/Running-Shark-on-__EC2
https://github.com/amplab/shark/wiki/Running-Shark-on-EC2>>
<https://github.com/amplab/___
___shark/wiki/Running-Shark-on-______EC2
<https://github.com/amplab/____shark/wiki/Running-Shark-on-_
___EC2>
<https://github.com/amplab/___
_shark/wiki/Running-Shark-on-____EC2
https://github.com/amplab/__shark/wiki/Running-Shark-on-__EC2>
<https://github.com/amplab/___
_shark/wiki/Running-Shark-on-____EC2
https://github.com/amplab/__shark/wiki/Running-Shark-on-__EC2
<https://github.com/amplab/__
shark/wiki/Running-Shark-on-__EC2
https://github.com/amplab/shark/wiki/Running-Shark-on-EC2>>>>
> > <https://github.com/amplab/___
___shark/wiki/Running-Shark-on-______EC2
<https://github.com/amplab/____shark/wiki/Running-Shark-on-_
___EC2>
<https://github.com/amplab/___
_shark/wiki/Running-Shark-on-____EC2
https://github.com/amplab/__shark/wiki/Running-Shark-on-__EC2>
<https://github.com/amplab/___
_shark/wiki/Running-Shark-on-____EC2
https://github.com/amplab/__shark/wiki/Running-Shark-on-__EC2
<https://github.com/amplab/__
shark/wiki/Running-Shark-on-__EC2
https://github.com/amplab/shark/wiki/Running-Shark-on-EC2>>
<https://github.com/amplab/___
___shark/wiki/Running-Shark-on-______EC2
<https://github.com/amplab/____shark/wiki/Running-Shark-on-_
___EC2>
<https://github.com/amplab/___
_shark/wiki/Running-Shark-on-____EC2
https://github.com/amplab/__shark/wiki/Running-Shark-on-__EC2>
<https://github.com/amplab/___
_shark/wiki/Running-Shark-on-____EC2
https://github.com/amplab/__shark/wiki/Running-Shark-on-__EC2
<https://github.com/amplab/__
shark/wiki/Running-Shark-on-__EC2
https://github.com/amplab/shark/wiki/Running-Shark-on-EC2>>>
> <https://github.com/amplab/___
___shark/wiki/Running-Shark-on-______EC2
<https://github.com/amplab/____shark/wiki/Running-Shark-on-_
___EC2>
<https://github.com/amplab/___
_shark/wiki/Running-Shark-on-____EC2
https://github.com/amplab/__shark/wiki/Running-Shark-on-__EC2>
<https://github.com/amplab/___
_shark/wiki/Running-Shark-on-____EC2
https://github.com/amplab/__shark/wiki/Running-Shark-on-__EC2
<https://github.com/amplab/__
shark/wiki/Running-Shark-on-__EC2
https://github.com/amplab/shark/wiki/Running-Shark-on-EC2>>
<https://github.com/amplab/___
___shark/wiki/Running-Shark-on-______EC2
<https://github.com/amplab/____shark/wiki/Running-Shark-on-_
___EC2>
<https://github.com/amplab/___
_shark/wiki/Running-Shark-on-____EC2
https://github.com/amplab/__shark/wiki/Running-Shark-on-__EC2>
<https://github.com/amplab/___
_shark/wiki/Running-Shark-on-____EC2
https://github.com/amplab/__shark/wiki/Running-Shark-on-__EC2
<https://github.com/amplab/__
shark/wiki/Running-Shark-on-__EC2
https://github.com/amplab/shark/wiki/Running-Shark-on-EC2>>>>>
on an ec2 cluster. I've
> > > copied the
elasticsearch-hadoop jars into the hive lib directory and I have
elasticsearch
running on localhost:9200. I'm
> > > running shark in a screen
session with --service screenserver and
connecting to it
at the
same time using shark -h
> > > localhost.
> > >
> > > Unfortunately, when I
attempt to write data into elasticsearch, it fails.
Here's an
example:
> > >
> > > |
> > >
[localhost:10000]shark>CREATE EXTERNAL TABLE wiki (id BIGINT,title
STRING,last_modified
STRING,xml STRING,text
> > > STRING)ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'LOCATION
's3n://spark-data/wikipedia-______sample/';
> > > Timetaken (including network
latency):0.159seconds
> > > 14/02/1901:23:33INFO
CliDriver:Timetaken (including network
latency):0.159seconds
> > >
> > >
[localhost:10000]shark>SELECT title FROM wiki LIMIT 1;
> > > Alpokalja
> > > Timetaken (including network
latency):2.23seconds
> > > 14/02/1901:23:48INFO
CliDriver:Timetaken (including network
latency):2.23seconds
> > >
> > >
[localhost:10000]shark>CREATE EXTERNAL TABLE es_wiki (id BIGINT,title
STRING,last_modified
STRING,xml STRING,text
> > > STRING)STORED BY
'org.elasticsearch.hadoop.______hive.EsStorageHandler'______
TBLPROPERTIES('es.resource'='______wikipedia/article');
> > > Timetaken (including network
latency):0.061seconds
> > > 14/02/1901:33:51INFO
CliDriver:Timetaken (including network
latency):0.061seconds
> > >
> > >
[localhost:10000]shark>INSERT OVERWRITE TABLE es_wiki SELECTw.id
http://w.id,w.title,w.last_______modified,w.xml,w.text
FROM wiki w;
> > > [HiveError]:Queryreturned
non-zero
code:9,cause:FAILED:______ExecutionError,returncode
-101fromshark.execution.______SparkTask
> > > Timetaken (including network
latency):3.575seconds
> > > 14/02/1901:34:42INFO
CliDriver:Timetaken (including network
latency):3.575seconds
> > > |
> > >
> > > The stack trace looks like
this:
> > >
> > >
org.apache.hadoop.hive.ql.______metadata.HiveException
(org.apache.hadoop.hive.ql.______metadata.HiveException:
java.io.IOException:
> > > Out of nodes and retries;
caught exception)
> > >
> > >
org.apache.hadoop.hive.ql.______exec.FileSinkOperator.______
processOp(FileSinkOperator.____java:602)shark.execution.
____FileSinkOperator$$anonfun$______processPartition$1.
apply(______FileSinkOperator.scala:84)shark.execution.
FileSinkOperator$$anonfun$___processPartition$1.apply(
___FileSinkOperator.scala:81)______scala.collection.
Iterator$______class.foreach(Iterator.scala:772)
scala.collection.____Iterator$$anon$19.foreach(
Iterator.__scala:399)shark.____execution.__FileSinkOperator.
______processPartition(__FileSinkOperator.scala:81)
__shark.execution._____FileSinkOperator$.writeFiles$
_____1(FileSinkOperator.scala:__207)shark.execution.
__FileSinkOperator$$anonfun$______executeProcessFileSinkPartitio
______n$1.apply(__FileSinkOperator.scala:211)shark.execution.
FileSinkOperator$$anonfun$______executeProcessFileSinkPartitio
______n$1.apply(__FileSinkOperator.__scala:
211)org.apache.spark.______scheduler.ResultTask.
...