Elasticsearch-hadoop and hadoop-streaming


(Peter Sheridan) #1

Let's say I have an index, my_twitter_river, which has been populated by
the Twitter river plugin. I want to do some analysis on the data using a
Hadoop streaming job (http://hadoop.apache.org/docs/r1.2.1/streaming.html),
where essentially command line programs can be used to write map/reduce
jobs.

Is there a good way for me to have the mapper for that streaming job
receive parsable json using only available configuration options? What I
see is this:

my-macbook:misc psheridan$ hadoop jar
/usr/lib/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar
-D es.resource=my_twitter_river/status -D mapred.reduce.tasks=0
-inputformat org.elasticsearch.hadoop.mr.EsInputFormat
-mapper /bin/cat -input in -output out

14/05/05 08:54:13 INFO mr.EsInputFormat: Discovered mapping
{my_twitter_river=[mappings=[status=[created_at=DATE, hashtag=[end=LONG,
start=LONG, text=STRING], in_reply=[status=LONG, user_id=LONG,
user_screen_name=STRING], language=STRING, link=[display_url=STRING,
end=LONG, expand_url=STRING, start=LONG, url=STRING], location=GEO_POINT,
mention=[end=LONG, id=LONG, name=STRING, screen_name=STRING, start=LONG],
place=[country=STRING, country_code=STRING, full_name=STRING, id=STRING,
name=STRING, type=STRING, url=STRING], retweet=[id=LONG,
retweet_count=LONG, user_id=LONG, user_screen_name=STRING],
retweet_count=LONG, source=STRING, text=STRING, truncated=BOOLEAN,
user=[description=STRING, id=LONG, location=STRING, name=STRING,
profile_image_url=STRING, profile_image_url_https=STRING,
screen_name=STRING]]]]} for [my_twitter_river/status]
14/05/05 08:54:13 INFO mr.EsInputFormat: Created [5] shard-splits

[...output snipped...]

my-macbook:misc psheridan$ hadoop dfs -ls out
Found 7 items
-rw-r--r-- 1 psheridan supergroup 0 2014-05-05 08:54
/user/psheridan/out/_SUCCESS
drwxr-xr-x - psheridan supergroup 0 2014-05-05 08:54
/user/psheridan/out/_logs
-rw-r--r-- 1 psheridan supergroup 10541885 2014-05-05 08:54
/user/psheridan/out/part-00000
-rw-r--r-- 1 psheridan supergroup 10252834 2014-05-05 08:54
/user/psheridan/out/part-00001
-rw-r--r-- 1 psheridan supergroup 10492008 2014-05-05 08:54
/user/psheridan/out/part-00002
-rw-r--r-- 1 psheridan supergroup 10497346 2014-05-05 08:54
/user/psheridan/out/part-00003
-rw-r--r-- 1 psheridan supergroup 10489611 2014-05-05 08:54
/user/psheridan/out/part-00004

my-macbook:misc psheridan$ dfs -cat /user/psheridan/out/part-00000 | head -2

458635348029886464 {text=RT @adrianrmante: Having lunch with my great
friend OH MR. MOOSSEEBBYY!! I love that guy for life!!
#thesuitelifeofzackandcody http://t.co/…,
created_at=2014-04-22T15:56:02.000Z, source=Twitter for
iPhone
, truncated=false, language=en, mention=[{id=38270235,
name=Adrian R'Mante, screen_name=adrianrmante, start=3, end=16}],
retweet_count=0, retweet={id=456201545562873856, user_id=38270235,
user_screen_name=adrianrmante, retweet_count=3279},
hashtag=[{text=thesuitelifeofzackandcody, start=100, end=126}], link=[],
user={id=467951508, name=Marenna Nonya, screen_name=ThatGirlish,
location=(null), description=(null),
profile_image_url=http://pbs.twimg.com/profile_images/431121751456890880/pItAYIpY_normal.jpeg,
profile_image_url_https=https://pbs.twimg.com/profile_images/431121751456890880/pItAYIpY_normal.jpeg}}
458635352232976384 {text=Dae vc sobe com seu amigão, ele fica de love e vc
separando briga --' K, created_at=2014-04-22T15:56:03.000Z, source=Twitter for
Android
, truncated=false, language=pt, mention=[], retweet_count=0,
hashtag=[], link=[], user={id=2187296905, name=Carlinhos ,
screen_name=AngeloChionpat0, location=(null), description=(null),
profile_image_url=http://pbs.twimg.com/profile_images/448849601844772864/ZNwdGDC7_normal.jpeg,
profile_image_url_https=https://pbs.twimg.com/profile_images/448849601844772864/ZNwdGDC7_normal.jpeg}}

When I cat the output files, I'd like to see valid json for the value.

Thanks for any assistance...if this isn't possible, I'll likely submit a
pull request when I get to it, which will not be as soon as I'd like. :slight_smile:

--Pete

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a3fc229e-4bd3-4c31-ae0a-bec7362ba724%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Costin Leau) #2

Hi,

Reading data in JSON format from ES (which I think is what you are interested in doing) is not available out of the box.
Simply because you can do the same thing directly from the command line with curl or any http-like client.
One of the reasons behind hadoop-streaming is to allow native clients to interact with Hadoop, primarily with HDF; since
you are
interacting with ES, why not talk to it directly?

Am I missing something?

On 5/5/14 4:06 PM, Peter Sheridan wrote:

Let's say I have an index, my_twitter_river, which has been populated by the Twitter river plugin. I want to do some
analysis on the data using a Hadoop streaming job (http://hadoop.apache.org/docs/r1.2.1/streaming.html), where
essentially command line programs can be used to write map/reduce jobs.

Is there a good way for me to have the mapper for that streaming job receive parsable json using only available
configuration options? What I see is this:

my-macbook:misc psheridan$ hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar
-D es.resource=my_twitter_river/status -D mapred.reduce.tasks=0
-inputformat org.elasticsearch.hadoop.mr.EsInputFormat
-mapper /bin/cat -input in -output out

14/05/05 08:54:13 INFO mr.EsInputFormat: Discovered mapping {my_twitter_river=[mappings=[status=[created_at=DATE,
hashtag=[end=LONG, start=LONG, text=STRING], in_reply=[status=LONG, user_id=LONG, user_screen_name=STRING],
language=STRING, link=[display_url=STRING, end=LONG, expand_url=STRING, start=LONG, url=STRING], location=GEO_POINT,
mention=[end=LONG, id=LONG, name=STRING, screen_name=STRING, start=LONG], place=[country=STRING, country_code=STRING,
full_name=STRING, id=STRING, name=STRING, type=STRING, url=STRING], retweet=[id=LONG, retweet_count=LONG, user_id=LONG,
user_screen_name=STRING], retweet_count=LONG, source=STRING, text=STRING, truncated=BOOLEAN, user=[description=STRING,
id=LONG, location=STRING, name=STRING, profile_image_url=STRING, profile_image_url_https=STRING, screen_name=STRING]]]]}
for [my_twitter_river/status]
14/05/05 08:54:13 INFO mr.EsInputFormat: Created [5] shard-splits

[...output snipped...]

my-macbook:misc psheridan$ hadoop dfs -ls out
Found 7 items
-rw-r--r-- 1 psheridan supergroup 0 2014-05-05 08:54 /user/psheridan/out/_SUCCESS
drwxr-xr-x - psheridan supergroup 0 2014-05-05 08:54 /user/psheridan/out/_logs
-rw-r--r-- 1 psheridan supergroup 10541885 2014-05-05 08:54 /user/psheridan/out/part-00000
-rw-r--r-- 1 psheridan supergroup 10252834 2014-05-05 08:54 /user/psheridan/out/part-00001
-rw-r--r-- 1 psheridan supergroup 10492008 2014-05-05 08:54 /user/psheridan/out/part-00002
-rw-r--r-- 1 psheridan supergroup 10497346 2014-05-05 08:54 /user/psheridan/out/part-00003
-rw-r--r-- 1 psheridan supergroup 10489611 2014-05-05 08:54 /user/psheridan/out/part-00004

my-macbook:misc psheridan$ dfs -cat /user/psheridan/out/part-00000 | head -2

458635348029886464{text=RT @adrianrmante: Having lunch with my great friend OH MR. MOOSSEEBBYY!! I love that guy for
life!! #thesuitelifeofzackandcody http://t.co/…, created_at=2014-04-22T15:56:02.000Z, source=Twitter for iPhone, truncated=false, language=en,
mention=[{id=38270235, name=Adrian R'Mante, screen_name=adrianrmante, start=3, end=16}], retweet_count=0,
retweet={id=456201545562873856, user_id=38270235, user_screen_name=adrianrmante, retweet_count=3279},
hashtag=[{text=thesuitelifeofzackandcody, start=100, end=126}], link=[], user={id=467951508, name=Marenna Nonya,
screen_name=ThatGirlish, location=(null), description=(null),
profile_image_url=http://pbs.twimg.com/profile_images/431121751456890880/pItAYIpY_normal.jpeg,
profile_image_url_https=https://pbs.twimg.com/profile_images/431121751456890880/pItAYIpY_normal.jpeg}}
458635352232976384{text=Dae vc sobe com seu amigão, ele fica de love e vc separando briga --' K,
created_at=2014-04-22T15:56:03.000Z, source=Twitter for
Android
, truncated=false, language=pt, mention=[], retweet_count=0, hashtag=[], link=[], user={id=2187296905,
name=Carlinhos , screen_name=AngeloChionpat0, location=(null), description=(null),
profile_image_url=http://pbs.twimg.com/profile_images/448849601844772864/ZNwdGDC7_normal.jpeg,
profile_image_url_https=https://pbs.twimg.com/profile_images/448849601844772864/ZNwdGDC7_normal.jpeg}}

When I cat the output files, I'd like to see valid json for the value.

Thanks for any assistance...if this isn't possible, I'll likely submit a pull request when I get to it, which will not
be as soon as I'd like. :slight_smile:

--Pete

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a3fc229e-4bd3-4c31-ae0a-bec7362ba724%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a3fc229e-4bd3-4c31-ae0a-bec7362ba724%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5367F28D.1080502%40gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Peter Sheridan) #3

This is a simple example to illustrate the point. The real use case:

  1. Is a rather large amount of data which I'd like to handle in parallel &
    also take advantage of es-hadoop's handling of shards.
  2. Uses an existing job execution framework & toolset based on
    hadoop-streaming; I would rather not have a special case to handle it.

Thanks for your feedback & nice work on es-hadoop, started using the Hive
integration today as well.

--Pete

On Monday, May 5, 2014 4:20:29 PM UTC-4, Costin Leau wrote:

Hi,

Reading data in JSON format from ES (which I think is what you are
interested in doing) is not available out of the box.
Simply because you can do the same thing directly from the command line
with curl or any http-like client.
One of the reasons behind hadoop-streaming is to allow native clients to
interact with Hadoop, primarily with HDF; since
you are
interacting with ES, why not talk to it directly?

Am I missing something?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3bc7c0cb-fd97-402d-9644-97a60b2aa4b7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Costin Leau) #4

How would the json format help you though? If you plan parse the output, using a simpler text representation (such as
TextOutputFormat) should be a lot easier.

On 5/5/14 11:35 PM, Peter Sheridan wrote:

This is a simple example to illustrate the point. The real use case:

  1. Is a rather large amount of data which I'd like to handle in parallel & also take advantage of es-hadoop's handling
    of shards.
  2. Uses an existing job execution framework & toolset based on hadoop-streaming; I would rather not have a special case
    to handle it.

Thanks for your feedback & nice work on es-hadoop, started using the Hive integration today as well.

--Pete

On Monday, May 5, 2014 4:20:29 PM UTC-4, Costin Leau wrote:

Hi,

Reading data in JSON format from ES (which I think is what you are interested in doing) is not available out of the
box.
Simply because you can do the same thing directly from the command line with curl or any http-like client.
One of the reasons behind hadoop-streaming is to allow native clients to interact with Hadoop, primarily with HDF;
since
you are
interacting with ES, why not talk to it directly?

Am I missing something?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/3bc7c0cb-fd97-402d-9644-97a60b2aa4b7%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/3bc7c0cb-fd97-402d-9644-97a60b2aa4b7%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/536800E9.8090805%40gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #5