Let's say I have an index, my_twitter_river, which has been populated by
the Twitter river plugin. I want to do some analysis on the data using a
Hadoop streaming job (http://hadoop.apache.org/docs/r1.2.1/streaming.html),
where essentially command line programs can be used to write map/reduce
jobs.
Is there a good way for me to have the mapper for that streaming job
receive parsable json using only available configuration options? What I
see is this:
my-macbook:misc psheridan$ hadoop jar
/usr/lib/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar
-D es.resource=my_twitter_river/status -D mapred.reduce.tasks=0
-inputformat org.elasticsearch.hadoop.mr.EsInputFormat
-mapper /bin/cat -input in -output out
14/05/05 08:54:13 INFO mr.EsInputFormat: Discovered mapping
{my_twitter_river=[mappings=[status=[created_at=DATE, hashtag=[end=LONG,
start=LONG, text=STRING], in_reply=[status=LONG, user_id=LONG,
user_screen_name=STRING], language=STRING, link=[display_url=STRING,
end=LONG, expand_url=STRING, start=LONG, url=STRING], location=GEO_POINT,
mention=[end=LONG, id=LONG, name=STRING, screen_name=STRING, start=LONG],
place=[country=STRING, country_code=STRING, full_name=STRING, id=STRING,
name=STRING, type=STRING, url=STRING], retweet=[id=LONG,
retweet_count=LONG, user_id=LONG, user_screen_name=STRING],
retweet_count=LONG, source=STRING, text=STRING, truncated=BOOLEAN,
user=[description=STRING, id=LONG, location=STRING, name=STRING,
profile_image_url=STRING, profile_image_url_https=STRING,
screen_name=STRING]]]]} for [my_twitter_river/status]
14/05/05 08:54:13 INFO mr.EsInputFormat: Created [5] shard-splits
[...output snipped...]
my-macbook:misc psheridan$ hadoop dfs -ls out
Found 7 items
-rw-r--r-- 1 psheridan supergroup 0 2014-05-05 08:54
/user/psheridan/out/_SUCCESS
drwxr-xr-x - psheridan supergroup 0 2014-05-05 08:54
/user/psheridan/out/_logs
-rw-r--r-- 1 psheridan supergroup 10541885 2014-05-05 08:54
/user/psheridan/out/part-00000
-rw-r--r-- 1 psheridan supergroup 10252834 2014-05-05 08:54
/user/psheridan/out/part-00001
-rw-r--r-- 1 psheridan supergroup 10492008 2014-05-05 08:54
/user/psheridan/out/part-00002
-rw-r--r-- 1 psheridan supergroup 10497346 2014-05-05 08:54
/user/psheridan/out/part-00003
-rw-r--r-- 1 psheridan supergroup 10489611 2014-05-05 08:54
/user/psheridan/out/part-00004
my-macbook:misc psheridan$ dfs -cat /user/psheridan/out/part-00000 | head -2
458635348029886464 {text=RT @adrianrmante: Having lunch with my great
friend OH MR. MOOSSEEBBYY!! I love that guy for life!!
#thesuitelifeofzackandcody http://t.co/…,
created_at=2014-04-22T15:56:02.000Z, source=Twitter for
iPhone, truncated=false, language=en, mention=[{id=38270235,
name=Adrian R'Mante, screen_name=adrianrmante, start=3, end=16}],
retweet_count=0, retweet={id=456201545562873856, user_id=38270235,
user_screen_name=adrianrmante, retweet_count=3279},
hashtag=[{text=thesuitelifeofzackandcody, start=100, end=126}], link=[],
user={id=467951508, name=Marenna Nonya, screen_name=ThatGirlish,
location=(null), description=(null),
profile_image_url=http://pbs.twimg.com/profile_images/431121751456890880/pItAYIpY_normal.jpeg,
profile_image_url_https=https://pbs.twimg.com/profile_images/431121751456890880/pItAYIpY_normal.jpeg}}
458635352232976384 {text=Dae vc sobe com seu amigão, ele fica de love e vc
separando briga --' K, created_at=2014-04-22T15:56:03.000Z, source=Twitter for
Android, truncated=false, language=pt, mention=[], retweet_count=0,
hashtag=[], link=[], user={id=2187296905, name=Carlinhos ,
screen_name=AngeloChionpat0, location=(null), description=(null),
profile_image_url=http://pbs.twimg.com/profile_images/448849601844772864/ZNwdGDC7_normal.jpeg,
profile_image_url_https=https://pbs.twimg.com/profile_images/448849601844772864/ZNwdGDC7_normal.jpeg}}
When I cat the output files, I'd like to see valid json for the value.
Thanks for any assistance...if this isn't possible, I'll likely submit a
pull request when I get to it, which will not be as soon as I'd like.
--Pete
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a3fc229e-4bd3-4c31-ae0a-bec7362ba724%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.