Word count works perfectly now. Thanks a lot
First of all, thanks for the nice write-up - I wish all bug reports are as
detailed as yours.
After trying several versions of cascading and Hadoop it turned up the
problem was with my tests. The issue was fairly
obscure but your logs helped.
Ironically the issue occurred only sometimes, when running against a
pseudo-distributed Hadoop tracker (as yours) not
when running against a remote one.
I've pushed the fix into master and pushed a nightly build as well - let
me know how it goes for you.
Thanks again for the continous feedback and detailed report.
On 18/11/2013 7:32 PM, Jeoffrey Lim wrote:
I tried using Cascading 2.1.6, and directly using the samples from
Cascading for the Impatient
When just performing a copy:
|
PipedocPipe =newPipe("copy");
FlowDefflowDef =FlowDef.flowDef()
.addSource(docPipe,docTap)
.addTailSink(docPipe,wcTap);
flowConnector.connect(flowDef).complete();
|
for i in output/part-0000* ; do echo "[$i]" && cat $i ; done
|
[output/part-00000]
doc_id description
3org.elasticsearch.hadoop.mr.LinkedMapWritable@3fde891b
2org.elasticsearch.hadoop.mr.LinkedMapWritable@3fde891b
1org.elasticsearch.hadoop.mr.LinkedMapWritable@3fde891b
[output/part-00001]
doc_id description
3org.elasticsearch.hadoop.mr.LinkedMapWritable@b05236
2org.elasticsearch.hadoop.mr.LinkedMapWritable@b05236
1org.elasticsearch.hadoop.mr.LinkedMapWritable@b05236
[output/part-00002]
doc_id description
3org.elasticsearch.hadoop.mr.LinkedMapWritable@3d762027
2org.elasticsearch.hadoop.mr.LinkedMapWritable@3d762027
1org.elasticsearch.hadoop.mr.LinkedMapWritable@3d762027
[output/part-00003]
doc_id description
3org.elasticsearch.hadoop.mr.LinkedMapWritable@1aeca36e
2org.elasticsearch.hadoop.mr.LinkedMapWritable@1aeca36e
1org.elasticsearch.hadoop.mr.LinkedMapWritable@1aeca36e
[output/part-00004]
doc_id description
3org.elasticsearch.hadoop.mr.LinkedMapWritable@23904e0d
2org.elasticsearch.hadoop.mr.LinkedMapWritable@23904e0d
1org.elasticsearch.hadoop.mr.LinkedMapWritable@23904e0d
|
With field extractor:
|
PipedocPipe =newPipe("copy");
PipecopyPipe
=newEach(docPipe,newFields("doc_id","description"),newESFieldExtractor(fieldArguments),Fields.RESULTS
);
FlowDefflowDef =FlowDef.flowDef()
.addSource(copyPipe,docTap)
.addTailSink(copyPipe,wcTap);
flowConnector.connect(flowDef).complete();
|
for i in output/part-0000* ; do echo "[$i]" && cat $i ; done
|
[output/part-00000]
doc_id description
3d3
2d2
1d1
[output/part-00001]
doc_id description
3d3
2d2
1d1
[output/part-00002]
doc_id description
3d3
2d2
1d1
[output/part-00003]
doc_id description
3d3
2d2
1d1
[output/part-00004]
doc_id description
3d3
2d2
1d1
|
Manual query of data by shards using curl:
|
curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:0'
{"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":0,"max_score":null,"hits":}}
curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:1'
{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":0,"max_score":null,"hits":}}
curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:2'
{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test","_id":"1","_score":1.0,"_source":{"timestamp":1348984904161,"category":"category1","description":"d1"}}]}}
curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:3'
{"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test","_id":"2","_score":1.0,"_source":{"timestamp":1348984904162,"category":"category2","description":"d2"}}]}}
curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:4'
{"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test","_id":"3","_score":1.0,"_source":{"timestamp":1348984904163,"category":"category3","description":"d3"}}]}}
|
Here is also the console output:
|
13/11/19 01:20:06 INFO property.AppProps: using app.id:
9BCEBB939DE70E7BBBC0F8427387176E
13/11/19 01:20:06 INFO util.HadoopUtil: resolving application jar from
found main method on: eshadoop.WordCount
13/11/19 01:20:06 INFO planner.HadoopPlanner: using application jar:
/home/jeoffrey/test/eshadoopwc.jar
13/11/19 01:20:06 INFO hadoop.Hfs: forcing job to local mode, via sink:
Hfs["TextDelimited[[UNKNOWN]->['doc_id',
'description']]"]["output"]
13/11/19 01:20:08 INFO util.Version: Concurrent, Inc - Cascading 2.1.6
13/11/19 01:20:08 INFO flow.Flow: starting
13/11/19 01:20:08 INFO flow.Flow: source:
ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:08 INFO flow.Flow: sink:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:08 INFO flow.Flow: parallel execution is enabled:
true
13/11/19 01:20:08 INFO flow.Flow: starting jobs: 1
13/11/19 01:20:08 INFO flow.Flow: allocating threads: 1
13/11/19 01:20:08 INFO flow.FlowStep: starting step: (1/1) output
13/11/19 01:20:08 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
13/11/19 01:20:08 INFO mr.ESInputFormat: Discovered mapping
{test=[raw-logs=, test=[category=STRING,
description=STRING, timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:08 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:08 INFO flow.FlowStep: submitted hadoop job:
job_local257186463_0001
13/11/19 01:20:08 INFO mapred.LocalJobRunner: Waiting for map tasks
13/11/19 01:20:08 INFO mapred.LocalJobRunner: Starting task:
attempt_local257186463_0001_m_000000_0
13/11/19 01:20:08 INFO util.ProcessTree: setsid exited with exit code 0
13/11/19 01:20:08 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4235e6e3
13/11/19 01:20:08 INFO mapred.MapTask: Processing split:
cascading.tap.hadoop.io.MultiInputSplit@62803d5
13/11/19 01:20:08 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:08 INFO hadoop.FlowMapper: cascading version: Concurrent,
Inc - Cascading 2.1.6
13/11/19 01:20:08 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:08 INFO hadoop.FlowMapper: sourcing from:
ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:08 INFO hadoop.FlowMapper: sinking to:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:08 INFO mr.ESInputFormat: Discovered mapping
{test=[raw-logs=, test=[category=STRING,
description=STRING, timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:08 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:08 INFO mapred.Task:
Task:attempt_local257186463_0001_m_000000_0 is done. And is in the process
of commiting
13/11/19 01:20:08 INFO mapred.LocalJobRunner:
13/11/19 01:20:08 INFO mapred.Task: Task
attempt_local257186463_0001_m_000000_0 is allowed to commit now
13/11/19 01:20:08 INFO mapred.FileOutputCommitter: Saved output of task
'attempt_local257186463_0001_m_000000_0' to
file:/home/jeoffrey/test/output
13/11/19 01:20:08 INFO mapred.LocalJobRunner: ShardInputSplit
[node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
Cordelia|192.168.1.120:9200],shard=0]
13/11/19 01:20:08 INFO mapred.Task: Task
'attempt_local257186463_0001_m_000000_0' done.
13/11/19 01:20:08 INFO mapred.LocalJobRunner: Finishing task:
attempt_local257186463_0001_m_000000_0
13/11/19 01:20:08 INFO mapred.LocalJobRunner: Starting task:
attempt_local257186463_0001_m_000001_0
13/11/19 01:20:08 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3d3b5a3a
13/11/19 01:20:08 INFO mapred.MapTask: Processing split:
cascading.tap.hadoop.io.MultiInputSplit@4ab9b8d0
13/11/19 01:20:08 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:08 INFO hadoop.FlowMapper: cascading version: Concurrent,
Inc - Cascading 2.1.6
13/11/19 01:20:08 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:08 INFO hadoop.FlowMapper: sourcing from:
ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:08 INFO hadoop.FlowMapper: sinking to:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping
{test=[raw-logs=, test=[category=STRING,
description=STRING, timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:09 INFO mapred.Task:
Task:attempt_local257186463_0001_m_000001_0 is done. And is in the process
of commiting
13/11/19 01:20:09 INFO mapred.LocalJobRunner:
13/11/19 01:20:09 INFO mapred.Task: Task
attempt_local257186463_0001_m_000001_0 is allowed to commit now
13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task
'attempt_local257186463_0001_m_000001_0' to
file:/home/jeoffrey/test/output
13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit
[node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
Cordelia|192.168.1.120:9200],shard=1]
13/11/19 01:20:09 INFO mapred.Task: Task
'attempt_local257186463_0001_m_000001_0' done.
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task:
attempt_local257186463_0001_m_000001_0
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Starting task:
attempt_local257186463_0001_m_000002_0
13/11/19 01:20:09 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3d46e381
13/11/19 01:20:09 INFO mapred.MapTask: Processing split:
cascading.tap.hadoop.io.MultiInputSplit@52cd19d
13/11/19 01:20:09 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:09 INFO hadoop.FlowMapper: cascading version: Concurrent,
Inc - Cascading 2.1.6
13/11/19 01:20:09 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:09 INFO hadoop.FlowMapper: sourcing from:
ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:09 INFO hadoop.FlowMapper: sinking to:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping
{test=[raw-logs=, test=[category=STRING,
description=STRING, timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:09 INFO mapred.Task:
Task:attempt_local257186463_0001_m_000002_0 is done. And is in the process
of commiting
13/11/19 01:20:09 INFO mapred.LocalJobRunner:
13/11/19 01:20:09 INFO mapred.Task: Task
attempt_local257186463_0001_m_000002_0 is allowed to commit now
13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task
'attempt_local257186463_0001_m_000002_0' to
file:/home/jeoffrey/test/output
13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit
[node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
Cordelia|192.168.1.120:9200],shard=4]
13/11/19 01:20:09 INFO mapred.Task: Task
'attempt_local257186463_0001_m_000002_0' done.
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task:
attempt_local257186463_0001_m_000002_0
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Starting task:
attempt_local257186463_0001_m_000003_0
13/11/19 01:20:09 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4ca68fd8
13/11/19 01:20:09 INFO mapred.MapTask: Processing split:
cascading.tap.hadoop.io.MultiInputSplit@2e097617
13/11/19 01:20:09 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:09 INFO hadoop.FlowMapper: cascading version: Concurrent,
Inc - Cascading 2.1.6
13/11/19 01:20:09 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:09 INFO hadoop.FlowMapper: sourcing from:
ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:09 INFO hadoop.FlowMapper: sinking to:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping
{test=[raw-logs=, test=[category=STRING,
description=STRING, timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:09 INFO mapred.Task:
Task:attempt_local257186463_0001_m_000003_0 is done. And is in the process
of commiting
13/11/19 01:20:09 INFO mapred.LocalJobRunner:
13/11/19 01:20:09 INFO mapred.Task: Task
attempt_local257186463_0001_m_000003_0 is allowed to commit now
13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task
'attempt_local257186463_0001_m_000003_0' to
file:/home/jeoffrey/test/output
13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit
[node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
Cordelia|192.168.1.120:9200],shard=3]
13/11/19 01:20:09 INFO mapred.Task: Task
'attempt_local257186463_0001_m_000003_0' done.
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task:
attempt_local257186463_0001_m_000003_0
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Starting task:
attempt_local257186463_0001_m_000004_0
13/11/19 01:20:09 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@6e848ecc
13/11/19 01:20:09 INFO mapred.MapTask: Processing split:
cascading.tap.hadoop.io.MultiInputSplit@40363068
13/11/19 01:20:09 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:09 INFO hadoop.FlowMapper: cascading version: Concurrent,
Inc - Cascading 2.1.6
13/11/19 01:20:09 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:09 INFO hadoop.FlowMapper: sourcing from:
ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:09 INFO hadoop.FlowMapper: sinking to:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping
{test=[raw-logs=, test=[category=STRING,
description=STRING, timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:09 INFO mapred.Task:
Task:attempt_local257186463_0001_m_000004_0 is done. And is in the process
of commiting
13/11/19 01:20:09 INFO mapred.LocalJobRunner:
13/11/19 01:20:09 INFO mapred.Task: Task
attempt_local257186463_0001_m_000004_0 is allowed to commit now
13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task
'attempt_local257186463_0001_m_000004_0' to
file:/home/jeoffrey/test/output
13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit
[node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
Cordelia|192.168.1.120:9200],shard=2]
13/11/19 01:20:09 INFO mapred.Task: Task
'attempt_local257186463_0001_m_000004_0' done.
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task:
attempt_local257186463_0001_m_000004_0
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Map task executor
complete.
13/11/19 01:20:13 INFO util.Hadoop18TapUtil: deleting temp path
output/_temporary
|
Thanks.
On Tuesday, November 19, 2013 12:01:05 AM UTC+8, Costin Leau wrote:
Can you please try with Cascading 2.1.6 just in case? I haven't
tested Cascading 2.2 yet.
Thanks,
On 18/11/2013 5:13 PM, Jeoffrey Lim wrote:
> I'm using Hadoop 1.2.1, ElasticSearch 0.90.3, Cascading 2.2 and
tried running the word count tutorial sample in local
> stand alone operation, and pseudo distributed mode resulting with
the same output.
>
>
> Thanks!
>
>
> On Mon, Nov 18, 2013 at 11:03 PM, Costin Leau <costi...@gmail.com<javascript:> <mailto:
costi...@gmail.com <javascript:>>> wrote:
>
> By the way, can you confirm the setup you are using - Hadoop
version and network topology?
>
> Thanks,
>
>
> On 18/11/2013 4:22 PM, Costin Leau wrote:
>
> Something is a miss here since I'm unable to reproduce the
problem. I'll try again using your verbatim example.
> The ESFieldExtractor shouldn't be needed since ESTap
returns the results as a TupleEntry iterator so there
> shouldn't be
> no need to do the field extraction but rather just refer
to the fields by name.
> Let me know what didn't work for you - I'll go through
yoru example myself but just in case I'm missing something,
> feedback is useful.
>
> Cheers,
>
> On 18/11/2013 1:18 PM, Jeoffrey Lim wrote:
>
> Thanks a lot for looking at this problem. But still
when I used the nightly build with the intended fix ,the
> problem
> still occurs and the word count is still 5 which is
the total number of shards of the index.
>
> On Saturday, November 16, 2013 1:50:18 AM UTC+8,
Costin Leau wrote:
>
> Hi,
>
> I've pushed a nightly build [1] which should
address the problem. Let me know if it works for you.
>
> It looks like Cascading is acting differently
when running against a pseudo/local cluster as oppose to
> a proper one
> when
> it comes to dealing with multiple splits. The bug
fix should make the behaviour consistent.
>
> Cheers,
>
> [1]
elasticsearch-hadoop-1.3.0.__BUILD-20131115.172649-221
>
> On 14/11/2013 5:51 AM, Jeoffrey Lim wrote:
> > I'm testing the cascading plugin of
elasticsearch-hadoop and trying the word count example from
> >
http://docs.cascading.org/__impatient/impatient2.html <
http://docs.cascading.org/__impatient/impatient2.html>
> <http://docs.cascading.org/impatient/impatient2.html <
Cascading for the Impatient, Part 2>>
<http://docs.cascading.org/__impatient/impatient2.html <
http://docs.cascading.org/__impatient/impatient2.html>
> <http://docs.cascading.org/impatient/impatient2.html <
Cascading for the Impatient, Part 2>>>
> > The problem is the result of the word count is
incorrect and somehow gets multiplied by the number of
> shards of
> the index.
> >
> >
> > Code:
> >
> > |
> > Stringindex =args[0];
> > StringwcPath =args[1];
> >
> > JobConfjobConf =newJobConf();
> >
jobConf.setBoolean("mapred.__map.tasks.speculative.__execution",false);
> >
jobConf.setBoolean("mapred.__reduce.tasks.speculative.__execution",false);
> >
> > Propertiesproperties
=AppProps.appProps().__buildProperties(jobConf);
> >
AppProps.__setApplicationJarClass(__properties,WordCountESHadoop.__class);
> > FlowConnectorflowConnector
=newHadoopFlowConnector(__properties);
> >
> > FieldsfieldArguments
=newFields("doc_id","__description");
> >
> > TapdocTap =newESTap("/"+index
+"/_search?q=:",__fieldArguments);
> > TapwcTap
=newHfs(newTextDelimited(true,__"\t"),wcPath );
> >
> > PipedescFieldPipe
>
=newEach("desconly",__fieldArguments,__newESFieldExtractor(__fieldArguments),Fields.RESULTS
);
> >
> > Fieldstoken =newFields("token");
> > Fieldstext =newFields("description");
> > RegexSplitGeneratorsplitter
=newRegexSplitGenerator(token,__"[ \[\]\(\),.]");
> > PipedocPipe
=newEach(descFieldPipe,text,__splitter,Fields.RESULTS );
> >
> >
> > // determine the word counts
> > PipewcPipe =newPipe("wc",docPipe );
> > wcPipe =newGroupBy(wcPipe,token );
> > wcPipe
=newEvery(wcPipe,Fields.ALL,__newCount(),Fields.ALL );
> >
> >
> > // connect the taps, pipes, etc., into a flow
> > FlowDefflowDef =FlowDef.flowDef()
> > .setName("wc")
> > .addSource(docPipe,docTap )
> > .addTailSink(wcPipe,wcTap );
> >
> >
> > // write a DOT file and run the flow
> > FlowwcFlow =flowConnector.connect(flowDef );
> > wcFlow.writeDOT("dot/wc.dot");
> > wcFlow.complete();
> > |
> >
> >
> >
> > Test case with 5 shards:
> >
> > |
> > curl -XPUT 'http://localhost:9200/test/'
> > curl -XPUT '
http://localhost:9200/test/__test/1'-d <
http://localhost:9200/test/__test/1'-d>
<http://localhost:9200/test/test/1'-d <
http://localhost:9200/test/test/1'-d>>
> <http://localhost:9200/test/__test/1'-d <
http://localhost:9200/test/__test/1'-d>
<http://localhost:9200/test/test/1'-d <
http://localhost:9200/test/test/1'-d>>> '{
>
"timestamp":1348984904161,"__category":"category1","__description":"d1" }'
> > curl -XPUT '
http://localhost:9200/test/__test/2'-d <
http://localhost:9200/test/__test/2'-d>
<http://localhost:9200/test/test/2'-d <
http://localhost:9200/test/test/2'-d>>
> <http://localhost:9200/test/__test/2'-d <
http://localhost:9200/test/__test/2'-d>
<http://localhost:9200/test/test/2'-d <
http://localhost:9200/test/test/2'-d>>> '{
>
"timestamp":1348984904162,"__category":"category2","__description":"d2" }'
> > curl -XPUT '
http://localhost:9200/test/__test/3'-d <
http://localhost:9200/test/__test/3'-d>
<http://localhost:9200/test/test/3'-d <
http://localhost:9200/test/test/3'-d>>
> <http://localhost:9200/test/__test/3'-d <
http://localhost:9200/test/__test/3'-d>
<http://localhost:9200/test/test/3'-d <
http://localhost:9200/test/test/3'-d>>> '{
>
"timestamp":1348984904163,"__category":"category3","__description":"d3" }'
> > |
> >
> >
> > Run using index 'test':
> >
> > |
> > hadoop jar wordcount.jar test output
> > |
> >
> >
> > Display output:
> >
> > |
> > cat output/part-00000
> >
> > token count
> > d1 5
> > d2 5
> > d3 5
> > |
> >
> >
> >
> > Test case with one shard only:
> >
> > |
> > curl -XPUT '
http://localhost:9200/__oneshardtest/'-d <
http://localhost:9200/__oneshardtest/'-d>
<http://localhost:9200/oneshardtest/'-d <
http://localhost:9200/oneshardtest/'-d>>
> <http://localhost:9200/__oneshardtest/'-d <
http://localhost:9200/__oneshardtest/'-d>
<http://localhost:9200/oneshardtest/'-d <
http://localhost:9200/oneshardtest/'-d>>> '{ "settings" : {
> "index" : {
> "number_of_shards" : 1,
> > "number_of_replicas" : 1 } } }'
> > curl -XPUT '
http://localhost:9200/__oneshardtest/test/1'-d <
http://localhost:9200/__oneshardtest/test/1'-d>
> <http://localhost:9200/oneshardtest/test/1'-d <
http://localhost:9200/oneshardtest/test/1'-d>>
<http://localhost:9200/__oneshardtest/test/1'-d <
http://localhost:9200/__oneshardtest/test/1'-d>
> <http://localhost:9200/oneshardtest/test/1'-d <
http://localhost:9200/oneshardtest/test/1'-d>>> '{
> >
"timestamp":1348984904161,"__category":"category1","__description":"d1" }'
> > curl -XPUT '
http://localhost:9200/__oneshardtest/test/2'-d <
http://localhost:9200/__oneshardtest/test/2'-d>
> <http://localhost:9200/oneshardtest/test/2'-d <
http://localhost:9200/oneshardtest/test/2'-d>>
<http://localhost:9200/__oneshardtest/test/2'-d <
http://localhost:9200/__oneshardtest/test/2'-d>
> <http://localhost:9200/oneshardtest/test/2'-d <
http://localhost:9200/oneshardtest/test/2'-d>>> '{
> >
"timestamp":1348984904162,"__category":"category2","__description":"d2" }'
> > curl -XPUT '
http://localhost:9200/__oneshardtest/test/3'-d <
http://localhost:9200/__oneshardtest/test/3'-d>
> <http://localhost:9200/oneshardtest/test/3'-d <
http://localhost:9200/oneshardtest/test/3'-d>>
<http://localhost:9200/__oneshardtest/test/3'-d <
http://localhost:9200/__oneshardtest/test/3'-d>
> <http://localhost:9200/oneshardtest/test/3'-d <
http://localhost:9200/oneshardtest/test/3'-d>>> '{
> >
"timestamp":1348984904163,"__category":"category3","__description":"d3" }'
> > |
> >
> >
> > Run using index 'oneshardtest':
> >
> > |
> > hadoop jar wordcount.jar oneshardtest output
> > |
> >
> >
> > Display output:
> >
> > |
> > cat output/part-00000
> >
> >
> > token count
> > d1 1
> > d2 1
> > d3 1
> > |
> >
> >
> >
> > If an index only has 1 shard the result of word
count is correct. Somehow there is a problem in
> elasticsearch
> hadoop in
> > the data input split during the map process. I
have already tried this in a hadoop single node
> environment and the
> > result is the same. Did I miss something in how
to properly configure cascading in elasticsearch-hadoop?
> >
> > Thanks.
> >
> >
> > --
> > You received this message because you are
subscribed to the Google Groups "elasticsearch" group.
> > To unsubscribe from this group and stop
receiving emails from it, send an email to
> >elasticsearc...@googlegroups.__com <mailto:
elasticsearc...@googlegroups.com> <javascript:>.
> > For more options,
visithttps://groups.google.__com/groups/opt_out
> <http://groups.google.com/groups/opt_out <
http://groups.google.com/groups/opt_out>>
<https://groups.google.com/__groups/opt_out <
https://groups.google.com/__groups/opt_out>
> <https://groups.google.com/groups/opt_out <
https://groups.google.com/groups/opt_out>>>.
>
> --
> Costin
>
> --
> You received this message because you are subscribed
to the Google Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving
emails from it, send an email to
> elasticsearch+unsubscribe@__googlegroups.com <
http://googlegroups.com> <mailto:
elasticsearch%2Bunsubscribe@googlegroups.com <javascript:><javascript:>>.
> For more options, visithttps://
groups.google.com/__groups/opt_out <
https://groups.google.com/__groups/opt_out>
<https://groups.google.com/groups/opt_out <
https://groups.google.com/groups/opt_out>>.
>
>
> --
> Costin
>
> --
> You received this message because you are subscribed to a
topic in the Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visithttps://
groups.google.com/d/__topic/elasticsearch/__byA9kRJB8AY/unsubscribe
<
https://groups.google.com/d/__topic/elasticsearch/__byA9kRJB8AY/unsubscribe>
> <
https://groups.google.com/d/topic/elasticsearch/byA9kRJB8AY/unsubscribe
<
https://groups.google.com/d/topic/elasticsearch/byA9kRJB8AY/unsubscribe>>.
> To unsubscribe from this group and all its topics, send an
email to elasticsearch+unsubscribe@__googlegroups.com <
http://googlegroups.com>
> <mailto:elasticsearch%2Bunsubscribe@googlegroups.com<javascript:><javascript:>>.
> For more options, visithttps://
groups.google.com/__groups/opt_out <
https://groups.google.com/__groups/opt_out>
<https://groups.google.com/groups/opt_out <
https://groups.google.com/groups/opt_out>>.
>
>
> --
> You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it,
send an email to
>elasticsearc...@googlegroups.com <javascript:>.
> For more options, visithttps://groups.google.com/groups/opt_out <
https://groups.google.com/groups/opt_out>.
--
Costin
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.
--
Costin