Invalid word count results when using elasticsearch-hadoop cascading

I'm testing the cascading plugin of elasticsearch-hadoop and trying the
word count example from http://docs.cascading.org/impatient/impatient2.html
The problem is the result of the word count is incorrect and somehow gets
multiplied by the number of shards of the index.

Code:

String index = args[0];
String wcPath = args[1];

JobConf jobConf = new JobConf();
jobConf.setBoolean("mapred.map.tasks.speculative.execution", false);
jobConf.setBoolean("mapred.reduce.tasks.speculative.execution", false);

Properties properties = AppProps.appProps().buildProperties(jobConf);
AppProps.setApplicationJarClass(properties, WordCountESHadoop.class);
FlowConnector flowConnector = new HadoopFlowConnector(properties);

Fields fieldArguments = new Fields("doc_id", "description" );

Tap docTap = new ESTap("/" + index + "/_search?q=:", fieldArguments);
Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Pipe descFieldPipe = new Each("desconly", fieldArguments, new
ESFieldExtractor(fieldArguments), Fields.RESULTS );

Fields token = new Fields( "token" );
Fields text = new Fields( "description" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[
\[\]\(\),.]" );
Pipe docPipe = new Each(descFieldPipe, text, splitter, Fields.RESULTS );

// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef()
.setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();

Test case with 5 shards:

curl -XPUT 'http://localhost:9200/test/'
curl -XPUT 'http://localhost:9200/test/test/1' -d '{
"timestamp":1348984904161,"category":"category1","description":"d1" }'
curl -XPUT 'http://localhost:9200/test/test/2' -d '{
"timestamp":1348984904162,"category":"category2","description":"d2" }'
curl -XPUT 'http://localhost:9200/test/test/3' -d '{
"timestamp":1348984904163,"category":"category3","description":"d3" }'

Run using index 'test':

hadoop jar wordcount.jar test output

Display output:

cat output/part-00000

token count
d1 5
d2 5
d3 5

Test case with one shard only:

curl -XPUT 'http://localhost:9200/oneshardtest/' -d '{ "settings" : {
"index" : { "number_of_shards" : 1, "number_of_replicas" : 1 } } }'
curl -XPUT 'http://localhost:9200/oneshardtest/test/1' -d '{
"timestamp":1348984904161,"category":"category1","description":"d1" }'
curl -XPUT 'http://localhost:9200/oneshardtest/test/2' -d '{
"timestamp":1348984904162,"category":"category2","description":"d2" }'
curl -XPUT 'http://localhost:9200/oneshardtest/test/3' -d '{
"timestamp":1348984904163,"category":"category3","description":"d3" }'

Run using index 'oneshardtest':

hadoop jar wordcount.jar oneshardtest output

Display output:

cat output/part-00000

token count
d1 1
d2 1
d3 1

If an index only has 1 shard the result of word count is correct. Somehow
there is a problem in elasticsearch hadoop in the data input split during
the map process. I have already tried this in a hadoop single node
environment and the result is the same. Did I miss something in how to
properly configure cascading in elasticsearch-hadoop?

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

I've pushed a nightly build [1] which should address the problem. Let me know if it works for you.

It looks like Cascading is acting differently when running against a pseudo/local cluster as oppose to a proper one when
it comes to dealing with multiple splits. The bug fix should make the behaviour consistent.

Cheers,

[1] elasticsearch-hadoop-1.3.0.BUILD-20131115.172649-221

On 14/11/2013 5:51 AM, Jeoffrey Lim wrote:

I'm testing the cascading plugin of elasticsearch-hadoop and trying the word count example from
Cascading for the Impatient, Part 2
The problem is the result of the word count is incorrect and somehow gets multiplied by the number of shards of the index.

Code:

|
Stringindex =args[0];
StringwcPath =args[1];

JobConfjobConf =newJobConf();
jobConf.setBoolean("mapred.map.tasks.speculative.execution",false);
jobConf.setBoolean("mapred.reduce.tasks.speculative.execution",false);

Propertiesproperties =AppProps.appProps().buildProperties(jobConf);
AppProps.setApplicationJarClass(properties,WordCountESHadoop.class);
FlowConnectorflowConnector =newHadoopFlowConnector(properties);

FieldsfieldArguments =newFields("doc_id","description");

TapdocTap =newESTap("/"+index +"/_search?q=:",fieldArguments);
TapwcTap =newHfs(newTextDelimited(true,"\t"),wcPath );

PipedescFieldPipe =newEach("desconly",fieldArguments,newESFieldExtractor(fieldArguments),Fields.RESULTS );

Fieldstoken =newFields("token");
Fieldstext =newFields("description");
RegexSplitGeneratorsplitter =newRegexSplitGenerator(token,"[ \[\]\(\),.]");
PipedocPipe =newEach(descFieldPipe,text,splitter,Fields.RESULTS );

// determine the word counts
PipewcPipe =newPipe("wc",docPipe );
wcPipe =newGroupBy(wcPipe,token );
wcPipe =newEvery(wcPipe,Fields.ALL,newCount(),Fields.ALL );

// connect the taps, pipes, etc., into a flow
FlowDefflowDef =FlowDef.flowDef()
.setName("wc")
.addSource(docPipe,docTap )
.addTailSink(wcPipe,wcTap );

// write a DOT file and run the flow
FlowwcFlow =flowConnector.connect(flowDef );
wcFlow.writeDOT("dot/wc.dot");
wcFlow.complete();
|

Test case with 5 shards:

|
curl -XPUT 'http://localhost:9200/test/'
curl -XPUT 'http://localhost:9200/test/test/1'-d '{ "timestamp":1348984904161,"category":"category1","description":"d1" }'
curl -XPUT 'http://localhost:9200/test/test/2'-d '{ "timestamp":1348984904162,"category":"category2","description":"d2" }'
curl -XPUT 'http://localhost:9200/test/test/3'-d '{ "timestamp":1348984904163,"category":"category3","description":"d3" }'
|

Run using index 'test':

|
hadoop jar wordcount.jar test output
|

Display output:

|
cat output/part-00000

token count
d1 5
d2 5
d3 5
|

Test case with one shard only:

|
curl -XPUT 'http://localhost:9200/oneshardtest/'-d '{ "settings" : { "index" : { "number_of_shards" : 1,
"number_of_replicas" : 1 } } }'
curl -XPUT 'http://localhost:9200/oneshardtest/test/1'-d '{
"timestamp":1348984904161,"category":"category1","description":"d1" }'
curl -XPUT 'http://localhost:9200/oneshardtest/test/2'-d '{
"timestamp":1348984904162,"category":"category2","description":"d2" }'
curl -XPUT 'http://localhost:9200/oneshardtest/test/3'-d '{
"timestamp":1348984904163,"category":"category3","description":"d3" }'
|

Run using index 'oneshardtest':

|
hadoop jar wordcount.jar oneshardtest output
|

Display output:

|
cat output/part-00000

token count
d1 1
d2 1
d3 1
|

If an index only has 1 shard the result of word count is correct. Somehow there is a problem in elasticsearch hadoop in
the data input split during the map process. I have already tried this in a hadoop single node environment and the
result is the same. Did I miss something in how to properly configure cascading in elasticsearch-hadoop?

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

hi,
I came across a problem when getting the result of elasticsearch index,it
seems I got _id and a MapWritable object.In your code,you used ESFieldExtractor,
I guess this class is used for extractor result (here refer to MapWritable),can
you show me the codes.

在 2013年11月14日星期四UTC+8上午11时51分22秒,Jeoffrey Lim写道:

I'm testing the cascading plugin of elasticsearch-hadoop and trying the
word count example from
Cascading for the Impatient, Part 2
The problem is the result of the word count is incorrect and somehow gets
multiplied by the number of shards of the index.

my code is:

            Properties props = new Properties(); 

AppProps.setApplicationJarClass(props, ESReader.class);
Fields fieldArguments = new Fields("no", "word");
Tap in = new ESTap("daily/2013/_search?q=*", fieldArguments);

Tap out = new Hfs(new TextDelimited(),"/daily/2012");
Pipe pipe = new Pipe("read-from-ES");
new HadoopFlowConnector().connect(in, out, pipe).complete();
In my elasticsearch,I have these index data:
{
"_index" : "daily",
"_type" : "2013",
"_id" : "vxcvPBsvTHOK8ab5gCYoBg",
"_score" : 1.0, "_source" : {"no":"352","word":"kr"}
}, {
"_index" : "daily",
"_type" : "2013",
"_id" : "2frEVEokRj6kOJJ7o85juw",
"_score" : 1.0, "_source" : {"no":"231","word":"world"}
}, {
"_index" : "daily",
"_type" : "2013",
"_id" : "cVjJUzTXQEiF1MoG3z_AAQ",
"_score" : 1.0, "_source" : {"no":"342","word":"guo"}
}, {
"_index" : "daily",
"_type" : "2013",
"_id" : "CLNaTjRYReG1vmsC4Y3mIw",
"_score" : 1.0, "_source" : {"no":"221","word":"hello"}
}
after running my code, I got these result
vxcvPBsvTHOK8ab5gCYoBg org.apache.hadoop.io.MapWritable@15ba1da3
2frEVEokRj6kOJJ7o85juw org.apache.hadoop.io.MapWritable@15ba1da3
cVjJUzTXQEiF1MoG3z_AAQ org.apache.hadoop.io.MapWritable@15ba1da3
CLNaTjRYReG1vmsC4Y3mIw org.apache.hadoop.io.MapWritable@15ba1da3

I will really appreciate you if you give me a hand.Thanks.

String index = args[0];
String wcPath = args[1];

JobConf jobConf = new JobConf();
jobConf.setBoolean("mapred.map.tasks.speculative.execution", false);
jobConf.setBoolean("mapred.reduce.tasks.speculative.execution", false);

Properties properties = AppProps.appProps().buildProperties(jobConf);
AppProps.setApplicationJarClass(properties, WordCountESHadoop.class);
FlowConnector flowConnector = new HadoopFlowConnector(properties);

Fields fieldArguments = new Fields("doc_id", "description" );

Tap docTap = new ESTap("/" + index + "/_search?q=:", fieldArguments);
Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Pipe descFieldPipe = new Each("desconly", fieldArguments, new
ESFieldExtractor(fieldArguments), Fields.RESULTS );

Fields token = new Fields( "token" );
Fields text = new Fields( "description" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[
\[\]\(\),.]" );
Pipe docPipe = new Each(descFieldPipe, text, splitter, Fields.RESULTS );

// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef()
.setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();

Test case with 5 shards:

curl -XPUT 'http://localhost:9200/test/'
curl -XPUT 'http://localhost:9200/test/test/1' -d '{
"timestamp":1348984904161,"category":"category1","description":"d1" }'
curl -XPUT 'http://localhost:9200/test/test/2' -d '{
"timestamp":1348984904162,"category":"category2","description":"d2" }'
curl -XPUT 'http://localhost:9200/test/test/3' -d '{
"timestamp":1348984904163,"category":"category3","description":"d3" }'

Run using index 'test':

hadoop jar wordcount.jar test output

Display output:

cat output/part-00000

token count
d1 5
d2 5
d3 5

Test case with one shard only:

curl -XPUT 'http://localhost:9200/oneshardtest/' -d '{ "settings" : {
"index" : { "number_of_shards" : 1, "number_of_replicas" : 1 } } }'
curl -XPUT 'http://localhost:9200/oneshardtest/test/1' -d '{
"timestamp":1348984904161,"category":"category1","description":"d1" }'
curl -XPUT 'http://localhost:9200/oneshardtest/test/2' -d '{
"timestamp":1348984904162,"category":"category2","description":"d2" }'
curl -XPUT 'http://localhost:9200/oneshardtest/test/3' -d '{
"timestamp":1348984904163,"category":"category3","description":"d3" }'

Run using index 'oneshardtest':

hadoop jar wordcount.jar oneshardtest output

Display output:

cat output/part-00000

token count
d1 1
d2 1
d3 1

If an index only has 1 shard the result of word count is correct. Somehow
there is a problem in elasticsearch hadoop in the data input split during
the map process. I have already tried this in a hadoop single node
environment and the result is the same. Did I miss something in how to
properly configure cascading in elasticsearch-hadoop?

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi

Here is the code of the ESFieldExtractor that I used:

public class ESFieldExtractor extends BaseOperation implements Function {

public ESFieldExtractor(Fields fieldDeclaration) {
    super(2, fieldDeclaration);
}

public void operate(FlowProcess flowProcess, FunctionCall functionCall) 

{
TupleEntry argument = functionCall.getArguments();
String doc_id = argument.getString(0);
MapWritable map = (MapWritable)argument.getObject(1);

    Tuple result = new Tuple();
    result.add(doc_id);

    if (getFieldDeclaration().size() > 1) {
        for (int i = 0; i < getFieldDeclaration().size(); i++) {
            Comparable field = getFieldDeclaration().get(i);
            for (Map.Entry<Writable, Writable> s : map.entrySet()) {
                if (field.compareTo(s.getKey().toString()) == 0) {
                    result.add(s.getValue());
                    break;
                }
            }
        }
    }

    functionCall.getOutputCollector().add(result);
}

}

Basically retrieves the fields that I specify and make it
"Cascading-friendly" compatible data type format. This is the first problem
that I encountered when using cascading and elasticsearch-hadoop, hope this
helps you.

On Mon, Nov 18, 2013 at 6:54 PM, 刘家财 jiacai2050@gmail.com wrote:

hi,
I came across a problem when getting the result of elasticsearch index,it
seems I got _id and a MapWritable object.In your code,you used
ESFieldExtractor, I guess this class is used for extractor result (here
refer to MapWritable),can you show me the codes.

在 2013年11月14日星期四UTC+8上午11时51分22秒,Jeoffrey Lim写道:

I'm testing the cascading plugin of elasticsearch-hadoop and trying the
word count example from http://docs.cascading.
org/impatient/impatient2.html
The problem is the result of the word count is incorrect and somehow gets
multiplied by the number of shards of the index.

my code is:

            Properties props = new Properties(); 

AppProps.setApplicationJarClass(props, ESReader.class);
Fields fieldArguments = new Fields("no", "word");
Tap in = new ESTap("daily/2013/_search?q=*", fieldArguments);

Tap out = new Hfs(new TextDelimited(),"/daily/2012");
Pipe pipe = new Pipe("read-from-ES");
new HadoopFlowConnector().connect(in, out, pipe).complete();
In my elasticsearch,I have these index data:
{
"_index" : "daily",
"_type" : "2013",
"_id" : "vxcvPBsvTHOK8ab5gCYoBg",
"_score" : 1.0, "_source" : {"no":"352","word":"kr"}
}, {
"_index" : "daily",
"_type" : "2013",
"_id" : "2frEVEokRj6kOJJ7o85juw",
"_score" : 1.0, "_source" : {"no":"231","word":"world"}
}, {
"_index" : "daily",
"_type" : "2013",
"_id" : "cVjJUzTXQEiF1MoG3z_AAQ",
"_score" : 1.0, "_source" : {"no":"342","word":"guo"}
}, {
"_index" : "daily",
"_type" : "2013",
"_id" : "CLNaTjRYReG1vmsC4Y3mIw",
"_score" : 1.0, "_source" : {"no":"221","word":"hello"}
}
after running my code, I got these result
vxcvPBsvTHOK8ab5gCYoBg org.apache.hadoop.io.MapWritable@15ba1da3
2frEVEokRj6kOJJ7o85juw org.apache.hadoop.io.MapWritable@15ba1da3
cVjJUzTXQEiF1MoG3z_AAQ org.apache.hadoop.io.MapWritable@15ba1da3
CLNaTjRYReG1vmsC4Y3mIw org.apache.hadoop.io.MapWritable@15ba1da3

I will really appreciate you if you give me a hand.Thanks.

String index = args[0];
String wcPath = args[1];

JobConf jobConf = new JobConf();
jobConf.setBoolean("mapred.map.tasks.speculative.execution", false);
jobConf.setBoolean("mapred.reduce.tasks.speculative.execution", false);

Properties properties = AppProps.appProps().buildProperties(jobConf);
AppProps.setApplicationJarClass(properties, WordCountESHadoop.class);
FlowConnector flowConnector = new HadoopFlowConnector(properties);

Fields fieldArguments = new Fields("doc_id", "description" );

Tap docTap = new ESTap("/" + index + "/_search?q=:", fieldArguments);
Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Pipe descFieldPipe = new Each("desconly", fieldArguments, new
ESFieldExtractor(fieldArguments), Fields.RESULTS );

Fields token = new Fields( "token" );
Fields text = new Fields( "description" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[
\[\]\(\),.]" );
Pipe docPipe = new Each(descFieldPipe, text, splitter, Fields.RESULTS );

// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef()
.setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();

Test case with 5 shards:

curl -XPUT 'http://localhost:9200/test/'
curl -XPUT 'http://localhost:9200/test/test/1' -d '{
"timestamp":1348984904161,"category":"category1","description":"d1" }'
curl -XPUT 'http://localhost:9200/test/test/2' -d '{
"timestamp":1348984904162,"category":"category2","description":"d2" }'
curl -XPUT 'http://localhost:9200/test/test/3' -d '{
"timestamp":1348984904163,"category":"category3","description":"d3" }'

Run using index 'test':

hadoop jar wordcount.jar test output

Display output:

cat output/part-00000

token count
d1 5
d2 5
d3 5

Test case with one shard only:

curl -XPUT 'http://localhost:9200/oneshardtest/' -d '{ "settings" : {
"index" : { "number_of_shards" : 1, "number_of_replicas" : 1 } } }'
curl -XPUT 'http://localhost:9200/oneshardtest/test/1' -d '{
"timestamp":1348984904161,"category":"category1","description":"d1" }'
curl -XPUT 'http://localhost:9200/oneshardtest/test/2' -d '{
"timestamp":1348984904162,"category":"category2","description":"d2" }'
curl -XPUT 'http://localhost:9200/oneshardtest/test/3' -d '{
"timestamp":1348984904163,"category":"category3","description":"d3" }'

Run using index 'oneshardtest':

hadoop jar wordcount.jar oneshardtest output

Display output:

cat output/part-00000

token count
d1 1
d2 1
d3 1

If an index only has 1 shard the result of word count is correct. Somehow
there is a problem in elasticsearch hadoop in the data input split during
the map process. I have already tried this in a hadoop single node
environment and the result is the same. Did I miss something in how to
properly configure cascading in elasticsearch-hadoop?

Thanks.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/byA9kRJB8AY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Monday, November 18, 2013 6:54:45 PM UTC+8, 刘家财 wrote:

hi,
I came across a problem when getting the result of elasticsearch index,it
seems I got _id and a MapWritable object.In your code,you used ESFieldExtractor,
I guess this class is used for extractor result (here refer to MapWritable),can
you show me the codes.

在 2013年11月14日星期四UTC+8上午11时51分22秒,Jeoffrey Lim写道:

I'm testing the cascading plugin of elasticsearch-hadoop and trying the
word count example from
Cascading for the Impatient, Part 2
The problem is the result of the word count is incorrect and somehow gets
multiplied by the number of shards of the index.

my code is:

            Properties props = new Properties(); 

AppProps.setApplicationJarClass(props, ESReader.class);
Fields fieldArguments = new Fields("no", "word");
Tap in = new ESTap("daily/2013/_search?q=*", fieldArguments);

Tap out = new Hfs(new TextDelimited(),"/daily/2012");
Pipe pipe = new Pipe("read-from-ES");
new HadoopFlowConnector().connect(in, out, pipe).complete();
In my elasticsearch,I have these index data:
{
"_index" : "daily",
"_type" : "2013",
"_id" : "vxcvPBsvTHOK8ab5gCYoBg",
"_score" : 1.0, "_source" : {"no":"352","word":"kr"}
}, {
"_index" : "daily",
"_type" : "2013",
"_id" : "2frEVEokRj6kOJJ7o85juw",
"_score" : 1.0, "_source" : {"no":"231","word":"world"}
}, {
"_index" : "daily",
"_type" : "2013",
"_id" : "cVjJUzTXQEiF1MoG3z_AAQ",
"_score" : 1.0, "_source" : {"no":"342","word":"guo"}
}, {
"_index" : "daily",
"_type" : "2013",
"_id" : "CLNaTjRYReG1vmsC4Y3mIw",
"_score" : 1.0, "_source" : {"no":"221","word":"hello"}
}
after running my code, I got these result
vxcvPBsvTHOK8ab5gCYoBg org.apache.hadoop.io.MapWritable@15ba1da3
2frEVEokRj6kOJJ7o85juw org.apache.hadoop.io.MapWritable@15ba1da3
cVjJUzTXQEiF1MoG3z_AAQ org.apache.hadoop.io.MapWritable@15ba1da3
CLNaTjRYReG1vmsC4Y3mIw org.apache.hadoop.io.MapWritable@15ba1da3

I will really appreciate you if you give me a hand.Thanks.

String index = args[0];
String wcPath = args[1];

JobConf jobConf = new JobConf();
jobConf.setBoolean("mapred.map.tasks.speculative.execution", false);
jobConf.setBoolean("mapred.reduce.tasks.speculative.execution", false);

Properties properties = AppProps.appProps().buildProperties(jobConf);
AppProps.setApplicationJarClass(properties, WordCountESHadoop.class);
FlowConnector flowConnector = new HadoopFlowConnector(properties);

Fields fieldArguments = new Fields("doc_id", "description" );

Tap docTap = new ESTap("/" + index + "/_search?q=:", fieldArguments);
Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Pipe descFieldPipe = new Each("desconly", fieldArguments, new
ESFieldExtractor(fieldArguments), Fields.RESULTS );

Fields token = new Fields( "token" );
Fields text = new Fields( "description" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[
\[\]\(\),.]" );
Pipe docPipe = new Each(descFieldPipe, text, splitter, Fields.RESULTS );

// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef()
.setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();

Test case with 5 shards:

curl -XPUT 'http://localhost:9200/test/'
curl -XPUT 'http://localhost:9200/test/test/1' -d '{
"timestamp":1348984904161,"category":"category1","description":"d1" }'
curl -XPUT 'http://localhost:9200/test/test/2' -d '{
"timestamp":1348984904162,"category":"category2","description":"d2" }'
curl -XPUT 'http://localhost:9200/test/test/3' -d '{
"timestamp":1348984904163,"category":"category3","description":"d3" }'

Run using index 'test':

hadoop jar wordcount.jar test output

Display output:

cat output/part-00000

token count
d1 5
d2 5
d3 5

Test case with one shard only:

curl -XPUT 'http://localhost:9200/oneshardtest/' -d '{ "settings" : {
"index" : { "number_of_shards" : 1, "number_of_replicas" : 1 } } }'
curl -XPUT 'http://localhost:9200/oneshardtest/test/1' -d '{
"timestamp":1348984904161,"category":"category1","description":"d1" }'
curl -XPUT 'http://localhost:9200/oneshardtest/test/2' -d '{
"timestamp":1348984904162,"category":"category2","description":"d2" }'
curl -XPUT 'http://localhost:9200/oneshardtest/test/3' -d '{
"timestamp":1348984904163,"category":"category3","description":"d3" }'

Run using index 'oneshardtest':

hadoop jar wordcount.jar oneshardtest output

Display output:

cat output/part-00000

token count
d1 1
d2 1
d3 1

If an index only has 1 shard the result of word count is correct. Somehow
there is a problem in elasticsearch hadoop in the data input split during
the map process. I have already tried this in a hadoop single node
environment and the result is the same. Did I miss something in how to
properly configure cascading in elasticsearch-hadoop?

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks a lot for looking at this problem. But still when I used the nightly
build with the intended fix ,the problem still occurs and the word count is
still 5 which is the total number of shards of the index.

On Saturday, November 16, 2013 1:50:18 AM UTC+8, Costin Leau wrote:

Hi,

I've pushed a nightly build [1] which should address the problem. Let me
know if it works for you.

It looks like Cascading is acting differently when running against a
pseudo/local cluster as oppose to a proper one when
it comes to dealing with multiple splits. The bug fix should make the
behaviour consistent.

Cheers,

[1] elasticsearch-hadoop-1.3.0.BUILD-20131115.172649-221

On 14/11/2013 5:51 AM, Jeoffrey Lim wrote:

I'm testing the cascading plugin of elasticsearch-hadoop and trying the
word count example from
Cascading for the Impatient, Part 2
The problem is the result of the word count is incorrect and somehow
gets multiplied by the number of shards of the index.

Code:

|
Stringindex =args[0];
StringwcPath =args[1];

JobConfjobConf =newJobConf();
jobConf.setBoolean("mapred.map.tasks.speculative.execution",false);
jobConf.setBoolean("mapred.reduce.tasks.speculative.execution",false);

Propertiesproperties =AppProps.appProps().buildProperties(jobConf);
AppProps.setApplicationJarClass(properties,WordCountESHadoop.class);
FlowConnectorflowConnector =newHadoopFlowConnector(properties);

FieldsfieldArguments =newFields("doc_id","description");

TapdocTap =newESTap("/"+index +"/_search?q=:",fieldArguments);
TapwcTap =newHfs(newTextDelimited(true,"\t"),wcPath );

PipedescFieldPipe
=newEach("desconly",fieldArguments,newESFieldExtractor(fieldArguments),Fields.RESULTS
);

Fieldstoken =newFields("token");
Fieldstext =newFields("description");
RegexSplitGeneratorsplitter =newRegexSplitGenerator(token,"[
\[\]\(\),.]");
PipedocPipe =newEach(descFieldPipe,text,splitter,Fields.RESULTS );

// determine the word counts
PipewcPipe =newPipe("wc",docPipe );
wcPipe =newGroupBy(wcPipe,token );
wcPipe =newEvery(wcPipe,Fields.ALL,newCount(),Fields.ALL );

// connect the taps, pipes, etc., into a flow
FlowDefflowDef =FlowDef.flowDef()
.setName("wc")
.addSource(docPipe,docTap )
.addTailSink(wcPipe,wcTap );

// write a DOT file and run the flow
FlowwcFlow =flowConnector.connect(flowDef );
wcFlow.writeDOT("dot/wc.dot");
wcFlow.complete();
|

Test case with 5 shards:

|
curl -XPUT 'http://localhost:9200/test/'
curl -XPUT 'http://localhost:9200/test/test/1'-d '{
"timestamp":1348984904161,"category":"category1","description":"d1" }'
curl -XPUT 'http://localhost:9200/test/test/2'-d '{
"timestamp":1348984904162,"category":"category2","description":"d2" }'
curl -XPUT 'http://localhost:9200/test/test/3'-d '{
"timestamp":1348984904163,"category":"category3","description":"d3" }'
|

Run using index 'test':

|
hadoop jar wordcount.jar test output
|

Display output:

|
cat output/part-00000

token count
d1 5
d2 5
d3 5
|

Test case with one shard only:

|
curl -XPUT 'http://localhost:9200/oneshardtest/'-d '{ "settings" : {
"index" : { "number_of_shards" : 1,
"number_of_replicas" : 1 } } }'
curl -XPUT 'http://localhost:9200/oneshardtest/test/1'-d '{
"timestamp":1348984904161,"category":"category1","description":"d1" }'
curl -XPUT 'http://localhost:9200/oneshardtest/test/2'-d '{
"timestamp":1348984904162,"category":"category2","description":"d2" }'
curl -XPUT 'http://localhost:9200/oneshardtest/test/3'-d '{
"timestamp":1348984904163,"category":"category3","description":"d3" }'
|

Run using index 'oneshardtest':

|
hadoop jar wordcount.jar oneshardtest output
|

Display output:

|
cat output/part-00000

token count
d1 1
d2 1
d3 1
|

If an index only has 1 shard the result of word count is correct.
Somehow there is a problem in elasticsearch hadoop in
the data input split during the map process. I have already tried this
in a hadoop single node environment and the
result is the same. Did I miss something in how to properly configure
cascading in elasticsearch-hadoop?

Thanks.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Awesome,thank you for your time helping me, as for your problem I have no
idea, I am just a newbie of cascading.Hope you solve this soon.

在 2013年11月18日星期一UTC+8下午7时14分42秒,Jeoffrey Lim写道:

Hi

Here is the code of the ESFieldExtractor that I used:

public class ESFieldExtractor extends BaseOperation implements Function {

public ESFieldExtractor(Fields fieldDeclaration) {
    super(2, fieldDeclaration);
}

public void operate(FlowProcess flowProcess, FunctionCall functionCall

) {
TupleEntry argument = functionCall.getArguments();
String doc_id = argument.getString(0);
MapWritable map = (MapWritable)argument.getObject(1);

    Tuple result = new Tuple();
    result.add(doc_id);

    if (getFieldDeclaration().size() > 1) {
        for (int i = 0; i < getFieldDeclaration().size(); i++) {
            Comparable field = getFieldDeclaration().get(i);
            for (Map.Entry<Writable, Writable> s : map.entrySet()) {
                if (field.compareTo(s.getKey().toString()) == 0) {
                    result.add(s.getValue());
                    break;
                }
            }
        }
    }

    functionCall.getOutputCollector().add(result);
}

}

Basically retrieves the fields that I specify and make it
"Cascading-friendly" compatible data type format. This is the first problem
that I encountered when using cascading and elasticsearch-hadoop, hope this
helps you.

On Mon, Nov 18, 2013 at 6:54 PM, 刘家财 <jiaca...@gmail.com <javascript:>>
wrote:

hi,
I came across a problem when getting the result of elasticsearch index,it
seems I got _id and a MapWritable object.In your code,you used
ESFieldExtractor, I guess this class is used for extractor result (here
refer to MapWritable),can you show me the codes.

在 2013年11月14日星期四UTC+8上午11时51分22秒,Jeoffrey Lim写道:

I'm testing the cascading plugin of elasticsearch-hadoop and trying the
word count example from http://docs.cascading.
org/impatient/impatient2.html
The problem is the result of the word count is incorrect and somehow
gets multiplied by the number of shards of the index.

my code is:

            Properties props = new Properties(); 

AppProps.setApplicationJarClass(props, ESReader.class);
Fields fieldArguments = new Fields("no", "word");
Tap in = new ESTap("daily/2013/_search?q=*", fieldArguments);

Tap out = new Hfs(new TextDelimited(),"/daily/2012");
Pipe pipe = new Pipe("read-from-ES");
new HadoopFlowConnector().connect(in, out, pipe).complete();
In my elasticsearch,I have these index data:
{
"_index" : "daily",
"_type" : "2013",
"_id" : "vxcvPBsvTHOK8ab5gCYoBg",
"_score" : 1.0, "_source" : {"no":"352","word":"kr"}
}, {
"_index" : "daily",
"_type" : "2013",
"_id" : "2frEVEokRj6kOJJ7o85juw",
"_score" : 1.0, "_source" : {"no":"231","word":"world"}
}, {
"_index" : "daily",
"_type" : "2013",
"_id" : "cVjJUzTXQEiF1MoG3z_AAQ",
"_score" : 1.0, "_source" : {"no":"342","word":"guo"}
}, {
"_index" : "daily",
"_type" : "2013",
"_id" : "CLNaTjRYReG1vmsC4Y3mIw",
"_score" : 1.0, "_source" : {"no":"221","word":"hello"}
}
after running my code, I got these result
vxcvPBsvTHOK8ab5gCYoBg org.apache.hadoop.io.MapWritable@15ba1da3
2frEVEokRj6kOJJ7o85juw org.apache.hadoop.io.MapWritable@15ba1da3
cVjJUzTXQEiF1MoG3z_AAQ org.apache.hadoop.io.MapWritable@15ba1da3
CLNaTjRYReG1vmsC4Y3mIw org.apache.hadoop.io.MapWritable@15ba1da3

I will really appreciate you if you give me a hand.Thanks.

String index = args[0];
String wcPath = args[1];

JobConf jobConf = new JobConf();
jobConf.setBoolean("mapred.map.tasks.speculative.execution", false);
jobConf.setBoolean("mapred.reduce.tasks.speculative.execution", false);

Properties properties = AppProps.appProps().buildProperties(jobConf);
AppProps.setApplicationJarClass(properties, WordCountESHadoop.class);
FlowConnector flowConnector = new HadoopFlowConnector(properties);

Fields fieldArguments = new Fields("doc_id", "description" );

Tap docTap = new ESTap("/" + index + "/_search?q=:", fieldArguments);
Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Pipe descFieldPipe = new Each("desconly", fieldArguments, new
ESFieldExtractor(fieldArguments), Fields.RESULTS );

Fields token = new Fields( "token" );
Fields text = new Fields( "description" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[
\[\]\(\),.]" );
Pipe docPipe = new Each(descFieldPipe, text, splitter, Fields.RESULTS );

// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef()
.setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();

Test case with 5 shards:

curl -XPUT 'http://localhost:9200/test/'
curl -XPUT 'http://localhost:9200/test/test/1' -d '{
"timestamp":1348984904161,"category":"category1","description":"d1" }'
curl -XPUT 'http://localhost:9200/test/test/2' -d '{
"timestamp":1348984904162,"category":"category2","description":"d2" }'
curl -XPUT 'http://localhost:9200/test/test/3' -d '{
"timestamp":1348984904163,"category":"category3","description":"d3" }'

Run using index 'test':

hadoop jar wordcount.jar test output

Display output:

cat output/part-00000

token count
d1 5
d2 5
d3 5

Test case with one shard only:

curl -XPUT 'http://localhost:9200/oneshardtest/' -d '{ "settings" : {
"index" : { "number_of_shards" : 1, "number_of_replicas" : 1 } } }'
curl -XPUT 'http://localhost:9200/oneshardtest/test/1' -d '{
"timestamp":1348984904161,"category":"category1","description":"d1" }'
curl -XPUT 'http://localhost:9200/oneshardtest/test/2' -d '{
"timestamp":1348984904162,"category":"category2","description":"d2" }'
curl -XPUT 'http://localhost:9200/oneshardtest/test/3' -d '{
"timestamp":1348984904163,"category":"category3","description":"d3" }'

Run using index 'oneshardtest':

hadoop jar wordcount.jar oneshardtest output

Display output:

cat output/part-00000

token count
d1 1
d2 1
d3 1

If an index only has 1 shard the result of word count is correct.
Somehow there is a problem in elasticsearch hadoop in the data input split
during the map process. I have already tried this in a hadoop single node
environment and the result is the same. Did I miss something in how to
properly configure cascading in elasticsearch-hadoop?

Thanks.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/byA9kRJB8AY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

On Monday, November 18, 2013 6:54:45 PM UTC+8, 刘家财 wrote:

hi,
I came across a problem when getting the result of elasticsearch index,it
seems I got _id and a MapWritable object.In your code,you used ESFieldExtractor,
I guess this class is used for extractor result (here refer to
MapWritable),can you show me the codes.

在 2013年11月14日星期四UTC+8上午11时51分22秒,Jeoffrey Lim写道:

I'm testing the cascading plugin of elasticsearch-hadoop and trying the
word count example from
Cascading for the Impatient, Part 2
The problem is the result of the word count is incorrect and somehow
gets multiplied by the number of shards of the index.

my code is:

            Properties props = new Properties(); 

AppProps.setApplicationJarClass(props, ESReader.class);
Fields fieldArguments = new Fields("no", "word");
Tap in = new ESTap("daily/2013/_search?q=*", fieldArguments);

Tap out = new Hfs(new TextDelimited(),"/daily/2012");
Pipe pipe = new Pipe("read-from-ES");
new HadoopFlowConnector().connect(in, out, pipe).complete();
In my elasticsearch,I have these index data:
{
"_index" : "daily",
"_type" : "2013",
"_id" : "vxcvPBsvTHOK8ab5gCYoBg",
"_score" : 1.0, "_source" : {"no":"352","word":"kr"}
}, {
"_index" : "daily",
"_type" : "2013",
"_id" : "2frEVEokRj6kOJJ7o85juw",
"_score" : 1.0, "_source" : {"no":"231","word":"world"}
}, {
"_index" : "daily",
"_type" : "2013",
"_id" : "cVjJUzTXQEiF1MoG3z_AAQ",
"_score" : 1.0, "_source" : {"no":"342","word":"guo"}
}, {
"_index" : "daily",
"_type" : "2013",
"_id" : "CLNaTjRYReG1vmsC4Y3mIw",
"_score" : 1.0, "_source" : {"no":"221","word":"hello"}
}
after running my code, I got these result
vxcvPBsvTHOK8ab5gCYoBg org.apache.hadoop.io.MapWritable@15ba1da3
2frEVEokRj6kOJJ7o85juw org.apache.hadoop.io.MapWritable@15ba1da3
cVjJUzTXQEiF1MoG3z_AAQ org.apache.hadoop.io.MapWritable@15ba1da3
CLNaTjRYReG1vmsC4Y3mIw org.apache.hadoop.io.MapWritable@15ba1da3

I will really appreciate you if you give me a hand.Thanks.

String index = args[0];
String wcPath = args[1];

JobConf jobConf = new JobConf();
jobConf.setBoolean("mapred.map.tasks.speculative.execution", false);
jobConf.setBoolean("mapred.reduce.tasks.speculative.execution", false);

Properties properties = AppProps.appProps().buildProperties(jobConf);
AppProps.setApplicationJarClass(properties, WordCountESHadoop.class);
FlowConnector flowConnector = new HadoopFlowConnector(properties);

Fields fieldArguments = new Fields("doc_id", "description" );

Tap docTap = new ESTap("/" + index + "/_search?q=:", fieldArguments);
Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Pipe descFieldPipe = new Each("desconly", fieldArguments, new
ESFieldExtractor(fieldArguments), Fields.RESULTS );

Fields token = new Fields( "token" );
Fields text = new Fields( "description" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[
\[\]\(\),.]" );
Pipe docPipe = new Each(descFieldPipe, text, splitter, Fields.RESULTS );

// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef()
.setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();

Test case with 5 shards:

curl -XPUT 'http://localhost:9200/test/'
curl -XPUT 'http://localhost:9200/test/test/1' -d '{
"timestamp":1348984904161,"category":"category1","description":"d1" }'
curl -XPUT 'http://localhost:9200/test/test/2' -d '{
"timestamp":1348984904162,"category":"category2","description":"d2" }'
curl -XPUT 'http://localhost:9200/test/test/3' -d '{
"timestamp":1348984904163,"category":"category3","description":"d3" }'

Run using index 'test':

hadoop jar wordcount.jar test output

Display output:

cat output/part-00000

token count
d1 5
d2 5
d3 5

Test case with one shard only:

curl -XPUT 'http://localhost:9200/oneshardtest/' -d '{ "settings" : {
"index" : { "number_of_shards" : 1, "number_of_replicas" : 1 } } }'
curl -XPUT 'http://localhost:9200/oneshardtest/test/1' -d '{
"timestamp":1348984904161,"category":"category1","description":"d1" }'
curl -XPUT 'http://localhost:9200/oneshardtest/test/2' -d '{
"timestamp":1348984904162,"category":"category2","description":"d2" }'
curl -XPUT 'http://localhost:9200/oneshardtest/test/3' -d '{
"timestamp":1348984904163,"category":"category3","description":"d3" }'

Run using index 'oneshardtest':

hadoop jar wordcount.jar oneshardtest output

Display output:

cat output/part-00000

token count
d1 1
d2 1
d3 1

If an index only has 1 shard the result of word count is correct.
Somehow there is a problem in elasticsearch hadoop in the data input split
during the map process. I have already tried this in a hadoop single node
environment and the result is the same. Did I miss something in how to
properly configure cascading in elasticsearch-hadoop?

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Something is a miss here since I'm unable to reproduce the problem. I'll try again using your verbatim example.
The ESFieldExtractor shouldn't be needed since ESTap returns the results as a TupleEntry iterator so there shouldn't be
no need to do the field extraction but rather just refer to the fields by name.
Let me know what didn't work for you - I'll go through yoru example myself but just in case I'm missing something,
feedback is useful.

Cheers,

On 18/11/2013 1:18 PM, Jeoffrey Lim wrote:

Thanks a lot for looking at this problem. But still when I used the nightly build with the intended fix ,the problem
still occurs and the word count is still 5 which is the total number of shards of the index.

On Saturday, November 16, 2013 1:50:18 AM UTC+8, Costin Leau wrote:

Hi,

I've pushed a nightly build [1] which should address the problem. Let me know if it works for you.

It looks like Cascading is acting differently when running against a pseudo/local cluster as oppose to a proper one
when
it comes to dealing with multiple splits. The bug fix should make the behaviour consistent.

Cheers,

[1] elasticsearch-hadoop-1.3.0.BUILD-20131115.172649-221

On 14/11/2013 5:51 AM, Jeoffrey Lim wrote:
> I'm testing the cascading plugin of elasticsearch-hadoop and trying the word count example from
>http://docs.cascading.org/impatient/impatient2.html <http://docs.cascading.org/impatient/impatient2.html>
> The problem is the result of the word count is incorrect and somehow gets multiplied by the number of shards of the index.
>
>
> Code:
>
> |
> Stringindex =args[0];
> StringwcPath =args[1];
>
> JobConfjobConf =newJobConf();
> jobConf.setBoolean("mapred.map.tasks.speculative.execution",false);
> jobConf.setBoolean("mapred.reduce.tasks.speculative.execution",false);
>
> Propertiesproperties =AppProps.appProps().buildProperties(jobConf);
> AppProps.setApplicationJarClass(properties,WordCountESHadoop.class);
> FlowConnectorflowConnector =newHadoopFlowConnector(properties);
>
> FieldsfieldArguments =newFields("doc_id","description");
>
> TapdocTap =newESTap("/"+index +"/_search?q=*:*",fieldArguments);
> TapwcTap =newHfs(newTextDelimited(true,"\t"),wcPath );
>
> PipedescFieldPipe =newEach("desconly",fieldArguments,newESFieldExtractor(fieldArguments),Fields.RESULTS );
>
> Fieldstoken =newFields("token");
> Fieldstext =newFields("description");
> RegexSplitGeneratorsplitter =newRegexSplitGenerator(token,"[ \\[\\]\\(\\),.]");
> PipedocPipe =newEach(descFieldPipe,text,splitter,Fields.RESULTS );
>
>
> // determine the word counts
> PipewcPipe =newPipe("wc",docPipe );
> wcPipe =newGroupBy(wcPipe,token );
> wcPipe =newEvery(wcPipe,Fields.ALL,newCount(),Fields.ALL );
>
>
> // connect the taps, pipes, etc., into a flow
> FlowDefflowDef =FlowDef.flowDef()
> .setName("wc")
> .addSource(docPipe,docTap )
> .addTailSink(wcPipe,wcTap );
>
>
> // write a DOT file and run the flow
> FlowwcFlow =flowConnector.connect(flowDef );
> wcFlow.writeDOT("dot/wc.dot");
> wcFlow.complete();
> |
>
>
>
> Test case with 5 shards:
>
> |
> curl -XPUT 'http://localhost:9200/test/'
> curl -XPUT 'http://localhost:9200/test/test/1'-d <http://localhost:9200/test/test/1'-d> '{
"timestamp":1348984904161,"category":"category1","description":"d1" }'
> curl -XPUT 'http://localhost:9200/test/test/2'-d <http://localhost:9200/test/test/2'-d> '{
"timestamp":1348984904162,"category":"category2","description":"d2" }'
> curl -XPUT 'http://localhost:9200/test/test/3'-d <http://localhost:9200/test/test/3'-d> '{
"timestamp":1348984904163,"category":"category3","description":"d3" }'
> |
>
>
> Run using index 'test':
>
> |
> hadoop jar wordcount.jar test output
> |
>
>
> Display output:
>
> |
> cat output/part-00000
>
> token count
> d1 5
> d2 5
> d3 5
> |
>
>
>
> Test case with one shard only:
>
> |
> curl -XPUT 'http://localhost:9200/oneshardtest/'-d <http://localhost:9200/oneshardtest/'-d> '{ "settings" : { "index" : {
"number_of_shards" : 1,
> "number_of_replicas" : 1 } } }'
> curl -XPUT 'http://localhost:9200/oneshardtest/test/1'-d <http://localhost:9200/oneshardtest/test/1'-d> '{
> "timestamp":1348984904161,"category":"category1","description":"d1" }'
> curl -XPUT 'http://localhost:9200/oneshardtest/test/2'-d <http://localhost:9200/oneshardtest/test/2'-d> '{
> "timestamp":1348984904162,"category":"category2","description":"d2" }'
> curl -XPUT 'http://localhost:9200/oneshardtest/test/3'-d <http://localhost:9200/oneshardtest/test/3'-d> '{
> "timestamp":1348984904163,"category":"category3","description":"d3" }'
> |
>
>
> Run using index 'oneshardtest':
>
> |
> hadoop jar wordcount.jar oneshardtest output
> |
>
>
> Display output:
>
> |
> cat output/part-00000
>
>
> token count
> d1 1
> d2 1
> d3 1
> |
>
>
>
> If an index only has 1 shard the result of word count is correct. Somehow there is a problem in elasticsearch hadoop in
> the data input split during the map process. I have already tried this in a hadoop single node environment and the
> result is the same. Did I miss something in how to properly configure cascading in elasticsearch-hadoop?
>
> Thanks.
>
>
> --
> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>elasticsearc...@googlegroups.com <javascript:>.
> For more options, visithttps://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

By the way, can you confirm the setup you are using - Hadoop version and network topology?

Thanks,

On 18/11/2013 4:22 PM, Costin Leau wrote:

Something is a miss here since I'm unable to reproduce the problem. I'll try again using your verbatim example.
The ESFieldExtractor shouldn't be needed since ESTap returns the results as a TupleEntry iterator so there shouldn't be
no need to do the field extraction but rather just refer to the fields by name.
Let me know what didn't work for you - I'll go through yoru example myself but just in case I'm missing something,
feedback is useful.

Cheers,

On 18/11/2013 1:18 PM, Jeoffrey Lim wrote:

Thanks a lot for looking at this problem. But still when I used the nightly build with the intended fix ,the problem
still occurs and the word count is still 5 which is the total number of shards of the index.

On Saturday, November 16, 2013 1:50:18 AM UTC+8, Costin Leau wrote:

Hi,

I've pushed a nightly build [1] which should address the problem. Let me know if it works for you.

It looks like Cascading is acting differently when running against a pseudo/local cluster as oppose to a proper one
when
it comes to dealing with multiple splits. The bug fix should make the behaviour consistent.

Cheers,

[1] elasticsearch-hadoop-1.3.0.BUILD-20131115.172649-221

On 14/11/2013 5:51 AM, Jeoffrey Lim wrote:
> I'm testing the cascading plugin of elasticsearch-hadoop and trying the word count example from
>http://docs.cascading.org/impatient/impatient2.html <http://docs.cascading.org/impatient/impatient2.html>
> The problem is the result of the word count is incorrect and somehow gets multiplied by the number of shards of

the index.
>
>
> Code:
>
> |
> Stringindex =args[0];
> StringwcPath =args[1];
>
> JobConfjobConf =newJobConf();
> jobConf.setBoolean("mapred.map.tasks.speculative.execution",false);
> jobConf.setBoolean("mapred.reduce.tasks.speculative.execution",false);
>
> Propertiesproperties =AppProps.appProps().buildProperties(jobConf);
> AppProps.setApplicationJarClass(properties,WordCountESHadoop.class);
> FlowConnectorflowConnector =newHadoopFlowConnector(properties);
>
> FieldsfieldArguments =newFields("doc_id","description");
>
> TapdocTap =newESTap("/"+index +"/_search?q=:",fieldArguments);
> TapwcTap =newHfs(newTextDelimited(true,"\t"),wcPath );
>
> PipedescFieldPipe =newEach("desconly",fieldArguments,newESFieldExtractor(fieldArguments),Fields.RESULTS );
>
> Fieldstoken =newFields("token");
> Fieldstext =newFields("description");
> RegexSplitGeneratorsplitter =newRegexSplitGenerator(token,"[ \[\]\(\),.]");
> PipedocPipe =newEach(descFieldPipe,text,splitter,Fields.RESULTS );
>
>
> // determine the word counts
> PipewcPipe =newPipe("wc",docPipe );
> wcPipe =newGroupBy(wcPipe,token );
> wcPipe =newEvery(wcPipe,Fields.ALL,newCount(),Fields.ALL );
>
>
> // connect the taps, pipes, etc., into a flow
> FlowDefflowDef =FlowDef.flowDef()
> .setName("wc")
> .addSource(docPipe,docTap )
> .addTailSink(wcPipe,wcTap );
>
>
> // write a DOT file and run the flow
> FlowwcFlow =flowConnector.connect(flowDef );
> wcFlow.writeDOT("dot/wc.dot");
> wcFlow.complete();
> |
>
>
>
> Test case with 5 shards:
>
> |
> curl -XPUT 'http://localhost:9200/test/'
> curl -XPUT 'http://localhost:9200/test/test/1'-d http://localhost:9200/test/test/1'-d '{
"timestamp":1348984904161,"category":"category1","description":"d1" }'
> curl -XPUT 'http://localhost:9200/test/test/2'-d http://localhost:9200/test/test/2'-d '{
"timestamp":1348984904162,"category":"category2","description":"d2" }'
> curl -XPUT 'http://localhost:9200/test/test/3'-d http://localhost:9200/test/test/3'-d '{
"timestamp":1348984904163,"category":"category3","description":"d3" }'
> |
>
>
> Run using index 'test':
>
> |
> hadoop jar wordcount.jar test output
> |
>
>
> Display output:
>
> |
> cat output/part-00000
>
> token count
> d1 5
> d2 5
> d3 5
> |
>
>
>
> Test case with one shard only:
>
> |
> curl -XPUT 'http://localhost:9200/oneshardtest/'-d http://localhost:9200/oneshardtest/'-d '{ "settings" : {
"index" : {
"number_of_shards" : 1,
> "number_of_replicas" : 1 } } }'
> curl -XPUT 'http://localhost:9200/oneshardtest/test/1'-d http://localhost:9200/oneshardtest/test/1'-d '{
> "timestamp":1348984904161,"category":"category1","description":"d1" }'
> curl -XPUT 'http://localhost:9200/oneshardtest/test/2'-d http://localhost:9200/oneshardtest/test/2'-d '{
> "timestamp":1348984904162,"category":"category2","description":"d2" }'
> curl -XPUT 'http://localhost:9200/oneshardtest/test/3'-d http://localhost:9200/oneshardtest/test/3'-d '{
> "timestamp":1348984904163,"category":"category3","description":"d3" }'
> |
>
>
> Run using index 'oneshardtest':
>
> |
> hadoop jar wordcount.jar oneshardtest output
> |
>
>
> Display output:
>
> |
> cat output/part-00000
>
>
> token count
> d1 1
> d2 1
> d3 1
> |
>
>
>
> If an index only has 1 shard the result of word count is correct. Somehow there is a problem in elasticsearch
hadoop in
> the data input split during the map process. I have already tried this in a hadoop single node environment and the
> result is the same. Did I miss something in how to properly configure cascading in elasticsearch-hadoop?
>
> Thanks.
>
>
> --
> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>elasticsearc...@googlegroups.com <javascript:>.
> For more options, visithttps://groups.google.com/groups/opt_out https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I'm using Hadoop 1.2.1, Elasticsearch 0.90.3, Cascading 2.2 and tried
running the word count tutorial sample in local stand alone operation, and
pseudo distributed mode resulting with the same output.

Thanks!

On Mon, Nov 18, 2013 at 11:03 PM, Costin Leau costin.leau@gmail.com wrote:

By the way, can you confirm the setup you are using - Hadoop version and
network topology?

Thanks,

On 18/11/2013 4:22 PM, Costin Leau wrote:

Something is a miss here since I'm unable to reproduce the problem. I'll
try again using your verbatim example.
The ESFieldExtractor shouldn't be needed since ESTap returns the results
as a TupleEntry iterator so there shouldn't be
no need to do the field extraction but rather just refer to the fields by
name.
Let me know what didn't work for you - I'll go through yoru example
myself but just in case I'm missing something,
feedback is useful.

Cheers,

On 18/11/2013 1:18 PM, Jeoffrey Lim wrote:

Thanks a lot for looking at this problem. But still when I used the
nightly build with the intended fix ,the problem
still occurs and the word count is still 5 which is the total number of
shards of the index.

On Saturday, November 16, 2013 1:50:18 AM UTC+8, Costin Leau wrote:

Hi,

I've pushed a nightly build [1] which should address the problem.

Let me know if it works for you.

It looks like Cascading is acting differently when running against a

pseudo/local cluster as oppose to a proper one
when
it comes to dealing with multiple splits. The bug fix should make
the behaviour consistent.

Cheers,

[1] elasticsearch-hadoop-1.3.0.BUILD-20131115.172649-221

On 14/11/2013 5:51 AM, Jeoffrey Lim wrote:
> I'm testing the cascading plugin of elasticsearch-hadoop and

trying the word count example from
>Cascading for the Impatient, Part 2 <
Cascading for the Impatient, Part 2>
> The problem is the result of the word count is incorrect and
somehow gets multiplied by the number of shards of
the index.
>
>
> Code:
>
> |
> Stringindex =args[0];
> StringwcPath =args[1];
>
> JobConfjobConf =newJobConf();
> jobConf.setBoolean("mapred.map.tasks.speculative.
execution",false);
> jobConf.setBoolean("mapred.reduce.tasks.speculative.
execution",false);
>
> Propertiesproperties =AppProps.appProps().
buildProperties(jobConf);
> AppProps.setApplicationJarClass(properties,WordCountESHadoop.
class);
> FlowConnectorflowConnector =newHadoopFlowConnector(properties);
>
> FieldsfieldArguments =newFields("doc_id","description");
>
> TapdocTap =newESTap("/"+index +"/_search?q=:",fieldArguments);
> TapwcTap =newHfs(newTextDelimited(true,"\t"),wcPath );
>
> PipedescFieldPipe =newEach("desconly",fieldArguments,
newESFieldExtractor(fieldArguments),Fields.RESULTS );
>
> Fieldstoken =newFields("token");
> Fieldstext =newFields("description");
> RegexSplitGeneratorsplitter =newRegexSplitGenerator(token,"[
\[\]\(\),.]");
> PipedocPipe =newEach(descFieldPipe,text,splitter,Fields.RESULTS );
>
>
> // determine the word counts
> PipewcPipe =newPipe("wc",docPipe );
> wcPipe =newGroupBy(wcPipe,token );
> wcPipe =newEvery(wcPipe,Fields.ALL,newCount(),Fields.ALL );
>
>
> // connect the taps, pipes, etc., into a flow
> FlowDefflowDef =FlowDef.flowDef()
> .setName("wc")
> .addSource(docPipe,docTap )
> .addTailSink(wcPipe,wcTap );
>
>
> // write a DOT file and run the flow
> FlowwcFlow =flowConnector.connect(flowDef );
> wcFlow.writeDOT("dot/wc.dot");
> wcFlow.complete();
> |
>
>
>
> Test case with 5 shards:
>
> |
> curl -XPUT 'http://localhost:9200/test/'
> curl -XPUT 'http://localhost:9200/test/test/1'-d <
http://localhost:9200/test/test/1'-d> '{
"timestamp":1348984904161,"category":"category1","description":"d1"
}'
> curl -XPUT 'http://localhost:9200/test/test/2'-d <
http://localhost:9200/test/test/2'-d> '{
"timestamp":1348984904162,"category":"category2","description":"d2"
}'
> curl -XPUT 'http://localhost:9200/test/test/3'-d <
http://localhost:9200/test/test/3'-d> '{
"timestamp":1348984904163,"category":"category3","description":"d3"
}'
> |
>
>
> Run using index 'test':
>
> |
> hadoop jar wordcount.jar test output
> |
>
>
> Display output:
>
> |
> cat output/part-00000
>
> token count
> d1 5
> d2 5
> d3 5
> |
>
>
>
> Test case with one shard only:
>
> |
> curl -XPUT 'http://localhost:9200/oneshardtest/'-d <
http://localhost:9200/oneshardtest/'-d> '{ "settings" : {
"index" : {
"number_of_shards" : 1,
> "number_of_replicas" : 1 } } }'
> curl -XPUT 'http://localhost:9200/oneshardtest/test/1'-d <
http://localhost:9200/oneshardtest/test/1'-d> '{
> "timestamp":1348984904161,"category":"category1","description":"d1"
}'
> curl -XPUT 'http://localhost:9200/oneshardtest/test/2'-d <
http://localhost:9200/oneshardtest/test/2'-d> '{
> "timestamp":1348984904162,"category":"category2","description":"d2"
}'
> curl -XPUT 'http://localhost:9200/oneshardtest/test/3'-d <
http://localhost:9200/oneshardtest/test/3'-d> '{
> "timestamp":1348984904163,"category":"category3","description":"d3"
}'
> |
>
>
> Run using index 'oneshardtest':
>
> |
> hadoop jar wordcount.jar oneshardtest output
> |
>
>
> Display output:
>
> |
> cat output/part-00000
>
>
> token count
> d1 1
> d2 1
> d3 1
> |
>
>
>
> If an index only has 1 shard the result of word count is correct.
Somehow there is a problem in elasticsearch
hadoop in
> the data input split during the map process. I have already tried
this in a hadoop single node environment and the
> result is the same. Did I miss something in how to properly
configure cascading in elasticsearch-hadoop?
>
> Thanks.
>
>
> --
> You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it,
send an email to
>elasticsearc...@googlegroups.com <javascript:>.
> For more options, visithttps://groups.google.com/groups/opt_out <
https://groups.google.com/groups/opt_out>.

--
Costin

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/elasticsearch/byA9kRJB8AY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Can you please try with Cascading 2.1.6 just in case? I haven't tested Cascading 2.2 yet.

Thanks,

On 18/11/2013 5:13 PM, Jeoffrey Lim wrote:

I'm using Hadoop 1.2.1, Elasticsearch 0.90.3, Cascading 2.2 and tried running the word count tutorial sample in local
stand alone operation, and pseudo distributed mode resulting with the same output.

Thanks!

On Mon, Nov 18, 2013 at 11:03 PM, Costin Leau <costin.leau@gmail.com mailto:costin.leau@gmail.com> wrote:

By the way, can you confirm the setup you are using - Hadoop version and network topology?

Thanks,


On 18/11/2013 4:22 PM, Costin Leau wrote:

    Something is a miss here since I'm unable to reproduce the problem. I'll try again using your verbatim example.
    The ESFieldExtractor shouldn't be needed since ESTap returns the results as a TupleEntry iterator so there
    shouldn't be
    no need to do the field extraction but rather just refer to the fields by name.
    Let me know what didn't work for you - I'll go through yoru example myself but just in case I'm missing something,
    feedback is useful.

    Cheers,

    On 18/11/2013 1:18 PM, Jeoffrey Lim wrote:

        Thanks a lot for looking at this problem. But still when I used the nightly build with the intended fix ,the
        problem
        still occurs and the word count is still 5 which is the total number of shards of the index.

        On Saturday, November 16, 2013 1:50:18 AM UTC+8, Costin Leau wrote:

             Hi,

             I've pushed a nightly build [1] which should address the problem. Let me know if it works for you.

             It looks like Cascading is acting differently when running against a pseudo/local cluster as oppose to
        a proper one
             when
             it comes to dealing with multiple splits. The bug fix should make the behaviour consistent.

             Cheers,

             [1] elasticsearch-hadoop-1.3.0.__BUILD-20131115.172649-221

             On 14/11/2013 5:51 AM, Jeoffrey Lim wrote:
             > I'm testing the cascading plugin of elasticsearch-hadoop and trying the word count example from
             >http://docs.cascading.org/__impatient/impatient2.html
        <http://docs.cascading.org/impatient/impatient2.html> <http://docs.cascading.org/__impatient/impatient2.html
        <http://docs.cascading.org/impatient/impatient2.html>>
             > The problem is the result of the word count is incorrect and somehow gets multiplied by the number of
        shards of
        the index.
             >
             >
             > Code:
             >
             > |
             > Stringindex =args[0];
             > StringwcPath =args[1];
             >
             > JobConfjobConf =newJobConf();
             > jobConf.setBoolean("mapred.__map.tasks.speculative.__execution",false);
             > jobConf.setBoolean("mapred.__reduce.tasks.speculative.__execution",false);
             >
             > Propertiesproperties =AppProps.appProps().__buildProperties(jobConf);
             > AppProps.__setApplicationJarClass(__properties,WordCountESHadoop.__class);
             > FlowConnectorflowConnector =newHadoopFlowConnector(__properties);
             >
             > FieldsfieldArguments =newFields("doc_id","__description");
             >
             > TapdocTap =newESTap("/"+index +"/_search?q=*:*",__fieldArguments);
             > TapwcTap =newHfs(newTextDelimited(true,__"\t"),wcPath );
             >
             > PipedescFieldPipe
        =newEach("desconly",__fieldArguments,__newESFieldExtractor(__fieldArguments),Fields.RESULTS );
             >
             > Fieldstoken =newFields("token");
             > Fieldstext =newFields("description");
             > RegexSplitGeneratorsplitter =newRegexSplitGenerator(token,__"[ \\[\\]\\(\\),.]");
             > PipedocPipe =newEach(descFieldPipe,text,__splitter,Fields.RESULTS );
             >
             >
             > // determine the word counts
             > PipewcPipe =newPipe("wc",docPipe );
             > wcPipe =newGroupBy(wcPipe,token );
             > wcPipe =newEvery(wcPipe,Fields.ALL,__newCount(),Fields.ALL );
             >
             >
             > // connect the taps, pipes, etc., into a flow
             > FlowDefflowDef =FlowDef.flowDef()
             > .setName("wc")
             > .addSource(docPipe,docTap )
             > .addTailSink(wcPipe,wcTap );
             >
             >
             > // write a DOT file and run the flow
             > FlowwcFlow =flowConnector.connect(flowDef );
             > wcFlow.writeDOT("dot/wc.dot");
             > wcFlow.complete();
             > |
             >
             >
             >
             > Test case with 5 shards:
             >
             > |
             > curl -XPUT 'http://localhost:9200/test/'
             > curl -XPUT 'http://localhost:9200/test/__test/1'-d <http://localhost:9200/test/test/1'-d>
        <http://localhost:9200/test/__test/1'-d <http://localhost:9200/test/test/1'-d>> '{
             "timestamp":1348984904161,"__category":"category1","__description":"d1" }'
             > curl -XPUT 'http://localhost:9200/test/__test/2'-d <http://localhost:9200/test/test/2'-d>
        <http://localhost:9200/test/__test/2'-d <http://localhost:9200/test/test/2'-d>> '{
             "timestamp":1348984904162,"__category":"category2","__description":"d2" }'
             > curl -XPUT 'http://localhost:9200/test/__test/3'-d <http://localhost:9200/test/test/3'-d>
        <http://localhost:9200/test/__test/3'-d <http://localhost:9200/test/test/3'-d>> '{
             "timestamp":1348984904163,"__category":"category3","__description":"d3" }'
             > |
             >
             >
             > Run using index 'test':
             >
             > |
             > hadoop jar wordcount.jar test output
             > |
             >
             >
             > Display output:
             >
             > |
             > cat output/part-00000
             >
             > token count
             > d1 5
             > d2 5
             > d3 5
             > |
             >
             >
             >
             > Test case with one shard only:
             >
             > |
             > curl -XPUT 'http://localhost:9200/__oneshardtest/'-d <http://localhost:9200/oneshardtest/'-d>
        <http://localhost:9200/__oneshardtest/'-d <http://localhost:9200/oneshardtest/'-d>> '{ "settings" : {
        "index" : {
             "number_of_shards" : 1,
             > "number_of_replicas" : 1 } } }'
             > curl -XPUT 'http://localhost:9200/__oneshardtest/test/1'-d
        <http://localhost:9200/oneshardtest/test/1'-d> <http://localhost:9200/__oneshardtest/test/1'-d
        <http://localhost:9200/oneshardtest/test/1'-d>> '{
             > "timestamp":1348984904161,"__category":"category1","__description":"d1" }'
             > curl -XPUT 'http://localhost:9200/__oneshardtest/test/2'-d
        <http://localhost:9200/oneshardtest/test/2'-d> <http://localhost:9200/__oneshardtest/test/2'-d
        <http://localhost:9200/oneshardtest/test/2'-d>> '{
             > "timestamp":1348984904162,"__category":"category2","__description":"d2" }'
             > curl -XPUT 'http://localhost:9200/__oneshardtest/test/3'-d
        <http://localhost:9200/oneshardtest/test/3'-d> <http://localhost:9200/__oneshardtest/test/3'-d
        <http://localhost:9200/oneshardtest/test/3'-d>> '{
             > "timestamp":1348984904163,"__category":"category3","__description":"d3" }'
             > |
             >
             >
             > Run using index 'oneshardtest':
             >
             > |
             > hadoop jar wordcount.jar oneshardtest output
             > |
             >
             >
             > Display output:
             >
             > |
             > cat output/part-00000
             >
             >
             > token count
             > d1 1
             > d2 1
             > d3 1
             > |
             >
             >
             >
             > If an index only has 1 shard the result of word count is correct. Somehow there is a problem in
        elasticsearch
        hadoop in
             > the data input split during the map process. I have already tried this in a hadoop single node
        environment and the
             > result is the same. Did I miss something in how to properly configure cascading in elasticsearch-hadoop?
             >
             > Thanks.
             >
             >
             > --
             > You received this message because you are subscribed to the Google Groups "elasticsearch" group.
             > To unsubscribe from this group and stop receiving emails from it, send an email to
             >elasticsearc...@googlegroups.__com <mailto:elasticsearc...@googlegroups.com> <javascript:>.
             > For more options, visithttps://groups.google.__com/groups/opt_out
        <http://groups.google.com/groups/opt_out> <https://groups.google.com/__groups/opt_out
        <https://groups.google.com/groups/opt_out>>.

             --
             Costin

        --
        You received this message because you are subscribed to the Google Groups "elasticsearch" group.
        To unsubscribe from this group and stop receiving emails from it, send an email to
        elasticsearch+unsubscribe@__googlegroups.com <mailto:elasticsearch%2Bunsubscribe@googlegroups.com>.
        For more options, visit https://groups.google.com/__groups/opt_out <https://groups.google.com/groups/opt_out>.


--
Costin

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/__topic/elasticsearch/__byA9kRJB8AY/unsubscribe
<https://groups.google.com/d/topic/elasticsearch/byA9kRJB8AY/unsubscribe>.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@__googlegroups.com
<mailto:elasticsearch%2Bunsubscribe@googlegroups.com>.
For more options, visit https://groups.google.com/__groups/opt_out <https://groups.google.com/groups/opt_out>.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I tried using Cascading 2.1.6, and directly using the samples from
http://docs.cascading.org/impatient/

When just performing a copy:

Pipe docPipe = new Pipe("copy");

FlowDef flowDef = FlowDef.flowDef()
.addSource(docPipe, docTap)
.addTailSink(docPipe, wcTap);

flowConnector.connect(flowDef).complete();

for i in output/part-0000* ; do echo "[$i]" && cat $i ; done

[output/part-00000]
doc_id description
3 org.elasticsearch.hadoop.mr.LinkedMapWritable@3fde891b
2 org.elasticsearch.hadoop.mr.LinkedMapWritable@3fde891b
1 org.elasticsearch.hadoop.mr.LinkedMapWritable@3fde891b
[output/part-00001]
doc_id description
3 org.elasticsearch.hadoop.mr.LinkedMapWritable@b05236
2 org.elasticsearch.hadoop.mr.LinkedMapWritable@b05236
1 org.elasticsearch.hadoop.mr.LinkedMapWritable@b05236
[output/part-00002]
doc_id description
3 org.elasticsearch.hadoop.mr.LinkedMapWritable@3d762027
2 org.elasticsearch.hadoop.mr.LinkedMapWritable@3d762027
1 org.elasticsearch.hadoop.mr.LinkedMapWritable@3d762027
[output/part-00003]
doc_id description
3 org.elasticsearch.hadoop.mr.LinkedMapWritable@1aeca36e
2 org.elasticsearch.hadoop.mr.LinkedMapWritable@1aeca36e
1 org.elasticsearch.hadoop.mr.LinkedMapWritable@1aeca36e
[output/part-00004]
doc_id description
3 org.elasticsearch.hadoop.mr.LinkedMapWritable@23904e0d
2 org.elasticsearch.hadoop.mr.LinkedMapWritable@23904e0d
1 org.elasticsearch.hadoop.mr.LinkedMapWritable@23904e0d

With field extractor:

Pipe docPipe = new Pipe("copy");
Pipe copyPipe = new Each(docPipe, new Fields("doc_id", "description"), new
ESFieldExtractor(fieldArguments), Fields.RESULTS );

FlowDef flowDef = FlowDef.flowDef()
.addSource(copyPipe, docTap)
.addTailSink(copyPipe, wcTap);

flowConnector.connect(flowDef).complete();

for i in output/part-0000* ; do echo "[$i]" && cat $i ; done

[output/part-00000]
doc_id description
3 d3
2 d2
1 d1
[output/part-00001]
doc_id description
3 d3
2 d2
1 d1
[output/part-00002]
doc_id description
3 d3
2 d2
1 d1
[output/part-00003]
doc_id description
3 d3
2 d2
1 d1
[output/part-00004]
doc_id description
3 d3
2 d2
1 d1

Manual query of data by shards using curl:

curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:0'
{"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0
},"hits":{"total":0,"max_score":null,"hits":}}

curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:1'
{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0
},"hits":{"total":0,"max_score":null,"hits":}}

curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:2'
{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0
},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test",
"_id":"1","_score":1.0, "_source" : { "timestamp":1348984904161,"category":
"category1","description":"d1" }}]}}

curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:3'
{"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0
},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test",
"_id":"2","_score":1.0, "_source" : { "timestamp":1348984904162,"category":
"category2","description":"d2" }}]}}

curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:4'
{"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0
},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test",
"_id":"3","_score":1.0, "_source" : { "timestamp":1348984904163,"category":
"category3","description":"d3" }}]}}

Here is also the console output:

13/11/19 01:20:06 INFO property.AppProps: using app.id:
9BCEBB939DE70E7BBBC0F8427387176E
13/11/19 01:20:06 INFO util.HadoopUtil: resolving application jar from
found main method on: eshadoop.WordCount
13/11/19 01:20:06 INFO planner.HadoopPlanner: using application jar:
/home/jeoffrey/test/eshadoopwc.jar
13/11/19 01:20:06 INFO hadoop.Hfs: forcing job to local mode, via sink:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:08 INFO util.Version: Concurrent, Inc - Cascading 2.1.6
13/11/19 01:20:08 INFO flow.Flow: starting
13/11/19 01:20:08 INFO flow.Flow: source:
ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:08 INFO flow.Flow: sink:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:08 INFO flow.Flow: parallel execution is enabled: true
13/11/19 01:20:08 INFO flow.Flow: starting jobs: 1
13/11/19 01:20:08 INFO flow.Flow: allocating threads: 1
13/11/19 01:20:08 INFO flow.FlowStep: starting step: (1/1) output
13/11/19 01:20:08 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
13/11/19 01:20:08 INFO mr.ESInputFormat: Discovered mapping
{test=[raw-logs=, test=[category=STRING, description=STRING,
timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:08 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:08 INFO flow.FlowStep: submitted hadoop job:
job_local257186463_0001
13/11/19 01:20:08 INFO mapred.LocalJobRunner: Waiting for map tasks
13/11/19 01:20:08 INFO mapred.LocalJobRunner: Starting task:
attempt_local257186463_0001_m_000000_0
13/11/19 01:20:08 INFO util.ProcessTree: setsid exited with exit code 0
13/11/19 01:20:08 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4235e6e3
13/11/19 01:20:08 INFO mapred.MapTask: Processing split:
cascading.tap.hadoop.io.MultiInputSplit@62803d5
13/11/19 01:20:08 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:08 INFO hadoop.FlowMapper: cascading version: Concurrent,
Inc - Cascading 2.1.6
13/11/19 01:20:08 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:08 INFO hadoop.FlowMapper: sourcing from:
ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:08 INFO hadoop.FlowMapper: sinking to:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:08 INFO mr.ESInputFormat: Discovered mapping
{test=[raw-logs=, test=[category=STRING, description=STRING,
timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:08 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:08 INFO mapred.Task:
Task:attempt_local257186463_0001_m_000000_0 is done. And is in the process
of commiting
13/11/19 01:20:08 INFO mapred.LocalJobRunner:
13/11/19 01:20:08 INFO mapred.Task: Task
attempt_local257186463_0001_m_000000_0 is allowed to commit now
13/11/19 01:20:08 INFO mapred.FileOutputCommitter: Saved output of task
'attempt_local257186463_0001_m_000000_0' to file:/home/jeoffrey/test/output
13/11/19 01:20:08 INFO mapred.LocalJobRunner: ShardInputSplit
[node=[WwgCOxR5S7KdVZlDLBWH-A/Frost, Cordelia|192.168.1.120:9200],shard=0]
13/11/19 01:20:08 INFO mapred.Task: Task
'attempt_local257186463_0001_m_000000_0' done.
13/11/19 01:20:08 INFO mapred.LocalJobRunner: Finishing task:
attempt_local257186463_0001_m_000000_0
13/11/19 01:20:08 INFO mapred.LocalJobRunner: Starting task:
attempt_local257186463_0001_m_000001_0
13/11/19 01:20:08 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3d3b5a3a
13/11/19 01:20:08 INFO mapred.MapTask: Processing split:
cascading.tap.hadoop.io.MultiInputSplit@4ab9b8d0
13/11/19 01:20:08 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:08 INFO hadoop.FlowMapper: cascading version: Concurrent,
Inc - Cascading 2.1.6
13/11/19 01:20:08 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:08 INFO hadoop.FlowMapper: sourcing from:
ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:08 INFO hadoop.FlowMapper: sinking to:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping
{test=[raw-logs=, test=[category=STRING, description=STRING,
timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:09 INFO mapred.Task:
Task:attempt_local257186463_0001_m_000001_0 is done. And is in the process
of commiting
13/11/19 01:20:09 INFO mapred.LocalJobRunner:
13/11/19 01:20:09 INFO mapred.Task: Task
attempt_local257186463_0001_m_000001_0 is allowed to commit now
13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task
'attempt_local257186463_0001_m_000001_0' to file:/home/jeoffrey/test/output
13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit
[node=[WwgCOxR5S7KdVZlDLBWH-A/Frost, Cordelia|192.168.1.120:9200],shard=1]
13/11/19 01:20:09 INFO mapred.Task: Task
'attempt_local257186463_0001_m_000001_0' done.
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task:
attempt_local257186463_0001_m_000001_0
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Starting task:
attempt_local257186463_0001_m_000002_0
13/11/19 01:20:09 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3d46e381
13/11/19 01:20:09 INFO mapred.MapTask: Processing split:
cascading.tap.hadoop.io.MultiInputSplit@52cd19d
13/11/19 01:20:09 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:09 INFO hadoop.FlowMapper: cascading version: Concurrent,
Inc - Cascading 2.1.6
13/11/19 01:20:09 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:09 INFO hadoop.FlowMapper: sourcing from:
ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:09 INFO hadoop.FlowMapper: sinking to:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping
{test=[raw-logs=, test=[category=STRING, description=STRING,
timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:09 INFO mapred.Task:
Task:attempt_local257186463_0001_m_000002_0 is done. And is in the process
of commiting
13/11/19 01:20:09 INFO mapred.LocalJobRunner:
13/11/19 01:20:09 INFO mapred.Task: Task
attempt_local257186463_0001_m_000002_0 is allowed to commit now
13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task
'attempt_local257186463_0001_m_000002_0' to file:/home/jeoffrey/test/output
13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit
[node=[WwgCOxR5S7KdVZlDLBWH-A/Frost, Cordelia|192.168.1.120:9200],shard=4]
13/11/19 01:20:09 INFO mapred.Task: Task
'attempt_local257186463_0001_m_000002_0' done.
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task:
attempt_local257186463_0001_m_000002_0
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Starting task:
attempt_local257186463_0001_m_000003_0
13/11/19 01:20:09 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4ca68fd8
13/11/19 01:20:09 INFO mapred.MapTask: Processing split:
cascading.tap.hadoop.io.MultiInputSplit@2e097617
13/11/19 01:20:09 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:09 INFO hadoop.FlowMapper: cascading version: Concurrent,
Inc - Cascading 2.1.6
13/11/19 01:20:09 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:09 INFO hadoop.FlowMapper: sourcing from:
ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:09 INFO hadoop.FlowMapper: sinking to:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping
{test=[raw-logs=, test=[category=STRING, description=STRING,
timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:09 INFO mapred.Task:
Task:attempt_local257186463_0001_m_000003_0 is done. And is in the process
of commiting
13/11/19 01:20:09 INFO mapred.LocalJobRunner:
13/11/19 01:20:09 INFO mapred.Task: Task
attempt_local257186463_0001_m_000003_0 is allowed to commit now
13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task
'attempt_local257186463_0001_m_000003_0' to file:/home/jeoffrey/test/output
13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit
[node=[WwgCOxR5S7KdVZlDLBWH-A/Frost, Cordelia|192.168.1.120:9200],shard=3]
13/11/19 01:20:09 INFO mapred.Task: Task
'attempt_local257186463_0001_m_000003_0' done.
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task:
attempt_local257186463_0001_m_000003_0
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Starting task:
attempt_local257186463_0001_m_000004_0
13/11/19 01:20:09 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@6e848ecc
13/11/19 01:20:09 INFO mapred.MapTask: Processing split:
cascading.tap.hadoop.io.MultiInputSplit@40363068
13/11/19 01:20:09 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:09 INFO hadoop.FlowMapper: cascading version: Concurrent,
Inc - Cascading 2.1.6
13/11/19 01:20:09 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:09 INFO hadoop.FlowMapper: sourcing from:
ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:09 INFO hadoop.FlowMapper: sinking to:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping
{test=[raw-logs=, test=[category=STRING, description=STRING,
timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:09 INFO mapred.Task:
Task:attempt_local257186463_0001_m_000004_0 is done. And is in the process
of commiting
13/11/19 01:20:09 INFO mapred.LocalJobRunner:
13/11/19 01:20:09 INFO mapred.Task: Task
attempt_local257186463_0001_m_000004_0 is allowed to commit now
13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task
'attempt_local257186463_0001_m_000004_0' to file:/home/jeoffrey/test/output
13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit
[node=[WwgCOxR5S7KdVZlDLBWH-A/Frost, Cordelia|192.168.1.120:9200],shard=2]
13/11/19 01:20:09 INFO mapred.Task: Task
'attempt_local257186463_0001_m_000004_0' done.
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task:
attempt_local257186463_0001_m_000004_0
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Map task executor complete.
13/11/19 01:20:13 INFO util.Hadoop18TapUtil: deleting temp path
output/_temporary

Thanks.

On Tuesday, November 19, 2013 12:01:05 AM UTC+8, Costin Leau wrote:

Can you please try with Cascading 2.1.6 just in case? I haven't tested
Cascading 2.2 yet.

Thanks,

On 18/11/2013 5:13 PM, Jeoffrey Lim wrote:

I'm using Hadoop 1.2.1, Elasticsearch 0.90.3, Cascading 2.2 and tried
running the word count tutorial sample in local
stand alone operation, and pseudo distributed mode resulting with the
same output.

Thanks!

On Mon, Nov 18, 2013 at 11:03 PM, Costin Leau <costi...@gmail.com<javascript:><mailto:
costi...@gmail.com <javascript:>>> wrote:

By the way, can you confirm the setup you are using - Hadoop version 

and network topology?

Thanks, 


On 18/11/2013 4:22 PM, Costin Leau wrote: 

    Something is a miss here since I'm unable to reproduce the 

problem. I'll try again using your verbatim example.

    The ESFieldExtractor shouldn't be needed since ESTap returns the 

results as a TupleEntry iterator so there

    shouldn't be 
    no need to do the field extraction but rather just refer to the 

fields by name.

    Let me know what didn't work for you - I'll go through yoru 

example myself but just in case I'm missing something,

    feedback is useful. 

    Cheers, 

    On 18/11/2013 1:18 PM, Jeoffrey Lim wrote: 

        Thanks a lot for looking at this problem. But still when I 

used the nightly build with the intended fix ,the

        problem 
        still occurs and the word count is still 5 which is the 

total number of shards of the index.

        On Saturday, November 16, 2013 1:50:18 AM UTC+8, Costin Leau 

wrote:

             Hi, 

             I've pushed a nightly build [1] which should address 

the problem. Let me know if it works for you.

             It looks like Cascading is acting differently when 

running against a pseudo/local cluster as oppose to

        a proper one 
             when 
             it comes to dealing with multiple splits. The bug fix 

should make the behaviour consistent.

             Cheers, 

             [1] 

elasticsearch-hadoop-1.3.0.__BUILD-20131115.172649-221

             On 14/11/2013 5:51 AM, Jeoffrey Lim wrote: 
             > I'm testing the cascading plugin of 

elasticsearch-hadoop and trying the word count example from

             >http://docs.cascading.org/__impatient/impatient2.html 
        <http://docs.cascading.org/impatient/impatient2.html> <

http://docs.cascading.org/__impatient/impatient2.html

        <http://docs.cascading.org/impatient/impatient2.html>> 
             > The problem is the result of the word count is 

incorrect and somehow gets multiplied by the number of

        shards of 
        the index. 
             > 
             > 
             > Code: 
             > 
             > | 
             > Stringindex =args[0]; 
             > StringwcPath =args[1]; 
             > 
             > JobConfjobConf =newJobConf(); 
             > 

jobConf.setBoolean("mapred.__map.tasks.speculative.__execution",false);

             > 

jobConf.setBoolean("mapred.__reduce.tasks.speculative.__execution",false);

             > 
             > Propertiesproperties 

=AppProps.appProps().__buildProperties(jobConf);

             > 

AppProps.__setApplicationJarClass(__properties,WordCountESHadoop.__class);

             > FlowConnectorflowConnector 

=newHadoopFlowConnector(__properties);

             > 
             > FieldsfieldArguments 

=newFields("doc_id","__description");

             > 
             > TapdocTap =newESTap("/"+index 

+"/_search?q=:",__fieldArguments);

             > TapwcTap =newHfs(newTextDelimited(true,__"\t"),wcPath 

);

             > 
             > PipedescFieldPipe 

=newEach("desconly",__fieldArguments,__newESFieldExtractor(__fieldArguments),Fields.RESULTS
);

             > 
             > Fieldstoken =newFields("token"); 
             > Fieldstext =newFields("description"); 
             > RegexSplitGeneratorsplitter 

=newRegexSplitGenerator(token,__"[ \[\]\(\),.]");

             > PipedocPipe 

=newEach(descFieldPipe,text,__splitter,Fields.RESULTS );

             > 
             > 
             > // determine the word counts 
             > PipewcPipe =newPipe("wc",docPipe ); 
             > wcPipe =newGroupBy(wcPipe,token ); 
             > wcPipe 

=newEvery(wcPipe,Fields.ALL,__newCount(),Fields.ALL );

             > 
             > 
             > // connect the taps, pipes, etc., into a flow 
             > FlowDefflowDef =FlowDef.flowDef() 
             > .setName("wc") 
             > .addSource(docPipe,docTap ) 
             > .addTailSink(wcPipe,wcTap ); 
             > 
             > 
             > // write a DOT file and run the flow 
             > FlowwcFlow =flowConnector.connect(flowDef ); 
             > wcFlow.writeDOT("dot/wc.dot"); 
             > wcFlow.complete(); 
             > | 
             > 
             > 
             > 
             > Test case with 5 shards: 
             > 
             > | 
             > curl -XPUT 'http://localhost:9200/test/' 
             > curl -XPUT 'http://localhost:9200/test/__test/1'-d <

http://localhost:9200/test/test/1'-d>

        <http://localhost:9200/test/__test/1'-d <

http://localhost:9200/test/test/1'-d>> '{

"timestamp":1348984904161,"__category":"category1","__description":"d1" }'

             > curl -XPUT 'http://localhost:9200/test/__test/2'-d <

http://localhost:9200/test/test/2'-d>

        <http://localhost:9200/test/__test/2'-d <

http://localhost:9200/test/test/2'-d>> '{

"timestamp":1348984904162,"__category":"category2","__description":"d2" }'

             > curl -XPUT 'http://localhost:9200/test/__test/3'-d <

http://localhost:9200/test/test/3'-d>

        <http://localhost:9200/test/__test/3'-d <

http://localhost:9200/test/test/3'-d>> '{

"timestamp":1348984904163,"__category":"category3","__description":"d3" }'

             > | 
             > 
             > 
             > Run using index 'test': 
             > 
             > | 
             > hadoop jar wordcount.jar test output 
             > | 
             > 
             > 
             > Display output: 
             > 
             > | 
             > cat output/part-00000 
             > 
             > token count 
             > d1 5 
             > d2 5 
             > d3 5 
             > | 
             > 
             > 
             > 
             > Test case with one shard only: 
             > 
             > | 
             > curl -XPUT 'http://localhost:9200/__oneshardtest/'-d<

http://localhost:9200/oneshardtest/'-d>

        <http://localhost:9200/__oneshardtest/'-d <

http://localhost:9200/oneshardtest/'-d>> '{ "settings" : {

        "index" : { 
             "number_of_shards" : 1, 
             > "number_of_replicas" : 1 } } }' 
             > curl -XPUT '

http://localhost:9200/__oneshardtest/test/1'-d

        <http://localhost:9200/oneshardtest/test/1'-d> <

http://localhost:9200/__oneshardtest/test/1'-d

        <http://localhost:9200/oneshardtest/test/1'-d>> '{ 
             > 

"timestamp":1348984904161,"__category":"category1","__description":"d1" }'

             > curl -XPUT '

http://localhost:9200/__oneshardtest/test/2'-d

        <http://localhost:9200/oneshardtest/test/2'-d> <

http://localhost:9200/__oneshardtest/test/2'-d

        <http://localhost:9200/oneshardtest/test/2'-d>> '{ 
             > 

"timestamp":1348984904162,"__category":"category2","__description":"d2" }'

             > curl -XPUT '

http://localhost:9200/__oneshardtest/test/3'-d

        <http://localhost:9200/oneshardtest/test/3'-d> <

http://localhost:9200/__oneshardtest/test/3'-d

        <http://localhost:9200/oneshardtest/test/3'-d>> '{ 
             > 

"timestamp":1348984904163,"__category":"category3","__description":"d3" }'

             > | 
             > 
             > 
             > Run using index 'oneshardtest': 
             > 
             > | 
             > hadoop jar wordcount.jar oneshardtest output 
             > | 
             > 
             > 
             > Display output: 
             > 
             > | 
             > cat output/part-00000 
             > 
             > 
             > token count 
             > d1 1 
             > d2 1 
             > d3 1 
             > | 
             > 
             > 
             > 
             > If an index only has 1 shard the result of word count 

is correct. Somehow there is a problem in

        elasticsearch 
        hadoop in 
             > the data input split during the map process. I have 

already tried this in a hadoop single node

        environment and the 
             > result is the same. Did I miss something in how to 

properly configure cascading in elasticsearch-hadoop?

             > 
             > Thanks. 
             > 
             > 
             > -- 
             > You received this message because you are subscribed 

to the Google Groups "elasticsearch" group.

             > To unsubscribe from this group and stop receiving 

emails from it, send an email to

             >elasticsearc...@googlegroups.__com <mailto:

elasticsearc...@googlegroups.com> <javascript:>.

             > For more options, 

visithttps://groups.google.__com/groups/opt_out

        <http://groups.google.com/groups/opt_out> <

https://groups.google.com/__groups/opt_out

        <https://groups.google.com/groups/opt_out>>. 

             -- 
             Costin 

        -- 
        You received this message because you are subscribed to the 

Google Groups "elasticsearch" group.

        To unsubscribe from this group and stop receiving emails 

from it, send an email to

        elasticsearch+unsubscribe@__googlegroups.com <mailto:

elasticsearch%2Bunsubscribe@googlegroups.com <javascript:>>.

        For more options, visit 

https://groups.google.com/__groups/opt_out <
https://groups.google.com/groups/opt_out>.

-- 
Costin 

-- 
You received this message because you are subscribed to a topic in 

the Google Groups "elasticsearch" group.

To unsubscribe from this topic, visit 

https://groups.google.com/d/__topic/elasticsearch/__byA9kRJB8AY/unsubscribe

<

https://groups.google.com/d/topic/elasticsearch/byA9kRJB8AY/unsubscribe>.

To unsubscribe from this group and all its topics, send an email to 

elasticsearch+unsubscribe@__googlegroups.com

<mailto:elasticsearch%2Bunsubscribe@googlegroups.com <javascript:>>. 
For more options, visit https://groups.google.com/__groups/opt_out <

https://groups.google.com/groups/opt_out>.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

First of all, thanks for the nice write-up - I wish all bug reports are as detailed as yours.
After trying several versions of cascading and Hadoop it turned up the problem was with my tests. The issue was fairly
obscure but your logs helped.
Ironically the issue occurred only sometimes, when running against a pseudo-distributed Hadoop tracker (as yours) not
when running against a remote one.

I've pushed the fix into master and pushed a nightly build as well - let me know how it goes for you.

Thanks again for the continous feedback and detailed report.

On 18/11/2013 7:32 PM, Jeoffrey Lim wrote:

I tried using Cascading 2.1.6, and directly using the samples from Cascading for the Impatient

When just performing a copy:

|
PipedocPipe =newPipe("copy");

FlowDefflowDef =FlowDef.flowDef()
.addSource(docPipe,docTap)
.addTailSink(docPipe,wcTap);

flowConnector.connect(flowDef).complete();
|

for i in output/part-0000* ; do echo "[$i]" && cat $i ; done

|
[output/part-00000]
doc_id description
3org.elasticsearch.hadoop.mr.LinkedMapWritable@3fde891b
2org.elasticsearch.hadoop.mr.LinkedMapWritable@3fde891b
1org.elasticsearch.hadoop.mr.LinkedMapWritable@3fde891b
[output/part-00001]
doc_id description
3org.elasticsearch.hadoop.mr.LinkedMapWritable@b05236
2org.elasticsearch.hadoop.mr.LinkedMapWritable@b05236
1org.elasticsearch.hadoop.mr.LinkedMapWritable@b05236
[output/part-00002]
doc_id description
3org.elasticsearch.hadoop.mr.LinkedMapWritable@3d762027
2org.elasticsearch.hadoop.mr.LinkedMapWritable@3d762027
1org.elasticsearch.hadoop.mr.LinkedMapWritable@3d762027
[output/part-00003]
doc_id description
3org.elasticsearch.hadoop.mr.LinkedMapWritable@1aeca36e
2org.elasticsearch.hadoop.mr.LinkedMapWritable@1aeca36e
1org.elasticsearch.hadoop.mr.LinkedMapWritable@1aeca36e
[output/part-00004]
doc_id description
3org.elasticsearch.hadoop.mr.LinkedMapWritable@23904e0d
2org.elasticsearch.hadoop.mr.LinkedMapWritable@23904e0d
1org.elasticsearch.hadoop.mr.LinkedMapWritable@23904e0d
|

With field extractor:

|
PipedocPipe =newPipe("copy");
PipecopyPipe =newEach(docPipe,newFields("doc_id","description"),newESFieldExtractor(fieldArguments),Fields.RESULTS );

FlowDefflowDef =FlowDef.flowDef()
.addSource(copyPipe,docTap)
.addTailSink(copyPipe,wcTap);

flowConnector.connect(flowDef).complete();
|

for i in output/part-0000* ; do echo "[$i]" && cat $i ; done

|
[output/part-00000]
doc_id description
3d3
2d2
1d1
[output/part-00001]
doc_id description
3d3
2d2
1d1
[output/part-00002]
doc_id description
3d3
2d2
1d1
[output/part-00003]
doc_id description
3d3
2d2
1d1
[output/part-00004]
doc_id description
3d3
2d2
1d1
|

Manual query of data by shards using curl:

|
curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:0'
{"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":0,"max_score":null,"hits":}}

curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:1'
{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":0,"max_score":null,"hits":}}

curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:2'
{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test","_id":"1","_score":1.0,"_source":{"timestamp":1348984904161,"category":"category1","description":"d1"}}]}}

curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:3'
{"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test","_id":"2","_score":1.0,"_source":{"timestamp":1348984904162,"category":"category2","description":"d2"}}]}}

curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:4'
{"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test","_id":"3","_score":1.0,"_source":{"timestamp":1348984904163,"category":"category3","description":"d3"}}]}}
|

Here is also the console output:

|
13/11/19 01:20:06 INFO property.AppProps: using app.id: 9BCEBB939DE70E7BBBC0F8427387176E
13/11/19 01:20:06 INFO util.HadoopUtil: resolving application jar from found main method on: eshadoop.WordCount
13/11/19 01:20:06 INFO planner.HadoopPlanner: using application jar: /home/jeoffrey/test/eshadoopwc.jar
13/11/19 01:20:06 INFO hadoop.Hfs: forcing job to local mode, via sink: Hfs["TextDelimited[[UNKNOWN]->['doc_id',
'description']]"]["output"]
13/11/19 01:20:08 INFO util.Version: Concurrent, Inc - Cascading 2.1.6
13/11/19 01:20:08 INFO flow.Flow: starting
13/11/19 01:20:08 INFO flow.Flow: source: ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:08 INFO flow.Flow: sink: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:08 INFO flow.Flow: parallel execution is enabled: true
13/11/19 01:20:08 INFO flow.Flow: starting jobs: 1
13/11/19 01:20:08 INFO flow.Flow: allocating threads: 1
13/11/19 01:20:08 INFO flow.FlowStep: starting step: (1/1) output
13/11/19 01:20:08 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/11/19 01:20:08 INFO mr.ESInputFormat: Discovered mapping {test=[raw-logs=, test=[category=STRING,
description=STRING, timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:08 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:08 INFO flow.FlowStep: submitted hadoop job: job_local257186463_0001
13/11/19 01:20:08 INFO mapred.LocalJobRunner: Waiting for map tasks
13/11/19 01:20:08 INFO mapred.LocalJobRunner: Starting task: attempt_local257186463_0001_m_000000_0
13/11/19 01:20:08 INFO util.ProcessTree: setsid exited with exit code 0
13/11/19 01:20:08 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4235e6e3
13/11/19 01:20:08 INFO mapred.MapTask: Processing split: cascading.tap.hadoop.io.MultiInputSplit@62803d5
13/11/19 01:20:08 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:08 INFO hadoop.FlowMapper: cascading version: Concurrent, Inc - Cascading 2.1.6
13/11/19 01:20:08 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:08 INFO hadoop.FlowMapper: sourcing from: ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:08 INFO hadoop.FlowMapper: sinking to: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:08 INFO mr.ESInputFormat: Discovered mapping {test=[raw-logs=, test=[category=STRING,
description=STRING, timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:08 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:08 INFO mapred.Task: Task:attempt_local257186463_0001_m_000000_0 is done. And is in the process of commiting
13/11/19 01:20:08 INFO mapred.LocalJobRunner:
13/11/19 01:20:08 INFO mapred.Task: Task attempt_local257186463_0001_m_000000_0 is allowed to commit now
13/11/19 01:20:08 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local257186463_0001_m_000000_0' to
file:/home/jeoffrey/test/output
13/11/19 01:20:08 INFO mapred.LocalJobRunner: ShardInputSplit [node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
Cordelia|192.168.1.120:9200],shard=0]
13/11/19 01:20:08 INFO mapred.Task: Task 'attempt_local257186463_0001_m_000000_0' done.
13/11/19 01:20:08 INFO mapred.LocalJobRunner: Finishing task: attempt_local257186463_0001_m_000000_0
13/11/19 01:20:08 INFO mapred.LocalJobRunner: Starting task: attempt_local257186463_0001_m_000001_0
13/11/19 01:20:08 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3d3b5a3a
13/11/19 01:20:08 INFO mapred.MapTask: Processing split: cascading.tap.hadoop.io.MultiInputSplit@4ab9b8d0
13/11/19 01:20:08 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:08 INFO hadoop.FlowMapper: cascading version: Concurrent, Inc - Cascading 2.1.6
13/11/19 01:20:08 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:08 INFO hadoop.FlowMapper: sourcing from: ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:08 INFO hadoop.FlowMapper: sinking to: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping {test=[raw-logs=, test=[category=STRING,
description=STRING, timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:09 INFO mapred.Task: Task:attempt_local257186463_0001_m_000001_0 is done. And is in the process of commiting
13/11/19 01:20:09 INFO mapred.LocalJobRunner:
13/11/19 01:20:09 INFO mapred.Task: Task attempt_local257186463_0001_m_000001_0 is allowed to commit now
13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local257186463_0001_m_000001_0' to
file:/home/jeoffrey/test/output
13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit [node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
Cordelia|192.168.1.120:9200],shard=1]
13/11/19 01:20:09 INFO mapred.Task: Task 'attempt_local257186463_0001_m_000001_0' done.
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task: attempt_local257186463_0001_m_000001_0
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Starting task: attempt_local257186463_0001_m_000002_0
13/11/19 01:20:09 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3d46e381
13/11/19 01:20:09 INFO mapred.MapTask: Processing split: cascading.tap.hadoop.io.MultiInputSplit@52cd19d
13/11/19 01:20:09 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:09 INFO hadoop.FlowMapper: cascading version: Concurrent, Inc - Cascading 2.1.6
13/11/19 01:20:09 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:09 INFO hadoop.FlowMapper: sourcing from: ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:09 INFO hadoop.FlowMapper: sinking to: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping {test=[raw-logs=, test=[category=STRING,
description=STRING, timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:09 INFO mapred.Task: Task:attempt_local257186463_0001_m_000002_0 is done. And is in the process of commiting
13/11/19 01:20:09 INFO mapred.LocalJobRunner:
13/11/19 01:20:09 INFO mapred.Task: Task attempt_local257186463_0001_m_000002_0 is allowed to commit now
13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local257186463_0001_m_000002_0' to
file:/home/jeoffrey/test/output
13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit [node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
Cordelia|192.168.1.120:9200],shard=4]
13/11/19 01:20:09 INFO mapred.Task: Task 'attempt_local257186463_0001_m_000002_0' done.
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task: attempt_local257186463_0001_m_000002_0
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Starting task: attempt_local257186463_0001_m_000003_0
13/11/19 01:20:09 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4ca68fd8
13/11/19 01:20:09 INFO mapred.MapTask: Processing split: cascading.tap.hadoop.io.MultiInputSplit@2e097617
13/11/19 01:20:09 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:09 INFO hadoop.FlowMapper: cascading version: Concurrent, Inc - Cascading 2.1.6
13/11/19 01:20:09 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:09 INFO hadoop.FlowMapper: sourcing from: ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:09 INFO hadoop.FlowMapper: sinking to: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping {test=[raw-logs=, test=[category=STRING,
description=STRING, timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:09 INFO mapred.Task: Task:attempt_local257186463_0001_m_000003_0 is done. And is in the process of commiting
13/11/19 01:20:09 INFO mapred.LocalJobRunner:
13/11/19 01:20:09 INFO mapred.Task: Task attempt_local257186463_0001_m_000003_0 is allowed to commit now
13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local257186463_0001_m_000003_0' to
file:/home/jeoffrey/test/output
13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit [node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
Cordelia|192.168.1.120:9200],shard=3]
13/11/19 01:20:09 INFO mapred.Task: Task 'attempt_local257186463_0001_m_000003_0' done.
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task: attempt_local257186463_0001_m_000003_0
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Starting task: attempt_local257186463_0001_m_000004_0
13/11/19 01:20:09 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@6e848ecc
13/11/19 01:20:09 INFO mapred.MapTask: Processing split: cascading.tap.hadoop.io.MultiInputSplit@40363068
13/11/19 01:20:09 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:09 INFO hadoop.FlowMapper: cascading version: Concurrent, Inc - Cascading 2.1.6
13/11/19 01:20:09 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:09 INFO hadoop.FlowMapper: sourcing from: ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:09 INFO hadoop.FlowMapper: sinking to: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping {test=[raw-logs=, test=[category=STRING,
description=STRING, timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:09 INFO mapred.Task: Task:attempt_local257186463_0001_m_000004_0 is done. And is in the process of commiting
13/11/19 01:20:09 INFO mapred.LocalJobRunner:
13/11/19 01:20:09 INFO mapred.Task: Task attempt_local257186463_0001_m_000004_0 is allowed to commit now
13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local257186463_0001_m_000004_0' to
file:/home/jeoffrey/test/output
13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit [node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
Cordelia|192.168.1.120:9200],shard=2]
13/11/19 01:20:09 INFO mapred.Task: Task 'attempt_local257186463_0001_m_000004_0' done.
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task: attempt_local257186463_0001_m_000004_0
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Map task executor complete.
13/11/19 01:20:13 INFO util.Hadoop18TapUtil: deleting temp path output/_temporary
|

Thanks.

On Tuesday, November 19, 2013 12:01:05 AM UTC+8, Costin Leau wrote:

Can you please try with Cascading 2.1.6 just in case? I haven't tested Cascading 2.2 yet.

Thanks,

On 18/11/2013 5:13 PM, Jeoffrey Lim wrote:
> I'm using Hadoop 1.2.1, ElasticSearch 0.90.3, Cascading 2.2 and tried running the word count tutorial sample in local
> stand alone operation, and pseudo distributed mode resulting with the same output.
>
>
> Thanks!
>
>
> On Mon, Nov 18, 2013 at 11:03 PM, Costin Leau <costi...@gmail.com <javascript:> <mailto:costi...@gmail.com <javascript:>>> wrote:
>
>     By the way, can you confirm the setup you are using - Hadoop version and network topology?
>
>     Thanks,
>
>
>     On 18/11/2013 4:22 PM, Costin Leau wrote:
>
>         Something is a miss here since I'm unable to reproduce the problem. I'll try again using your verbatim example.
>         The ESFieldExtractor shouldn't be needed since ESTap returns the results as a TupleEntry iterator so there
>         shouldn't be
>         no need to do the field extraction but rather just refer to the fields by name.
>         Let me know what didn't work for you - I'll go through yoru example myself but just in case I'm missing something,
>         feedback is useful.
>
>         Cheers,
>
>         On 18/11/2013 1:18 PM, Jeoffrey Lim wrote:
>
>             Thanks a lot for looking at this problem. But still when I used the nightly build with the intended fix ,the
>             problem
>             still occurs and the word count is still 5 which is the total number of shards of the index.
>
>             On Saturday, November 16, 2013 1:50:18 AM UTC+8, Costin Leau wrote:
>
>                  Hi,
>
>                  I've pushed a nightly build [1] which should address the problem. Let me know if it works for you.
>
>                  It looks like Cascading is acting differently when running against a pseudo/local cluster as oppose to
>             a proper one
>                  when
>                  it comes to dealing with multiple splits. The bug fix should make the behaviour consistent.
>
>                  Cheers,
>
>                  [1] elasticsearch-hadoop-1.3.0.__BUILD-20131115.172649-221
>
>                  On 14/11/2013 5:51 AM, Jeoffrey Lim wrote:
>                  > I'm testing the cascading plugin of elasticsearch-hadoop and trying the word count example from
>                  >http://docs.cascading.org/__impatient/impatient2.html <http://docs.cascading.org/__impatient/impatient2.html>
>             <http://docs.cascading.org/impatient/impatient2.html <http://docs.cascading.org/impatient/impatient2.html>>
<http://docs.cascading.org/__impatient/impatient2.html <http://docs.cascading.org/__impatient/impatient2.html>
>             <http://docs.cascading.org/impatient/impatient2.html <http://docs.cascading.org/impatient/impatient2.html>>>
>                  > The problem is the result of the word count is incorrect and somehow gets multiplied by the number of
>             shards of
>             the index.
>                  >
>                  >
>                  > Code:
>                  >
>                  > |
>                  > Stringindex =args[0];
>                  > StringwcPath =args[1];
>                  >
>                  > JobConfjobConf =newJobConf();
>                  > jobConf.setBoolean("mapred.__map.tasks.speculative.__execution",false);
>                  > jobConf.setBoolean("mapred.__reduce.tasks.speculative.__execution",false);
>                  >
>                  > Propertiesproperties =AppProps.appProps().__buildProperties(jobConf);
>                  > AppProps.__setApplicationJarClass(__properties,WordCountESHadoop.__class);
>                  > FlowConnectorflowConnector =newHadoopFlowConnector(__properties);
>                  >
>                  > FieldsfieldArguments =newFields("doc_id","__description");
>                  >
>                  > TapdocTap =newESTap("/"+index +"/_search?q=*:*",__fieldArguments);
>                  > TapwcTap =newHfs(newTextDelimited(true,__"\t"),wcPath );
>                  >
>                  > PipedescFieldPipe
>             =newEach("desconly",__fieldArguments,__newESFieldExtractor(__fieldArguments),Fields.RESULTS );
>                  >
>                  > Fieldstoken =newFields("token");
>                  > Fieldstext =newFields("description");
>                  > RegexSplitGeneratorsplitter =newRegexSplitGenerator(token,__"[ \\[\\]\\(\\),.]");
>                  > PipedocPipe =newEach(descFieldPipe,text,__splitter,Fields.RESULTS );
>                  >
>                  >
>                  > // determine the word counts
>                  > PipewcPipe =newPipe("wc",docPipe );
>                  > wcPipe =newGroupBy(wcPipe,token );
>                  > wcPipe =newEvery(wcPipe,Fields.ALL,__newCount(),Fields.ALL );
>                  >
>                  >
>                  > // connect the taps, pipes, etc., into a flow
>                  > FlowDefflowDef =FlowDef.flowDef()
>                  > .setName("wc")
>                  > .addSource(docPipe,docTap )
>                  > .addTailSink(wcPipe,wcTap );
>                  >
>                  >
>                  > // write a DOT file and run the flow
>                  > FlowwcFlow =flowConnector.connect(flowDef );
>                  > wcFlow.writeDOT("dot/wc.dot");
>                  > wcFlow.complete();
>                  > |
>                  >
>                  >
>                  >
>                  > Test case with 5 shards:
>                  >
>                  > |
>                  > curl -XPUT 'http://localhost:9200/test/'
>                  > curl -XPUT 'http://localhost:9200/test/__test/1'-d <http://localhost:9200/test/__test/1'-d>
<http://localhost:9200/test/test/1'-d <http://localhost:9200/test/test/1'-d>>
>             <http://localhost:9200/test/__test/1'-d <http://localhost:9200/test/__test/1'-d>
<http://localhost:9200/test/test/1'-d <http://localhost:9200/test/test/1'-d>>> '{
>                  "timestamp":1348984904161,"__category":"category1","__description":"d1" }'
>                  > curl -XPUT 'http://localhost:9200/test/__test/2'-d <http://localhost:9200/test/__test/2'-d>
<http://localhost:9200/test/test/2'-d <http://localhost:9200/test/test/2'-d>>
>             <http://localhost:9200/test/__test/2'-d <http://localhost:9200/test/__test/2'-d>
<http://localhost:9200/test/test/2'-d <http://localhost:9200/test/test/2'-d>>> '{
>                  "timestamp":1348984904162,"__category":"category2","__description":"d2" }'
>                  > curl -XPUT 'http://localhost:9200/test/__test/3'-d <http://localhost:9200/test/__test/3'-d>
<http://localhost:9200/test/test/3'-d <http://localhost:9200/test/test/3'-d>>
>             <http://localhost:9200/test/__test/3'-d <http://localhost:9200/test/__test/3'-d>
<http://localhost:9200/test/test/3'-d <http://localhost:9200/test/test/3'-d>>> '{
>                  "timestamp":1348984904163,"__category":"category3","__description":"d3" }'
>                  > |
>                  >
>                  >
>                  > Run using index 'test':
>                  >
>                  > |
>                  > hadoop jar wordcount.jar test output
>                  > |
>                  >
>                  >
>                  > Display output:
>                  >
>                  > |
>                  > cat output/part-00000
>                  >
>                  > token count
>                  > d1 5
>                  > d2 5
>                  > d3 5
>                  > |
>                  >
>                  >
>                  >
>                  > Test case with one shard only:
>                  >
>                  > |
>                  > curl -XPUT 'http://localhost:9200/__oneshardtest/'-d <http://localhost:9200/__oneshardtest/'-d>
<http://localhost:9200/oneshardtest/'-d <http://localhost:9200/oneshardtest/'-d>>
>             <http://localhost:9200/__oneshardtest/'-d <http://localhost:9200/__oneshardtest/'-d>
<http://localhost:9200/oneshardtest/'-d <http://localhost:9200/oneshardtest/'-d>>> '{ "settings" : {
>             "index" : {
>                  "number_of_shards" : 1,
>                  > "number_of_replicas" : 1 } } }'
>                  > curl -XPUT 'http://localhost:9200/__oneshardtest/test/1'-d <http://localhost:9200/__oneshardtest/test/1'-d>
>             <http://localhost:9200/oneshardtest/test/1'-d <http://localhost:9200/oneshardtest/test/1'-d>>
<http://localhost:9200/__oneshardtest/test/1'-d <http://localhost:9200/__oneshardtest/test/1'-d>
>             <http://localhost:9200/oneshardtest/test/1'-d <http://localhost:9200/oneshardtest/test/1'-d>>> '{
>                  > "timestamp":1348984904161,"__category":"category1","__description":"d1" }'
>                  > curl -XPUT 'http://localhost:9200/__oneshardtest/test/2'-d <http://localhost:9200/__oneshardtest/test/2'-d>
>             <http://localhost:9200/oneshardtest/test/2'-d <http://localhost:9200/oneshardtest/test/2'-d>>
<http://localhost:9200/__oneshardtest/test/2'-d <http://localhost:9200/__oneshardtest/test/2'-d>
>             <http://localhost:9200/oneshardtest/test/2'-d <http://localhost:9200/oneshardtest/test/2'-d>>> '{
>                  > "timestamp":1348984904162,"__category":"category2","__description":"d2" }'
>                  > curl -XPUT 'http://localhost:9200/__oneshardtest/test/3'-d <http://localhost:9200/__oneshardtest/test/3'-d>
>             <http://localhost:9200/oneshardtest/test/3'-d <http://localhost:9200/oneshardtest/test/3'-d>>
<http://localhost:9200/__oneshardtest/test/3'-d <http://localhost:9200/__oneshardtest/test/3'-d>
>             <http://localhost:9200/oneshardtest/test/3'-d <http://localhost:9200/oneshardtest/test/3'-d>>> '{
>                  > "timestamp":1348984904163,"__category":"category3","__description":"d3" }'
>                  > |
>                  >
>                  >
>                  > Run using index 'oneshardtest':
>                  >
>                  > |
>                  > hadoop jar wordcount.jar oneshardtest output
>                  > |
>                  >
>                  >
>                  > Display output:
>                  >
>                  > |
>                  > cat output/part-00000
>                  >
>                  >
>                  > token count
>                  > d1 1
>                  > d2 1
>                  > d3 1
>                  > |
>                  >
>                  >
>                  >
>                  > If an index only has 1 shard the result of word count is correct. Somehow there is a problem in
>             elasticsearch
>             hadoop in
>                  > the data input split during the map process. I have already tried this in a hadoop single node
>             environment and the
>                  > result is the same. Did I miss something in how to properly configure cascading in elasticsearch-hadoop?
>                  >
>                  > Thanks.
>                  >
>                  >
>                  > --
>                  > You received this message because you are subscribed to the Google Groups "elasticsearch" group.
>                  > To unsubscribe from this group and stop receiving emails from it, send an email to
>                  >elasticsearc...@googlegroups.__com <mailto:elasticsearc...@googlegroups.com> <javascript:>.
>                  > For more options, visithttps://groups.google.__com/groups/opt_out
>             <http://groups.google.com/groups/opt_out <http://groups.google.com/groups/opt_out>>
<https://groups.google.com/__groups/opt_out <https://groups.google.com/__groups/opt_out>
>             <https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>>>.
>
>                  --
>                  Costin
>
>             --
>             You received this message because you are subscribed to the Google Groups "elasticsearch" group.
>             To unsubscribe from this group and stop receiving emails from it, send an email to
>             elasticsearch+unsubscribe@__googlegroups.com <http://googlegroups.com> <mailto:elasticsearch%2Bunsubscribe@googlegroups.com <javascript:>>.
>             For more options, visithttps://groups.google.com/__groups/opt_out <https://groups.google.com/__groups/opt_out>
<https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>>.
>
>
>     --
>     Costin
>
>     --
>     You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
>     To unsubscribe from this topic, visithttps://groups.google.com/d/__topic/elasticsearch/__byA9kRJB8AY/unsubscribe
<https://groups.google.com/d/__topic/elasticsearch/__byA9kRJB8AY/unsubscribe>
>     <https://groups.google.com/d/topic/elasticsearch/byA9kRJB8AY/unsubscribe
<https://groups.google.com/d/topic/elasticsearch/byA9kRJB8AY/unsubscribe>>.
>     To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@__googlegroups.com <http://googlegroups.com>
>     <mailto:elasticsearch%2Bunsubscribe@googlegroups.com <javascript:>>.
>     For more options, visithttps://groups.google.com/__groups/opt_out <https://groups.google.com/__groups/opt_out>
<https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>elasticsearc...@googlegroups.com <javascript:>.
> For more options, visithttps://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Word count works perfectly now. Thanks a lot :slight_smile:

On Tuesday, November 19, 2013 4:11:37 AM UTC+8, Costin Leau wrote:

First of all, thanks for the nice write-up - I wish all bug reports are as
detailed as yours.
After trying several versions of cascading and Hadoop it turned up the
problem was with my tests. The issue was fairly
obscure but your logs helped.
Ironically the issue occurred only sometimes, when running against a
pseudo-distributed Hadoop tracker (as yours) not
when running against a remote one.

I've pushed the fix into master and pushed a nightly build as well - let
me know how it goes for you.

Thanks again for the continous feedback and detailed report.

On 18/11/2013 7:32 PM, Jeoffrey Lim wrote:

I tried using Cascading 2.1.6, and directly using the samples from
Cascading for the Impatient

When just performing a copy:

|
PipedocPipe =newPipe("copy");

FlowDefflowDef =FlowDef.flowDef()
.addSource(docPipe,docTap)
.addTailSink(docPipe,wcTap);

flowConnector.connect(flowDef).complete();
|

for i in output/part-0000* ; do echo "[$i]" && cat $i ; done

|
[output/part-00000]
doc_id description
3org.elasticsearch.hadoop.mr.LinkedMapWritable@3fde891b
2org.elasticsearch.hadoop.mr.LinkedMapWritable@3fde891b
1org.elasticsearch.hadoop.mr.LinkedMapWritable@3fde891b
[output/part-00001]
doc_id description
3org.elasticsearch.hadoop.mr.LinkedMapWritable@b05236
2org.elasticsearch.hadoop.mr.LinkedMapWritable@b05236
1org.elasticsearch.hadoop.mr.LinkedMapWritable@b05236
[output/part-00002]
doc_id description
3org.elasticsearch.hadoop.mr.LinkedMapWritable@3d762027
2org.elasticsearch.hadoop.mr.LinkedMapWritable@3d762027
1org.elasticsearch.hadoop.mr.LinkedMapWritable@3d762027
[output/part-00003]
doc_id description
3org.elasticsearch.hadoop.mr.LinkedMapWritable@1aeca36e
2org.elasticsearch.hadoop.mr.LinkedMapWritable@1aeca36e
1org.elasticsearch.hadoop.mr.LinkedMapWritable@1aeca36e
[output/part-00004]
doc_id description
3org.elasticsearch.hadoop.mr.LinkedMapWritable@23904e0d
2org.elasticsearch.hadoop.mr.LinkedMapWritable@23904e0d
1org.elasticsearch.hadoop.mr.LinkedMapWritable@23904e0d
|

With field extractor:

|
PipedocPipe =newPipe("copy");
PipecopyPipe
=newEach(docPipe,newFields("doc_id","description"),newESFieldExtractor(fieldArguments),Fields.RESULTS
);

FlowDefflowDef =FlowDef.flowDef()
.addSource(copyPipe,docTap)
.addTailSink(copyPipe,wcTap);

flowConnector.connect(flowDef).complete();
|

for i in output/part-0000* ; do echo "[$i]" && cat $i ; done

|
[output/part-00000]
doc_id description
3d3
2d2
1d1
[output/part-00001]
doc_id description
3d3
2d2
1d1
[output/part-00002]
doc_id description
3d3
2d2
1d1
[output/part-00003]
doc_id description
3d3
2d2
1d1
[output/part-00004]
doc_id description
3d3
2d2
1d1
|

Manual query of data by shards using curl:

|
curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:0'

{"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":0,"max_score":null,"hits":}}

curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:1'

{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":0,"max_score":null,"hits":}}

curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:2'

{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test","_id":"1","_score":1.0,"_source":{"timestamp":1348984904161,"category":"category1","description":"d1"}}]}}

curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:3'

{"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test","_id":"2","_score":1.0,"_source":{"timestamp":1348984904162,"category":"category2","description":"d2"}}]}}

curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:4'

{"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test","_id":"3","_score":1.0,"_source":{"timestamp":1348984904163,"category":"category3","description":"d3"}}]}}

|

Here is also the console output:

|
13/11/19 01:20:06 INFO property.AppProps: using app.id:
9BCEBB939DE70E7BBBC0F8427387176E
13/11/19 01:20:06 INFO util.HadoopUtil: resolving application jar from
found main method on: eshadoop.WordCount
13/11/19 01:20:06 INFO planner.HadoopPlanner: using application jar:
/home/jeoffrey/test/eshadoopwc.jar
13/11/19 01:20:06 INFO hadoop.Hfs: forcing job to local mode, via sink:
Hfs["TextDelimited[[UNKNOWN]->['doc_id',
'description']]"]["output"]
13/11/19 01:20:08 INFO util.Version: Concurrent, Inc - Cascading 2.1.6
13/11/19 01:20:08 INFO flow.Flow: starting
13/11/19 01:20:08 INFO flow.Flow: source:
ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:08 INFO flow.Flow: sink:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:08 INFO flow.Flow: parallel execution is enabled:
true
13/11/19 01:20:08 INFO flow.Flow: starting jobs: 1
13/11/19 01:20:08 INFO flow.Flow: allocating threads: 1
13/11/19 01:20:08 INFO flow.FlowStep: starting step: (1/1) output
13/11/19 01:20:08 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
13/11/19 01:20:08 INFO mr.ESInputFormat: Discovered mapping
{test=[raw-logs=, test=[category=STRING,
description=STRING, timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:08 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:08 INFO flow.FlowStep: submitted hadoop job:
job_local257186463_0001
13/11/19 01:20:08 INFO mapred.LocalJobRunner: Waiting for map tasks
13/11/19 01:20:08 INFO mapred.LocalJobRunner: Starting task:
attempt_local257186463_0001_m_000000_0
13/11/19 01:20:08 INFO util.ProcessTree: setsid exited with exit code 0
13/11/19 01:20:08 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4235e6e3
13/11/19 01:20:08 INFO mapred.MapTask: Processing split:
cascading.tap.hadoop.io.MultiInputSplit@62803d5
13/11/19 01:20:08 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:08 INFO hadoop.FlowMapper: cascading version: Concurrent,
Inc - Cascading 2.1.6
13/11/19 01:20:08 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:08 INFO hadoop.FlowMapper: sourcing from:
ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:08 INFO hadoop.FlowMapper: sinking to:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:08 INFO mr.ESInputFormat: Discovered mapping
{test=[raw-logs=, test=[category=STRING,
description=STRING, timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:08 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:08 INFO mapred.Task:
Task:attempt_local257186463_0001_m_000000_0 is done. And is in the process
of commiting
13/11/19 01:20:08 INFO mapred.LocalJobRunner:
13/11/19 01:20:08 INFO mapred.Task: Task
attempt_local257186463_0001_m_000000_0 is allowed to commit now
13/11/19 01:20:08 INFO mapred.FileOutputCommitter: Saved output of task
'attempt_local257186463_0001_m_000000_0' to
file:/home/jeoffrey/test/output
13/11/19 01:20:08 INFO mapred.LocalJobRunner: ShardInputSplit
[node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
Cordelia|192.168.1.120:9200],shard=0]
13/11/19 01:20:08 INFO mapred.Task: Task
'attempt_local257186463_0001_m_000000_0' done.
13/11/19 01:20:08 INFO mapred.LocalJobRunner: Finishing task:
attempt_local257186463_0001_m_000000_0
13/11/19 01:20:08 INFO mapred.LocalJobRunner: Starting task:
attempt_local257186463_0001_m_000001_0
13/11/19 01:20:08 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3d3b5a3a
13/11/19 01:20:08 INFO mapred.MapTask: Processing split:
cascading.tap.hadoop.io.MultiInputSplit@4ab9b8d0
13/11/19 01:20:08 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:08 INFO hadoop.FlowMapper: cascading version: Concurrent,
Inc - Cascading 2.1.6
13/11/19 01:20:08 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:08 INFO hadoop.FlowMapper: sourcing from:
ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:08 INFO hadoop.FlowMapper: sinking to:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping
{test=[raw-logs=, test=[category=STRING,
description=STRING, timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:09 INFO mapred.Task:
Task:attempt_local257186463_0001_m_000001_0 is done. And is in the process
of commiting
13/11/19 01:20:09 INFO mapred.LocalJobRunner:
13/11/19 01:20:09 INFO mapred.Task: Task
attempt_local257186463_0001_m_000001_0 is allowed to commit now
13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task
'attempt_local257186463_0001_m_000001_0' to
file:/home/jeoffrey/test/output
13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit
[node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
Cordelia|192.168.1.120:9200],shard=1]
13/11/19 01:20:09 INFO mapred.Task: Task
'attempt_local257186463_0001_m_000001_0' done.
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task:
attempt_local257186463_0001_m_000001_0
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Starting task:
attempt_local257186463_0001_m_000002_0
13/11/19 01:20:09 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3d46e381
13/11/19 01:20:09 INFO mapred.MapTask: Processing split:
cascading.tap.hadoop.io.MultiInputSplit@52cd19d
13/11/19 01:20:09 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:09 INFO hadoop.FlowMapper: cascading version: Concurrent,
Inc - Cascading 2.1.6
13/11/19 01:20:09 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:09 INFO hadoop.FlowMapper: sourcing from:
ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:09 INFO hadoop.FlowMapper: sinking to:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping
{test=[raw-logs=, test=[category=STRING,
description=STRING, timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:09 INFO mapred.Task:
Task:attempt_local257186463_0001_m_000002_0 is done. And is in the process
of commiting
13/11/19 01:20:09 INFO mapred.LocalJobRunner:
13/11/19 01:20:09 INFO mapred.Task: Task
attempt_local257186463_0001_m_000002_0 is allowed to commit now
13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task
'attempt_local257186463_0001_m_000002_0' to
file:/home/jeoffrey/test/output
13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit
[node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
Cordelia|192.168.1.120:9200],shard=4]
13/11/19 01:20:09 INFO mapred.Task: Task
'attempt_local257186463_0001_m_000002_0' done.
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task:
attempt_local257186463_0001_m_000002_0
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Starting task:
attempt_local257186463_0001_m_000003_0
13/11/19 01:20:09 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4ca68fd8
13/11/19 01:20:09 INFO mapred.MapTask: Processing split:
cascading.tap.hadoop.io.MultiInputSplit@2e097617
13/11/19 01:20:09 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:09 INFO hadoop.FlowMapper: cascading version: Concurrent,
Inc - Cascading 2.1.6
13/11/19 01:20:09 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:09 INFO hadoop.FlowMapper: sourcing from:
ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:09 INFO hadoop.FlowMapper: sinking to:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping
{test=[raw-logs=, test=[category=STRING,
description=STRING, timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:09 INFO mapred.Task:
Task:attempt_local257186463_0001_m_000003_0 is done. And is in the process
of commiting
13/11/19 01:20:09 INFO mapred.LocalJobRunner:
13/11/19 01:20:09 INFO mapred.Task: Task
attempt_local257186463_0001_m_000003_0 is allowed to commit now
13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task
'attempt_local257186463_0001_m_000003_0' to
file:/home/jeoffrey/test/output
13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit
[node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
Cordelia|192.168.1.120:9200],shard=3]
13/11/19 01:20:09 INFO mapred.Task: Task
'attempt_local257186463_0001_m_000003_0' done.
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task:
attempt_local257186463_0001_m_000003_0
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Starting task:
attempt_local257186463_0001_m_000004_0
13/11/19 01:20:09 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@6e848ecc
13/11/19 01:20:09 INFO mapred.MapTask: Processing split:
cascading.tap.hadoop.io.MultiInputSplit@40363068
13/11/19 01:20:09 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:09 INFO hadoop.FlowMapper: cascading version: Concurrent,
Inc - Cascading 2.1.6
13/11/19 01:20:09 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:09 INFO hadoop.FlowMapper: sourcing from:
ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:09 INFO hadoop.FlowMapper: sinking to:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping
{test=[raw-logs=, test=[category=STRING,
description=STRING, timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:09 INFO mapred.Task:
Task:attempt_local257186463_0001_m_000004_0 is done. And is in the process
of commiting
13/11/19 01:20:09 INFO mapred.LocalJobRunner:
13/11/19 01:20:09 INFO mapred.Task: Task
attempt_local257186463_0001_m_000004_0 is allowed to commit now
13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task
'attempt_local257186463_0001_m_000004_0' to
file:/home/jeoffrey/test/output
13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit
[node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
Cordelia|192.168.1.120:9200],shard=2]
13/11/19 01:20:09 INFO mapred.Task: Task
'attempt_local257186463_0001_m_000004_0' done.
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task:
attempt_local257186463_0001_m_000004_0
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Map task executor
complete.
13/11/19 01:20:13 INFO util.Hadoop18TapUtil: deleting temp path
output/_temporary
|

Thanks.

On Tuesday, November 19, 2013 12:01:05 AM UTC+8, Costin Leau wrote:

Can you please try with Cascading 2.1.6 just in case? I haven't 

tested Cascading 2.2 yet.

Thanks, 

On 18/11/2013 5:13 PM, Jeoffrey Lim wrote: 
> I'm using Hadoop 1.2.1, ElasticSearch 0.90.3, Cascading 2.2 and 

tried running the word count tutorial sample in local

> stand alone operation, and pseudo distributed mode resulting with 

the same output.

> 
> 
> Thanks! 
> 
> 
> On Mon, Nov 18, 2013 at 11:03 PM, Costin Leau <costi...@gmail.com<javascript:> <mailto:

costi...@gmail.com <javascript:>>> wrote:

> 
>     By the way, can you confirm the setup you are using - Hadoop 

version and network topology?

> 
>     Thanks, 
> 
> 
>     On 18/11/2013 4:22 PM, Costin Leau wrote: 
> 
>         Something is a miss here since I'm unable to reproduce the 

problem. I'll try again using your verbatim example.

>         The ESFieldExtractor shouldn't be needed since ESTap 

returns the results as a TupleEntry iterator so there

>         shouldn't be 
>         no need to do the field extraction but rather just refer 

to the fields by name.

>         Let me know what didn't work for you - I'll go through 

yoru example myself but just in case I'm missing something,

>         feedback is useful. 
> 
>         Cheers, 
> 
>         On 18/11/2013 1:18 PM, Jeoffrey Lim wrote: 
> 
>             Thanks a lot for looking at this problem. But still 

when I used the nightly build with the intended fix ,the

>             problem 
>             still occurs and the word count is still 5 which is 

the total number of shards of the index.

> 
>             On Saturday, November 16, 2013 1:50:18 AM UTC+8, 

Costin Leau wrote:

> 
>                  Hi, 
> 
>                  I've pushed a nightly build [1] which should 

address the problem. Let me know if it works for you.

> 
>                  It looks like Cascading is acting differently 

when running against a pseudo/local cluster as oppose to

>             a proper one 
>                  when 
>                  it comes to dealing with multiple splits. The bug 

fix should make the behaviour consistent.

> 
>                  Cheers, 
> 
>                  [1] 

elasticsearch-hadoop-1.3.0.__BUILD-20131115.172649-221

> 
>                  On 14/11/2013 5:51 AM, Jeoffrey Lim wrote: 
>                  > I'm testing the cascading plugin of 

elasticsearch-hadoop and trying the word count example from

>                  >

http://docs.cascading.org/__impatient/impatient2.html <
http://docs.cascading.org/__impatient/impatient2.html>

>             <http://docs.cascading.org/impatient/impatient2.html <

Cascading for the Impatient, Part 2>>

<http://docs.cascading.org/__impatient/impatient2.html <

http://docs.cascading.org/__impatient/impatient2.html>

>             <http://docs.cascading.org/impatient/impatient2.html <

Cascading for the Impatient, Part 2>>>

>                  > The problem is the result of the word count is 

incorrect and somehow gets multiplied by the number of

>             shards of 
>             the index. 
>                  > 
>                  > 
>                  > Code: 
>                  > 
>                  > | 
>                  > Stringindex =args[0]; 
>                  > StringwcPath =args[1]; 
>                  > 
>                  > JobConfjobConf =newJobConf(); 
>                  > 

jobConf.setBoolean("mapred.__map.tasks.speculative.__execution",false);

>                  > 

jobConf.setBoolean("mapred.__reduce.tasks.speculative.__execution",false);

>                  > 
>                  > Propertiesproperties 

=AppProps.appProps().__buildProperties(jobConf);

>                  > 

AppProps.__setApplicationJarClass(__properties,WordCountESHadoop.__class);

>                  > FlowConnectorflowConnector 

=newHadoopFlowConnector(__properties);

>                  > 
>                  > FieldsfieldArguments 

=newFields("doc_id","__description");

>                  > 
>                  > TapdocTap =newESTap("/"+index 

+"/_search?q=:",__fieldArguments);

>                  > TapwcTap 

=newHfs(newTextDelimited(true,__"\t"),wcPath );

>                  > 
>                  > PipedescFieldPipe 
>             

=newEach("desconly",__fieldArguments,__newESFieldExtractor(__fieldArguments),Fields.RESULTS
);

>                  > 
>                  > Fieldstoken =newFields("token"); 
>                  > Fieldstext =newFields("description"); 
>                  > RegexSplitGeneratorsplitter 

=newRegexSplitGenerator(token,__"[ \[\]\(\),.]");

>                  > PipedocPipe 

=newEach(descFieldPipe,text,__splitter,Fields.RESULTS );

>                  > 
>                  > 
>                  > // determine the word counts 
>                  > PipewcPipe =newPipe("wc",docPipe ); 
>                  > wcPipe =newGroupBy(wcPipe,token ); 
>                  > wcPipe 

=newEvery(wcPipe,Fields.ALL,__newCount(),Fields.ALL );

>                  > 
>                  > 
>                  > // connect the taps, pipes, etc., into a flow 
>                  > FlowDefflowDef =FlowDef.flowDef() 
>                  > .setName("wc") 
>                  > .addSource(docPipe,docTap ) 
>                  > .addTailSink(wcPipe,wcTap ); 
>                  > 
>                  > 
>                  > // write a DOT file and run the flow 
>                  > FlowwcFlow =flowConnector.connect(flowDef ); 
>                  > wcFlow.writeDOT("dot/wc.dot"); 
>                  > wcFlow.complete(); 
>                  > | 
>                  > 
>                  > 
>                  > 
>                  > Test case with 5 shards: 
>                  > 
>                  > | 
>                  > curl -XPUT 'http://localhost:9200/test/' 
>                  > curl -XPUT '

http://localhost:9200/test/__test/1'-d <
http://localhost:9200/test/__test/1'-d>

<http://localhost:9200/test/test/1'-d <

http://localhost:9200/test/test/1'-d>>

>             <http://localhost:9200/test/__test/1'-d <

http://localhost:9200/test/__test/1'-d>

<http://localhost:9200/test/test/1'-d <

http://localhost:9200/test/test/1'-d>>> '{

>                 

"timestamp":1348984904161,"__category":"category1","__description":"d1" }'

>                  > curl -XPUT '

http://localhost:9200/test/__test/2'-d <
http://localhost:9200/test/__test/2'-d>

<http://localhost:9200/test/test/2'-d <

http://localhost:9200/test/test/2'-d>>

>             <http://localhost:9200/test/__test/2'-d <

http://localhost:9200/test/__test/2'-d>

<http://localhost:9200/test/test/2'-d <

http://localhost:9200/test/test/2'-d>>> '{

>                 

"timestamp":1348984904162,"__category":"category2","__description":"d2" }'

>                  > curl -XPUT '

http://localhost:9200/test/__test/3'-d <
http://localhost:9200/test/__test/3'-d>

<http://localhost:9200/test/test/3'-d <

http://localhost:9200/test/test/3'-d>>

>             <http://localhost:9200/test/__test/3'-d <

http://localhost:9200/test/__test/3'-d>

<http://localhost:9200/test/test/3'-d <

http://localhost:9200/test/test/3'-d>>> '{

>                 

"timestamp":1348984904163,"__category":"category3","__description":"d3" }'

>                  > | 
>                  > 
>                  > 
>                  > Run using index 'test': 
>                  > 
>                  > | 
>                  > hadoop jar wordcount.jar test output 
>                  > | 
>                  > 
>                  > 
>                  > Display output: 
>                  > 
>                  > | 
>                  > cat output/part-00000 
>                  > 
>                  > token count 
>                  > d1 5 
>                  > d2 5 
>                  > d3 5 
>                  > | 
>                  > 
>                  > 
>                  > 
>                  > Test case with one shard only: 
>                  > 
>                  > | 
>                  > curl -XPUT '

http://localhost:9200/__oneshardtest/'-d <
http://localhost:9200/__oneshardtest/'-d>

<http://localhost:9200/oneshardtest/'-d <

http://localhost:9200/oneshardtest/'-d>>

>             <http://localhost:9200/__oneshardtest/'-d <

http://localhost:9200/__oneshardtest/'-d>

<http://localhost:9200/oneshardtest/'-d <

http://localhost:9200/oneshardtest/'-d>>> '{ "settings" : {

>             "index" : { 
>                  "number_of_shards" : 1, 
>                  > "number_of_replicas" : 1 } } }' 
>                  > curl -XPUT '

http://localhost:9200/__oneshardtest/test/1'-d <
http://localhost:9200/__oneshardtest/test/1'-d>

>             <http://localhost:9200/oneshardtest/test/1'-d <

http://localhost:9200/oneshardtest/test/1'-d>>

<http://localhost:9200/__oneshardtest/test/1'-d <

http://localhost:9200/__oneshardtest/test/1'-d>

>             <http://localhost:9200/oneshardtest/test/1'-d <

http://localhost:9200/oneshardtest/test/1'-d>>> '{

>                  > 

"timestamp":1348984904161,"__category":"category1","__description":"d1" }'

>                  > curl -XPUT '

http://localhost:9200/__oneshardtest/test/2'-d <
http://localhost:9200/__oneshardtest/test/2'-d>

>             <http://localhost:9200/oneshardtest/test/2'-d <

http://localhost:9200/oneshardtest/test/2'-d>>

<http://localhost:9200/__oneshardtest/test/2'-d <

http://localhost:9200/__oneshardtest/test/2'-d>

>             <http://localhost:9200/oneshardtest/test/2'-d <

http://localhost:9200/oneshardtest/test/2'-d>>> '{

>                  > 

"timestamp":1348984904162,"__category":"category2","__description":"d2" }'

>                  > curl -XPUT '

http://localhost:9200/__oneshardtest/test/3'-d <
http://localhost:9200/__oneshardtest/test/3'-d>

>             <http://localhost:9200/oneshardtest/test/3'-d <

http://localhost:9200/oneshardtest/test/3'-d>>

<http://localhost:9200/__oneshardtest/test/3'-d <

http://localhost:9200/__oneshardtest/test/3'-d>

>             <http://localhost:9200/oneshardtest/test/3'-d <

http://localhost:9200/oneshardtest/test/3'-d>>> '{

>                  > 

"timestamp":1348984904163,"__category":"category3","__description":"d3" }'

>                  > | 
>                  > 
>                  > 
>                  > Run using index 'oneshardtest': 
>                  > 
>                  > | 
>                  > hadoop jar wordcount.jar oneshardtest output 
>                  > | 
>                  > 
>                  > 
>                  > Display output: 
>                  > 
>                  > | 
>                  > cat output/part-00000 
>                  > 
>                  > 
>                  > token count 
>                  > d1 1 
>                  > d2 1 
>                  > d3 1 
>                  > | 
>                  > 
>                  > 
>                  > 
>                  > If an index only has 1 shard the result of word 

count is correct. Somehow there is a problem in

>             elasticsearch 
>             hadoop in 
>                  > the data input split during the map process. I 

have already tried this in a hadoop single node

>             environment and the 
>                  > result is the same. Did I miss something in how 

to properly configure cascading in elasticsearch-hadoop?

>                  > 
>                  > Thanks. 
>                  > 
>                  > 
>                  > -- 
>                  > You received this message because you are 

subscribed to the Google Groups "elasticsearch" group.

>                  > To unsubscribe from this group and stop 

receiving emails from it, send an email to

>                  >elasticsearc...@googlegroups.__com <mailto:

elasticsearc...@googlegroups.com> <javascript:>.

>                  > For more options, 

visithttps://groups.google.__com/groups/opt_out

>             <http://groups.google.com/groups/opt_out <

http://groups.google.com/groups/opt_out>>

<https://groups.google.com/__groups/opt_out <

https://groups.google.com/__groups/opt_out>

>             <https://groups.google.com/groups/opt_out <

https://groups.google.com/groups/opt_out>>>.

> 
>                  -- 
>                  Costin 
> 
>             -- 
>             You received this message because you are subscribed 

to the Google Groups "elasticsearch" group.

>             To unsubscribe from this group and stop receiving 

emails from it, send an email to

>             elasticsearch+unsubscribe@__googlegroups.com <

http://googlegroups.com> <mailto:
elasticsearch%2Bunsubscribe@googlegroups.com <javascript:><javascript:>>.

>             For more options, visithttps://

groups.google.com/__groups/opt_out <
https://groups.google.com/__groups/opt_out>

<https://groups.google.com/groups/opt_out <

https://groups.google.com/groups/opt_out>>.

> 
> 
>     -- 
>     Costin 
> 
>     -- 
>     You received this message because you are subscribed to a 

topic in the Google Groups "elasticsearch" group.

>     To unsubscribe from this topic, visithttps://

groups.google.com/d/__topic/elasticsearch/__byA9kRJB8AY/unsubscribe

<

https://groups.google.com/d/__topic/elasticsearch/__byA9kRJB8AY/unsubscribe>

>     <

https://groups.google.com/d/topic/elasticsearch/byA9kRJB8AY/unsubscribe

<

https://groups.google.com/d/topic/elasticsearch/byA9kRJB8AY/unsubscribe>>.

>     To unsubscribe from this group and all its topics, send an 

email to elasticsearch+unsubscribe@__googlegroups.com <
http://googlegroups.com>

>     <mailto:elasticsearch%2Bunsubscribe@googlegroups.com<javascript:><javascript:>>. 
>     For more options, visithttps://

groups.google.com/__groups/opt_out <
https://groups.google.com/__groups/opt_out>

<https://groups.google.com/groups/opt_out <

https://groups.google.com/groups/opt_out>>.

> 
> 
> -- 
> You received this message because you are subscribed to the Google 

Groups "elasticsearch" group.

> To unsubscribe from this group and stop receiving emails from it, 

send an email to

>elasticsearc...@googlegroups.com <javascript:>. 
> For more options, visithttps://groups.google.com/groups/opt_out <

https://groups.google.com/groups/opt_out>.

-- 
Costin 

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Excellent - glad to hear it!

On 19/11/2013 4:35 AM, Jeoffrey Lim wrote:

Word count works perfectly now. Thanks a lot :slight_smile:

On Tuesday, November 19, 2013 4:11:37 AM UTC+8, Costin Leau wrote:

First of all, thanks for the nice write-up - I wish all bug reports are as detailed as yours.
After trying several versions of cascading and Hadoop it turned up the problem was with my tests. The issue was fairly
obscure but your logs helped.
Ironically the issue occurred only sometimes, when running against a pseudo-distributed Hadoop tracker (as yours) not
when running against a remote one.

I've pushed the fix into master and pushed a nightly build as well - let me know how it goes for you.

Thanks again for the continous feedback and detailed report.

On 18/11/2013 7:32 PM, Jeoffrey Lim wrote:
> I tried using Cascading 2.1.6, and directly using the samples fromhttp://docs.cascading.org/impatient/ <http://docs.cascading.org/impatient/>
>
> When just performing a copy:
>
> |
> PipedocPipe =newPipe("copy");
>
> FlowDefflowDef =FlowDef.flowDef()
> .addSource(docPipe,docTap)
> .addTailSink(docPipe,wcTap);
>
> flowConnector.connect(flowDef).complete();
> |
>
> for i in output/part-0000* ; do echo "[$i]" && cat $i ; done
>
> |
> [output/part-00000]
> doc_id description
> 3org.elasticsearch.hadoop.mr.LinkedMapWritable@3fde891b
> 2org.elasticsearch.hadoop.mr.LinkedMapWritable@3fde891b
> 1org.elasticsearch.hadoop.mr.LinkedMapWritable@3fde891b
> [output/part-00001]
> doc_id description
> 3org.elasticsearch.hadoop.mr.LinkedMapWritable@b05236
> 2org.elasticsearch.hadoop.mr.LinkedMapWritable@b05236
> 1org.elasticsearch.hadoop.mr.LinkedMapWritable@b05236
> [output/part-00002]
> doc_id description
> 3org.elasticsearch.hadoop.mr.LinkedMapWritable@3d762027
> 2org.elasticsearch.hadoop.mr.LinkedMapWritable@3d762027
> 1org.elasticsearch.hadoop.mr.LinkedMapWritable@3d762027
> [output/part-00003]
> doc_id description
> 3org.elasticsearch.hadoop.mr.LinkedMapWritable@1aeca36e
> 2org.elasticsearch.hadoop.mr.LinkedMapWritable@1aeca36e
> 1org.elasticsearch.hadoop.mr.LinkedMapWritable@1aeca36e
> [output/part-00004]
> doc_id description
> 3org.elasticsearch.hadoop.mr.LinkedMapWritable@23904e0d
> 2org.elasticsearch.hadoop.mr.LinkedMapWritable@23904e0d
> 1org.elasticsearch.hadoop.mr.LinkedMapWritable@23904e0d
> |
>
>
>
> With field extractor:
>
> |
> PipedocPipe =newPipe("copy");
> PipecopyPipe =newEach(docPipe,newFields("doc_id","description"),newESFieldExtractor(fieldArguments),Fields.RESULTS );
>
> FlowDefflowDef =FlowDef.flowDef()
> .addSource(copyPipe,docTap)
> .addTailSink(copyPipe,wcTap);
>
> flowConnector.connect(flowDef).complete();
> |
>
>
> for i in output/part-0000* ; do echo "[$i]" && cat $i ; done
>
> |
> [output/part-00000]
> doc_id description
> 3d3
> 2d2
> 1d1
> [output/part-00001]
> doc_id description
> 3d3
> 2d2
> 1d1
> [output/part-00002]
> doc_id description
> 3d3
> 2d2
> 1d1
> [output/part-00003]
> doc_id description
> 3d3
> 2d2
> 1d1
> [output/part-00004]
> doc_id description
> 3d3
> 2d2
> 1d1
> |
>
>
> Manual query of data by shards using curl:
>
> |
> curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:0
<http://localhost:9200/test/_search?q=*:*&preference=_shards:0>'
> {"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}
>
>
> curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:1
<http://localhost:9200/test/_search?q=*:*&preference=_shards:1>'
> {"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}
>
>
> curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:2
<http://localhost:9200/test/_search?q=*:*&preference=_shards:2>'
> {"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test","_id":"1","_score":1.0,"_source":{"timestamp":1348984904161,"category":"category1","description":"d1"}}]}}

>
>
> curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:3
<http://localhost:9200/test/_search?q=*:*&preference=_shards:3>'
> {"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test","_id":"2","_score":1.0,"_source":{"timestamp":1348984904162,"category":"category2","description":"d2"}}]}}

>
>
> curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:4
<http://localhost:9200/test/_search?q=*:*&preference=_shards:4>'
> {"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test","_id":"3","_score":1.0,"_source":{"timestamp":1348984904163,"category":"category3","description":"d3"}}]}}

> |
>
>
> Here is also the console output:
>
> |
> 13/11/19 01:20:06 INFO property.AppProps: usingapp.id <http://app.id>: 9BCEBB939DE70E7BBBC0F8427387176E
> 13/11/19 01:20:06 INFO util.HadoopUtil: resolving application jar from found main method on: eshadoop.WordCount
> 13/11/19 01:20:06 INFO planner.HadoopPlanner: using application jar: /home/jeoffrey/test/eshadoopwc.jar
> 13/11/19 01:20:06 INFO hadoop.Hfs: forcing job to local mode, via sink: Hfs["TextDelimited[[UNKNOWN]->['doc_id',
> 'description']]"]["output"]
> 13/11/19 01:20:08 INFO util.Version: Concurrent, Inc - Cascading 2.1.6
> 13/11/19 01:20:08 INFO flow.Flow: [] starting
> 13/11/19 01:20:08 INFO flow.Flow: []  source: ESHadoopTap["ESHadoopScheme[['doc_id',
> 'description']]"]["/test/_search?q=*:*"]
> 13/11/19 01:20:08 INFO flow.Flow: []  sink: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
> 13/11/19 01:20:08 INFO flow.Flow: []  parallel execution is enabled: true
> 13/11/19 01:20:08 INFO flow.Flow: []  starting jobs: 1
> 13/11/19 01:20:08 INFO flow.Flow: []  allocating threads: 1
> 13/11/19 01:20:08 INFO flow.FlowStep: [] starting step: (1/1) output
> 13/11/19 01:20:08 INFO util.NativeCodeLoader: Loaded the native-hadoop library
> 13/11/19 01:20:08 INFO mr.ESInputFormat: Discovered mapping {test=[raw-logs=[], test=[category=STRING,
> description=STRING, timestamp=LONG]]} for [/test/_search?q=*:*]
> 13/11/19 01:20:08 INFO mr.ESInputFormat: Created [5] shard-splits
> 13/11/19 01:20:08 INFO flow.FlowStep: [] submitted hadoop job: job_local257186463_0001
> 13/11/19 01:20:08 INFO mapred.LocalJobRunner: Waiting for map tasks
> 13/11/19 01:20:08 INFO mapred.LocalJobRunner: Starting task: attempt_local257186463_0001_m_000000_0
> 13/11/19 01:20:08 INFO util.ProcessTree: setsid exited with exit code 0
> 13/11/19 01:20:08 INFO mapred.Task:  Using ResourceCalculatorPlugin :
> org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4235e6e3
> 13/11/19 01:20:08 INFO mapred.MapTask: Processing split: cascading.tap.hadoop.io.MultiInputSplit@62803d5
> 13/11/19 01:20:08 INFO mapred.MapTask: numReduceTasks: 0
> 13/11/19 01:20:08 INFO hadoop.FlowMapper: cascading version: Concurrent, Inc - Cascading 2.1.6
> 13/11/19 01:20:08 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
> 13/11/19 01:20:08 INFO hadoop.FlowMapper: sourcing from: ESHadoopTap["ESHadoopScheme[['doc_id',
> 'description']]"]["/test/_search?q=*:*"]
> 13/11/19 01:20:08 INFO hadoop.FlowMapper: sinking to: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
> 13/11/19 01:20:08 INFO mr.ESInputFormat: Discovered mapping {test=[raw-logs=[], test=[category=STRING,
> description=STRING, timestamp=LONG]]} for [/test/_search?q=*:*]
> 13/11/19 01:20:08 INFO mr.ESInputFormat: Created [5] shard-splits
> 13/11/19 01:20:08 INFO mapred.Task: Task:attempt_local257186463_0001_m_000000_0 is done. And is in the process of commiting
> 13/11/19 01:20:08 INFO mapred.LocalJobRunner:
> 13/11/19 01:20:08 INFO mapred.Task: Task attempt_local257186463_0001_m_000000_0 is allowed to commit now
> 13/11/19 01:20:08 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local257186463_0001_m_000000_0' to
> file:/home/jeoffrey/test/output
> 13/11/19 01:20:08 INFO mapred.LocalJobRunner: ShardInputSplit [node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
> Cordelia|192.168.1.120:9200 <http://192.168.1.120:9200>],shard=0]
> 13/11/19 01:20:08 INFO mapred.Task: Task 'attempt_local257186463_0001_m_000000_0' done.
> 13/11/19 01:20:08 INFO mapred.LocalJobRunner: Finishing task: attempt_local257186463_0001_m_000000_0
> 13/11/19 01:20:08 INFO mapred.LocalJobRunner: Starting task: attempt_local257186463_0001_m_000001_0
> 13/11/19 01:20:08 INFO mapred.Task:  Using ResourceCalculatorPlugin :
> org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3d3b5a3a
> 13/11/19 01:20:08 INFO mapred.MapTask: Processing split: cascading.tap.hadoop.io.MultiInputSplit@4ab9b8d0
> 13/11/19 01:20:08 INFO mapred.MapTask: numReduceTasks: 0
> 13/11/19 01:20:08 INFO hadoop.FlowMapper: cascading version: Concurrent, Inc - Cascading 2.1.6
> 13/11/19 01:20:08 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
> 13/11/19 01:20:08 INFO hadoop.FlowMapper: sourcing from: ESHadoopTap["ESHadoopScheme[['doc_id',
> 'description']]"]["/test/_search?q=*:*"]
> 13/11/19 01:20:08 INFO hadoop.FlowMapper: sinking to: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
> 13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping {test=[raw-logs=[], test=[category=STRING,
> description=STRING, timestamp=LONG]]} for [/test/_search?q=*:*]
> 13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
> 13/11/19 01:20:09 INFO mapred.Task: Task:attempt_local257186463_0001_m_000001_0 is done. And is in the process of commiting
> 13/11/19 01:20:09 INFO mapred.LocalJobRunner:
> 13/11/19 01:20:09 INFO mapred.Task: Task attempt_local257186463_0001_m_000001_0 is allowed to commit now
> 13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local257186463_0001_m_000001_0' to
> file:/home/jeoffrey/test/output
> 13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit [node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
> Cordelia|192.168.1.120:9200 <http://192.168.1.120:9200>],shard=1]
> 13/11/19 01:20:09 INFO mapred.Task: Task 'attempt_local257186463_0001_m_000001_0' done.
> 13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task: attempt_local257186463_0001_m_000001_0
> 13/11/19 01:20:09 INFO mapred.LocalJobRunner: Starting task: attempt_local257186463_0001_m_000002_0
> 13/11/19 01:20:09 INFO mapred.Task:  Using ResourceCalculatorPlugin :
> org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3d46e381
> 13/11/19 01:20:09 INFO mapred.MapTask: Processing split: cascading.tap.hadoop.io.MultiInputSplit@52cd19d
> 13/11/19 01:20:09 INFO mapred.MapTask: numReduceTasks: 0
> 13/11/19 01:20:09 INFO hadoop.FlowMapper: cascading version: Concurrent, Inc - Cascading 2.1.6
> 13/11/19 01:20:09 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
> 13/11/19 01:20:09 INFO hadoop.FlowMapper: sourcing from: ESHadoopTap["ESHadoopScheme[['doc_id',
> 'description']]"]["/test/_search?q=*:*"]
> 13/11/19 01:20:09 INFO hadoop.FlowMapper: sinking to: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
> 13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping {test=[raw-logs=[], test=[category=STRING,
> description=STRING, timestamp=LONG]]} for [/test/_search?q=*:*]
> 13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
> 13/11/19 01:20:09 INFO mapred.Task: Task:attempt_local257186463_0001_m_000002_0 is done. And is in the process of commiting
> 13/11/19 01:20:09 INFO mapred.LocalJobRunner:
> 13/11/19 01:20:09 INFO mapred.Task: Task attempt_local257186463_0001_m_000002_0 is allowed to commit now
> 13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local257186463_0001_m_000002_0' to
> file:/home/jeoffrey/test/output
> 13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit [node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
> Cordelia|192.168.1.120:9200 <http://192.168.1.120:9200>],shard=4]
> 13/11/19 01:20:09 INFO mapred.Task: Task 'attempt_local257186463_0001_m_000002_0' done.
> 13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task: attempt_local257186463_0001_m_000002_0
> 13/11/19 01:20:09 INFO mapred.LocalJobRunner: Starting task: attempt_local257186463_0001_m_000003_0
> 13/11/19 01:20:09 INFO mapred.Task:  Using ResourceCalculatorPlugin :
> org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4ca68fd8
> 13/11/19 01:20:09 INFO mapred.MapTask: Processing split: cascading.tap.hadoop.io.MultiInputSplit@2e097617
> 13/11/19 01:20:09 INFO mapred.MapTask: numReduceTasks: 0
> 13/11/19 01:20:09 INFO hadoop.FlowMapper: cascading version: Concurrent, Inc - Cascading 2.1.6
> 13/11/19 01:20:09 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
> 13/11/19 01:20:09 INFO hadoop.FlowMapper: sourcing from: ESHadoopTap["ESHadoopScheme[['doc_id',
> 'description']]"]["/test/_search?q=*:*"]
> 13/11/19 01:20:09 INFO hadoop.FlowMapper: sinking to: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
> 13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping {test=[raw-logs=[], test=[category=STRING,
> description=STRING, timestamp=LONG]]} for [/test/_search?q=*:*]
> 13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
> 13/11/19 01:20:09 INFO mapred.Task: Task:attempt_local257186463_0001_m_000003_0 is done. And is in the process of commiting
> 13/11/19 01:20:09 INFO mapred.LocalJobRunner:
> 13/11/19 01:20:09 INFO mapred.Task: Task attempt_local257186463_0001_m_000003_0 is allowed to commit now
> 13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local257186463_0001_m_000003_0' to
> file:/home/jeoffrey/test/output
> 13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit [node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
> Cordelia|192.168.1.120:9200 <http://192.168.1.120:9200>],shard=3]
> 13/11/19 01:20:09 INFO mapred.Task: Task 'attempt_local257186463_0001_m_000003_0' done.
> 13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task: attempt_local257186463_0001_m_000003_0
> 13/11/19 01:20:09 INFO mapred.LocalJobRunner: Starting task: attempt_local257186463_0001_m_000004_0
> 13/11/19 01:20:09 INFO mapred.Task:  Using ResourceCalculatorPlugin :
> org.apache.hadoop.util.LinuxResourceCalculatorPlugin@6e848ecc
> 13/11/19 01:20:09 INFO mapred.MapTask: Processing split: cascading.tap.hadoop.io.MultiInputSplit@40363068
> 13/11/19 01:20:09 INFO mapred.MapTask: numReduceTasks: 0
> 13/11/19 01:20:09 INFO hadoop.FlowMapper: cascading version: Concurrent, Inc - Cascading 2.1.6
> 13/11/19 01:20:09 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
> 13/11/19 01:20:09 INFO hadoop.FlowMapper: sourcing from: ESHadoopTap["ESHadoopScheme[['doc_id',
> 'description']]"]["/test/_search?q=*:*"]
> 13/11/19 01:20:09 INFO hadoop.FlowMapper: sinking to: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
> 13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping {test=[raw-logs=[], test=[category=STRING,
> description=STRING, timestamp=LONG]]} for [/test/_search?q=*:*]
> 13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
> 13/11/19 01:20:09 INFO mapred.Task: Task:attempt_local257186463_0001_m_000004_0 is done. And is in the process of commiting
> 13/11/19 01:20:09 INFO mapred.LocalJobRunner:
> 13/11/19 01:20:09 INFO mapred.Task: Task attempt_local257186463_0001_m_000004_0 is allowed to commit now
> 13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local257186463_0001_m_000004_0' to
> file:/home/jeoffrey/test/output
> 13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit [node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
> Cordelia|192.168.1.120:9200 <http://192.168.1.120:9200>],shard=2]
> 13/11/19 01:20:09 INFO mapred.Task: Task 'attempt_local257186463_0001_m_000004_0' done.
> 13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task: attempt_local257186463_0001_m_000004_0
> 13/11/19 01:20:09 INFO mapred.LocalJobRunner: Map task executor complete.
> 13/11/19 01:20:13 INFO util.Hadoop18TapUtil: deleting temp path output/_temporary
> |
>
>
> Thanks.
>
> On Tuesday, November 19, 2013 12:01:05 AM UTC+8, Costin Leau wrote:
>
>     Can you please try with Cascading 2.1.6 just in case? I haven't tested Cascading 2.2 yet.
>
>     Thanks,
>
>     On 18/11/2013 5:13 PM, Jeoffrey Lim wrote:
>     > I'm using Hadoop 1.2.1, ElasticSearch 0.90.3, Cascading 2.2 and tried running the word count tutorial sample in local
>     > stand alone operation, and pseudo distributed mode resulting with the same output.
>     >
>     >
>     > Thanks!
>     >
>     >
>     > On Mon, Nov 18, 2013 at 11:03 PM, Costin Leau <costi...@gmail.com <javascript:> <mailto:costi...@gmail.com <javascript:>>> wrote:
>     >
>     >     By the way, can you confirm the setup you are using - Hadoop version and network topology?
>     >
>     >     Thanks,
>     >
>     >
>     >     On 18/11/2013 4:22 PM, Costin Leau wrote:
>     >
>     >         Something is a miss here since I'm unable to reproduce the problem. I'll try again using your verbatim example.
>     >         The ESFieldExtractor shouldn't be needed since ESTap returns the results as a TupleEntry iterator so there
>     >         shouldn't be
>     >         no need to do the field extraction but rather just refer to the fields by name.
>     >         Let me know what didn't work for you - I'll go through yoru example myself but just in case I'm missing something,
>     >         feedback is useful.
>     >
>     >         Cheers,
>     >
>     >         On 18/11/2013 1:18 PM, Jeoffrey Lim wrote:
>     >
>     >             Thanks a lot for looking at this problem. But still when I used the nightly build with the intended fix ,the
>     >             problem
>     >             still occurs and the word count is still 5 which is the total number of shards of the index.
>     >
>     >             On Saturday, November 16, 2013 1:50:18 AM UTC+8, Costin Leau wrote:
>     >
>     >                  Hi,
>     >
>     >                  I've pushed a nightly build [1] which should address the problem. Let me know if it works for you.
>     >
>     >                  It looks like Cascading is acting differently when running against a pseudo/local cluster as oppose to
>     >             a proper one
>     >                  when
>     >                  it comes to dealing with multiple splits. The bug fix should make the behaviour consistent.
>     >
>     >                  Cheers,
>     >
>     >                  [1] elasticsearch-hadoop-1.3.0.__BUILD-20131115.172649-221
>     >
>     >                  On 14/11/2013 5:51 AM, Jeoffrey Lim wrote:
>     >                  > I'm testing the cascading plugin of elasticsearch-hadoop and trying the word count example from
>     >                  >http://docs.cascading.org/__impatient/impatient2.html <http://docs.cascading.org/__impatient/impatient2.html>
<http://docs.cascading.org/__impatient/impatient2.html <http://docs.cascading.org/__impatient/impatient2.html>>
>     >             <http://docs.cascading.org/impatient/impatient2.html <http://docs.cascading.org/impatient/impatient2.html>
<http://docs.cascading.org/impatient/impatient2.html <http://docs.cascading.org/impatient/impatient2.html>>>
>     <http://docs.cascading.org/__impatient/impatient2.html <http://docs.cascading.org/__impatient/impatient2.html>
<http://docs.cascading.org/__impatient/impatient2.html <http://docs.cascading.org/__impatient/impatient2.html>>
>     >             <http://docs.cascading.org/impatient/impatient2.html <http://docs.cascading.org/impatient/impatient2.html>
<http://docs.cascading.org/impatient/impatient2.html <http://docs.cascading.org/impatient/impatient2.html>>>>
>     >                  > The problem is the result of the word count is incorrect and somehow gets multiplied by the number of
>     >             shards of
>     >             the index.
>     >                  >
>     >                  >
>     >                  > Code:
>     >                  >
>     >                  > |
>     >                  > Stringindex =args[0];
>     >                  > StringwcPath =args[1];
>     >                  >
>     >                  > JobConfjobConf =newJobConf();
>     >                  > jobConf.setBoolean("mapred.__map.tasks.speculative.__execution",false);
>     >                  > jobConf.setBoolean("mapred.__reduce.tasks.speculative.__execution",false);
>     >                  >
>     >                  > Propertiesproperties =AppProps.appProps().__buildProperties(jobConf);
>     >                  > AppProps.__setApplicationJarClass(__properties,WordCountESHadoop.__class);
>     >                  > FlowConnectorflowConnector =newHadoopFlowConnector(__properties);
>     >                  >
>     >                  > FieldsfieldArguments =newFields("doc_id","__description");
>     >                  >
>     >                  > TapdocTap =newESTap("/"+index +"/_search?q=*:*",__fieldArguments);
>     >                  > TapwcTap =newHfs(newTextDelimited(true,__"\t"),wcPath );
>     >                  >
>     >                  > PipedescFieldPipe
>     >             =newEach("desconly",__fieldArguments,__newESFieldExtractor(__fieldArguments),Fields.RESULTS );
>     >                  >
>     >                  > Fieldstoken =newFields("token");
>     >                  > Fieldstext =newFields("description");
>     >                  > RegexSplitGeneratorsplitter =newRegexSplitGenerator(token,__"[ \\[\\]\\(\\),.]");
>     >                  > PipedocPipe =newEach(descFieldPipe,text,__splitter,Fields.RESULTS );
>     >                  >
>     >                  >
>     >                  > // determine the word counts
>     >                  > PipewcPipe =newPipe("wc",docPipe );
>     >                  > wcPipe =newGroupBy(wcPipe,token );
>     >                  > wcPipe =newEvery(wcPipe,Fields.ALL,__newCount(),Fields.ALL );
>     >                  >
>     >                  >
>     >                  > // connect the taps, pipes, etc., into a flow
>     >                  > FlowDefflowDef =FlowDef.flowDef()
>     >                  > .setName("wc")
>     >                  > .addSource(docPipe,docTap )
>     >                  > .addTailSink(wcPipe,wcTap );
>     >                  >
>     >                  >
>     >                  > // write a DOT file and run the flow
>     >                  > FlowwcFlow =flowConnector.connect(flowDef );
>     >                  > wcFlow.writeDOT("dot/wc.dot");
>     >                  > wcFlow.complete();
>     >                  > |
>     >                  >
>     >                  >
>     >                  >
>     >                  > Test case with 5 shards:
>     >                  >
>     >                  > |
>     >                  > curl -XPUT 'http://localhost:9200/test/'
>     >                  > curl -XPUT 'http://localhost:9200/test/__test/1'-d <http://localhost:9200/test/__test/1'-d>
<http://localhost:9200/test/__test/1'-d <http://localhost:9200/test/__test/1'-d>>
>     <http://localhost:9200/test/test/1'-d <http://localhost:9200/test/test/1'-d> <http://localhost:9200/test/test/1'-d
<http://localhost:9200/test/test/1'-d>>>
>     >             <http://localhost:9200/test/__test/1'-d <http://localhost:9200/test/__test/1'-d>
<http://localhost:9200/test/__test/1'-d <http://localhost:9200/test/__test/1'-d>>
>     <http://localhost:9200/test/test/1'-d <http://localhost:9200/test/test/1'-d> <http://localhost:9200/test/test/1'-d
<http://localhost:9200/test/test/1'-d>>>> '{
>     >                  "timestamp":1348984904161,"__category":"category1","__description":"d1" }'
>     >                  > curl -XPUT 'http://localhost:9200/test/__test/2'-d <http://localhost:9200/test/__test/2'-d>
<http://localhost:9200/test/__test/2'-d <http://localhost:9200/test/__test/2'-d>>
>     <http://localhost:9200/test/test/2'-d <http://localhost:9200/test/test/2'-d> <http://localhost:9200/test/test/2'-d
<http://localhost:9200/test/test/2'-d>>>
>     >             <http://localhost:9200/test/__test/2'-d <http://localhost:9200/test/__test/2'-d>
<http://localhost:9200/test/__test/2'-d <http://localhost:9200/test/__test/2'-d>>
>     <http://localhost:9200/test/test/2'-d <http://localhost:9200/test/test/2'-d> <http://localhost:9200/test/test/2'-d
<http://localhost:9200/test/test/2'-d>>>> '{
>     >                  "timestamp":1348984904162,"__category":"category2","__description":"d2" }'
>     >                  > curl -XPUT 'http://localhost:9200/test/__test/3'-d <http://localhost:9200/test/__test/3'-d>
<http://localhost:9200/test/__test/3'-d <http://localhost:9200/test/__test/3'-d>>
>     <http://localhost:9200/test/test/3'-d <http://localhost:9200/test/test/3'-d> <http://localhost:9200/test/test/3'-d
<http://localhost:9200/test/test/3'-d>>>
>     >             <http://localhost:9200/test/__test/3'-d <http://localhost:9200/test/__test/3'-d>
<http://localhost:9200/test/__test/3'-d <http://localhost:9200/test/__test/3'-d>>
>     <http://localhost:9200/test/test/3'-d <http://localhost:9200/test/test/3'-d> <http://localhost:9200/test/test/3'-d
<http://localhost:9200/test/test/3'-d>>>> '{
>     >                  "timestamp":1348984904163,"__category":"category3","__description":"d3" }'
>     >                  > |
>     >                  >
>     >                  >
>     >                  > Run using index 'test':
>     >                  >
>     >                  > |
>     >                  > hadoop jar wordcount.jar test output
>     >                  > |
>     >                  >
>     >                  >
>     >                  > Display output:
>     >                  >
>     >                  > |
>     >                  > cat output/part-00000
>     >                  >
>     >                  > token count
>     >                  > d1 5
>     >                  > d2 5
>     >                  > d3 5
>     >                  > |
>     >                  >
>     >                  >
>     >                  >
>     >                  > Test case with one shard only:
>     >                  >
>     >                  > |
>     >                  > curl -XPUT 'http://localhost:9200/__oneshardtest/'-d <http://localhost:9200/__oneshardtest/'-d>
<http://localhost:9200/__oneshardtest/'-d <http://localhost:9200/__oneshardtest/'-d>>
>     <http://localhost:9200/oneshardtest/'-d <http://localhost:9200/oneshardtest/'-d>
<http://localhost:9200/oneshardtest/'-d <http://localhost:9200/oneshardtest/'-d>>>
>     >             <http://localhost:9200/__oneshardtest/'-d <http://localhost:9200/__oneshardtest/'-d>
<http://localhost:9200/__oneshardtest/'-d <http://localhost:9200/__oneshardtest/'-d>>
>     <http://localhost:9200/oneshardtest/'-d <http://localhost:9200/oneshardtest/'-d>
<http://localhost:9200/oneshardtest/'-d <http://localhost:9200/oneshardtest/'-d>>>> '{ "settings" : {
>     >             "index" : {
>     >                  "number_of_shards" : 1,
>     >                  > "number_of_replicas" : 1 } } }'
>     >                  > curl -XPUT 'http://localhost:9200/__oneshardtest/test/1'-d <http://localhost:9200/__oneshardtest/test/1'-d>
<http://localhost:9200/__oneshardtest/test/1'-d <http://localhost:9200/__oneshardtest/test/1'-d>>
>     >             <http://localhost:9200/oneshardtest/test/1'-d <http://localhost:9200/oneshardtest/test/1'-d>
<http://localhost:9200/oneshardtest/test/1'-d <http://localhost:9200/oneshardtest/test/1'-d>>>
>     <http://localhost:9200/__oneshardtest/test/1'-d <http://localhost:9200/__oneshardtest/test/1'-d>
<http://localhost:9200/__oneshardtest/test/1'-d <http://localhost:9200/__oneshardtest/test/1'-d>>
>     >             <http://localhost:9200/oneshardtest/test/1'-d <http://localhost:9200/oneshardtest/test/1'-d>
<http://localhost:9200/oneshardtest/test/1'-d <http://localhost:9200/oneshardtest/test/1'-d>>>> '{
>     >                  > "timestamp":1348984904161,"__category":"category1","__description":"d1" }'
>     >                  > curl -XPUT 'http://localhost:9200/__oneshardtest/test/2'-d <http://localhost:9200/__oneshardtest/test/2'-d>
<http://localhost:9200/__oneshardtest/test/2'-d <http://localhost:9200/__oneshardtest/test/2'-d>>
>     >             <http://localhost:9200/oneshardtest/test/2'-d <http://localhost:9200/oneshardtest/test/2'-d>
<http://localhost:9200/oneshardtest/test/2'-d <http://localhost:9200/oneshardtest/test/2'-d>>>
>     <http://localhost:9200/__oneshardtest/test/2'-d <http://localhost:9200/__oneshardtest/test/2'-d>
<http://localhost:9200/__oneshardtest/test/2'-d <http://localhost:9200/__oneshardtest/test/2'-d>>
>     >             <http://localhost:9200/oneshardtest/test/2'-d <http://localhost:9200/oneshardtest/test/2'-d>
<http://localhost:9200/oneshardtest/test/2'-d <http://localhost:9200/oneshardtest/test/2'-d>>>> '{
>     >                  > "timestamp":1348984904162,"__category":"category2","__description":"d2" }'
>     >                  > curl -XPUT 'http://localhost:9200/__oneshardtest/test/3'-d <http://localhost:9200/__oneshardtest/test/3'-d>
<http://localhost:9200/__oneshardtest/test/3'-d <http://localhost:9200/__oneshardtest/test/3'-d>>
>     >             <http://localhost:9200/oneshardtest/test/3'-d <http://localhost:9200/oneshardtest/test/3'-d>
<http://localhost:9200/oneshardtest/test/3'-d <http://localhost:9200/oneshardtest/test/3'-d>>>
>     <http://localhost:9200/__oneshardtest/test/3'-d <http://localhost:9200/__oneshardtest/test/3'-d>
<http://localhost:9200/__oneshardtest/test/3'-d <http://localhost:9200/__oneshardtest/test/3'-d>>
>     >             <http://localhost:9200/oneshardtest/test/3'-d <http://localhost:9200/oneshardtest/test/3'-d>
<http://localhost:9200/oneshardtest/test/3'-d <http://localhost:9200/oneshardtest/test/3'-d>>>> '{
>     >                  > "timestamp":1348984904163,"__category":"category3","__description":"d3" }'
>     >                  > |
>     >                  >
>     >                  >
>     >                  > Run using index 'oneshardtest':
>     >                  >
>     >                  > |
>     >                  > hadoop jar wordcount.jar oneshardtest output
>     >                  > |
>     >                  >
>     >                  >
>     >                  > Display output:
>     >                  >
>     >                  > |
>     >                  > cat output/part-00000
>     >                  >
>     >                  >
>     >                  > token count
>     >                  > d1 1
>     >                  > d2 1
>     >                  > d3 1
>     >                  > |
>     >                  >
>     >                  >
>     >                  >
>     >                  > If an index only has 1 shard the result of word count is correct. Somehow there is a problem in
>     >             elasticsearch
>     >             hadoop in
>     >                  > the data input split during the map process. I have already tried this in a hadoop single node
>     >             environment and the
>     >                  > result is the same. Did I miss something in how to properly configure cascading in elasticsearch-hadoop?
>     >                  >
>     >                  > Thanks.
>     >                  >
>     >                  >
>     >                  > --
>     >                  > You received this message because you are subscribed to the Google Groups "elasticsearch" group.
>     >                  > To unsubscribe from this group and stop receiving emails from it, send an email to
>     >                  >elasticsearc...@googlegroups.__com <mailto:elasticsearc...@googlegroups.com> <javascript:>.
>     >                  > For more options, visithttps://groups.google.__com/groups/opt_out
>     >             <http://groups.google.com/groups/opt_out <http://groups.google.com/groups/opt_out>
<http://groups.google.com/groups/opt_out <http://groups.google.com/groups/opt_out>>>
>     <https://groups.google.com/__groups/opt_out <https://groups.google.com/__groups/opt_out>
<https://groups.google.com/__groups/opt_out <https://groups.google.com/__groups/opt_out>>
>     >             <https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>
<https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>>>>.
>     >
>     >                  --
>     >                  Costin
>     >
>     >             --
>     >             You received this message because you are subscribed to the Google Groups "elasticsearch" group.
>     >             To unsubscribe from this group and stop receiving emails from it, send an email to
>     >             elasticsearch+unsubscribe@__googlegroups.com <http://googlegroups.com> <http://googlegroups.com>
<mailto:elasticsearch%2Bunsubscribe@googlegroups.com <javascript:> <javascript:>>.
>     >             For more options, visithttps://groups.google.com/__groups/opt_out <http://groups.google.com/__groups/opt_out>
<https://groups.google.com/__groups/opt_out <https://groups.google.com/__groups/opt_out>>
>     <https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>
<https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>>>.
>     >
>     >
>     >     --
>     >     Costin
>     >
>     >     --
>     >     You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
>     >     To unsubscribe from this topic, visithttps://groups.google.com/d/__topic/elasticsearch/__byA9kRJB8AY/unsubscribe
<http://groups.google.com/d/__topic/elasticsearch/__byA9kRJB8AY/unsubscribe>
>     <https://groups.google.com/d/__topic/elasticsearch/__byA9kRJB8AY/unsubscribe
<https://groups.google.com/d/__topic/elasticsearch/__byA9kRJB8AY/unsubscribe>>
>     >     <https://groups.google.com/d/topic/elasticsearch/byA9kRJB8AY/unsubscribe
<https://groups.google.com/d/topic/elasticsearch/byA9kRJB8AY/unsubscribe>
>     <https://groups.google.com/d/topic/elasticsearch/byA9kRJB8AY/unsubscribe
<https://groups.google.com/d/topic/elasticsearch/byA9kRJB8AY/unsubscribe>>>.
>     >     To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@__googlegroups.com <http://googlegroups.com> <http://googlegroups.com>
>     >     <mailto:elasticsearch%2Bunsubscribe@googlegroups.com <javascript:> <javascript:>>.
>     >     For more options, visithttps://groups.google.com/__groups/opt_out <http://groups.google.com/__groups/opt_out>
<https://groups.google.com/__groups/opt_out <https://groups.google.com/__groups/opt_out>>
>     <https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>
<https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>>>.
>     >
>     >
>     > --
>     > You received this message because you are subscribed to the Google Groups "elasticsearch" group.
>     > To unsubscribe from this group and stop receiving emails from it, send an email to
>     >elasticsearc...@googlegroups.com <javascript:>.
>     > For more options, visithttps://groups.google.com/groups/opt_out <http://groups.google.com/groups/opt_out> <https://groups.google.com/groups/opt_out
<https://groups.google.com/groups/opt_out>>.
>
>     --
>     Costin
>
> --
> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>elasticsearc...@googlegroups.com <javascript:>.
> For more options, visithttps://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

So did word count start working perfectly with or without the
ESFieldExtractor? I've run into a similar where I'm trying to use,
elasticsearch-hadoop with cascalog, however, ESTap is only returning the
doc_id and a LinkedMapWritable object, making it difficult to use the
cascalog abstraction to cherry pick specific fields from the tap?

On Monday, November 18, 2013 9:35:07 PM UTC-5, Jeoffrey Lim wrote:

Word count works perfectly now. Thanks a lot :slight_smile:

On Tuesday, November 19, 2013 4:11:37 AM UTC+8, Costin Leau wrote:

First of all, thanks for the nice write-up - I wish all bug reports are
as detailed as yours.
After trying several versions of cascading and Hadoop it turned up the
problem was with my tests. The issue was fairly
obscure but your logs helped.
Ironically the issue occurred only sometimes, when running against a
pseudo-distributed Hadoop tracker (as yours) not
when running against a remote one.

I've pushed the fix into master and pushed a nightly build as well - let
me know how it goes for you.

Thanks again for the continous feedback and detailed report.

On 18/11/2013 7:32 PM, Jeoffrey Lim wrote:

I tried using Cascading 2.1.6, and directly using the samples from
Cascading for the Impatient

When just performing a copy:

|
PipedocPipe =newPipe("copy");

FlowDefflowDef =FlowDef.flowDef()
.addSource(docPipe,docTap)
.addTailSink(docPipe,wcTap);

flowConnector.connect(flowDef).complete();
|

for i in output/part-0000* ; do echo "[$i]" && cat $i ; done

|
[output/part-00000]
doc_id description
3org.elasticsearch.hadoop.mr.LinkedMapWritable@3fde891b
2org.elasticsearch.hadoop.mr.LinkedMapWritable@3fde891b
1org.elasticsearch.hadoop.mr.LinkedMapWritable@3fde891b
[output/part-00001]
doc_id description
3org.elasticsearch.hadoop.mr.LinkedMapWritable@b05236
2org.elasticsearch.hadoop.mr.LinkedMapWritable@b05236
1org.elasticsearch.hadoop.mr.LinkedMapWritable@b05236
[output/part-00002]
doc_id description
3org.elasticsearch.hadoop.mr.LinkedMapWritable@3d762027
2org.elasticsearch.hadoop.mr.LinkedMapWritable@3d762027
1org.elasticsearch.hadoop.mr.LinkedMapWritable@3d762027
[output/part-00003]
doc_id description
3org.elasticsearch.hadoop.mr.LinkedMapWritable@1aeca36e
2org.elasticsearch.hadoop.mr.LinkedMapWritable@1aeca36e
1org.elasticsearch.hadoop.mr.LinkedMapWritable@1aeca36e
[output/part-00004]
doc_id description
3org.elasticsearch.hadoop.mr.LinkedMapWritable@23904e0d
2org.elasticsearch.hadoop.mr.LinkedMapWritable@23904e0d
1org.elasticsearch.hadoop.mr.LinkedMapWritable@23904e0d
|

With field extractor:

|
PipedocPipe =newPipe("copy");
PipecopyPipe
=newEach(docPipe,newFields("doc_id","description"),newESFieldExtractor(fieldArguments),Fields.RESULTS
);

FlowDefflowDef =FlowDef.flowDef()
.addSource(copyPipe,docTap)
.addTailSink(copyPipe,wcTap);

flowConnector.connect(flowDef).complete();
|

for i in output/part-0000* ; do echo "[$i]" && cat $i ; done

|
[output/part-00000]
doc_id description
3d3
2d2
1d1
[output/part-00001]
doc_id description
3d3
2d2
1d1
[output/part-00002]
doc_id description
3d3
2d2
1d1
[output/part-00003]
doc_id description
3d3
2d2
1d1
[output/part-00004]
doc_id description
3d3
2d2
1d1
|

Manual query of data by shards using curl:

|
curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:0'

{"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":0,"max_score":null,"hits":}}

curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:1'

{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":0,"max_score":null,"hits":}}

curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:2'

{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test","_id":"1","_score":1.0,"_source":{"timestamp":1348984904161,"category":"category1","description":"d1"}}]}}

curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:3'

{"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test","_id":"2","_score":1.0,"_source":{"timestamp":1348984904162,"category":"category2","description":"d2"}}]}}

curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:4'

{"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test","_id":"3","_score":1.0,"_source":{"timestamp":1348984904163,"category":"category3","description":"d3"}}]}}

|

Here is also the console output:

|
13/11/19 01:20:06 INFO property.AppProps: using app.id:
9BCEBB939DE70E7BBBC0F8427387176E
13/11/19 01:20:06 INFO util.HadoopUtil: resolving application jar from
found main method on: eshadoop.WordCount
13/11/19 01:20:06 INFO planner.HadoopPlanner: using application jar:
/home/jeoffrey/test/eshadoopwc.jar
13/11/19 01:20:06 INFO hadoop.Hfs: forcing job to local mode, via sink:
Hfs["TextDelimited[[UNKNOWN]->['doc_id',
'description']]"]["output"]
13/11/19 01:20:08 INFO util.Version: Concurrent, Inc - Cascading 2.1.6
13/11/19 01:20:08 INFO flow.Flow: starting
13/11/19 01:20:08 INFO flow.Flow: source:
ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:08 INFO flow.Flow: sink:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:08 INFO flow.Flow: parallel execution is enabled:
true
13/11/19 01:20:08 INFO flow.Flow: starting jobs: 1
13/11/19 01:20:08 INFO flow.Flow: allocating threads: 1
13/11/19 01:20:08 INFO flow.FlowStep: starting step: (1/1) output
13/11/19 01:20:08 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
13/11/19 01:20:08 INFO mr.ESInputFormat: Discovered mapping
{test=[raw-logs=, test=[category=STRING,
description=STRING, timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:08 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:08 INFO flow.FlowStep: submitted hadoop job:
job_local257186463_0001
13/11/19 01:20:08 INFO mapred.LocalJobRunner: Waiting for map tasks
13/11/19 01:20:08 INFO mapred.LocalJobRunner: Starting task:
attempt_local257186463_0001_m_000000_0
13/11/19 01:20:08 INFO util.ProcessTree: setsid exited with exit code 0
13/11/19 01:20:08 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4235e6e3
13/11/19 01:20:08 INFO mapred.MapTask: Processing split:
cascading.tap.hadoop.io.MultiInputSplit@62803d5
13/11/19 01:20:08 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:08 INFO hadoop.FlowMapper: cascading version:
Concurrent, Inc - Cascading 2.1.6
13/11/19 01:20:08 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:08 INFO hadoop.FlowMapper: sourcing from:
ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:08 INFO hadoop.FlowMapper: sinking to:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:08 INFO mr.ESInputFormat: Discovered mapping
{test=[raw-logs=, test=[category=STRING,
description=STRING, timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:08 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:08 INFO mapred.Task:
Task:attempt_local257186463_0001_m_000000_0 is done. And is in the process
of commiting
13/11/19 01:20:08 INFO mapred.LocalJobRunner:
13/11/19 01:20:08 INFO mapred.Task: Task
attempt_local257186463_0001_m_000000_0 is allowed to commit now
13/11/19 01:20:08 INFO mapred.FileOutputCommitter: Saved output of task
'attempt_local257186463_0001_m_000000_0' to
file:/home/jeoffrey/test/output
13/11/19 01:20:08 INFO mapred.LocalJobRunner: ShardInputSplit
[node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
Cordelia|192.168.1.120:9200],shard=0]
13/11/19 01:20:08 INFO mapred.Task: Task
'attempt_local257186463_0001_m_000000_0' done.
13/11/19 01:20:08 INFO mapred.LocalJobRunner: Finishing task:
attempt_local257186463_0001_m_000000_0
13/11/19 01:20:08 INFO mapred.LocalJobRunner: Starting task:
attempt_local257186463_0001_m_000001_0
13/11/19 01:20:08 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3d3b5a3a
13/11/19 01:20:08 INFO mapred.MapTask: Processing split:
cascading.tap.hadoop.io.MultiInputSplit@4ab9b8d0
13/11/19 01:20:08 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:08 INFO hadoop.FlowMapper: cascading version:
Concurrent, Inc - Cascading 2.1.6
13/11/19 01:20:08 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:08 INFO hadoop.FlowMapper: sourcing from:
ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:08 INFO hadoop.FlowMapper: sinking to:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping
{test=[raw-logs=, test=[category=STRING,
description=STRING, timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:09 INFO mapred.Task:
Task:attempt_local257186463_0001_m_000001_0 is done. And is in the process
of commiting
13/11/19 01:20:09 INFO mapred.LocalJobRunner:
13/11/19 01:20:09 INFO mapred.Task: Task
attempt_local257186463_0001_m_000001_0 is allowed to commit now
13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task
'attempt_local257186463_0001_m_000001_0' to
file:/home/jeoffrey/test/output
13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit
[node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
Cordelia|192.168.1.120:9200],shard=1]
13/11/19 01:20:09 INFO mapred.Task: Task
'attempt_local257186463_0001_m_000001_0' done.
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task:
attempt_local257186463_0001_m_000001_0
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Starting task:
attempt_local257186463_0001_m_000002_0
13/11/19 01:20:09 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3d46e381
13/11/19 01:20:09 INFO mapred.MapTask: Processing split:
cascading.tap.hadoop.io.MultiInputSplit@52cd19d
13/11/19 01:20:09 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:09 INFO hadoop.FlowMapper: cascading version:
Concurrent, Inc - Cascading 2.1.6
13/11/19 01:20:09 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:09 INFO hadoop.FlowMapper: sourcing from:
ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:09 INFO hadoop.FlowMapper: sinking to:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping
{test=[raw-logs=, test=[category=STRING,
description=STRING, timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:09 INFO mapred.Task:
Task:attempt_local257186463_0001_m_000002_0 is done. And is in the process
of commiting
13/11/19 01:20:09 INFO mapred.LocalJobRunner:
13/11/19 01:20:09 INFO mapred.Task: Task
attempt_local257186463_0001_m_000002_0 is allowed to commit now
13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task
'attempt_local257186463_0001_m_000002_0' to
file:/home/jeoffrey/test/output
13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit
[node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
Cordelia|192.168.1.120:9200],shard=4]
13/11/19 01:20:09 INFO mapred.Task: Task
'attempt_local257186463_0001_m_000002_0' done.
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task:
attempt_local257186463_0001_m_000002_0
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Starting task:
attempt_local257186463_0001_m_000003_0
13/11/19 01:20:09 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4ca68fd8
13/11/19 01:20:09 INFO mapred.MapTask: Processing split:
cascading.tap.hadoop.io.MultiInputSplit@2e097617
13/11/19 01:20:09 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:09 INFO hadoop.FlowMapper: cascading version:
Concurrent, Inc - Cascading 2.1.6
13/11/19 01:20:09 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:09 INFO hadoop.FlowMapper: sourcing from:
ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:09 INFO hadoop.FlowMapper: sinking to:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping
{test=[raw-logs=, test=[category=STRING,
description=STRING, timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:09 INFO mapred.Task:
Task:attempt_local257186463_0001_m_000003_0 is done. And is in the process
of commiting
13/11/19 01:20:09 INFO mapred.LocalJobRunner:
13/11/19 01:20:09 INFO mapred.Task: Task
attempt_local257186463_0001_m_000003_0 is allowed to commit now
13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task
'attempt_local257186463_0001_m_000003_0' to
file:/home/jeoffrey/test/output
13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit
[node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
Cordelia|192.168.1.120:9200],shard=3]
13/11/19 01:20:09 INFO mapred.Task: Task
'attempt_local257186463_0001_m_000003_0' done.
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task:
attempt_local257186463_0001_m_000003_0
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Starting task:
attempt_local257186463_0001_m_000004_0
13/11/19 01:20:09 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@6e848ecc
13/11/19 01:20:09 INFO mapred.MapTask: Processing split:
cascading.tap.hadoop.io.MultiInputSplit@40363068
13/11/19 01:20:09 INFO mapred.MapTask: numReduceTasks: 0
13/11/19 01:20:09 INFO hadoop.FlowMapper: cascading version:
Concurrent, Inc - Cascading 2.1.6
13/11/19 01:20:09 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
13/11/19 01:20:09 INFO hadoop.FlowMapper: sourcing from:
ESHadoopTap["ESHadoopScheme[['doc_id',
'description']]"]["/test/_search?q=:"]
13/11/19 01:20:09 INFO hadoop.FlowMapper: sinking to:
Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping
{test=[raw-logs=, test=[category=STRING,
description=STRING, timestamp=LONG]]} for [/test/_search?q=:]
13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
13/11/19 01:20:09 INFO mapred.Task:
Task:attempt_local257186463_0001_m_000004_0 is done. And is in the process
of commiting
13/11/19 01:20:09 INFO mapred.LocalJobRunner:
13/11/19 01:20:09 INFO mapred.Task: Task
attempt_local257186463_0001_m_000004_0 is allowed to commit now
13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task
'attempt_local257186463_0001_m_000004_0' to
file:/home/jeoffrey/test/output
13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit
[node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
Cordelia|192.168.1.120:9200],shard=2]
13/11/19 01:20:09 INFO mapred.Task: Task
'attempt_local257186463_0001_m_000004_0' done.
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task:
attempt_local257186463_0001_m_000004_0
13/11/19 01:20:09 INFO mapred.LocalJobRunner: Map task executor
complete.
13/11/19 01:20:13 INFO util.Hadoop18TapUtil: deleting temp path
output/_temporary
|

Thanks.

On Tuesday, November 19, 2013 12:01:05 AM UTC+8, Costin Leau wrote:

Can you please try with Cascading 2.1.6 just in case? I haven't 

tested Cascading 2.2 yet.

Thanks, 

On 18/11/2013 5:13 PM, Jeoffrey Lim wrote: 
> I'm using Hadoop 1.2.1, ElasticSearch 0.90.3, Cascading 2.2 and 

tried running the word count tutorial sample in local

> stand alone operation, and pseudo distributed mode resulting with 

the same output.

> 
> 
> Thanks! 
> 
> 
> On Mon, Nov 18, 2013 at 11:03 PM, Costin Leau <costi...@gmail.com<javascript:> <mailto:

costi...@gmail.com <javascript:>>> wrote:

> 
>     By the way, can you confirm the setup you are using - Hadoop 

version and network topology?

> 
>     Thanks, 
> 
> 
>     On 18/11/2013 4:22 PM, Costin Leau wrote: 
> 
>         Something is a miss here since I'm unable to reproduce 

the problem. I'll try again using your verbatim example.

>         The ESFieldExtractor shouldn't be needed since ESTap 

returns the results as a TupleEntry iterator so there

>         shouldn't be 
>         no need to do the field extraction but rather just refer 

to the fields by name.

>         Let me know what didn't work for you - I'll go through 

yoru example myself but just in case I'm missing something,

>         feedback is useful. 
> 
>         Cheers, 
> 
>         On 18/11/2013 1:18 PM, Jeoffrey Lim wrote: 
> 
>             Thanks a lot for looking at this problem. But still 

when I used the nightly build with the intended fix ,the

>             problem 
>             still occurs and the word count is still 5 which is 

the total number of shards of the index.

> 
>             On Saturday, November 16, 2013 1:50:18 AM UTC+8, 

Costin Leau wrote:

> 
>                  Hi, 
> 
>                  I've pushed a nightly build [1] which should 

address the problem. Let me know if it works for you.

> 
>                  It looks like Cascading is acting differently 

when running against a pseudo/local cluster as oppose to

>             a proper one 
>                  when 
>                  it comes to dealing with multiple splits. The 

bug fix should make the behaviour consistent.

> 
>                  Cheers, 
> 
>                  [1] 

elasticsearch-hadoop-1.3.0.__BUILD-20131115.172649-221

> 
>                  On 14/11/2013 5:51 AM, Jeoffrey Lim wrote: 
>                  > I'm testing the cascading plugin of 

elasticsearch-hadoop and trying the word count example from

>                  >

http://docs.cascading.org/__impatient/impatient2.html <
http://docs.cascading.org/__impatient/impatient2.html>

>             <http://docs.cascading.org/impatient/impatient2.html<

Cascading for the Impatient, Part 2>>

<http://docs.cascading.org/__impatient/impatient2.html <

http://docs.cascading.org/__impatient/impatient2.html>

>             <http://docs.cascading.org/impatient/impatient2.html<

Cascading for the Impatient, Part 2>>>

>                  > The problem is the result of the word count is 

incorrect and somehow gets multiplied by the number of

>             shards of 
>             the index. 
>                  > 
>                  > 
>                  > Code: 
>                  > 
>                  > | 
>                  > Stringindex =args[0]; 
>                  > StringwcPath =args[1]; 
>                  > 
>                  > JobConfjobConf =newJobConf(); 
>                  > 

jobConf.setBoolean("mapred.__map.tasks.speculative.__execution",false);

>                  > 

jobConf.setBoolean("mapred.__reduce.tasks.speculative.__execution",false);

>                  > 
>                  > Propertiesproperties 

=AppProps.appProps().__buildProperties(jobConf);

>                  > 

AppProps.__setApplicationJarClass(__properties,WordCountESHadoop.__class);

>                  > FlowConnectorflowConnector 

=newHadoopFlowConnector(__properties);

>                  > 
>                  > FieldsfieldArguments 

=newFields("doc_id","__description");

>                  > 
>                  > TapdocTap =newESTap("/"+index 

+"/_search?q=:",__fieldArguments);

>                  > TapwcTap 

=newHfs(newTextDelimited(true,__"\t"),wcPath );

>                  > 
>                  > PipedescFieldPipe 
>             

=newEach("desconly",__fieldArguments,__newESFieldExtractor(__fieldArguments),Fields.RESULTS
);

>                  > 
>                  > Fieldstoken =newFields("token"); 
>                  > Fieldstext =newFields("description"); 
>                  > RegexSplitGeneratorsplitter 

=newRegexSplitGenerator(token,__"[ \[\]\(\),.]");

>                  > PipedocPipe 

=newEach(descFieldPipe,text,__splitter,Fields.RESULTS );

>                  > 
>                  > 
>                  > // determine the word counts 
>                  > PipewcPipe =newPipe("wc",docPipe ); 
>                  > wcPipe =newGroupBy(wcPipe,token ); 
>                  > wcPipe 

=newEvery(wcPipe,Fields.ALL,__newCount(),Fields.ALL );

>                  > 
>                  > 
>                  > // connect the taps, pipes, etc., into a flow 
>                  > FlowDefflowDef =FlowDef.flowDef() 
>                  > .setName("wc") 
>                  > .addSource(docPipe,docTap ) 
>                  > .addTailSink(wcPipe,wcTap ); 
>                  > 
>                  > 
>                  > // write a DOT file and run the flow 
>                  > FlowwcFlow =flowConnector.connect(flowDef ); 
>                  > wcFlow.writeDOT("dot/wc.dot"); 
>                  > wcFlow.complete(); 
>                  > | 
>                  > 
>                  > 
>                  > 
>                  > Test case with 5 shards: 
>                  > 
>                  > | 
>                  > curl -XPUT 'http://localhost:9200/test/' 
>                  > curl -XPUT '

http://localhost:9200/test/__test/1'-d <
http://localhost:9200/test/__test/1'-d>

<http://localhost:9200/test/test/1'-d <

http://localhost:9200/test/test/1'-d>>

>             <http://localhost:9200/test/__test/1'-d <

http://localhost:9200/test/__test/1'-d>

<http://localhost:9200/test/test/1'-d <

http://localhost:9200/test/test/1'-d>>> '{

>                 

"timestamp":1348984904161,"__category":"category1","__description":"d1" }'

>                  > curl -XPUT '

http://localhost:9200/test/__test/2'-d <
http://localhost:9200/test/__test/2'-d>

<http://localhost:9200/test/test/2'-d <

http://localhost:9200/test/test/2'-d>>

>             <http://localhost:9200/test/__test/2'-d <

http://localhost:9200/test/__test/2'-d>

<http://localhost:9200/test/test/2'-d <

http://localhost:9200/test/test/2'-d>>> '{

>                 

"timestamp":1348984904162,"__category":"category2","__description":"d2" }'

>                  > curl -XPUT '

http://localhost:9200/test/__test/3'-d <
http://localhost:9200/test/__test/3'-d>

<http://localhost:9200/test/test/3'-d <

http://localhost:9200/test/test/3'-d>>

>             <http://localhost:9200/test/__test/3'-d <

http://localhost:9200/test/__test/3'-d>

<http://localhost:9200/test/test/3'-d <

http://localhost:9200/test/test/3'-d>>> '{

>                 

"timestamp":1348984904163,"__category":"category3","__description":"d3" }'

>                  > | 
>                  > 
>                  > 
>                  > Run using index 'test': 
>                  > 
>                  > | 
>                  > hadoop jar wordcount.jar test output 
>                  > | 
>                  > 
>                  > 
>                  > Display output: 
>                  > 
>                  > | 
>                  > cat output/part-00000 
>                  > 
>                  > token count 
>                  > d1 5 
>                  > d2 5 
>                  > d3 5 
>                  > | 
>                  > 
>                  > 
>                  > 
>                  > Test case with one shard only: 
>                  > 
>                  > | 
>                  > curl -XPUT '

http://localhost:9200/__oneshardtest/'-d <
http://localhost:9200/__oneshardtest/'-d>

<http://localhost:9200/oneshardtest/'-d <

http://localhost:9200/oneshardtest/'-d>>

>             <http://localhost:9200/__oneshardtest/'-d <

http://localhost:9200/__oneshardtest/'-d>

<http://localhost:9200/oneshardtest/'-d <

http://localhost:9200/oneshardtest/'-d>>> '{ "settings" : {

>             "index" : { 
>                  "number_of_shards" : 1, 
>                  > "number_of_replicas" : 1 } } }' 
>                  > curl -XPUT '

http://localhost:9200/__oneshardtest/test/1'-d <
http://localhost:9200/__oneshardtest/test/1'-d>

>             <http://localhost:9200/oneshardtest/test/1'-d <

http://localhost:9200/oneshardtest/test/1'-d>>

<http://localhost:9200/__oneshardtest/test/1'-d <

http://localhost:9200/__oneshardtest/test/1'-d>

>             <http://localhost:9200/oneshardtest/test/1'-d <

http://localhost:9200/oneshardtest/test/1'-d>>> '{

>                  > 

"timestamp":1348984904161,"__category":"category1","__description":"d1" }'

>                  > curl -XPUT '

http://localhost:9200/__oneshardtest/test/2'-d <
http://localhost:9200/__oneshardtest/test/2'-d>

>             <http://localhost:9200/oneshardtest/test/2'-d <

http://localhost:9200/oneshardtest/test/2'-d>>

<http://localhost:9200/__oneshardtest/test/2'-d <

http://localhost:9200/__oneshardtest/test/2'-d>

>             <http://localhost:9200/oneshardtest/test/2'-d <

http://localhost:9200/oneshardtest/test/2'-d>>> '{

>                  > 

"timestamp":1348984904162,"__category":"category2","__description":"d2" }'

>                  > curl -XPUT '

http://localhost:9200/__oneshardtest/test/3'-d <
http://localhost:9200/__oneshardtest/test/3'-d>

>             <http://localhost:9200/oneshardtest/test/3'-d <

http://localhost:9200/oneshardtest/test/3'-d>>

<http://localhost:9200/__oneshardtest/test/3'-d <

http://localhost:9200/__oneshardtest/test/3'-d>

>             <http://localhost:9200/oneshardtest/test/3'-d <

http://localhost:9200/oneshardtest/test/3'-d>>> '{

>                  > 

"timestamp":1348984904163,"__category":"category3","__description":"d3" }'

>                  > | 
>                  > 
>                  > 
>                  > Run using index 'oneshardtest': 
>                  > 
>                  > | 
>                  > hadoop jar wordcount.jar oneshardtest output 
>                  > | 
>                  > 
>                  > 
>                  > Display output: 
>                  > 
>                  > | 
>                  > cat output/part-00000 
>                  > 
>                  > 
>                  > token count 
>                  > d1 1 
>                  > d2 1 
>                  > d3 1 
>                  > | 
>                  > 
>                  > 
>                  > 
>                  > If an index only has 1 shard the result of 

word count is correct. Somehow there is a problem in

>             elasticsearch 
>             hadoop in 
>                  > the data input split during the map process. I 

have already tried this in a hadoop single node

>             environment and the 
>                  > result is the same. Did I miss something in 

how to properly configure cascading in elasticsearch-hadoop?

>                  > 
>                  > Thanks. 
>                  > 
>                  > 
>                  > -- 
>                  > You received this message because you are 

subscribed to the Google Groups "elasticsearch" group.

>                  > To unsubscribe from this group and stop 

receiving emails from it, send an email to

>                  >elasticsearc...@googlegroups.__com <mailto:

elasticsearc...@googlegroups.com> <javascript:>.

>                  > For more options, 

visithttps://groups.google.__com/groups/opt_out

>             <http://groups.google.com/groups/opt_out <

http://groups.google.com/groups/opt_out>>

<https://groups.google.com/__groups/opt_out <

https://groups.google.com/__groups/opt_out>

>             <https://groups.google.com/groups/opt_out <

https://groups.google.com/groups/opt_out>>>.

> 
>                  -- 
>                  Costin 
> 
>             -- 
>             You received this message because you are subscribed 

to the Google Groups "elasticsearch" group.

>             To unsubscribe from this group and stop receiving 

emails from it, send an email to

>             elasticsearch+unsubscribe@__googlegroups.com <

http://googlegroups.com> <mailto:
elasticsearch%2Bunsubscribe@googlegroups.com <javascript:>>.

>             For more options, visithttps://

groups.google.com/__groups/opt_out <
https://groups.google.com/__groups/opt_out>

<https://groups.google.com/groups/opt_out <

https://groups.google.com/groups/opt_out>>.

> 
> 
>     -- 
>     Costin 
> 
>     -- 
>     You received this message because you are subscribed to a 

topic in the Google Groups "elasticsearch" group.

>     To unsubscribe from this topic, visithttps://

groups.google.com/d/__topic/elasticsearch/__byA9kRJB8AY/unsubscribe

<

https://groups.google.com/d/__topic/elasticsearch/__byA9kRJB8AY/unsubscribe>

>     <

https://groups.google.com/d/topic/elasticsearch/byA9kRJB8AY/unsubscribe

<

https://groups.google.com/d/topic/elasticsearch/byA9kRJB8AY/unsubscribe>>.

>     To unsubscribe from this group and all its topics, send an 

email to elasticsearch+unsubscribe@__googlegroups.com <
http://googlegroups.com>

>     <mailto:elasticsearch%2Bunsubscribe@googlegroups.com<javascript:>>. 
>     For more options, visithttps://

groups.google.com/__groups/opt_out <
https://groups.google.com/__groups/opt_out>

<https://groups.google.com/groups/opt_out <

https://groups.google.com/groups/opt_out>>.

> 
> 
> -- 
> You received this message because you are subscribed to the 

Google Groups "elasticsearch" group.

> To unsubscribe from this group and stop receiving emails from it, 

send an email to

>elasticsearc...@googlegroups.com <javascript:>. 
> For more options, visithttps://groups.google.com/groups/opt_out <

https://groups.google.com/groups/opt_out>.

-- 
Costin 

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to
elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/72e15a46-f9eb-466a-811c-c01996cbb341%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The correctness of the task execution should work fine with or without ESFieldExtractor.
As for the latter, there's an issue to expose the map content directly in the tuple so the 'extractor' is not needed
anymore [1]

[1] Cascading tap should use the field declaration as a selector · Issue #108 · elastic/elasticsearch-hadoop · GitHub

On 04/12/2013 12:32 AM, Müslim Baig wrote:

So did word count start working perfectly with or without the ESFieldExtractor? I've run into a similar where I'm trying
to use, elasticsearch-hadoop with cascalog, however, ESTap is only returning the doc_id and a LinkedMapWritable object,
making it difficult to use the cascalog abstraction to cherry pick specific fields from the tap?

On Monday, November 18, 2013 9:35:07 PM UTC-5, Jeoffrey Lim wrote:

Word count works perfectly now. Thanks a lot :)

On Tuesday, November 19, 2013 4:11:37 AM UTC+8, Costin Leau wrote:

    First of all, thanks for the nice write-up - I wish all bug reports are as detailed as yours.
    After trying several versions of cascading and Hadoop it turned up the problem was with my tests. The issue was
    fairly
    obscure but your logs helped.
    Ironically the issue occurred only sometimes, when running against a pseudo-distributed Hadoop tracker (as
    yours) not
    when running against a remote one.

    I've pushed the fix into master and pushed a nightly build as well - let me know how it goes for you.

    Thanks again for the continous feedback and detailed report.

    On 18/11/2013 7:32 PM, Jeoffrey Lim wrote:
    > I tried using Cascading 2.1.6, and directly using the samples fromhttp://docs.cascading.org/impatient/ <http://docs.cascading.org/impatient/>
    >
    > When just performing a copy:
    >
    > |
    > PipedocPipe =newPipe("copy");
    >
    > FlowDefflowDef =FlowDef.flowDef()
    > .addSource(docPipe,docTap)
    > .addTailSink(docPipe,wcTap);
    >
    > flowConnector.connect(flowDef).complete();
    > |
    >
    > for i in output/part-0000* ; do echo "[$i]" && cat $i ; done
    >
    > |
    > [output/part-00000]
    > doc_id description
    > 3org.elasticsearch.hadoop.mr.LinkedMapWritable@3fde891b
    > 2org.elasticsearch.hadoop.mr.LinkedMapWritable@3fde891b
    > 1org.elasticsearch.hadoop.mr.LinkedMapWritable@3fde891b
    > [output/part-00001]
    > doc_id description
    > 3org.elasticsearch.hadoop.mr.LinkedMapWritable@b05236
    > 2org.elasticsearch.hadoop.mr.LinkedMapWritable@b05236
    > 1org.elasticsearch.hadoop.mr.LinkedMapWritable@b05236
    > [output/part-00002]
    > doc_id description
    > 3org.elasticsearch.hadoop.mr.LinkedMapWritable@3d762027
    > 2org.elasticsearch.hadoop.mr.LinkedMapWritable@3d762027
    > 1org.elasticsearch.hadoop.mr.LinkedMapWritable@3d762027
    > [output/part-00003]
    > doc_id description
    > 3org.elasticsearch.hadoop.mr.LinkedMapWritable@1aeca36e
    > 2org.elasticsearch.hadoop.mr.LinkedMapWritable@1aeca36e
    > 1org.elasticsearch.hadoop.mr.LinkedMapWritable@1aeca36e
    > [output/part-00004]
    > doc_id description
    > 3org.elasticsearch.hadoop.mr.LinkedMapWritable@23904e0d
    > 2org.elasticsearch.hadoop.mr.LinkedMapWritable@23904e0d
    > 1org.elasticsearch.hadoop.mr.LinkedMapWritable@23904e0d
    > |
    >
    >
    >
    > With field extractor:
    >
    > |
    > PipedocPipe =newPipe("copy");
    > PipecopyPipe =newEach(docPipe,newFields("doc_id","description"),newESFieldExtractor(fieldArguments),Fields.RESULTS );
    >
    > FlowDefflowDef =FlowDef.flowDef()
    > .addSource(copyPipe,docTap)
    > .addTailSink(copyPipe,wcTap);
    >
    > flowConnector.connect(flowDef).complete();
    > |
    >
    >
    > for i in output/part-0000* ; do echo "[$i]" && cat $i ; done
    >
    > |
    > [output/part-00000]
    > doc_id description
    > 3d3
    > 2d2
    > 1d1
    > [output/part-00001]
    > doc_id description
    > 3d3
    > 2d2
    > 1d1
    > [output/part-00002]
    > doc_id description
    > 3d3
    > 2d2
    > 1d1
    > [output/part-00003]
    > doc_id description
    > 3d3
    > 2d2
    > 1d1
    > [output/part-00004]
    > doc_id description
    > 3d3
    > 2d2
    > 1d1
    > |
    >
    >
    > Manual query of data by shards using curl:
    >
    > |
    > curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:0
    <http://localhost:9200/test/_search?q=*:*&preference=_shards:0>'
    > {"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}
    >
    >
    > curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:1
    <http://localhost:9200/test/_search?q=*:*&preference=_shards:1>'
    > {"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}
    >
    >
    > curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:2
    <http://localhost:9200/test/_search?q=*:*&preference=_shards:2>'
    > {"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test","_id":"1","_score":1.0,"_source":{"timestamp":1348984904161,"category":"category1","description":"d1"}}]}}

    >
    >
    > curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:3
    <http://localhost:9200/test/_search?q=*:*&preference=_shards:3>'
    > {"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test","_id":"2","_score":1.0,"_source":{"timestamp":1348984904162,"category":"category2","description":"d2"}}]}}

    >
    >
    > curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:4
    <http://localhost:9200/test/_search?q=*:*&preference=_shards:4>'
    > {"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test","_id":"3","_score":1.0,"_source":{"timestamp":1348984904163,"category":"category3","description":"d3"}}]}}

    > |
    >
    >
    > Here is also the console output:
    >
    > |
    > 13/11/19 01:20:06 INFO property.AppProps: usingapp.id <http://app.id>: 9BCEBB939DE70E7BBBC0F8427387176E
    > 13/11/19 01:20:06 INFO util.HadoopUtil: resolving application jar from found main method on: eshadoop.WordCount
    > 13/11/19 01:20:06 INFO planner.HadoopPlanner: using application jar: /home/jeoffrey/test/eshadoopwc.jar
    > 13/11/19 01:20:06 INFO hadoop.Hfs: forcing job to local mode, via sink: Hfs["TextDelimited[[UNKNOWN]->['doc_id',
    > 'description']]"]["output"]
    > 13/11/19 01:20:08 INFO util.Version: Concurrent, Inc - Cascading 2.1.6
    > 13/11/19 01:20:08 INFO flow.Flow: [] starting
    > 13/11/19 01:20:08 INFO flow.Flow: []  source: ESHadoopTap["ESHadoopScheme[['doc_id',
    > 'description']]"]["/test/_search?q=*:*"]
    > 13/11/19 01:20:08 INFO flow.Flow: []  sink: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
    > 13/11/19 01:20:08 INFO flow.Flow: []  parallel execution is enabled: true
    > 13/11/19 01:20:08 INFO flow.Flow: []  starting jobs: 1
    > 13/11/19 01:20:08 INFO flow.Flow: []  allocating threads: 1
    > 13/11/19 01:20:08 INFO flow.FlowStep: [] starting step: (1/1) output
    > 13/11/19 01:20:08 INFO util.NativeCodeLoader: Loaded the native-hadoop library
    > 13/11/19 01:20:08 INFO mr.ESInputFormat: Discovered mapping {test=[raw-logs=[], test=[category=STRING,
    > description=STRING, timestamp=LONG]]} for [/test/_search?q=*:*]
    > 13/11/19 01:20:08 INFO mr.ESInputFormat: Created [5] shard-splits
    > 13/11/19 01:20:08 INFO flow.FlowStep: [] submitted hadoop job: job_local257186463_0001
    > 13/11/19 01:20:08 INFO mapred.LocalJobRunner: Waiting for map tasks
    > 13/11/19 01:20:08 INFO mapred.LocalJobRunner: Starting task: attempt_local257186463_0001_m_000000_0
    > 13/11/19 01:20:08 INFO util.ProcessTree: setsid exited with exit code 0
    > 13/11/19 01:20:08 INFO mapred.Task:  Using ResourceCalculatorPlugin :
    > org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4235e6e3
    > 13/11/19 01:20:08 INFO mapred.MapTask: Processing split: cascading.tap.hadoop.io.MultiInputSplit@62803d5
    > 13/11/19 01:20:08 INFO mapred.MapTask: numReduceTasks: 0
    > 13/11/19 01:20:08 INFO hadoop.FlowMapper: cascading version: Concurrent, Inc - Cascading 2.1.6
    > 13/11/19 01:20:08 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
    > 13/11/19 01:20:08 INFO hadoop.FlowMapper: sourcing from: ESHadoopTap["ESHadoopScheme[['doc_id',
    > 'description']]"]["/test/_search?q=*:*"]
    > 13/11/19 01:20:08 INFO hadoop.FlowMapper: sinking to: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
    > 13/11/19 01:20:08 INFO mr.ESInputFormat: Discovered mapping {test=[raw-logs=[], test=[category=STRING,
    > description=STRING, timestamp=LONG]]} for [/test/_search?q=*:*]
    > 13/11/19 01:20:08 INFO mr.ESInputFormat: Created [5] shard-splits
    > 13/11/19 01:20:08 INFO mapred.Task: Task:attempt_local257186463_0001_m_000000_0 is done. And is in the process of commiting
    > 13/11/19 01:20:08 INFO mapred.LocalJobRunner:
    > 13/11/19 01:20:08 INFO mapred.Task: Task attempt_local257186463_0001_m_000000_0 is allowed to commit now
    > 13/11/19 01:20:08 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local257186463_0001_m_000000_0' to
    > file:/home/jeoffrey/test/output
    > 13/11/19 01:20:08 INFO mapred.LocalJobRunner: ShardInputSplit [node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
    > Cordelia|192.168.1.120:9200 <http://192.168.1.120:9200>],shard=0]
    > 13/11/19 01:20:08 INFO mapred.Task: Task 'attempt_local257186463_0001_m_000000_0' done.
    > 13/11/19 01:20:08 INFO mapred.LocalJobRunner: Finishing task: attempt_local257186463_0001_m_000000_0
    > 13/11/19 01:20:08 INFO mapred.LocalJobRunner: Starting task: attempt_local257186463_0001_m_000001_0
    > 13/11/19 01:20:08 INFO mapred.Task:  Using ResourceCalculatorPlugin :
    > org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3d3b5a3a
    > 13/11/19 01:20:08 INFO mapred.MapTask: Processing split: cascading.tap.hadoop.io.MultiInputSplit@4ab9b8d0
    > 13/11/19 01:20:08 INFO mapred.MapTask: numReduceTasks: 0
    > 13/11/19 01:20:08 INFO hadoop.FlowMapper: cascading version: Concurrent, Inc - Cascading 2.1.6
    > 13/11/19 01:20:08 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
    > 13/11/19 01:20:08 INFO hadoop.FlowMapper: sourcing from: ESHadoopTap["ESHadoopScheme[['doc_id',
    > 'description']]"]["/test/_search?q=*:*"]
    > 13/11/19 01:20:08 INFO hadoop.FlowMapper: sinking to: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
    > 13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping {test=[raw-logs=[], test=[category=STRING,
    > description=STRING, timestamp=LONG]]} for [/test/_search?q=*:*]
    > 13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
    > 13/11/19 01:20:09 INFO mapred.Task: Task:attempt_local257186463_0001_m_000001_0 is done. And is in the process of commiting
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner:
    > 13/11/19 01:20:09 INFO mapred.Task: Task attempt_local257186463_0001_m_000001_0 is allowed to commit now
    > 13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local257186463_0001_m_000001_0' to
    > file:/home/jeoffrey/test/output
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit [node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
    > Cordelia|192.168.1.120:9200 <http://192.168.1.120:9200>],shard=1]
    > 13/11/19 01:20:09 INFO mapred.Task: Task 'attempt_local257186463_0001_m_000001_0' done.
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task: attempt_local257186463_0001_m_000001_0
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner: Starting task: attempt_local257186463_0001_m_000002_0
    > 13/11/19 01:20:09 INFO mapred.Task:  Using ResourceCalculatorPlugin :
    > org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3d46e381
    > 13/11/19 01:20:09 INFO mapred.MapTask: Processing split: cascading.tap.hadoop.io.MultiInputSplit@52cd19d
    > 13/11/19 01:20:09 INFO mapred.MapTask: numReduceTasks: 0
    > 13/11/19 01:20:09 INFO hadoop.FlowMapper: cascading version: Concurrent, Inc - Cascading 2.1.6
    > 13/11/19 01:20:09 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
    > 13/11/19 01:20:09 INFO hadoop.FlowMapper: sourcing from: ESHadoopTap["ESHadoopScheme[['doc_id',
    > 'description']]"]["/test/_search?q=*:*"]
    > 13/11/19 01:20:09 INFO hadoop.FlowMapper: sinking to: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
    > 13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping {test=[raw-logs=[], test=[category=STRING,
    > description=STRING, timestamp=LONG]]} for [/test/_search?q=*:*]
    > 13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
    > 13/11/19 01:20:09 INFO mapred.Task: Task:attempt_local257186463_0001_m_000002_0 is done. And is in the process of commiting
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner:
    > 13/11/19 01:20:09 INFO mapred.Task: Task attempt_local257186463_0001_m_000002_0 is allowed to commit now
    > 13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local257186463_0001_m_000002_0' to
    > file:/home/jeoffrey/test/output
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit [node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
    > Cordelia|192.168.1.120:9200 <http://192.168.1.120:9200>],shard=4]
    > 13/11/19 01:20:09 INFO mapred.Task: Task 'attempt_local257186463_0001_m_000002_0' done.
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task: attempt_local257186463_0001_m_000002_0
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner: Starting task: attempt_local257186463_0001_m_000003_0
    > 13/11/19 01:20:09 INFO mapred.Task:  Using ResourceCalculatorPlugin :
    > org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4ca68fd8
    > 13/11/19 01:20:09 INFO mapred.MapTask: Processing split: cascading.tap.hadoop.io.MultiInputSplit@2e097617
    > 13/11/19 01:20:09 INFO mapred.MapTask: numReduceTasks: 0
    > 13/11/19 01:20:09 INFO hadoop.FlowMapper: cascading version: Concurrent, Inc - Cascading 2.1.6
    > 13/11/19 01:20:09 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
    > 13/11/19 01:20:09 INFO hadoop.FlowMapper: sourcing from: ESHadoopTap["ESHadoopScheme[['doc_id',
    > 'description']]"]["/test/_search?q=*:*"]
    > 13/11/19 01:20:09 INFO hadoop.FlowMapper: sinking to: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
    > 13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping {test=[raw-logs=[], test=[category=STRING,
    > description=STRING, timestamp=LONG]]} for [/test/_search?q=*:*]
    > 13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
    > 13/11/19 01:20:09 INFO mapred.Task: Task:attempt_local257186463_0001_m_000003_0 is done. And is in the process of commiting
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner:
    > 13/11/19 01:20:09 INFO mapred.Task: Task attempt_local257186463_0001_m_000003_0 is allowed to commit now
    > 13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local257186463_0001_m_000003_0' to
    > file:/home/jeoffrey/test/output
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit [node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
    > Cordelia|192.168.1.120:9200 <http://192.168.1.120:9200>],shard=3]
    > 13/11/19 01:20:09 INFO mapred.Task: Task 'attempt_local257186463_0001_m_000003_0' done.
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task: attempt_local257186463_0001_m_000003_0
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner: Starting task: attempt_local257186463_0001_m_000004_0
    > 13/11/19 01:20:09 INFO mapred.Task:  Using ResourceCalculatorPlugin :
    > org.apache.hadoop.util.LinuxResourceCalculatorPlugin@6e848ecc
    > 13/11/19 01:20:09 INFO mapred.MapTask: Processing split: cascading.tap.hadoop.io.MultiInputSplit@40363068
    > 13/11/19 01:20:09 INFO mapred.MapTask: numReduceTasks: 0
    > 13/11/19 01:20:09 INFO hadoop.FlowMapper: cascading version: Concurrent, Inc - Cascading 2.1.6
    > 13/11/19 01:20:09 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
    > 13/11/19 01:20:09 INFO hadoop.FlowMapper: sourcing from: ESHadoopTap["ESHadoopScheme[['doc_id',
    > 'description']]"]["/test/_search?q=*:*"]
    > 13/11/19 01:20:09 INFO hadoop.FlowMapper: sinking to: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
    > 13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping {test=[raw-logs=[], test=[category=STRING,
    > description=STRING, timestamp=LONG]]} for [/test/_search?q=*:*]
    > 13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
    > 13/11/19 01:20:09 INFO mapred.Task: Task:attempt_local257186463_0001_m_000004_0 is done. And is in the process of commiting
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner:
    > 13/11/19 01:20:09 INFO mapred.Task: Task attempt_local257186463_0001_m_000004_0 is allowed to commit now
    > 13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local257186463_0001_m_000004_0' to
    > file:/home/jeoffrey/test/output
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit [node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
    > Cordelia|192.168.1.120:9200 <http://192.168.1.120:9200>],shard=2]
    > 13/11/19 01:20:09 INFO mapred.Task: Task 'attempt_local257186463_0001_m_000004_0' done.
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task: attempt_local257186463_0001_m_000004_0
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner: Map task executor complete.
    > 13/11/19 01:20:13 INFO util.Hadoop18TapUtil: deleting temp path output/_temporary
    > |
    >
    >
    > Thanks.
    >
    > On Tuesday, November 19, 2013 12:01:05 AM UTC+8, Costin Leau wrote:
    >
    >     Can you please try with Cascading 2.1.6 just in case? I haven't tested Cascading 2.2 yet.
    >
    >     Thanks,
    >
    >     On 18/11/2013 5:13 PM, Jeoffrey Lim wrote:
    >     > I'm using Hadoop 1.2.1, ElasticSearch 0.90.3, Cascading 2.2 and tried running the word count tutorial sample in local
    >     > stand alone operation, and pseudo distributed mode resulting with the same output.
    >     >
    >     >
    >     > Thanks!
    >     >
    >     >
    >     > On Mon, Nov 18, 2013 at 11:03 PM, Costin Leau <costi...@gmail.com <javascript:> <mailto:costi...@gmail.com <javascript:>>> wrote:
    >     >
    >     >     By the way, can you confirm the setup you are using - Hadoop version and network topology?
    >     >
    >     >     Thanks,
    >     >
    >     >
    >     >     On 18/11/2013 4:22 PM, Costin Leau wrote:
    >     >
    >     >         Something is a miss here since I'm unable to reproduce the problem. I'll try again using your verbatim example.
    >     >         The ESFieldExtractor shouldn't be needed since ESTap returns the results as a TupleEntry iterator so there
    >     >         shouldn't be
    >     >         no need to do the field extraction but rather just refer to the fields by name.
    >     >         Let me know what didn't work for you - I'll go through yoru example myself but just in case I'm missing something,
    >     >         feedback is useful.
    >     >
    >     >         Cheers,
    >     >
    >     >         On 18/11/2013 1:18 PM, Jeoffrey Lim wrote:
    >     >
    >     >             Thanks a lot for looking at this problem. But still when I used the nightly build with the intended fix ,the
    >     >             problem
    >     >             still occurs and the word count is still 5 which is the total number of shards of the index.
    >     >
    >     >             On Saturday, November 16, 2013 1:50:18 AM UTC+8, Costin Leau wrote:
    >     >
    >     >                  Hi,
    >     >
    >     >                  I've pushed a nightly build [1] which should address the problem. Let me know if it works for you.
    >     >
    >     >                  It looks like Cascading is acting differently when running against a pseudo/local cluster as oppose to
    >     >             a proper one
    >     >                  when
    >     >                  it comes to dealing with multiple splits. The bug fix should make the behaviour consistent.
    >     >
    >     >                  Cheers,
    >     >
    >     >                  [1] elasticsearch-hadoop-1.3.0.__BUILD-20131115.172649-221
    >     >
    >     >                  On 14/11/2013 5:51 AM, Jeoffrey Lim wrote:
    >     >                  > I'm testing the cascading plugin of elasticsearch-hadoop and trying the word count example from
    >     >                  >http://docs.cascading.org/__impatient/impatient2.html <http://docs.cascading.org/__impatient/impatient2.html>
    <http://docs.cascading.org/__impatient/impatient2.html <http://docs.cascading.org/__impatient/impatient2.html>>
    >     >             <http://docs.cascading.org/impatient/impatient2.html <http://docs.cascading.org/impatient/impatient2.html>
    <http://docs.cascading.org/impatient/impatient2.html <http://docs.cascading.org/impatient/impatient2.html>>>
    >     <http://docs.cascading.org/__impatient/impatient2.html <http://docs.cascading.org/__impatient/impatient2.html>
    <http://docs.cascading.org/__impatient/impatient2.html <http://docs.cascading.org/__impatient/impatient2.html>>
    >     >             <http://docs.cascading.org/impatient/impatient2.html <http://docs.cascading.org/impatient/impatient2.html>
    <http://docs.cascading.org/impatient/impatient2.html <http://docs.cascading.org/impatient/impatient2.html>>>>
    >     >                  > The problem is the result of the word count is incorrect and somehow gets multiplied by the number of
    >     >             shards of
    >     >             the index.
    >     >                  >
    >     >                  >
    >     >                  > Code:
    >     >                  >
    >     >                  > |
    >     >                  > Stringindex =args[0];
    >     >                  > StringwcPath =args[1];
    >     >                  >
    >     >                  > JobConfjobConf =newJobConf();
    >     >                  > jobConf.setBoolean("mapred.__map.tasks.speculative.__execution",false);
    >     >                  > jobConf.setBoolean("mapred.__reduce.tasks.speculative.__execution",false);
    >     >                  >
    >     >                  > Propertiesproperties =AppProps.appProps().__buildProperties(jobConf);
    >     >                  > AppProps.__setApplicationJarClass(__properties,WordCountESHadoop.__class);
    >     >                  > FlowConnectorflowConnector =newHadoopFlowConnector(__properties);
    >     >                  >
    >     >                  > FieldsfieldArguments =newFields("doc_id","__description");
    >     >                  >
    >     >                  > TapdocTap =newESTap("/"+index +"/_search?q=*:*",__fieldArguments);
    >     >                  > TapwcTap =newHfs(newTextDelimited(true,__"\t"),wcPath );
    >     >                  >
    >     >                  > PipedescFieldPipe
    >     >             =newEach("desconly",__fieldArguments,__newESFieldExtractor(__fieldArguments),Fields.RESULTS );
    >     >                  >
    >     >                  > Fieldstoken =newFields("token");
    >     >                  > Fieldstext =newFields("description");
    >     >                  > RegexSplitGeneratorsplitter =newRegexSplitGenerator(token,__"[ \\[\\]\\(\\),.]");
    >     >                  > PipedocPipe =newEach(descFieldPipe,text,__splitter,Fields.RESULTS );
    >     >                  >
    >     >                  >
    >     >                  > // determine the word counts
    >     >                  > PipewcPipe =newPipe("wc",docPipe );
    >     >                  > wcPipe =newGroupBy(wcPipe,token );
    >     >                  > wcPipe =newEvery(wcPipe,Fields.ALL,__newCount(),Fields.ALL );
    >     >                  >
    >     >                  >
    >     >                  > // connect the taps, pipes, etc., into a flow
    >     >                  > FlowDefflowDef =FlowDef.flowDef()
    >     >                  > .setName("wc")
    >     >                  > .addSource(docPipe,docTap )
    >     >                  > .addTailSink(wcPipe,wcTap );
    >     >                  >
    >     >                  >
    >     >                  > // write a DOT file and run the flow
    >     >                  > FlowwcFlow =flowConnector.connect(flowDef );
    >     >                  > wcFlow.writeDOT("dot/wc.dot");
    >     >                  > wcFlow.complete();
    >     >                  > |
    >     >                  >
    >     >                  >
    >     >                  >
    >     >                  > Test case with 5 shards:
    >     >                  >
    >     >                  > |
    >     >                  > curl -XPUT 'http://localhost:9200/test/'
    >     >                  > curl -XPUT 'http://localhost:9200/test/__test/1'-d <http://localhost:9200/test/__test/1'-d>
    <http://localhost:9200/test/__test/1'-d <http://localhost:9200/test/__test/1'-d>>
    >     <http://localhost:9200/test/test/1'-d <http://localhost:9200/test/test/1'-d>
    <http://localhost:9200/test/test/1'-d <http://localhost:9200/test/test/1'-d>>>
    >     >             <http://localhost:9200/test/__test/1'-d <http://localhost:9200/test/__test/1'-d>
    <http://localhost:9200/test/__test/1'-d <http://localhost:9200/test/__test/1'-d>>
    >     <http://localhost:9200/test/test/1'-d <http://localhost:9200/test/test/1'-d>
    <http://localhost:9200/test/test/1'-d <http://localhost:9200/test/test/1'-d>>>> '{
    >     >                  "timestamp":1348984904161,"__category":"category1","__description":"d1" }'
    >     >                  > curl -XPUT 'http://localhost:9200/test/__test/2'-d <http://localhost:9200/test/__test/2'-d>
    <http://localhost:9200/test/__test/2'-d <http://localhost:9200/test/__test/2'-d>>
    >     <http://localhost:9200/test/test/2'-d <http://localhost:9200/test/test/2'-d>
    <http://localhost:9200/test/test/2'-d <http://localhost:9200/test/test/2'-d>>>
    >     >             <http://localhost:9200/test/__test/2'-d <http://localhost:9200/test/__test/2'-d>
    <http://localhost:9200/test/__test/2'-d <http://localhost:9200/test/__test/2'-d>>
    >     <http://localhost:9200/test/test/2'-d <http://localhost:9200/test/test/2'-d>
    <http://localhost:9200/test/test/2'-d <http://localhost:9200/test/test/2'-d>>>> '{
    >     >                  "timestamp":1348984904162,"__category":"category2","__description":"d2" }'
    >     >                  > curl -XPUT 'http://localhost:9200/test/__test/3'-d <http://localhost:9200/test/__test/3'-d>
    <http://localhost:9200/test/__test/3'-d <http://localhost:9200/test/__test/3'-d>>
    >     <http://localhost:9200/test/test/3'-d <http://localhost:9200/test/test/3'-d>
    <http://localhost:9200/test/test/3'-d <http://localhost:9200/test/test/3'-d>>>
    >     >             <http://localhost:9200/test/__test/3'-d <http://localhost:9200/test/__test/3'-d>
    <http://localhost:9200/test/__test/3'-d <http://localhost:9200/test/__test/3'-d>>
    >     <http://localhost:9200/test/test/3'-d <http://localhost:9200/test/test/3'-d>
    <http://localhost:9200/test/test/3'-d <http://localhost:9200/test/test/3'-d>>>> '{
    >     >                  "timestamp":1348984904163,"__category":"category3","__description":"d3" }'
    >     >                  > |
    >     >                  >
    >     >                  >
    >     >                  > Run using index 'test':
    >     >                  >
    >     >                  > |
    >     >                  > hadoop jar wordcount.jar test output
    >     >                  > |
    >     >                  >
    >     >                  >
    >     >                  > Display output:
    >     >                  >
    >     >                  > |
    >     >                  > cat output/part-00000
    >     >                  >
    >     >                  > token count
    >     >                  > d1 5
    >     >                  > d2 5
    >     >                  > d3 5
    >     >                  > |
    >     >                  >
    >     >                  >
    >     >                  >
    >     >                  > Test case with one shard only:
    >     >                  >
    >     >                  > |
    >     >                  > curl -XPUT 'http://localhost:9200/__oneshardtest/'-d <http://localhost:9200/__oneshardtest/'-d>
    <http://localhost:9200/__oneshardtest/'-d <http://localhost:9200/__oneshardtest/'-d>>
    >     <http://localhost:9200/oneshardtest/'-d <http://localhost:9200/oneshardtest/'-d>
    <http://localhost:9200/oneshardtest/'-d <http://localhost:9200/oneshardtest/'-d>>>
    >     >             <http://localhost:9200/__oneshardtest/'-d <http://localhost:9200/__oneshardtest/'-d>
    <http://localhost:9200/__oneshardtest/'-d <http://localhost:9200/__oneshardtest/'-d>>
    >     <http://localhost:9200/oneshardtest/'-d <http://localhost:9200/oneshardtest/'-d>
    <http://localhost:9200/oneshardtest/'-d <http://localhost:9200/oneshardtest/'-d>>>> '{ "settings" : {
    >     >             "index" : {
    >     >                  "number_of_shards" : 1,
    >     >                  > "number_of_replicas" : 1 } } }'
    >     >                  > curl -XPUT 'http://localhost:9200/__oneshardtest/test/1'-d <http://localhost:9200/__oneshardtest/test/1'-d>
    <http://localhost:9200/__oneshardtest/test/1'-d <http://localhost:9200/__oneshardtest/test/1'-d>>
    >     >             <http://localhost:9200/oneshardtest/test/1'-d <http://localhost:9200/oneshardtest/test/1'-d>
    <http://localhost:9200/oneshardtest/test/1'-d <http://localhost:9200/oneshardtest/test/1'-d>>>
    >     <http://localhost:9200/__oneshardtest/test/1'-d <http://localhost:9200/__oneshardtest/test/1'-d>
    <http://localhost:9200/__oneshardtest/test/1'-d <http://localhost:9200/__oneshardtest/test/1'-d>>
    >     >             <http://localhost:9200/oneshardtest/test/1'-d <http://localhost:9200/oneshardtest/test/1'-d>
    <http://localhost:9200/oneshardtest/test/1'-d <http://localhost:9200/oneshardtest/test/1'-d>>>> '{
    >     >                  > "timestamp":1348984904161,"__category":"category1","__description":"d1" }'
    >     >                  > curl -XPUT 'http://localhost:9200/__oneshardtest/test/2'-d <http://localhost:9200/__oneshardtest/test/2'-d>
    <http://localhost:9200/__oneshardtest/test/2'-d <http://localhost:9200/__oneshardtest/test/2'-d>>
    >     >             <http://localhost:9200/oneshardtest/test/2'-d <http://localhost:9200/oneshardtest/test/2'-d>
    <http://localhost:9200/oneshardtest/test/2'-d <http://localhost:9200/oneshardtest/test/2'-d>>>
    >     <http://localhost:9200/__oneshardtest/test/2'-d <http://localhost:9200/__oneshardtest/test/2'-d>
    <http://localhost:9200/__oneshardtest/test/2'-d <http://localhost:9200/__oneshardtest/test/2'-d>>
    >     >             <http://localhost:9200/oneshardtest/test/2'-d <http://localhost:9200/oneshardtest/test/2'-d>
    <http://localhost:9200/oneshardtest/test/2'-d <http://localhost:9200/oneshardtest/test/2'-d>>>> '{
    >     >                  > "timestamp":1348984904162,"__category":"category2","__description":"d2" }'
    >     >                  > curl -XPUT 'http://localhost:9200/__oneshardtest/test/3'-d <http://localhost:9200/__oneshardtest/test/3'-d>
    <http://localhost:9200/__oneshardtest/test/3'-d <http://localhost:9200/__oneshardtest/test/3'-d>>
    >     >             <http://localhost:9200/oneshardtest/test/3'-d <http://localhost:9200/oneshardtest/test/3'-d>
    <http://localhost:9200/oneshardtest/test/3'-d <http://localhost:9200/oneshardtest/test/3'-d>>>
    >     <http://localhost:9200/__oneshardtest/test/3'-d <http://localhost:9200/__oneshardtest/test/3'-d>
    <http://localhost:9200/__oneshardtest/test/3'-d <http://localhost:9200/__oneshardtest/test/3'-d>>
    >     >             <http://localhost:9200/oneshardtest/test/3'-d <http://localhost:9200/oneshardtest/test/3'-d>
    <http://localhost:9200/oneshardtest/test/3'-d <http://localhost:9200/oneshardtest/test/3'-d>>>> '{
    >     >                  > "timestamp":1348984904163,"__category":"category3","__description":"d3" }'
    >     >                  > |
    >     >                  >
    >     >                  >
    >     >                  > Run using index 'oneshardtest':
    >     >                  >
    >     >                  > |
    >     >                  > hadoop jar wordcount.jar oneshardtest output
    >     >                  > |
    >     >                  >
    >     >                  >
    >     >                  > Display output:
    >     >                  >
    >     >                  > |
    >     >                  > cat output/part-00000
    >     >                  >
    >     >                  >
    >     >                  > token count
    >     >                  > d1 1
    >     >                  > d2 1
    >     >                  > d3 1
    >     >                  > |
    >     >                  >
    >     >                  >
    >     >                  >
    >     >                  > If an index only has 1 shard the result of word count is correct. Somehow there is a problem in
    >     >             elasticsearch
    >     >             hadoop in
    >     >                  > the data input split during the map process. I have already tried this in a hadoop single node
    >     >             environment and the
    >     >                  > result is the same. Did I miss something in how to properly configure cascading in elasticsearch-hadoop?
    >     >                  >
    >     >                  > Thanks.
    >     >                  >
    >     >                  >
    >     >                  > --
    >     >                  > You received this message because you are subscribed to the Google Groups "elasticsearch" group.
    >     >                  > To unsubscribe from this group and stop receiving emails from it, send an email to
    >     >                  >elasticsearc...@googlegroups.__com <mailto:elasticsearc...@googlegroups.com> <javascript:>.
    >     >                  > For more options, visithttps://groups.google.__com/groups/opt_out
    >     >             <http://groups.google.com/groups/opt_out <http://groups.google.com/groups/opt_out>
    <http://groups.google.com/groups/opt_out <http://groups.google.com/groups/opt_out>>>
    >     <https://groups.google.com/__groups/opt_out <https://groups.google.com/__groups/opt_out>
    <https://groups.google.com/__groups/opt_out <https://groups.google.com/__groups/opt_out>>
    >     >             <https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>
    <https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>>>>.
    >     >
    >     >                  --
    >     >                  Costin
    >     >
    >     >             --
    >     >             You received this message because you are subscribed to the Google Groups "elasticsearch" group.
    >     >             To unsubscribe from this group and stop receiving emails from it, send an email to
    >     >             elasticsearch+unsubscribe@__googlegroups.com <http://googlegroups.com> <http://googlegroups.com>
    <mailto:elasticsearch%2Bunsubscribe@googlegroups.com <javascript:>>.
    >     >             For more options, visithttps://groups.google.com/__groups/opt_out <http://groups.google.com/__groups/opt_out>
    <https://groups.google.com/__groups/opt_out <https://groups.google.com/__groups/opt_out>>
    >     <https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>
    <https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>>>.
    >     >
    >     >
    >     >     --
    >     >     Costin
    >     >
    >     >     --
    >     >     You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
    >     >     To unsubscribe from this topic, visithttps://groups.google.com/d/__topic/elasticsearch/__byA9kRJB8AY/unsubscribe
    <http://groups.google.com/d/__topic/elasticsearch/__byA9kRJB8AY/unsubscribe>
    >     <https://groups.google.com/d/__topic/elasticsearch/__byA9kRJB8AY/unsubscribe
    <https://groups.google.com/d/__topic/elasticsearch/__byA9kRJB8AY/unsubscribe>>
    >     >     <https://groups.google.com/d/topic/elasticsearch/byA9kRJB8AY/unsubscribe
    <https://groups.google.com/d/topic/elasticsearch/byA9kRJB8AY/unsubscribe>
    >     <https://groups.google.com/d/topic/elasticsearch/byA9kRJB8AY/unsubscribe
    <https://groups.google.com/d/topic/elasticsearch/byA9kRJB8AY/unsubscribe>>>.
    >     >     To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@__googlegroups.com <http://googlegroups.com> <http://googlegroups.com>
    >     >     <mailto:elasticsearch%2Bunsubscribe@googlegroups.com <javascript:>>.
    >     >     For more options, visithttps://groups.google.com/__groups/opt_out <http://groups.google.com/__groups/opt_out>
    <https://groups.google.com/__groups/opt_out <https://groups.google.com/__groups/opt_out>>
    >     <https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>
    <https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>>>.
    >     >
    >     >
    >     > --
    >     > You received this message because you are subscribed to the Google Groups "elasticsearch" group.
    >     > To unsubscribe from this group and stop receiving emails from it, send an email to
    >     >elasticsearc...@googlegroups.com <javascript:>.
    >     > For more options, visithttps://groups.google.com/groups/opt_out <http://groups.google.com/groups/opt_out>
    <https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>>.
    >
    >     --
    >     Costin
    >
    > --
    > You received this message because you are subscribed to the Google Groups "elasticsearch" group.
    > To unsubscribe from this group and stop receiving emails from it, send an email to
    >elasticsearc...@googlegroups.com.
    > For more options, visithttps://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>.

    --
    Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/72e15a46-f9eb-466a-811c-c01996cbb341%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/529EDB0C.7010605%40gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

I'm reviving an old thread to announce that the Elasticsearch-Hadoop master has several improvements on the Cascading
front, specifically the documents are 'unwrapped' into a tuple so there's no more need for ESFieldExtractor.
The infrastructure has been revised to take into account the Tap Fields (including type coercion available in Cascading
2.2 - note that Cascading 2.1 is still supported).

A nightly build has been already published to Maven - please try it out and let us know how it works for you.

Cheers,

On 04/12/2013 12:32 AM, Müslim Baig wrote:

So did word count start working perfectly with or without the ESFieldExtractor? I've run into a similar where I'm trying
to use, elasticsearch-hadoop with cascalog, however, ESTap is only returning the doc_id and a LinkedMapWritable object,
making it difficult to use the cascalog abstraction to cherry pick specific fields from the tap?

On Monday, November 18, 2013 9:35:07 PM UTC-5, Jeoffrey Lim wrote:

Word count works perfectly now. Thanks a lot :)

On Tuesday, November 19, 2013 4:11:37 AM UTC+8, Costin Leau wrote:

    First of all, thanks for the nice write-up - I wish all bug reports are as detailed as yours.
    After trying several versions of cascading and Hadoop it turned up the problem was with my tests. The issue was
    fairly
    obscure but your logs helped.
    Ironically the issue occurred only sometimes, when running against a pseudo-distributed Hadoop tracker (as
    yours) not
    when running against a remote one.

    I've pushed the fix into master and pushed a nightly build as well - let me know how it goes for you.

    Thanks again for the continous feedback and detailed report.

    On 18/11/2013 7:32 PM, Jeoffrey Lim wrote:
    > I tried using Cascading 2.1.6, and directly using the samples fromhttp://docs.cascading.org/impatient/ <http://docs.cascading.org/impatient/>
    >
    > When just performing a copy:
    >
    > |
    > PipedocPipe =newPipe("copy");
    >
    > FlowDefflowDef =FlowDef.flowDef()
    > .addSource(docPipe,docTap)
    > .addTailSink(docPipe,wcTap);
    >
    > flowConnector.connect(flowDef).complete();
    > |
    >
    > for i in output/part-0000* ; do echo "[$i]" && cat $i ; done
    >
    > |
    > [output/part-00000]
    > doc_id description
    > 3org.elasticsearch.hadoop.mr.LinkedMapWritable@3fde891b
    > 2org.elasticsearch.hadoop.mr.LinkedMapWritable@3fde891b
    > 1org.elasticsearch.hadoop.mr.LinkedMapWritable@3fde891b
    > [output/part-00001]
    > doc_id description
    > 3org.elasticsearch.hadoop.mr.LinkedMapWritable@b05236
    > 2org.elasticsearch.hadoop.mr.LinkedMapWritable@b05236
    > 1org.elasticsearch.hadoop.mr.LinkedMapWritable@b05236
    > [output/part-00002]
    > doc_id description
    > 3org.elasticsearch.hadoop.mr.LinkedMapWritable@3d762027
    > 2org.elasticsearch.hadoop.mr.LinkedMapWritable@3d762027
    > 1org.elasticsearch.hadoop.mr.LinkedMapWritable@3d762027
    > [output/part-00003]
    > doc_id description
    > 3org.elasticsearch.hadoop.mr.LinkedMapWritable@1aeca36e
    > 2org.elasticsearch.hadoop.mr.LinkedMapWritable@1aeca36e
    > 1org.elasticsearch.hadoop.mr.LinkedMapWritable@1aeca36e
    > [output/part-00004]
    > doc_id description
    > 3org.elasticsearch.hadoop.mr.LinkedMapWritable@23904e0d
    > 2org.elasticsearch.hadoop.mr.LinkedMapWritable@23904e0d
    > 1org.elasticsearch.hadoop.mr.LinkedMapWritable@23904e0d
    > |
    >
    >
    >
    > With field extractor:
    >
    > |
    > PipedocPipe =newPipe("copy");
    > PipecopyPipe =newEach(docPipe,newFields("doc_id","description"),newESFieldExtractor(fieldArguments),Fields.RESULTS );
    >
    > FlowDefflowDef =FlowDef.flowDef()
    > .addSource(copyPipe,docTap)
    > .addTailSink(copyPipe,wcTap);
    >
    > flowConnector.connect(flowDef).complete();
    > |
    >
    >
    > for i in output/part-0000* ; do echo "[$i]" && cat $i ; done
    >
    > |
    > [output/part-00000]
    > doc_id description
    > 3d3
    > 2d2
    > 1d1
    > [output/part-00001]
    > doc_id description
    > 3d3
    > 2d2
    > 1d1
    > [output/part-00002]
    > doc_id description
    > 3d3
    > 2d2
    > 1d1
    > [output/part-00003]
    > doc_id description
    > 3d3
    > 2d2
    > 1d1
    > [output/part-00004]
    > doc_id description
    > 3d3
    > 2d2
    > 1d1
    > |
    >
    >
    > Manual query of data by shards using curl:
    >
    > |
    > curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:0
    <http://localhost:9200/test/_search?q=*:*&preference=_shards:0>'
    > {"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}
    >
    >
    > curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:1
    <http://localhost:9200/test/_search?q=*:*&preference=_shards:1>'
    > {"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}
    >
    >
    > curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:2
    <http://localhost:9200/test/_search?q=*:*&preference=_shards:2>'
    > {"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test","_id":"1","_score":1.0,"_source":{"timestamp":1348984904161,"category":"category1","description":"d1"}}]}}

    >
    >
    > curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:3
    <http://localhost:9200/test/_search?q=*:*&preference=_shards:3>'
    > {"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test","_id":"2","_score":1.0,"_source":{"timestamp":1348984904162,"category":"category2","description":"d2"}}]}}

    >
    >
    > curl 'http://localhost:9200/test/_search?q=*:*&preference=_shards:4
    <http://localhost:9200/test/_search?q=*:*&preference=_shards:4>'
    > {"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test","_id":"3","_score":1.0,"_source":{"timestamp":1348984904163,"category":"category3","description":"d3"}}]}}

    > |
    >
    >
    > Here is also the console output:
    >
    > |
    > 13/11/19 01:20:06 INFO property.AppProps: usingapp.id <http://app.id>: 9BCEBB939DE70E7BBBC0F8427387176E
    > 13/11/19 01:20:06 INFO util.HadoopUtil: resolving application jar from found main method on: eshadoop.WordCount
    > 13/11/19 01:20:06 INFO planner.HadoopPlanner: using application jar: /home/jeoffrey/test/eshadoopwc.jar
    > 13/11/19 01:20:06 INFO hadoop.Hfs: forcing job to local mode, via sink: Hfs["TextDelimited[[UNKNOWN]->['doc_id',
    > 'description']]"]["output"]
    > 13/11/19 01:20:08 INFO util.Version: Concurrent, Inc - Cascading 2.1.6
    > 13/11/19 01:20:08 INFO flow.Flow: [] starting
    > 13/11/19 01:20:08 INFO flow.Flow: []  source: ESHadoopTap["ESHadoopScheme[['doc_id',
    > 'description']]"]["/test/_search?q=*:*"]
    > 13/11/19 01:20:08 INFO flow.Flow: []  sink: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
    > 13/11/19 01:20:08 INFO flow.Flow: []  parallel execution is enabled: true
    > 13/11/19 01:20:08 INFO flow.Flow: []  starting jobs: 1
    > 13/11/19 01:20:08 INFO flow.Flow: []  allocating threads: 1
    > 13/11/19 01:20:08 INFO flow.FlowStep: [] starting step: (1/1) output
    > 13/11/19 01:20:08 INFO util.NativeCodeLoader: Loaded the native-hadoop library
    > 13/11/19 01:20:08 INFO mr.ESInputFormat: Discovered mapping {test=[raw-logs=[], test=[category=STRING,
    > description=STRING, timestamp=LONG]]} for [/test/_search?q=*:*]
    > 13/11/19 01:20:08 INFO mr.ESInputFormat: Created [5] shard-splits
    > 13/11/19 01:20:08 INFO flow.FlowStep: [] submitted hadoop job: job_local257186463_0001
    > 13/11/19 01:20:08 INFO mapred.LocalJobRunner: Waiting for map tasks
    > 13/11/19 01:20:08 INFO mapred.LocalJobRunner: Starting task: attempt_local257186463_0001_m_000000_0
    > 13/11/19 01:20:08 INFO util.ProcessTree: setsid exited with exit code 0
    > 13/11/19 01:20:08 INFO mapred.Task:  Using ResourceCalculatorPlugin :
    > org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4235e6e3
    > 13/11/19 01:20:08 INFO mapred.MapTask: Processing split: cascading.tap.hadoop.io.MultiInputSplit@62803d5
    > 13/11/19 01:20:08 INFO mapred.MapTask: numReduceTasks: 0
    > 13/11/19 01:20:08 INFO hadoop.FlowMapper: cascading version: Concurrent, Inc - Cascading 2.1.6
    > 13/11/19 01:20:08 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
    > 13/11/19 01:20:08 INFO hadoop.FlowMapper: sourcing from: ESHadoopTap["ESHadoopScheme[['doc_id',
    > 'description']]"]["/test/_search?q=*:*"]
    > 13/11/19 01:20:08 INFO hadoop.FlowMapper: sinking to: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
    > 13/11/19 01:20:08 INFO mr.ESInputFormat: Discovered mapping {test=[raw-logs=[], test=[category=STRING,
    > description=STRING, timestamp=LONG]]} for [/test/_search?q=*:*]
    > 13/11/19 01:20:08 INFO mr.ESInputFormat: Created [5] shard-splits
    > 13/11/19 01:20:08 INFO mapred.Task: Task:attempt_local257186463_0001_m_000000_0 is done. And is in the process of commiting
    > 13/11/19 01:20:08 INFO mapred.LocalJobRunner:
    > 13/11/19 01:20:08 INFO mapred.Task: Task attempt_local257186463_0001_m_000000_0 is allowed to commit now
    > 13/11/19 01:20:08 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local257186463_0001_m_000000_0' to
    > file:/home/jeoffrey/test/output
    > 13/11/19 01:20:08 INFO mapred.LocalJobRunner: ShardInputSplit [node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
    > Cordelia|192.168.1.120:9200 <http://192.168.1.120:9200>],shard=0]
    > 13/11/19 01:20:08 INFO mapred.Task: Task 'attempt_local257186463_0001_m_000000_0' done.
    > 13/11/19 01:20:08 INFO mapred.LocalJobRunner: Finishing task: attempt_local257186463_0001_m_000000_0
    > 13/11/19 01:20:08 INFO mapred.LocalJobRunner: Starting task: attempt_local257186463_0001_m_000001_0
    > 13/11/19 01:20:08 INFO mapred.Task:  Using ResourceCalculatorPlugin :
    > org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3d3b5a3a
    > 13/11/19 01:20:08 INFO mapred.MapTask: Processing split: cascading.tap.hadoop.io.MultiInputSplit@4ab9b8d0
    > 13/11/19 01:20:08 INFO mapred.MapTask: numReduceTasks: 0
    > 13/11/19 01:20:08 INFO hadoop.FlowMapper: cascading version: Concurrent, Inc - Cascading 2.1.6
    > 13/11/19 01:20:08 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
    > 13/11/19 01:20:08 INFO hadoop.FlowMapper: sourcing from: ESHadoopTap["ESHadoopScheme[['doc_id',
    > 'description']]"]["/test/_search?q=*:*"]
    > 13/11/19 01:20:08 INFO hadoop.FlowMapper: sinking to: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
    > 13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping {test=[raw-logs=[], test=[category=STRING,
    > description=STRING, timestamp=LONG]]} for [/test/_search?q=*:*]
    > 13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
    > 13/11/19 01:20:09 INFO mapred.Task: Task:attempt_local257186463_0001_m_000001_0 is done. And is in the process of commiting
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner:
    > 13/11/19 01:20:09 INFO mapred.Task: Task attempt_local257186463_0001_m_000001_0 is allowed to commit now
    > 13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local257186463_0001_m_000001_0' to
    > file:/home/jeoffrey/test/output
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit [node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
    > Cordelia|192.168.1.120:9200 <http://192.168.1.120:9200>],shard=1]
    > 13/11/19 01:20:09 INFO mapred.Task: Task 'attempt_local257186463_0001_m_000001_0' done.
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task: attempt_local257186463_0001_m_000001_0
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner: Starting task: attempt_local257186463_0001_m_000002_0
    > 13/11/19 01:20:09 INFO mapred.Task:  Using ResourceCalculatorPlugin :
    > org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3d46e381
    > 13/11/19 01:20:09 INFO mapred.MapTask: Processing split: cascading.tap.hadoop.io.MultiInputSplit@52cd19d
    > 13/11/19 01:20:09 INFO mapred.MapTask: numReduceTasks: 0
    > 13/11/19 01:20:09 INFO hadoop.FlowMapper: cascading version: Concurrent, Inc - Cascading 2.1.6
    > 13/11/19 01:20:09 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
    > 13/11/19 01:20:09 INFO hadoop.FlowMapper: sourcing from: ESHadoopTap["ESHadoopScheme[['doc_id',
    > 'description']]"]["/test/_search?q=*:*"]
    > 13/11/19 01:20:09 INFO hadoop.FlowMapper: sinking to: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
    > 13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping {test=[raw-logs=[], test=[category=STRING,
    > description=STRING, timestamp=LONG]]} for [/test/_search?q=*:*]
    > 13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
    > 13/11/19 01:20:09 INFO mapred.Task: Task:attempt_local257186463_0001_m_000002_0 is done. And is in the process of commiting
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner:
    > 13/11/19 01:20:09 INFO mapred.Task: Task attempt_local257186463_0001_m_000002_0 is allowed to commit now
    > 13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local257186463_0001_m_000002_0' to
    > file:/home/jeoffrey/test/output
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit [node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
    > Cordelia|192.168.1.120:9200 <http://192.168.1.120:9200>],shard=4]
    > 13/11/19 01:20:09 INFO mapred.Task: Task 'attempt_local257186463_0001_m_000002_0' done.
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task: attempt_local257186463_0001_m_000002_0
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner: Starting task: attempt_local257186463_0001_m_000003_0
    > 13/11/19 01:20:09 INFO mapred.Task:  Using ResourceCalculatorPlugin :
    > org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4ca68fd8
    > 13/11/19 01:20:09 INFO mapred.MapTask: Processing split: cascading.tap.hadoop.io.MultiInputSplit@2e097617
    > 13/11/19 01:20:09 INFO mapred.MapTask: numReduceTasks: 0
    > 13/11/19 01:20:09 INFO hadoop.FlowMapper: cascading version: Concurrent, Inc - Cascading 2.1.6
    > 13/11/19 01:20:09 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
    > 13/11/19 01:20:09 INFO hadoop.FlowMapper: sourcing from: ESHadoopTap["ESHadoopScheme[['doc_id',
    > 'description']]"]["/test/_search?q=*:*"]
    > 13/11/19 01:20:09 INFO hadoop.FlowMapper: sinking to: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
    > 13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping {test=[raw-logs=[], test=[category=STRING,
    > description=STRING, timestamp=LONG]]} for [/test/_search?q=*:*]
    > 13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
    > 13/11/19 01:20:09 INFO mapred.Task: Task:attempt_local257186463_0001_m_000003_0 is done. And is in the process of commiting
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner:
    > 13/11/19 01:20:09 INFO mapred.Task: Task attempt_local257186463_0001_m_000003_0 is allowed to commit now
    > 13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local257186463_0001_m_000003_0' to
    > file:/home/jeoffrey/test/output
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit [node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
    > Cordelia|192.168.1.120:9200 <http://192.168.1.120:9200>],shard=3]
    > 13/11/19 01:20:09 INFO mapred.Task: Task 'attempt_local257186463_0001_m_000003_0' done.
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task: attempt_local257186463_0001_m_000003_0
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner: Starting task: attempt_local257186463_0001_m_000004_0
    > 13/11/19 01:20:09 INFO mapred.Task:  Using ResourceCalculatorPlugin :
    > org.apache.hadoop.util.LinuxResourceCalculatorPlugin@6e848ecc
    > 13/11/19 01:20:09 INFO mapred.MapTask: Processing split: cascading.tap.hadoop.io.MultiInputSplit@40363068
    > 13/11/19 01:20:09 INFO mapred.MapTask: numReduceTasks: 0
    > 13/11/19 01:20:09 INFO hadoop.FlowMapper: cascading version: Concurrent, Inc - Cascading 2.1.6
    > 13/11/19 01:20:09 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
    > 13/11/19 01:20:09 INFO hadoop.FlowMapper: sourcing from: ESHadoopTap["ESHadoopScheme[['doc_id',
    > 'description']]"]["/test/_search?q=*:*"]
    > 13/11/19 01:20:09 INFO hadoop.FlowMapper: sinking to: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'description']]"]["output"]
    > 13/11/19 01:20:09 INFO mr.ESInputFormat: Discovered mapping {test=[raw-logs=[], test=[category=STRING,
    > description=STRING, timestamp=LONG]]} for [/test/_search?q=*:*]
    > 13/11/19 01:20:09 INFO mr.ESInputFormat: Created [5] shard-splits
    > 13/11/19 01:20:09 INFO mapred.Task: Task:attempt_local257186463_0001_m_000004_0 is done. And is in the process of commiting
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner:
    > 13/11/19 01:20:09 INFO mapred.Task: Task attempt_local257186463_0001_m_000004_0 is allowed to commit now
    > 13/11/19 01:20:09 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local257186463_0001_m_000004_0' to
    > file:/home/jeoffrey/test/output
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner: ShardInputSplit [node=[WwgCOxR5S7KdVZlDLBWH-A/Frost,
    > Cordelia|192.168.1.120:9200 <http://192.168.1.120:9200>],shard=2]
    > 13/11/19 01:20:09 INFO mapred.Task: Task 'attempt_local257186463_0001_m_000004_0' done.
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner: Finishing task: attempt_local257186463_0001_m_000004_0
    > 13/11/19 01:20:09 INFO mapred.LocalJobRunner: Map task executor complete.
    > 13/11/19 01:20:13 INFO util.Hadoop18TapUtil: deleting temp path output/_temporary
    > |
    >
    >
    > Thanks.
    >
    > On Tuesday, November 19, 2013 12:01:05 AM UTC+8, Costin Leau wrote:
    >
    >     Can you please try with Cascading 2.1.6 just in case? I haven't tested Cascading 2.2 yet.
    >
    >     Thanks,
    >
    >     On 18/11/2013 5:13 PM, Jeoffrey Lim wrote:
    >     > I'm using Hadoop 1.2.1, ElasticSearch 0.90.3, Cascading 2.2 and tried running the word count tutorial sample in local
    >     > stand alone operation, and pseudo distributed mode resulting with the same output.
    >     >
    >     >
    >     > Thanks!
    >     >
    >     >
    >     > On Mon, Nov 18, 2013 at 11:03 PM, Costin Leau <costi...@gmail.com <javascript:> <mailto:costi...@gmail.com <javascript:>>> wrote:
    >     >
    >     >     By the way, can you confirm the setup you are using - Hadoop version and network topology?
    >     >
    >     >     Thanks,
    >     >
    >     >
    >     >     On 18/11/2013 4:22 PM, Costin Leau wrote:
    >     >
    >     >         Something is a miss here since I'm unable to reproduce the problem. I'll try again using your verbatim example.
    >     >         The ESFieldExtractor shouldn't be needed since ESTap returns the results as a TupleEntry iterator so there
    >     >         shouldn't be
    >     >         no need to do the field extraction but rather just refer to the fields by name.
    >     >         Let me know what didn't work for you - I'll go through yoru example myself but just in case I'm missing something,
    >     >         feedback is useful.
    >     >
    >     >         Cheers,
    >     >
    >     >         On 18/11/2013 1:18 PM, Jeoffrey Lim wrote:
    >     >
    >     >             Thanks a lot for looking at this problem. But still when I used the nightly build with the intended fix ,the
    >     >             problem
    >     >             still occurs and the word count is still 5 which is the total number of shards of the index.
    >     >
    >     >             On Saturday, November 16, 2013 1:50:18 AM UTC+8, Costin Leau wrote:
    >     >
    >     >                  Hi,
    >     >
    >     >                  I've pushed a nightly build [1] which should address the problem. Let me know if it works for you.
    >     >
    >     >                  It looks like Cascading is acting differently when running against a pseudo/local cluster as oppose to
    >     >             a proper one
    >     >                  when
    >     >                  it comes to dealing with multiple splits. The bug fix should make the behaviour consistent.
    >     >
    >     >                  Cheers,
    >     >
    >     >                  [1] elasticsearch-hadoop-1.3.0.__BUILD-20131115.172649-221
    >     >
    >     >                  On 14/11/2013 5:51 AM, Jeoffrey Lim wrote:
    >     >                  > I'm testing the cascading plugin of elasticsearch-hadoop and trying the word count example from
    >     >                  >http://docs.cascading.org/__impatient/impatient2.html <http://docs.cascading.org/__impatient/impatient2.html>
    <http://docs.cascading.org/__impatient/impatient2.html <http://docs.cascading.org/__impatient/impatient2.html>>
    >     >             <http://docs.cascading.org/impatient/impatient2.html <http://docs.cascading.org/impatient/impatient2.html>
    <http://docs.cascading.org/impatient/impatient2.html <http://docs.cascading.org/impatient/impatient2.html>>>
    >     <http://docs.cascading.org/__impatient/impatient2.html <http://docs.cascading.org/__impatient/impatient2.html>
    <http://docs.cascading.org/__impatient/impatient2.html <http://docs.cascading.org/__impatient/impatient2.html>>
    >     >             <http://docs.cascading.org/impatient/impatient2.html <http://docs.cascading.org/impatient/impatient2.html>
    <http://docs.cascading.org/impatient/impatient2.html <http://docs.cascading.org/impatient/impatient2.html>>>>
    >     >                  > The problem is the result of the word count is incorrect and somehow gets multiplied by the number of
    >     >             shards of
    >     >             the index.
    >     >                  >
    >     >                  >
    >     >                  > Code:
    >     >                  >
    >     >                  > |
    >     >                  > Stringindex =args[0];
    >     >                  > StringwcPath =args[1];
    >     >                  >
    >     >                  > JobConfjobConf =newJobConf();
    >     >                  > jobConf.setBoolean("mapred.__map.tasks.speculative.__execution",false);
    >     >                  > jobConf.setBoolean("mapred.__reduce.tasks.speculative.__execution",false);
    >     >                  >
    >     >                  > Propertiesproperties =AppProps.appProps().__buildProperties(jobConf);
    >     >                  > AppProps.__setApplicationJarClass(__properties,WordCountESHadoop.__class);
    >     >                  > FlowConnectorflowConnector =newHadoopFlowConnector(__properties);
    >     >                  >
    >     >                  > FieldsfieldArguments =newFields("doc_id","__description");
    >     >                  >
    >     >                  > TapdocTap =newESTap("/"+index +"/_search?q=*:*",__fieldArguments);
    >     >                  > TapwcTap =newHfs(newTextDelimited(true,__"\t"),wcPath );
    >     >                  >
    >     >                  > PipedescFieldPipe
    >     >             =newEach("desconly",__fieldArguments,__newESFieldExtractor(__fieldArguments),Fields.RESULTS );
    >     >                  >
    >     >                  > Fieldstoken =newFields("token");
    >     >                  > Fieldstext =newFields("description");
    >     >                  > RegexSplitGeneratorsplitter =newRegexSplitGenerator(token,__"[ \\[\\]\\(\\),.]");
    >     >                  > PipedocPipe =newEach(descFieldPipe,text,__splitter,Fields.RESULTS );
    >     >                  >
    >     >                  >
    >     >                  > // determine the word counts
    >     >                  > PipewcPipe =newPipe("wc",docPipe );
    >     >                  > wcPipe =newGroupBy(wcPipe,token );
    >     >                  > wcPipe =newEvery(wcPipe,Fields.ALL,__newCount(),Fields.ALL );
    >     >                  >
    >     >                  >
    >     >                  > // connect the taps, pipes, etc., into a flow
    >     >                  > FlowDefflowDef =FlowDef.flowDef()
    >     >                  > .setName("wc")
    >     >                  > .addSource(docPipe,docTap )
    >     >                  > .addTailSink(wcPipe,wcTap );
    >     >                  >
    >     >                  >
    >     >                  > // write a DOT file and run the flow
    >     >                  > FlowwcFlow =flowConnector.connect(flowDef );
    >     >                  > wcFlow.writeDOT("dot/wc.dot");
    >     >                  > wcFlow.complete();
    >     >                  > |
    >     >                  >
    >     >                  >
    >     >                  >
    >     >                  > Test case with 5 shards:
    >     >                  >
    >     >                  > |
    >     >                  > curl -XPUT 'http://localhost:9200/test/'
    >     >                  > curl -XPUT 'http://localhost:9200/test/__test/1'-d <http://localhost:9200/test/__test/1'-d>
    <http://localhost:9200/test/__test/1'-d <http://localhost:9200/test/__test/1'-d>>
    >     <http://localhost:9200/test/test/1'-d <http://localhost:9200/test/test/1'-d>
    <http://localhost:9200/test/test/1'-d <http://localhost:9200/test/test/1'-d>>>
    >     >             <http://localhost:9200/test/__test/1'-d <http://localhost:9200/test/__test/1'-d>
    <http://localhost:9200/test/__test/1'-d <http://localhost:9200/test/__test/1'-d>>
    >     <http://localhost:9200/test/test/1'-d <http://localhost:9200/test/test/1'-d>
    <http://localhost:9200/test/test/1'-d <http://localhost:9200/test/test/1'-d>>>> '{
    >     >                  "timestamp":1348984904161,"__category":"category1","__description":"d1" }'
    >     >                  > curl -XPUT 'http://localhost:9200/test/__test/2'-d <http://localhost:9200/test/__test/2'-d>
    <http://localhost:9200/test/__test/2'-d <http://localhost:9200/test/__test/2'-d>>
    >     <http://localhost:9200/test/test/2'-d <http://localhost:9200/test/test/2'-d>
    <http://localhost:9200/test/test/2'-d <http://localhost:9200/test/test/2'-d>>>
    >     >             <http://localhost:9200/test/__test/2'-d <http://localhost:9200/test/__test/2'-d>
    <http://localhost:9200/test/__test/2'-d <http://localhost:9200/test/__test/2'-d>>
    >     <http://localhost:9200/test/test/2'-d <http://localhost:9200/test/test/2'-d>
    <http://localhost:9200/test/test/2'-d <http://localhost:9200/test/test/2'-d>>>> '{
    >     >                  "timestamp":1348984904162,"__category":"category2","__description":"d2" }'
    >     >                  > curl -XPUT 'http://localhost:9200/test/__test/3'-d <http://localhost:9200/test/__test/3'-d>
    <http://localhost:9200/test/__test/3'-d <http://localhost:9200/test/__test/3'-d>>
    >     <http://localhost:9200/test/test/3'-d <http://localhost:9200/test/test/3'-d>
    <http://localhost:9200/test/test/3'-d <http://localhost:9200/test/test/3'-d>>>
    >     >             <http://localhost:9200/test/__test/3'-d <http://localhost:9200/test/__test/3'-d>
    <http://localhost:9200/test/__test/3'-d <http://localhost:9200/test/__test/3'-d>>
    >     <http://localhost:9200/test/test/3'-d <http://localhost:9200/test/test/3'-d>
    <http://localhost:9200/test/test/3'-d <http://localhost:9200/test/test/3'-d>>>> '{
    >     >                  "timestamp":1348984904163,"__category":"category3","__description":"d3" }'
    >     >                  > |
    >     >                  >
    >     >                  >
    >     >                  > Run using index 'test':
    >     >                  >
    >     >                  > |
    >     >                  > hadoop jar wordcount.jar test output
    >     >                  > |
    >     >                  >
    >     >                  >
    >     >                  > Display output:
    >     >                  >
    >     >                  > |
    >     >                  > cat output/part-00000
    >     >                  >
    >     >                  > token count
    >     >                  > d1 5
    >     >                  > d2 5
    >     >                  > d3 5
    >     >                  > |
    >     >                  >
    >     >                  >
    >     >                  >
    >     >                  > Test case with one shard only:
    >     >                  >
    >     >                  > |
    >     >                  > curl -XPUT 'http://localhost:9200/__oneshardtest/'-d <http://localhost:9200/__oneshardtest/'-d>
    <http://localhost:9200/__oneshardtest/'-d <http://localhost:9200/__oneshardtest/'-d>>
    >     <http://localhost:9200/oneshardtest/'-d <http://localhost:9200/oneshardtest/'-d>
    <http://localhost:9200/oneshardtest/'-d <http://localhost:9200/oneshardtest/'-d>>>
    >     >             <http://localhost:9200/__oneshardtest/'-d <http://localhost:9200/__oneshardtest/'-d>
    <http://localhost:9200/__oneshardtest/'-d <http://localhost:9200/__oneshardtest/'-d>>
    >     <http://localhost:9200/oneshardtest/'-d <http://localhost:9200/oneshardtest/'-d>
    <http://localhost:9200/oneshardtest/'-d <http://localhost:9200/oneshardtest/'-d>>>> '{ "settings" : {
    >     >             "index" : {
    >     >                  "number_of_shards" : 1,
    >     >                  > "number_of_replicas" : 1 } } }'
    >     >                  > curl -XPUT 'http://localhost:9200/__oneshardtest/test/1'-d <http://localhost:9200/__oneshardtest/test/1'-d>
    <http://localhost:9200/__oneshardtest/test/1'-d <http://localhost:9200/__oneshardtest/test/1'-d>>
    >     >             <http://localhost:9200/oneshardtest/test/1'-d <http://localhost:9200/oneshardtest/test/1'-d>
    <http://localhost:9200/oneshardtest/test/1'-d <http://localhost:9200/oneshardtest/test/1'-d>>>
    >     <http://localhost:9200/__oneshardtest/test/1'-d <http://localhost:9200/__oneshardtest/test/1'-d>
    <http://localhost:9200/__oneshardtest/test/1'-d <http://localhost:9200/__oneshardtest/test/1'-d>>
    >     >             <http://localhost:9200/oneshardtest/test/1'-d <http://localhost:9200/oneshardtest/test/1'-d>
    <http://localhost:9200/oneshardtest/test/1'-d <http://localhost:9200/oneshardtest/test/1'-d>>>> '{
    >     >                  > "timestamp":1348984904161,"__category":"category1","__description":"d1" }'
    >     >                  > curl -XPUT 'http://localhost:9200/__oneshardtest/test/2'-d <http://localhost:9200/__oneshardtest/test/2'-d>
    <http://localhost:9200/__oneshardtest/test/2'-d <http://localhost:9200/__oneshardtest/test/2'-d>>
    >     >             <http://localhost:9200/oneshardtest/test/2'-d <http://localhost:9200/oneshardtest/test/2'-d>
    <http://localhost:9200/oneshardtest/test/2'-d <http://localhost:9200/oneshardtest/test/2'-d>>>
    >     <http://localhost:9200/__oneshardtest/test/2'-d <http://localhost:9200/__oneshardtest/test/2'-d>
    <http://localhost:9200/__oneshardtest/test/2'-d <http://localhost:9200/__oneshardtest/test/2'-d>>
    >     >             <http://localhost:9200/oneshardtest/test/2'-d <http://localhost:9200/oneshardtest/test/2'-d>
    <http://localhost:9200/oneshardtest/test/2'-d <http://localhost:9200/oneshardtest/test/2'-d>>>> '{
    >     >                  > "timestamp":1348984904162,"__category":"category2","__description":"d2" }'
    >     >                  > curl -XPUT 'http://localhost:9200/__oneshardtest/test/3'-d <http://localhost:9200/__oneshardtest/test/3'-d>
    <http://localhost:9200/__oneshardtest/test/3'-d <http://localhost:9200/__oneshardtest/test/3'-d>>
    >     >             <http://localhost:9200/oneshardtest/test/3'-d <http://localhost:9200/oneshardtest/test/3'-d>
    <http://localhost:9200/oneshardtest/test/3'-d <http://localhost:9200/oneshardtest/test/3'-d>>>
    >     <http://localhost:9200/__oneshardtest/test/3'-d <http://localhost:9200/__oneshardtest/test/3'-d>
    <http://localhost:9200/__oneshardtest/test/3'-d <http://localhost:9200/__oneshardtest/test/3'-d>>
    >     >             <http://localhost:9200/oneshardtest/test/3'-d <http://localhost:9200/oneshardtest/test/3'-d>
    <http://localhost:9200/oneshardtest/test/3'-d <http://localhost:9200/oneshardtest/test/3'-d>>>> '{
    >     >                  > "timestamp":1348984904163,"__category":"category3","__description":"d3" }'
    >     >                  > |
    >     >                  >
    >     >                  >
    >     >                  > Run using index 'oneshardtest':
    >     >                  >
    >     >                  > |
    >     >                  > hadoop jar wordcount.jar oneshardtest output
    >     >                  > |
    >     >                  >
    >     >                  >
    >     >                  > Display output:
    >     >                  >
    >     >                  > |
    >     >                  > cat output/part-00000
    >     >                  >
    >     >                  >
    >     >                  > token count
    >     >                  > d1 1
    >     >                  > d2 1
    >     >                  > d3 1
    >     >                  > |
    >     >                  >
    >     >                  >
    >     >                  >
    >     >                  > If an index only has 1 shard the result of word count is correct. Somehow there is a problem in
    >     >             elasticsearch
    >     >             hadoop in
    >     >                  > the data input split during the map process. I have already tried this in a hadoop single node
    >     >             environment and the
    >     >                  > result is the same. Did I miss something in how to properly configure cascading in elasticsearch-hadoop?
    >     >                  >
    >     >                  > Thanks.
    >     >                  >
    >     >                  >
    >     >                  > --
    >     >                  > You received this message because you are subscribed to the Google Groups "elasticsearch" group.
    >     >                  > To unsubscribe from this group and stop receiving emails from it, send an email to
    >     >                  >elasticsearc...@googlegroups.__com <mailto:elasticsearc...@googlegroups.com> <javascript:>.
    >     >                  > For more options, visithttps://groups.google.__com/groups/opt_out
    >     >             <http://groups.google.com/groups/opt_out <http://groups.google.com/groups/opt_out>
    <http://groups.google.com/groups/opt_out <http://groups.google.com/groups/opt_out>>>
    >     <https://groups.google.com/__groups/opt_out <https://groups.google.com/__groups/opt_out>
    <https://groups.google.com/__groups/opt_out <https://groups.google.com/__groups/opt_out>>
    >     >             <https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>
    <https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>>>>.
    >     >
    >     >                  --
    >     >                  Costin
    >     >
    >     >             --
    >     >             You received this message because you are subscribed to the Google Groups "elasticsearch" group.
    >     >             To unsubscribe from this group and stop receiving emails from it, send an email to
    >     >             elasticsearch+unsubscribe@__googlegroups.com <http://googlegroups.com> <http://googlegroups.com>
    <mailto:elasticsearch%2Bunsubscribe@googlegroups.com <javascript:>>.
    >     >             For more options, visithttps://groups.google.com/__groups/opt_out <http://groups.google.com/__groups/opt_out>
    <https://groups.google.com/__groups/opt_out <https://groups.google.com/__groups/opt_out>>
    >     <https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>
    <https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>>>.
    >     >
    >     >
    >     >     --
    >     >     Costin
    >     >
    >     >     --
    >     >     You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
    >     >     To unsubscribe from this topic, visithttps://groups.google.com/d/__topic/elasticsearch/__byA9kRJB8AY/unsubscribe
    <http://groups.google.com/d/__topic/elasticsearch/__byA9kRJB8AY/unsubscribe>
    >     <https://groups.google.com/d/__topic/elasticsearch/__byA9kRJB8AY/unsubscribe
    <https://groups.google.com/d/__topic/elasticsearch/__byA9kRJB8AY/unsubscribe>>
    >     >     <https://groups.google.com/d/topic/elasticsearch/byA9kRJB8AY/unsubscribe
    <https://groups.google.com/d/topic/elasticsearch/byA9kRJB8AY/unsubscribe>
    >     <https://groups.google.com/d/topic/elasticsearch/byA9kRJB8AY/unsubscribe
    <https://groups.google.com/d/topic/elasticsearch/byA9kRJB8AY/unsubscribe>>>.
    >     >     To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@__googlegroups.com <http://googlegroups.com> <http://googlegroups.com>
    >     >     <mailto:elasticsearch%2Bunsubscribe@googlegroups.com <javascript:>>.
    >     >     For more options, visithttps://groups.google.com/__groups/opt_out <http://groups.google.com/__groups/opt_out>
    <https://groups.google.com/__groups/opt_out <https://groups.google.com/__groups/opt_out>>
    >     <https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>
    <https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>>>.
    >     >
    >     >
    >     > --
    >     > You received this message because you are subscribed to the Google Groups "elasticsearch" group.
    >     > To unsubscribe from this group and stop receiving emails from it, send an email to
    >     >elasticsearc...@googlegroups.com <javascript:>.
    >     > For more options, visithttps://groups.google.com/groups/opt_out <http://groups.google.com/groups/opt_out>
    <https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>>.
    >
    >     --
    >     Costin
    >
    > --
    > You received this message because you are subscribed to the Google Groups "elasticsearch" group.
    > To unsubscribe from this group and stop receiving emails from it, send an email to
    >elasticsearc...@googlegroups.com.
    > For more options, visithttps://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>.

    --
    Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/72e15a46-f9eb-466a-811c-c01996cbb341%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/52AB7135.8070303%40gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.