Custom in memory map/reduce using ES data

Hajime_Takase · October 23, 2014, 1:16am

Hi,

I have like billion records on 20 nodes and would like to run custom
map/reduce or "aggregation" (word count,sentiment analysis,etc) immediately
after the ES result set is determined.

I came up with using Plugin system to customize "aggregation" like this:

but want to update the jar quite often which will eventually require ES to
be reload,I look up the scripted map/ reduce
http://www.elasticsearch.org/guide/en/elasticsearch/reference/1.4/search-aggregations-metrics-scripted-metric-aggregation.html

but was not sure about the memory usage or customization,I decide to run
hazelcast or Spark on the same node or jvm and use their map/reduce
framework.I use Filter phase to put the ES data like this:

github.com

medcl/elasticsearch-filter-redis/blob/master/src/main/java/org/elasticsearch/index/query/RedisFilterParser.java#L121


        throw new QueryParsingException(parseContext.index(), "No value specified for term utils");
    }


    Filter filter = null;
    MapperService.SmartNameFieldMappers smartNameFieldMappers = parseContext.smartFieldMappers(fieldName);
    if (filter == null) {
        filter = new RedisFilter(new Term(fieldName, BytesRefs.toBytesRef(value)));
    }


    if (cache) {
        filter = parseContext.cacheFilter(filter, cacheKey);
    }


    filter = wrapSmartNameFilter(filter, smartNameFieldMappers, parseContext);
    if (filterName != null) {
        parseContext.addNamedFilter(filterName, filter);
    }
    return filter;
}
}

but it just takes quite long time to put data on those in-memory
middleware...

Is there any best practice to put ES data to in-memory middleware, just to
re-use the same data efficiently in subsequent program?
I don't think I can use the ES query result set (on each shard) which seems
to be on memory ,in my program,am I right?

Thanks,

Haji

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHm3ZsobDAfy7%3DNXuD0%3DmH12H4haadiFYq25NCz47dfsOkDmmA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

chengtao_cheng · February 3, 2015, 11:30am

I met the same problem with you !

在 2014年10月23日星期四 UTC+8上午9:17:18，Hajime Takase写道：

Hi,

I have like billion records on 20 nodes and would like to run custom
map/reduce or "aggregation" (word count,sentiment analysis,etc) immediately
after the ES result set is determined.

I came up with using Plugin system to customize "aggregation" like this:

https://github.com/algolia/elasticsearch-cardinality-plugin/tree/1.0.X/src/main/java/org/alg/elasticsearch/search/aggregations/cardinality

but want to update the jar quite often which will eventually require ES to
be reload,I look up the scripted map/ reduce

Elasticsearch Platform — Find real-time answers at scale | Elastic

but was not sure about the memory usage or customization,I decide to run
hazelcast or Spark on the same node or jvm and use their map/reduce
framework.I use Filter phase to put the ES data like this:

https://github.com/medcl/elasticsearch-filter-redis/blob/master/src/main/java/org/elasticsearch/index/query/RedisFilterParser.java#L121

but it just takes quite long time to put data on those in-memory
middleware...

Is there any best practice to put ES data to in-memory middleware, just to
re-use the same data efficiently in subsequent program?
I don't think I can use the ES query result set (on each shard) which
seems to be on memory ,in my program,am I right?

Thanks,

Haji

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e78a0e3e-9958-4744-b4fc-b26b7bb86093%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hajime_Takase · February 5, 2015, 11:52am

I use hazelcast on same jvm and run map/reduce in memory.It works really
well.For about 100000 blog datas and word count,es request with hz
map/reduce finish in less than 3 seconds.

On Tue, Feb 3, 2015 at 8:30 PM, chengtao cheng chengtaotxwd@gmail.com
wrote:

I met the same problem with you !

在 2014年10月23日星期四 UTC+8上午9:17:18，Hajime Takase写道：

Hi,

I have like billion records on 20 nodes and would like to run custom
map/reduce or "aggregation" (word count,sentiment analysis,etc) immediately
after the ES result set is determined.

I came up with using Plugin system to customize "aggregation" like this:
https://github.com/algolia/elasticsearch-cardinality-
plugin/tree/1.0.X/src/main/java/org/alg/elasticsearch/
search/aggregations/cardinality

but want to update the jar quite often which will eventually require ES
to be reload,I look up the scripted map/ reduce
Elasticsearch Platform — Find real-time answers at scale | Elastic
aggregations-metrics-scripted-metric-aggregation.html

but was not sure about the memory usage or customization,I decide to run
hazelcast or Spark on the same node or jvm and use their map/reduce
framework.I use Filter phase to put the ES data like this:
GitHub - medcl/elasticsearch-filter-redis: a customized search filter for elasticsearch,use external redis-store to do search result filtering,supposed to move some part of logic from index to outer redis.
blob/master/src/main/java/org/elasticsearch/index/query/
RedisFilterParser.java#L121

but it just takes quite long time to put data on those in-memory
middleware...

Is there any best practice to put ES data to in-memory middleware, just
to re-use the same data efficiently in subsequent program?
I don't think I can use the ES query result set (on each shard) which
seems to be on memory ,in my program,am I right?

Thanks,

Haji

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e78a0e3e-9958-4744-b4fc-b26b7bb86093%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e78a0e3e-9958-4744-b4fc-b26b7bb86093%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHm3ZspNp14L4LMZAa0Qjkg3MjOye0UzTmtMsoo8ip-t65etZw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Map-Reduce in Elasticsearch (was: Use case) Elasticsearch	2	487	July 6, 2017
Pseudo map-reduce for searchresults Elasticsearch	5	359	July 6, 2017
Analytics on data stored in ES Elasticsearch	5	962	July 6, 2017
Newbie question about Spark and Elasticsearch Elasticsearch	5	454	July 6, 2017
Improving search speed for 100 million queries Elasticsearch	8	2452	July 6, 2017

Custom in memory map/reduce using ES data

Related topics