[Hadoop] New Feature - to write bulks to different indexes from hadoop

Hey,

I am designing solution for indexing using hadoop.
I think to use same logic of LogStash to create index per period of time of
my records (10 days or Month) , in order to avoid working with big index
sizes(from experience - merge of huge fragments in lucene make whole index
being slow) and also that way I don't limit myself to certain amount of
shards, I will be able to modify period dynamically and move indexes
between nodes in the cluster...

So I though writing in elasticsearch-hadoop option of extracting indexName
from value object - or even use the key for index name, then holding
RestRepository object per index name, that will buffer bulks per index and
send them when bulk is full or hadoop job ends

Another option just write in the bulk index name + type, and send bulk to
master ES node (not take shards list of certain index and choose one shard
depending on instance of hadoop)
(but in that scenario I think that master ES node will work too hard
because many mappers/reducers will write to same node and it will need to
route those index records one by one...)

Who worked with elasticsearch-hadoop code - I would like to receive inputs

  • what do you think? what better?

Thanks,
Igor

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/696de734-e97e-4cb5-ae80-5fa8717b6190%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi,

The functionality is available in master and soon be released through es-hadoop 1.3 M3. The docs are not there yet but
in short, you can declare dynamic index/type using the data being parsed. For example:

es.resource={media_type}/location_{id}
where 'media_type' and 'id' are resolved from the current entry. In M/R this means means looking into the current
MapWritable, in Cascading and Pig the current tuple and for Hive the current 'column'.

Of course, raw json can be also used in which case, the field will be extracted.

Try it out and let us know what you think.
Cheers,

On 4/2/14 12:24 PM, Igor Romanov wrote:

Hey,

I am designing solution for indexing using hadoop.
I think to use same logic of LogStash to create index per period of time of my records (10 days or Month) , in order to
avoid working with big index sizes(from experience - merge of huge fragments in lucene make whole index being slow) and
also that way I don't limit myself to certain amount of shards, I will be able to modify period dynamically and move
indexes between nodes in the cluster...

So I though writing in elasticsearch-hadoop option of extracting indexName from value object - or even use the key for
index name, then holding RestRepository object per index name, that will buffer bulks per index and send them when bulk
is full or hadoop job ends

Another option just write in the bulk index name + type, and send bulk to master ES node (not take shards list of
certain index and choose one shard depending on instance of hadoop)
(but in that scenario I think that master ES node will work too hard because many mappers/reducers will write to same
node and it will need to route those index records one by one...)

Who worked with elasticsearch-hadoop code - I would like to receive inputs - what do you think? what better?

Thanks,
Igor

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/696de734-e97e-4cb5-ae80-5fa8717b6190%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/696de734-e97e-4cb5-ae80-5fa8717b6190%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/533C43D0.8050207%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks! Exactly what I needed :slight_smile:

Igor

On Wednesday, April 2, 2014 8:07:28 PM UTC+3, Costin Leau wrote:

Hi,

The functionality is available in master and soon be released through
es-hadoop 1.3 M3. The docs are not there yet but
in short, you can declare dynamic index/type using the data being parsed.
For example:

es.resource={media_type}/location_{id}
where 'media_type' and 'id' are resolved from the current entry. In M/R
this means means looking into the current
MapWritable, in Cascading and Pig the current tuple and for Hive the
current 'column'.

Of course, raw json can be also used in which case, the field will be
extracted.

Try it out and let us know what you think.
Cheers,

On 4/2/14 12:24 PM, Igor Romanov wrote:

Hey,

I am designing solution for indexing using hadoop.
I think to use same logic of LogStash to create index per period of time
of my records (10 days or Month) , in order to
avoid working with big index sizes(from experience - merge of huge
fragments in lucene make whole index being slow) and
also that way I don't limit myself to certain amount of shards, I will
be able to modify period dynamically and move
indexes between nodes in the cluster...

So I though writing in elasticsearch-hadoop option of extracting
indexName from value object - or even use the key for
index name, then holding RestRepository object per index name, that will
buffer bulks per index and send them when bulk
is full or hadoop job ends

Another option just write in the bulk index name + type, and send bulk
to master ES node (not take shards list of
certain index and choose one shard depending on instance of hadoop)
(but in that scenario I think that master ES node will work too hard
because many mappers/reducers will write to same
node and it will need to route those index records one by one...)

Who worked with elasticsearch-hadoop code - I would like to receive
inputs - what do you think? what better?

Thanks,
Igor

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to
elasticsearc...@googlegroups.com <javascript:> <mailto:
elasticsearch+unsubscribe@googlegroups.com <javascript:>>.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/696de734-e97e-4cb5-ae80-5fa8717b6190%40googlegroups.com

<
https://groups.google.com/d/msgid/elasticsearch/696de734-e97e-4cb5-ae80-5fa8717b6190%40googlegroups.com?utm_medium=email&utm_source=footer>.

For more options, visit https://groups.google.com/d/optout.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d9ffcdff-8a13-42de-aeab-793289083c81%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Great! For the record 1.3 M3 has been released:

On 4/11/14 9:57 PM, Igor Romanov wrote:

Thanks! Exactly what I needed :slight_smile:

Igor

On Wednesday, April 2, 2014 8:07:28 PM UTC+3, Costin Leau wrote:

Hi,

The functionality is available in master and soon be released through es-hadoop 1.3 M3. The docs are not there yet but
in short, you can declare dynamic index/type using the data being parsed. For example:

es.resource={media_type}/location_{id}
where 'media_type' and 'id' are resolved from the current entry. In M/R this means means looking into the current
MapWritable, in Cascading and Pig the current tuple and for Hive the current 'column'.

Of course, raw json can be also used in which case, the field will be extracted.

Try it out and let us know what you think.
Cheers,

On 4/2/14 12:24 PM, Igor Romanov wrote:
> Hey,
>
> I am designing solution for indexing using hadoop.
> I think to use same logic of LogStash to create index per period of time of my records (10 days or Month) , in order to
> avoid working with big index sizes(from experience - merge of huge fragments in lucene make whole index being slow) and
> also that way I don't limit myself to certain amount of shards, I will be able to modify period dynamically and move
> indexes between nodes in the cluster...
>
> So I though writing in elasticsearch-hadoop option of extracting indexName from value object -  or even use the key for
> index name, then holding RestRepository object per index name, that will buffer bulks per index and send them when bulk
> is full or hadoop job ends
>
> Another option just write in the bulk index name + type, and send bulk to master ES node (not take shards list of
> certain index and choose one shard depending on instance of hadoop)
> (but in that scenario I think that master ES node will work too hard because many mappers/reducers will write to same
> node and it will need to route those index records one by one...)
>
> Who worked with elasticsearch-hadoop code - I would like to receive inputs - what do you think? what better?
>
> Thanks,
> Igor
>
>
> --
> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>elasticsearc...@googlegroups.com <javascript:> <mailto:elasticsearch+unsubscribe@googlegroups.com <javascript:>>.
> To view this discussion on the web visit
>https://groups.google.com/d/msgid/elasticsearch/696de734-e97e-4cb5-ae80-5fa8717b6190%40googlegroups.com
<https://groups.google.com/d/msgid/elasticsearch/696de734-e97e-4cb5-ae80-5fa8717b6190%40googlegroups.com>
> <https://groups.google.com/d/msgid/elasticsearch/696de734-e97e-4cb5-ae80-5fa8717b6190%40googlegroups.com?utm_medium=email&utm_source=footer
<https://groups.google.com/d/msgid/elasticsearch/696de734-e97e-4cb5-ae80-5fa8717b6190%40googlegroups.com?utm_medium=email&utm_source=footer>>.

> For more options, visithttps://groups.google.com/d/optout <https://groups.google.com/d/optout>.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d9ffcdff-8a13-42de-aeab-793289083c81%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/d9ffcdff-8a13-42de-aeab-793289083c81%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.
--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/53483CCC.9050005%40gmail.com.
For more options, visit https://groups.google.com/d/optout.