I work on a complex workflow using Spark (Parsing, Cleaning, Machine
Learning....).
At the end of the workflow I want to send aggregated results to
elasticsearch so my portal could query data.
There will be two types of processing: streaming and the possibility to
relaunch workflow on all available data.
Right now I use elasticsearch-hadoop and particularly the spark part to
send document to elasticsearch with the saveJsonToEs(myindex, mytype)
method.
The target is to have an index by day using the proper template that we
build.
AFAIK you could not add consideration of a feature in a document to send it
to the proper index in elasticsearch-hadoop.
What is the proper way to implement this feature?
Have a special step useing spark and bulk so that each executor send
documents to the proper index considering the feature of each line?
Is there something that I missed in elasticsearch-hadoop?
I think I have a solution:
Build JSON files so I could send it directly to _bulk
saveJsonToEs("_bulk")
Not sure if it will be optimized or even worked, I'll try.
On Thursday, January 15, 2015 at 4:17:57 PM UTC+1, Julien Naour wrote:
Hi,
I work on a complex workflow using Spark (Parsing, Cleaning, Machine
Learning....).
At the end of the workflow I want to send aggregated results to
elasticsearch so my portal could query data.
There will be two types of processing: streaming and the possibility to
relaunch workflow on all available data.
Right now I use elasticsearch-hadoop and particularly the spark part to
send document to elasticsearch with the saveJsonToEs(myindex, mytype)
method.
The target is to have an index by day using the proper template that we
build.
AFAIK you could not add consideration of a feature in a document to send
it to the proper index in elasticsearch-hadoop.
What is the proper way to implement this feature?
Have a special step useing spark and bulk so that each executor send
documents to the proper index considering the feature of each line?
Is there something that I missed in elasticsearch-hadoop?
My previous idea doesn't seem to work. Cannot send documents directly to
_bulk only to "index/type" pattern
On Thursday, January 15, 2015 at 4:17:57 PM UTC+1, Julien Naour wrote:
Hi,
I work on a complex workflow using Spark (Parsing, Cleaning, Machine
Learning....).
At the end of the workflow I want to send aggregated results to
elasticsearch so my portal could query data.
There will be two types of processing: streaming and the possibility to
relaunch workflow on all available data.
Right now I use elasticsearch-hadoop and particularly the spark part to
send document to elasticsearch with the saveJsonToEs(myindex, mytype)
method.
The target is to have an index by day using the proper template that we
build.
AFAIK you could not add consideration of a feature in a document to send
it to the proper index in elasticsearch-hadoop.
What is the proper way to implement this feature?
Have a special step useing spark and bulk so that each executor send
documents to the proper index considering the feature of each line?
Is there something that I missed in elasticsearch-hadoop?
I implemented a solution for my problem.
I use a foreachPartitions and instantiate a bulk processor using a
transport client (i.e. one by partition) to send documents.
It's not fast but it works.
Somebody have an idea to be more efficient?
Julien
On Thursday, January 15, 2015 at 4:40:22 PM UTC+1, Julien Naour wrote:
My previous idea doesn't seem to work. Cannot send documents directly to
_bulk only to "index/type" pattern
On Thursday, January 15, 2015 at 4:17:57 PM UTC+1, Julien Naour wrote:
Hi,
I work on a complex workflow using Spark (Parsing, Cleaning, Machine
Learning....).
At the end of the workflow I want to send aggregated results to
elasticsearch so my portal could query data.
There will be two types of processing: streaming and the possibility to
relaunch workflow on all available data.
Right now I use elasticsearch-hadoop and particularly the spark part to
send document to elasticsearch with the saveJsonToEs(myindex, mytype)
method.
The target is to have an index by day using the proper template that we
build.
AFAIK you could not add consideration of a feature in a document to send
it to the proper index in elasticsearch-hadoop.
What is the proper way to implement this feature?
Have a special step useing spark and bulk so that each executor send
documents to the proper index considering the feature of each line?
Is there something that I missed in elasticsearch-hadoop?
I'm unclear of what you are trying to achieve and what doesn't work.
es-hadoop allows either a static index/type or a dynamic one [1] [2]. One can also use a 'formatter' so for example
you can use a pattern like "{@timestamp:YYYY-MM-dd}" - meaning the field @timestamp will be used as a target but first
it will formatted into year/month/day.
There's work underway to extend that for API/real-time environments, if the global settings (which are pluggable are not
enough)
like Spark to customize the metadata per entry [3].
I implemented a solution for my problem.
I use a foreachPartitions and instantiate a bulk processor using a transport client (i.e. one by partition) to send
documents.
It's not fast but it works.
Somebody have an idea to be more efficient?
Julien
On Thursday, January 15, 2015 at 4:40:22 PM UTC+1, Julien Naour wrote:
My previous idea doesn't seem to work. Cannot send documents directly to _bulk only to "index/type" pattern
On Thursday, January 15, 2015 at 4:17:57 PM UTC+1, Julien Naour wrote:
Hi,
I work on a complex workflow using Spark (Parsing, Cleaning, Machine Learning....).
At the end of the workflow I want to send aggregated results to elasticsearch so my portal could query data.
There will be two types of processing: streaming and the possibility to relaunch workflow on all available data.
Right now I use elasticsearch-hadoop and particularly the spark part to send document to elasticsearch with the
saveJsonToEs(myindex, mytype) method.
The target is to have an index by day using the proper template that we build.
AFAIK you could not add consideration of a feature in a document to send it to the proper index in
elasticsearch-hadoop.
What is the proper way to implement this feature?
Have a special step useing spark and bulk so that each executor send documents to the proper index considering
the feature of each line?
Is there something that I missed in elasticsearch-hadoop?
Julien
Thanks for the reply Costin.
I'm not really clear but the basic idea is to index data by day considering
a feature available in each line
Example of data (~700 millions lines for ~90days):
2014-01-01,05,06,ici
2014-01-04,05,06,la
The first one have to be send to my-index-2014-01-01/my-type and the other
my-index-2014-01-04/my-type
I would like to do it without having to launch 90 saveJsonToES (using the
elasticsearch-hadoop spark API)
Is it more clear?
It seems that the dynamic index could work for me. I'll try that right away.
I'm unclear of what you are trying to achieve and what doesn't work.
es-hadoop allows either a static index/type or a dynamic one [1] [2]. One
can also use a 'formatter' so for example
you can use a pattern like "{@timestamp:YYYY-MM-dd}" - meaning the field @timestamp will be used as a target but first
it will formatted into year/month/day.
There's work underway to extend that for API/real-time environments, if
the global settings (which are pluggable are not enough)
like Spark to customize the metadata per entry [3].
I implemented a solution for my problem.
I use a foreachPartitions and instantiate a bulk processor using a
transport client (i.e. one by partition) to send
documents.
It's not fast but it works.
Somebody have an idea to be more efficient?
Julien
On Thursday, January 15, 2015 at 4:40:22 PM UTC+1, Julien Naour wrote:
My previous idea doesn't seem to work. Cannot send documents directly
to _bulk only to "index/type" pattern
On Thursday, January 15, 2015 at 4:17:57 PM UTC+1, Julien Naour wrote:
Hi,
I work on a complex workflow using Spark (Parsing, Cleaning,
Machine Learning....).
At the end of the workflow I want to send aggregated results to
elasticsearch so my portal could query data.
There will be two types of processing: streaming and the
possibility to relaunch workflow on all available data.
Right now I use elasticsearch-hadoop and particularly the spark
part to send document to elasticsearch with the
saveJsonToEs(myindex, mytype) method.
The target is to have an index by day using the proper template
that we build.
AFAIK you could not add consideration of a feature in a document
to send it to the proper index in
elasticsearch-hadoop.
What is the proper way to implement this feature?
Have a special step useing spark and bulk so that each executor
send documents to the proper index considering
the feature of each line?
Is there something that I missed in elasticsearch-hadoop?
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/elasticsearch/5-LwjQxVlhk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/54BD2062.1050907%40gmail.com.
Thanks for the reply Costin.
I'm not really clear but the basic idea is to index data by day
considering a feature available in each line
Example of data (~700 millions lines for ~90days):
2014-01-01,05,06,ici
2014-01-04,05,06,la
The first one have to be send to my-index-2014-01-01/my-type and the other
my-index-2014-01-04/my-type
I would like to do it without having to launch 90 saveJsonToES (using the
elasticsearch-hadoop spark API)
Is it more clear?
It seems that the dynamic index could work for me. I'll try that right
away.
I'm unclear of what you are trying to achieve and what doesn't work.
es-hadoop allows either a static index/type or a dynamic one [1] [2]. One
can also use a 'formatter' so for example
you can use a pattern like "{@timestamp:YYYY-MM-dd}" - meaning the field @timestamp will be used as a target but first
it will formatted into year/month/day.
There's work underway to extend that for API/real-time environments, if
the global settings (which are pluggable are not enough)
like Spark to customize the metadata per entry [3].
I implemented a solution for my problem.
I use a foreachPartitions and instantiate a bulk processor using a
transport client (i.e. one by partition) to send
documents.
It's not fast but it works.
Somebody have an idea to be more efficient?
Julien
On Thursday, January 15, 2015 at 4:40:22 PM UTC+1, Julien Naour wrote:
My previous idea doesn't seem to work. Cannot send documents
directly to _bulk only to "index/type" pattern
On Thursday, January 15, 2015 at 4:17:57 PM UTC+1, Julien Naour
wrote:
Hi,
I work on a complex workflow using Spark (Parsing, Cleaning,
Machine Learning....).
At the end of the workflow I want to send aggregated results to
elasticsearch so my portal could query data.
There will be two types of processing: streaming and the
possibility to relaunch workflow on all available data.
Right now I use elasticsearch-hadoop and particularly the spark
part to send document to elasticsearch with the
saveJsonToEs(myindex, mytype) method.
The target is to have an index by day using the proper template
that we build.
AFAIK you could not add consideration of a feature in a document
to send it to the proper index in
elasticsearch-hadoop.
What is the proper way to implement this feature?
Have a special step useing spark and bulk so that each executor
send documents to the proper index considering
the feature of each line?
Is there something that I missed in elasticsearch-hadoop?
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/elasticsearch/5-LwjQxVlhk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/54BD2062.1050907%40gmail.com.
I am trying to achieve something similar. In my case, my JSON contains a
field "time" in Unix time. And i want to partition my indexes by this
field. That is, if one JSON1 contains 1422904680 in "time" and JSON2
contains 1422991080 in time, then i want to create indexes which are
partitioned by time (24 hours) like
index_1422835200_1422921599 - which will contain JSON1 because value of
"time" = 1422904680 falls in its range.
index_1422921600_1423007999 - which will contain JSON2 because value of
"time" = 1422991080 falls in its range.
Now I want to create my indexes dynamically. There is also possibility of
receiving a JSON which contain a value of time before current date.
To achieve this, I need to create index name dynamically by calculating it
at the time of creation then and there itself. Programmatically, I want to
achieve something like this
I am trying to achieve something similar. In my case, my JSON contains a
field "time" in Unix time. And i want to partition my indexes by this
field. That is, if one JSON1 contains 1422904680 in "time" and JSON2
contains 1422991080 in time, then i want to create indexes which are
partitioned by time (24 hours) like
index_1422835200_1422921599 - which will contain JSON1 because value of
"time" = 1422904680 falls in its range.
index_1422921600_1423007999 - which will contain JSON2 because value of
"time" = 1422991080 falls in its range.
Now I want to create my indexes dynamically. There is also possibility of
receiving a JSON which contain a value of time before current date.
To achieve this, I need to create index name dynamically by calculating it
at the time of creation then and there itself. Programmatically, I want to
achieve something like this
Yes. Probably that's the only possible work around to do this. What I am
planning to do is, calculate the name of index prior to writing and add a
field named "indexname" in my JSON and then I will use
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.