Loading JSON to ElasticSearch

IronMike · January 28, 2014, 3:45pm

I would like to get your perspective on how to load json to index server in
my scenario.
We have about 15 million documents in html/pdf/... on Server 1
I would like to process the data and convert to json on server 2
I would like the indexer to index json n a separate machine/server server 3

Ideally I thought on Server 2, as I prepare json and have it ready in
memory, I can feed it to indexer. But since data processing is cpu
intensive, I want indexing to be done on a separate machines/server.
How do you guys deal with this since I can no longer feed in-memory json to
the indexer on separate machine? Do I just grab files from server 2 and
index them then?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/05b977ac-00d0-45c0-9e58-8df523e6978c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · January 28, 2014, 6:05pm

Did you try https://github.com/dadoonet/fsriver?
Never tested it with so many docs but may be it could help you here?

If you have already generated json files on a server, then I would recommend trying logstash to send them into elasticsearch.

My 2 cents

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 28 janvier 2014 at 16:46:06, ZenMaster80 (sabdalla80@gmail.com) a écrit:

I would like to get your perspective on how to load json to index server in my scenario.
We have about 15 million documents in html/pdf/... on Server 1
I would like to process the data and convert to json on server 2
I would like the indexer to index json n a separate machine/server server 3

Ideally I thought on Server 2, as I prepare json and have it ready in memory, I can feed it to indexer. But since data processing is cpu intensive, I want indexing to be done on a separate machines/server.
How do you guys deal with this since I can no longer feed in-memory json to the indexer on separate machine? Do I just grab files from server 2 and index them then?

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/05b977ac-00d0-45c0-9e58-8df523e6978c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.52e7f16c.74b0dc51.ec%40MacBook-Air-de-David.local.
For more options, visit https://groups.google.com/groups/opt_out.

IronMike · January 28, 2014, 6:56pm

Thanks David, I will certainly look into hashtag. Do you think it is a good
idea to separate data analysis and indexing into 2 different machines since
both require lots of cpu time.
If I use hashtag to send files over to ES, will I be able to use native
Java API or http, and is there any preference to the API? I have noticed
there are somethings that aren't very easy and may be don't even work in
the native API?
Thanks again.

On Tuesday, January 28, 2014 1:05:32 PM UTC-5, David Pilato wrote:

Did you try GitHub - dadoonet/fscrawler: Elasticsearch File System Crawler (FS Crawler)?
Never tested it with so many docs but may be it could help you here?

If you have already generated json files on a server, then I would
recommend trying logstash to send them into elasticsearch.

My 2 cents

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 28 janvier 2014 at 16:46:06, ZenMaster80 (sabda...@gmail.com<javascript:>)
a écrit:

I would like to get your perspective on how to load json to index server
in my scenario.
We have about 15 million documents in html/pdf/... on Server 1
I would like to process the data and convert to json on server 2
I would like the indexer to index json n a separate machine/server server 3

Ideally I thought on Server 2, as I prepare json and have it ready in
memory, I can feed it to indexer. But since data processing is cpu
intensive, I want indexing to be done on a separate machines/server.
How do you guys deal with this since I can no longer feed in-memory json
to the indexer on separate machine? Do I just grab files from server 2 and
index them then?

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/05b977ac-00d0-45c0-9e58-8df523e6978c%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a02427ec-a3d8-484f-9cfb-2ba7628192b1%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

IronMike · January 28, 2014, 7:13pm

Thanks David, I will certainly look into logstash. Do you think it is a
good idea to separate data analysis and indexing into 2 different machines
since both require lots of cpu time.
If I use logstash to send files over to ES, will I be able to use native
Java API or http, and is there any preference to the API? I have noticed
there are somethings that aren't very easy and may be don't even work in
the native API?
Thanks again

On Tuesday, January 28, 2014 1:05:32 PM UTC-5, David Pilato wrote:

Did you try GitHub - dadoonet/fscrawler: Elasticsearch File System Crawler (FS Crawler)?
Never tested it with so many docs but may be it could help you here?

If you have already generated json files on a server, then I would
recommend trying logstash to send them into elasticsearch.

My 2 cents

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 28 janvier 2014 at 16:46:06, ZenMaster80 (sabda...@gmail.com<javascript:>)
a écrit:

I would like to get your perspective on how to load json to index server
in my scenario.
We have about 15 million documents in html/pdf/... on Server 1
I would like to process the data and convert to json on server 2
I would like the indexer to index json n a separate machine/server server 3

Ideally I thought on Server 2, as I prepare json and have it ready in
memory, I can feed it to indexer. But since data processing is cpu
intensive, I want indexing to be done on a separate machines/server.
How do you guys deal with this since I can no longer feed in-memory json
to the indexer on separate machine? Do I just grab files from server 2 and
index them then?

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/05b977ac-00d0-45c0-9e58-8df523e6978c%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f536d58c-89ab-4609-b5ca-cef44e2b879a%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · January 30, 2014, 8:01am

Logstash uses native API when you choose elasticsearch output: http://logstash.net/docs/1.3.3/outputs/elasticsearch
About machines separation, I would say that you should test it. If your nodes are not really intensively used (CPU / IO), you can probably use the same machine for extracting content and produce JSON docs.

HTH

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 28 janvier 2014 at 20:14:07, ZenMaster80 (sabdalla80@gmail.com) a écrit:

Thanks David, I will certainly look into logstash. Do you think it is a good idea to separate data analysis and indexing into 2 different machines since both require lots of cpu time.
If I use logstash to send files over to ES, will I be able to use native Java API or http, and is there any preference to the API? I have noticed there are somethings that aren't very easy and may be don't even work in the native API?
Thanks again

On Tuesday, January 28, 2014 1:05:32 PM UTC-5, David Pilato wrote:
Did you try https://github.com/dadoonet/fsriver?
Never tested it with so many docs but may be it could help you here?

If you have already generated json files on a server, then I would recommend trying logstash to send them into elasticsearch.

My 2 cents

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 28 janvier 2014 at 16:46:06, ZenMaster80 (sabda...@gmail.com) a écrit:

I would like to get your perspective on how to load json to index server in my scenario.
We have about 15 million documents in html/pdf/... on Server 1
I would like to process the data and convert to json on server 2
I would like the indexer to index json n a separate machine/server server 3

Ideally I thought on Server 2, as I prepare json and have it ready in memory, I can feed it to indexer. But since data processing is cpu intensive, I want indexing to be done on a separate machines/server.
How do you guys deal with this since I can no longer feed in-memory json to the indexer on separate machine? Do I just grab files from server 2 and index them then?

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/05b977ac-00d0-45c0-9e58-8df523e6978c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f536d58c-89ab-4609-b5ca-cef44e2b879a%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.52ea06f7.41b71efb.45fa%40MacBook-Air-de-David.local.
For more options, visit https://groups.google.com/groups/opt_out.