Design question

We are in process of evaluating elasticsearch for potentially indexing 100s
of millions of xml files every 6 months. We'll get these files almost in
real time and need to index. XML looks like this:

7777777777 INFOWKS, SR

I have elasticsearch working which uses hdfs gateway. Currently it's using
all the default values. I am starting out testing with 1 node.

I have several questions, I would really appreciate if someone could help
me out in answering these questions:

  1. How to design my indexing logic. Should I do it based on year?
  2. How to design my sharding logic? How to decide how many shards are
    enough per node?
  3. Should I just convert the above xml in Json?
  4. Since this xml is just key value pairs, does it pose any problems with
    index queries? for eg: if can somebody just say get me all the fields that
    has id of DaytimePhoneNumber?
  5. All of the xmls are going to be stored in Hadoop. And from hadoop we'll
    index all the files. Does it make sense to also use hdfs as gateway?
  6. What could be the issue with using hdfs as a gateway? Or is it still a
    good idea?
  7. And most important, is there a way to automatically index data from
    hadoop using map reduce?
  8. Anything specific I should be worried about?

Could someone help me answer these questions?

On Mon, May 14, 2012 at 8:47 PM, Mohit Anchlia mohitanchlia@gmail.comwrote:

We are in process of evaluating elasticsearch for potentially indexing
100s of millions of xml files every 6 months. We'll get these files almost
in real time and need to index. XML looks like this:

7777777777 INFOWKS, SR

I have elasticsearch working which uses hdfs gateway. Currently it's using
all the default values. I am starting out testing with 1 node.

I have several questions, I would really appreciate if someone could help
me out in answering these questions:

  1. How to design my indexing logic. Should I do it based on year?
  2. How to design my sharding logic? How to decide how many shards are
    enough per node?
  3. Should I just convert the above xml in Json?
  4. Since this xml is just key value pairs, does it pose any problems with
    index queries? for eg: if can somebody just say get me all the fields that
    has id of DaytimePhoneNumber?
  5. All of the xmls are going to be stored in Hadoop. And from hadoop we'll
    index all the files. Does it make sense to also use hdfs as gateway?
  6. What could be the issue with using hdfs as a gateway? Or is it still a
    good idea?
  7. And most important, is there a way to automatically index data from
    hadoop using map reduce?
  8. Anything specific I should be worried about?

Answers below:

On Tue, May 15, 2012 at 6:47 AM, Mohit Anchlia mohitanchlia@gmail.comwrote:

We are in process of evaluating elasticsearch for potentially indexing
100s of millions of xml files every 6 months. We'll get these files almost
in real time and need to index. XML looks like this:

7777777777 INFOWKS, SR

I have elasticsearch working which uses hdfs gateway. Currently it's using
all the default values. I am starting out testing with 1 node.

I recommend using the local gateway, its considerably more lightweight than
the HDFS gateway.

I have several questions, I would really appreciate if someone could help
me out in answering these questions:

  1. How to design my indexing logic. Should I do it based on year?

Hard to say, not enough enough. This thread might help:
https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/data$20flow/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ
.

  1. How to design my sharding logic? How to decide how many shards are
    enough per node?

It really depends on the HW you have. The best thing is to do some initial
capacity planning, since a lot of it depends on what and how you index and
search your data.

  1. Should I just convert the above xml in Json?

Yes, sounds good.

  1. Since this xml is just key value pairs, does it pose any problems with
    index queries? for eg: if can somebody just say get me all the fields that
    has id of DaytimePhoneNumber?

You probably want to have the json field named DaytimePhoneNumber, i.e.,
something like this: "DaytimePhoneNumber" : "value goes here".

  1. All of the xmls are going to be stored in Hadoop. And from hadoop we'll
    index all the files. Does it make sense to also use hdfs as gateway?

I recommend using the local gateway. The HDFS gateway means periodic
snapshots of the index files will be stored in Hadoop as well (but the
index will still be stored locally on each node).

  1. What could be the issue with using hdfs as a gateway? Or is it still a
    good idea?

See above. Basically, the local gateway happened after the "shared"
gateways (like Hadoop). The proper future solution for backups is to simply
have a backup API to allow to backup the index files and cluster metadata
to HDFS for example.

  1. And most important, is there a way to automatically index data from
    hadoop using map reduce?

You will need to write some code to do that. Basically a map reduce job
that reads data from HDFS, munges the data, and index it. wonderdog by
infochimps can provide a starting point:
GitHub - infochimps-labs/wonderdog: Bulk loading for elastic search.

  1. Anything specific I should be worried about?

On Wed, May 16, 2012 at 3:37 PM, Shay Banon kimchy@gmail.com wrote:

Answers below:

On Tue, May 15, 2012 at 6:47 AM, Mohit Anchlia mohitanchlia@gmail.comwrote:

We are in process of evaluating elasticsearch for potentially indexing
100s of millions of xml files every 6 months. We'll get these files almost
in real time and need to index. XML looks like this:

7777777777 INFOWKS, SR

I have elasticsearch working which uses hdfs gateway. Currently it's
using all the default values. I am starting out testing with 1 node.

I recommend using the local gateway, its considerably more lightweight
than the HDFS gateway.

I have several questions, I would really appreciate if someone could help
me out in answering these questions:

  1. How to design my indexing logic. Should I do it based on year?

Hard to say, not enough enough. This thread might help:
Redirecting to Google Groups
.

  1. How to design my sharding logic? How to decide how many shards are
    enough per node?

It really depends on the HW you have. The best thing is to do some initial
capacity planning, since a lot of it depends on what and how you index and
search your data.

  1. Should I just convert the above xml in Json?

Yes, sounds good.

  1. Since this xml is just key value pairs, does it pose any problems with
    index queries? for eg: if can somebody just say get me all the fields that
    has id of DaytimePhoneNumber?

You probably want to have the json field named DaytimePhoneNumber, i.e.,
something like this: "DaytimePhoneNumber" : "value goes here".

But the problem is that I am converting xml to json and directly indexing
that. Users can search on any of the <field id=xyz"> id value. How do I go
about make this searchable. From what you describe I'll have to read entire
xml and convert it somehow which I want to avoid.

Do you think I should write my own converter with name : value pair instead
of just converting xml to json?

  1. All of the xmls are going to be stored in Hadoop. And from hadoop
    we'll index all the files. Does it make sense to also use hdfs as gateway?

I recommend using the local gateway. The HDFS gateway means periodic
snapshots of the index files will be stored in Hadoop as well (but the
index will still be stored locally on each node).

  1. What could be the issue with using hdfs as a gateway? Or is it still a
    good idea?

See above. Basically, the local gateway happened after the "shared"
gateways (like Hadoop). The proper future solution for backups is to simply
have a backup API to allow to backup the index files and cluster metadata
to HDFS for example.

  1. And most important, is there a way to automatically index data from
    hadoop using map reduce?

You will need to write some code to do that. Basically a map reduce job
that reads data from HDFS, munges the data, and index it. wonderdog by
infochimps can provide a starting point:
GitHub - infochimps-labs/wonderdog: Bulk loading for elastic search.

  1. Anything specific I should be worried about?

Based on what you described, it makes more sense to index the data as I
suggested, then yes, you will need to write a converter that does that. It
really depends on the type of searches you want to execute.

On Thu, May 17, 2012 at 4:19 PM, Mohit Anchlia mohitanchlia@gmail.comwrote:

On Wed, May 16, 2012 at 3:37 PM, Shay Banon kimchy@gmail.com wrote:

Answers below:

On Tue, May 15, 2012 at 6:47 AM, Mohit Anchlia mohitanchlia@gmail.comwrote:

We are in process of evaluating elasticsearch for potentially indexing
100s of millions of xml files every 6 months. We'll get these files almost
in real time and need to index. XML looks like this:

7777777777 INFOWKS, SR

I have elasticsearch working which uses hdfs gateway. Currently it's
using all the default values. I am starting out testing with 1 node.

I recommend using the local gateway, its considerably more lightweight
than the HDFS gateway.

I have several questions, I would really appreciate if someone could
help me out in answering these questions:

  1. How to design my indexing logic. Should I do it based on year?

Hard to say, not enough enough. This thread might help:
Redirecting to Google Groups
.

  1. How to design my sharding logic? How to decide how many shards are
    enough per node?

It really depends on the HW you have. The best thing is to do some
initial capacity planning, since a lot of it depends on what and how you
index and search your data.

  1. Should I just convert the above xml in Json?

Yes, sounds good.

  1. Since this xml is just key value pairs, does it pose any problems
    with index queries? for eg: if can somebody just say get me all the fields
    that has id of DaytimePhoneNumber?

You probably want to have the json field named DaytimePhoneNumber, i.e.,
something like this: "DaytimePhoneNumber" : "value goes here".

But the problem is that I am converting xml to json and directly indexing
that. Users can search on any of the <field id=xyz"> id value. How do I go
about make this searchable. From what you describe I'll have to read entire
xml and convert it somehow which I want to avoid.

Do you think I should write my own converter with name : value pair
instead of just converting xml to json?

  1. All of the xmls are going to be stored in Hadoop. And from hadoop
    we'll index all the files. Does it make sense to also use hdfs as gateway?

I recommend using the local gateway. The HDFS gateway means periodic
snapshots of the index files will be stored in Hadoop as well (but the
index will still be stored locally on each node).

  1. What could be the issue with using hdfs as a gateway? Or is it still
    a good idea?

See above. Basically, the local gateway happened after the "shared"
gateways (like Hadoop). The proper future solution for backups is to simply
have a backup API to allow to backup the index files and cluster metadata
to HDFS for example.

  1. And most important, is there a way to automatically index data from
    hadoop using map reduce?

You will need to write some code to do that. Basically a map reduce job
that reads data from HDFS, munges the data, and index it. wonderdog by
infochimps can provide a starting point:
GitHub - infochimps-labs/wonderdog: Bulk loading for elastic search.

  1. Anything specific I should be worried about?

Problem I am having is this. Say I have a these json docs for eg. Now I
want to query forms.id = 40 and fields.id = L31A and fields.value = 3000. I
expect it to return doc 1. But with the regular search I'll also get doc 2.
What's the best way of designing search for such queries? I tried filter
"and" but that doesn't seem to work either. I am unable to follow the
examples on the site as most of them, I think assume a good knowledge of
lucene.

Json doc 1
{
"fileName":"filename",
"createdDate":"05/20/12 16:21:56",
"setModel":[
{
"id":"1",
"compliance":false,
"forms":[
{
"id":"40",
"copy":null,
"tpsId":null,
"forms":[
{
"id":"F40_SW_2",
"copy":null,
"tpsId":"1/F40",
"forms":[
],
"tables":[
],
"fields":[
{
"id":"L31A",
"security":null,
"value":"3000."
},
{
"id":"MRSSN1",
"security":null,
"value":"656465464"
}
]
}
]
}
}

Json doc 2

{
"fileName":"filename",
"createdDate":"05/20/12 16:21:56",
"setModel":[
{
"id":"1",
"compliance":false,
"forms":[
{
"id":"50",
"copy":null,
"tpsId":null,
"forms":[
{
"id":"F50_SW_2",
"copy":null,
"tpsId":"1/F50",
"forms":[
],
"tables":[
],
"fields":[
{
"id":"L31A",
"security":null,
"value":"3000."
},
{
"id":"MRSSN1",
"security":null,
"value":"656465464"
}
]
}
]
}
}

On Sun, May 20, 2012 at 12:48 PM, Shay Banon kimchy@gmail.com wrote:

Based on what you described, it makes more sense to index the data as I
suggested, then yes, you will need to write a converter that does that. It
really depends on the type of searches you want to execute.

On Thu, May 17, 2012 at 4:19 PM, Mohit Anchlia mohitanchlia@gmail.comwrote:

On Wed, May 16, 2012 at 3:37 PM, Shay Banon kimchy@gmail.com wrote:

Answers below:

On Tue, May 15, 2012 at 6:47 AM, Mohit Anchlia mohitanchlia@gmail.comwrote:

We are in process of evaluating elasticsearch for potentially indexing
100s of millions of xml files every 6 months. We'll get these files almost
in real time and need to index. XML looks like this:

7777777777 INFOWKS, SR

I have elasticsearch working which uses hdfs gateway. Currently it's
using all the default values. I am starting out testing with 1 node.

I recommend using the local gateway, its considerably more lightweight
than the HDFS gateway.

I have several questions, I would really appreciate if someone could
help me out in answering these questions:

  1. How to design my indexing logic. Should I do it based on year?

Hard to say, not enough enough. This thread might help:
Redirecting to Google Groups
.

  1. How to design my sharding logic? How to decide how many shards are
    enough per node?

It really depends on the HW you have. The best thing is to do some
initial capacity planning, since a lot of it depends on what and how you
index and search your data.

  1. Should I just convert the above xml in Json?

Yes, sounds good.

  1. Since this xml is just key value pairs, does it pose any problems
    with index queries? for eg: if can somebody just say get me all the fields
    that has id of DaytimePhoneNumber?

You probably want to have the json field named DaytimePhoneNumber, i.e.,
something like this: "DaytimePhoneNumber" : "value goes here".

But the problem is that I am converting xml to json and directly indexing
that. Users can search on any of the <field id=xyz"> id value. How do I go
about make this searchable. From what you describe I'll have to read entire
xml and convert it somehow which I want to avoid.

Do you think I should write my own converter with name : value pair
instead of just converting xml to json?

  1. All of the xmls are going to be stored in Hadoop. And from hadoop
    we'll index all the files. Does it make sense to also use hdfs as gateway?

I recommend using the local gateway. The HDFS gateway means periodic
snapshots of the index files will be stored in Hadoop as well (but the
index will still be stored locally on each node).

  1. What could be the issue with using hdfs as a gateway? Or is it still
    a good idea?

See above. Basically, the local gateway happened after the "shared"
gateways (like Hadoop). The proper future solution for backups is to simply
have a backup API to allow to backup the index files and cluster metadata
to HDFS for example.

  1. And most important, is there a way to automatically index data from
    hadoop using map reduce?

You will need to write some code to do that. Basically a map reduce job
that reads data from HDFS, munges the data, and index it. wonderdog by
infochimps can provide a starting point:
GitHub - infochimps-labs/wonderdog: Bulk loading for elastic search.

  1. Anything specific I should be worried about?