Performance suggestions for Indexing large documents

Are there any performance suggestion when indexing documents of size
300k-500k? I am planning to do the following:

  1. Increase the default shard from 5 to 20
  2. Increase the heap size
  3. Enable compression
  4. Use 4 nodes

We expect volume of 100 documents/sec. Is there anything else I should do?

One suggestion is to avoid indexing large docs. Eg break them into smaller units like chapters paragraphs, even sentence groups. Elastic's parent-child feature is a natural in this context.

Do you really want to present a 500k doc in response to a phrase search? Probably not. You want to present the matching context.

Of course we need ro more about the requirements in order to be more helpful.

Sent from my iPhone

On May 20, 2012, at 8:10 AM, Mohit Anchlia mohitanchlia@gmail.com wrote:

Are there any performance suggestion when indexing documents of size 300k-500k? I am planning to do the following:

  1. Increase the default shard from 5 to 20
  2. Increase the heap size
  3. Enable compression
  4. Use 4 nodes

We expect volume of 100 documents/sec. Is there anything else I should do?

I have a large xml which consists of set of forms that users fill out. So
essentially each json doc has a list of forms and each for has list of
fields and each field has a value that user types. This comes as xml doc.
This is what I would then need to convert to json. User requirements:

  1. User want every field to be searchable.
  2. User want to know just the filename that a particular field might be in.
  3. They might want to pull the entire doc, if needed. But in most cases
    they may not need that.

What do you suggest I should do?

Typical Json looks like this:

Json doc 1
{
"fileName":"filename",
"createdDate":"05/20/12 16:21:56",
"setModel":[
{
"id":"1",
"compliance":false,
"forms":[
{
"id":"40",
"copy":null,
"tpsId":null,
"forms":[
{
"id":"F40_SW_2",
"copy":null,
"tpsId":"1/F40",
"forms":[
],
"tables":[
],
"fields":[
{
"id":"L31A",
"security":null,
"value":"3000."
},
{
"id":"MRSSN1",
"security":null,
"value":"656465464"
}
]
}
]
}
}

Json doc 2

{
"fileName":"filename",
"createdDate":"05/20/12 16:21:56",
"setModel":[
{
"id":"1",
"compliance":false,
"forms":[
{
"id":"50",
"copy":null,
"tpsId":null,
"forms":[
{
"id":"F50_SW_2",
"copy":null,
"tpsId":"1/F50",
"forms":[
],
"tables":[
],
"fields":[
{
"id":"L31A",
"security":null,
"value":"3000."
},
{
"id":"MRSSN1",
"security":null,
"value":"656465464"
}
]
}
]
}
}

On Sun, May 20, 2012 at 11:29 AM, Randy randall.mcree@gmail.com wrote:

One suggestion is to avoid indexing large docs. Eg break them into smaller
units like chapters paragraphs, even sentence groups. Elastic's
parent-child feature is a natural in this context.

Do you really want to present a 500k doc in response to a phrase search?
Probably not. You want to present the matching context.

Of course we need ro more about the requirements in order to be more
helpful.

Sent from my iPhone

On May 20, 2012, at 8:10 AM, Mohit Anchlia mohitanchlia@gmail.com wrote:

Are there any performance suggestion when indexing documents of size
300k-500k? I am planning to do the following:

  1. Increase the default shard from 5 to 20
  2. Increase the heap size
  3. Enable compression
  4. Use 4 nodes

We expect volume of 100 documents/sec. Is there anything else I should
do?

Hey Mohit,

Is that 100 documents / second for indexing? that would bring you to an
average of around 40 megabytes per second, which certain, while possible,
may be difficult with 4 nodes, and documents of that size, I'd think. If
you're planning on using a single index, 20 shards over 4 nodes may also
not give you the sort of performance you'd want, unless you're distributing
them over multiple disks, or disk sets, or even perhaps virtual machines.
Could you send along a little more information on your nodes?

Patrick

patrick eefy net

On Sun, May 20, 2012 at 9:10 AM, Mohit Anchlia mohitanchlia@gmail.comwrote:

Are there any performance suggestion when indexing documents of size
300k-500k? I am planning to do the following:

  1. Increase the default shard from 5 to 20
  2. Increase the heap size
  3. Enable compression
  4. Use 4 nodes

We expect volume of 100 documents/sec. Is there anything else I should do?

These are 12 CPU, 48 GB, 8 RAID 10 disk which are 15k rpm with controller
cache card. I can add more machines as needed. But I want to make sure that
my initial design is scalable.

On Sun, May 20, 2012 at 4:48 PM, Patrick patrick@eefy.net wrote:

Hey Mohit,

Is that 100 documents / second for indexing? that would bring you to an
average of around 40 megabytes per second, which certain, while possible,
may be difficult with 4 nodes, and documents of that size, I'd think. If
you're planning on using a single index, 20 shards over 4 nodes may also
not give you the sort of performance you'd want, unless you're distributing
them over multiple disks, or disk sets, or even perhaps virtual machines.
Could you send along a little more information on your nodes?

Patrick

Patrick Ancillotti - New York | about.me
patrick eefy net

On Sun, May 20, 2012 at 9:10 AM, Mohit Anchlia mohitanchlia@gmail.comwrote:

Are there any performance suggestion when indexing documents of size
300k-500k? I am planning to do the following:

  1. Increase the default shard from 5 to 20
  2. Increase the heap size
  3. Enable compression
  4. Use 4 nodes

We expect volume of 100 documents/sec. Is there anything else I should do?

Could someone help answer my questions?

On Sun, May 20, 2012 at 4:35 PM, Mohit Anchlia mohitanchlia@gmail.comwrote:

I have a large xml which consists of set of forms that users fill out. So
essentially each json doc has a list of forms and each for has list of
fields and each field has a value that user types. This comes as xml doc.
This is what I would then need to convert to json. User requirements:

  1. User want every field to be searchable.
  2. User want to know just the filename that a particular field might be in.
  3. They might want to pull the entire doc, if needed. But in most cases
    they may not need that.

What do you suggest I should do?

Typical Json looks like this:

Json doc 1
{
"fileName":"filename",
"createdDate":"05/20/12 16:21:56",
"setModel":[
{
"id":"1",
"compliance":false,
"forms":[
{
"id":"40",
"copy":null,
"tpsId":null,
"forms":[
{
"id":"F40_SW_2",
"copy":null,
"tpsId":"1/F40",
"forms":[
],
"tables":[
],
"fields":[
{
"id":"L31A",
"security":null,
"value":"3000."
},
{
"id":"MRSSN1",
"security":null,
"value":"656465464"
}
]
}
]
}
}

Json doc 2

{
"fileName":"filename",
"createdDate":"05/20/12 16:21:56",
"setModel":[
{
"id":"1",
"compliance":false,
"forms":[
{
"id":"50",
"copy":null,
"tpsId":null,
"forms":[
{
"id":"F50_SW_2",
"copy":null,
"tpsId":"1/F50",
"forms":[
],
"tables":[
],
"fields":[
{
"id":"L31A",
"security":null,
"value":"3000."
},
{
"id":"MRSSN1",
"security":null,
"value":"656465464"
}
]
}
]
}
}

On Sun, May 20, 2012 at 11:29 AM, Randy randall.mcree@gmail.com wrote:

One suggestion is to avoid indexing large docs. Eg break them into
smaller units like chapters paragraphs, even sentence groups. Elastic's
parent-child feature is a natural in this context.

Do you really want to present a 500k doc in response to a phrase search?
Probably not. You want to present the matching context.

Of course we need ro more about the requirements in order to be more
helpful.

Sent from my iPhone

On May 20, 2012, at 8:10 AM, Mohit Anchlia mohitanchlia@gmail.com
wrote:

Are there any performance suggestion when indexing documents of size
300k-500k? I am planning to do the following:

  1. Increase the default shard from 5 to 20
  2. Increase the heap size
  3. Enable compression
  4. Use 4 nodes

We expect volume of 100 documents/sec. Is there anything else I should
do?

Hi Mohit,

Unfortunately you've not provided answers to all of the questions listed in
your thread. While I'm certain this is something you eagerly want to get
going, unfortunately this is not paid support, and responses are due in a
best effort fashion. You have some of the smartest search guys in the open
source community here, but alas, they have day jobs, and while they'll do
their best job to respond to you, it's not usually the best etiquette to
poke for responses within a few hours of your last mail.

I think for us to best assist you, we should probably get an example
document (or documents), some idea on statistics you're looking for
(inserts? searches? etc....), perhaps example searches you're planning on
running, and a breakdown of hardware (which you've partially given). The
more information you can provide, the better the answers you'll get would
be.

Patrick

patrick eefy net

On Mon, May 21, 2012 at 12:12 PM, Mohit Anchlia mohitanchlia@gmail.comwrote:

Could someone help answer my questions?

On Sun, May 20, 2012 at 4:35 PM, Mohit Anchlia mohitanchlia@gmail.comwrote:

I have a large xml which consists of set of forms that users fill out. So
essentially each json doc has a list of forms and each for has list of
fields and each field has a value that user types. This comes as xml doc.
This is what I would then need to convert to json. User requirements:

  1. User want every field to be searchable.
  2. User want to know just the filename that a particular field might be
    in.
  3. They might want to pull the entire doc, if needed. But in most cases
    they may not need that.

What do you suggest I should do?

Typical Json looks like this:

Json doc 1
{
"fileName":"filename",
"createdDate":"05/20/12 16:21:56",
"setModel":[
{
"id":"1",
"compliance":false,
"forms":[
{
"id":"40",
"copy":null,
"tpsId":null,
"forms":[
{
"id":"F40_SW_2",
"copy":null,
"tpsId":"1/F40",
"forms":[
],
"tables":[
],
"fields":[
{
"id":"L31A",
"security":null,
"value":"3000."
},
{
"id":"MRSSN1",
"security":null,
"value":"656465464"
}
]
}
]
}
}

Json doc 2

{
"fileName":"filename",
"createdDate":"05/20/12 16:21:56",
"setModel":[
{
"id":"1",
"compliance":false,
"forms":[
{
"id":"50",
"copy":null,
"tpsId":null,
"forms":[
{
"id":"F50_SW_2",
"copy":null,
"tpsId":"1/F50",
"forms":[
],
"tables":[
],
"fields":[
{
"id":"L31A",
"security":null,
"value":"3000."
},
{
"id":"MRSSN1",
"security":null,
"value":"656465464"
}
]
}
]
}
}

On Sun, May 20, 2012 at 11:29 AM, Randy randall.mcree@gmail.com wrote:

One suggestion is to avoid indexing large docs. Eg break them into
smaller units like chapters paragraphs, even sentence groups. Elastic's
parent-child feature is a natural in this context.

Do you really want to present a 500k doc in response to a phrase search?
Probably not. You want to present the matching context.

Of course we need ro more about the requirements in order to be more
helpful.

Sent from my iPhone

On May 20, 2012, at 8:10 AM, Mohit Anchlia mohitanchlia@gmail.com
wrote:

Are there any performance suggestion when indexing documents of size
300k-500k? I am planning to do the following:

  1. Increase the default shard from 5 to 20
  2. Increase the heap size
  3. Enable compression
  4. Use 4 nodes

We expect volume of 100 documents/sec. Is there anything else I should
do?

I did provide information, or atleast I thought has all the info. Pasting
my previous mail:

I have a large xml which consists of set of forms that users fill out. So
essentially each json doc has a list of forms and each for has list of
fields and each field has a value that user types. This comes as xml doc.
This is what I would then need to convert to json. User requirements:

  1. User want every field to be searchable.
  2. User want to know just the filename that a particular field might be in.
  3. They might want to pull the entire doc, if needed. But in most cases
    they may not need that.

What do you suggest I should do?

Typical Json looks like this:

Json doc 1
{
"fileName":"filename",
"createdDate":"05/20/12 16:21:56",
"setModel":[
{
"id":"1",
"compliance":false,
"forms":[
{
"id":"40",
"copy":null,
"tpsId":null,
"forms":[
{
"id":"F40_SW_2",
"copy":null,
"tpsId":"1/F40",
"forms":[
],
"tables":[
],
"fields":[
{
"id":"L31A",
"security":null,
"value":"3000."
},
{
"id":"MRSSN1",
"security":null,
"value":"656465464"
}
]
}
]
}
}

Json doc 2
{
"fileName":"filename",
"createdDate":"05/20/12 16:21:56",
"setModel":[
{
"id":"1",
"compliance":false,
"forms":[
{
"id":"50",
"copy":null,
"tpsId":null,
"forms":[
{
"id":"F50_SW_2",
"copy":null,
"tpsId":"1/F50",
"forms":[
],
"tables":[
],
"fields":[
{
"id":"L31A",
"security":null,
"value":"3000."
},
{
"id":"MRSSN1",
"security":null,
"value":"656465464"
}
]
}
]
}
}

On Mon, May 21, 2012 at 11:48 AM, Patrick patrick@eefy.net wrote:

Hi Mohit,

Unfortunately you've not provided answers to all of the questions listed
in your thread. While I'm certain this is something you eagerly want to get
going, unfortunately this is not paid support, and responses are due in a
best effort fashion. You have some of the smartest search guys in the open
source community here, but alas, they have day jobs, and while they'll do
their best job to respond to you, it's not usually the best etiquette to
poke for responses within a few hours of your last mail.

I think for us to best assist you, we should probably get an example
document (or documents), some idea on statistics you're looking for
(inserts? searches? etc....), perhaps example searches you're planning on
running, and a breakdown of hardware (which you've partially given). The
more information you can provide, the better the answers you'll get would
be.

Patrick

Patrick Ancillotti - New York | about.me
patrick eefy net

On Mon, May 21, 2012 at 12:12 PM, Mohit Anchlia mohitanchlia@gmail.comwrote:

Could someone help answer my questions?

On Sun, May 20, 2012 at 4:35 PM, Mohit Anchlia mohitanchlia@gmail.comwrote:

I have a large xml which consists of set of forms that users fill out.
So essentially each json doc has a list of forms and each for has list of
fields and each field has a value that user types. This comes as xml doc.
This is what I would then need to convert to json. User requirements:

  1. User want every field to be searchable.
  2. User want to know just the filename that a particular field might be
    in.
  3. They might want to pull the entire doc, if needed. But in most cases
    they may not need that.

What do you suggest I should do?

Typical Json looks like this:

Json doc 1
{
"fileName":"filename",
"createdDate":"05/20/12 16:21:56",
"setModel":[
{
"id":"1",
"compliance":false,
"forms":[
{
"id":"40",
"copy":null,
"tpsId":null,
"forms":[
{
"id":"F40_SW_2",
"copy":null,
"tpsId":"1/F40",
"forms":[
],
"tables":[
],
"fields":[
{
"id":"L31A",
"security":null,
"value":"3000."
},
{
"id":"MRSSN1",
"security":null,
"value":"656465464"
}
]
}
]
}
}

Json doc 2

{
"fileName":"filename",
"createdDate":"05/20/12 16:21:56",
"setModel":[
{
"id":"1",
"compliance":false,
"forms":[
{
"id":"50",
"copy":null,
"tpsId":null,
"forms":[
{
"id":"F50_SW_2",
"copy":null,
"tpsId":"1/F50",
"forms":[
],
"tables":[
],
"fields":[
{
"id":"L31A",
"security":null,
"value":"3000."
},
{
"id":"MRSSN1",
"security":null,
"value":"656465464"
}
]
}
]
}
}

On Sun, May 20, 2012 at 11:29 AM, Randy randall.mcree@gmail.com wrote:

One suggestion is to avoid indexing large docs. Eg break them into
smaller units like chapters paragraphs, even sentence groups. Elastic's
parent-child feature is a natural in this context.

Do you really want to present a 500k doc in response to a phrase
search? Probably not. You want to present the matching context.

Of course we need ro more about the requirements in order to be more
helpful.

Sent from my iPhone

On May 20, 2012, at 8:10 AM, Mohit Anchlia mohitanchlia@gmail.com
wrote:

Are there any performance suggestion when indexing documents of size
300k-500k? I am planning to do the following:

  1. Increase the default shard from 5 to 20
  2. Increase the heap size
  3. Enable compression
  4. Use 4 nodes

We expect volume of 100 documents/sec. Is there anything else I
should do?