Performance suggestions for Indexing large documents

Mohit_Anchlia · May 20, 2012, 3:10pm

Are there any performance suggestion when indexing documents of size
300k-500k? I am planning to do the following:

Increase the default shard from 5 to 20
Increase the heap size
Enable compression
Use 4 nodes

We expect volume of 100 documents/sec. Is there anything else I should do?

Randall_McRee · May 20, 2012, 6:29pm

One suggestion is to avoid indexing large docs. Eg break them into smaller units like chapters paragraphs, even sentence groups. Elastic's parent-child feature is a natural in this context.

Do you really want to present a 500k doc in response to a phrase search? Probably not. You want to present the matching context.

Of course we need ro more about the requirements in order to be more helpful.

Sent from my iPhone

On May 20, 2012, at 8:10 AM, Mohit Anchlia mohitanchlia@gmail.com wrote:

Are there any performance suggestion when indexing documents of size 300k-500k? I am planning to do the following:

Increase the default shard from 5 to 20

Increase the heap size

Enable compression

Use 4 nodes

We expect volume of 100 documents/sec. Is there anything else I should do?

Mohit_Anchlia · May 20, 2012, 11:35pm

I have a large xml which consists of set of forms that users fill out. So
essentially each json doc has a list of forms and each for has list of
fields and each field has a value that user types. This comes as xml doc.
This is what I would then need to convert to json. User requirements:

User want every field to be searchable.
User want to know just the filename that a particular field might be in.
They might want to pull the entire doc, if needed. But in most cases
they may not need that.

What do you suggest I should do?

Typical Json looks like this:

Json doc 1
{
"fileName":"filename",
"createdDate":"05/20/12 16:21:56",
"setModel":[
{
"id":"1",
"compliance":false,
"forms":[
{
"id":"40",
"copy":null,
"tpsId":null,
"forms":[
{
"id":"F40_SW_2",
"copy":null,
"tpsId":"1/F40",
"forms":[
],
"tables":[
],
"fields":[
{
"id":"L31A",
"security":null,
"value":"3000."
},
{
"id":"MRSSN1",
"security":null,
"value":"656465464"
}
]
}
]
}
}

Json doc 2

{
"fileName":"filename",
"createdDate":"05/20/12 16:21:56",
"setModel":[
{
"id":"1",
"compliance":false,
"forms":[
{
"id":"50",
"copy":null,
"tpsId":null,
"forms":[
{
"id":"F50_SW_2",
"copy":null,
"tpsId":"1/F50",
"forms":[
],
"tables":[
],
"fields":[
{
"id":"L31A",
"security":null,
"value":"3000."
},
{
"id":"MRSSN1",
"security":null,
"value":"656465464"
}
]
}
]
}
}

On Sun, May 20, 2012 at 11:29 AM, Randy randall.mcree@gmail.com wrote:

One suggestion is to avoid indexing large docs. Eg break them into smaller
units like chapters paragraphs, even sentence groups. Elastic's
parent-child feature is a natural in this context.

Do you really want to present a 500k doc in response to a phrase search?
Probably not. You want to present the matching context.

Of course we need ro more about the requirements in order to be more
helpful.

Sent from my iPhone

On May 20, 2012, at 8:10 AM, Mohit Anchlia mohitanchlia@gmail.com wrote:

Are there any performance suggestion when indexing documents of size
300k-500k? I am planning to do the following:

Increase the default shard from 5 to 20

Increase the heap size

Enable compression

Use 4 nodes

We expect volume of 100 documents/sec. Is there anything else I should
do?

Patrick · May 20, 2012, 11:48pm

Hey Mohit,

Is that 100 documents / second for indexing? that would bring you to an
average of around 40 megabytes per second, which certain, while possible,
may be difficult with 4 nodes, and documents of that size, I'd think. If
you're planning on using a single index, 20 shards over 4 nodes may also
not give you the sort of performance you'd want, unless you're distributing
them over multiple disks, or disk sets, or even perhaps virtual machines.
Could you send along a little more information on your nodes?

Patrick

patrick eefy net

On Sun, May 20, 2012 at 9:10 AM, Mohit Anchlia mohitanchlia@gmail.comwrote:

Are there any performance suggestion when indexing documents of size
300k-500k? I am planning to do the following:

Increase the default shard from 5 to 20

Increase the heap size

Enable compression

Use 4 nodes

We expect volume of 100 documents/sec. Is there anything else I should do?

Mohit_Anchlia · May 21, 2012, 2:55am

These are 12 CPU, 48 GB, 8 RAID 10 disk which are 15k rpm with controller
cache card. I can add more machines as needed. But I want to make sure that
my initial design is scalable.

On Sun, May 20, 2012 at 4:48 PM, Patrick patrick@eefy.net wrote:

Hey Mohit,

Is that 100 documents / second for indexing? that would bring you to an
average of around 40 megabytes per second, which certain, while possible,
may be difficult with 4 nodes, and documents of that size, I'd think. If
you're planning on using a single index, 20 shards over 4 nodes may also
not give you the sort of performance you'd want, unless you're distributing
them over multiple disks, or disk sets, or even perhaps virtual machines.
Could you send along a little more information on your nodes?

Patrick

Patrick Ancillotti - New York | about.me
patrick eefy net

On Sun, May 20, 2012 at 9:10 AM, Mohit Anchlia mohitanchlia@gmail.comwrote:

Are there any performance suggestion when indexing documents of size
300k-500k? I am planning to do the following:

Increase the default shard from 5 to 20

Increase the heap size

Enable compression

Use 4 nodes

We expect volume of 100 documents/sec. Is there anything else I should do?

Mohit_Anchlia · May 21, 2012, 6:12pm

Could someone help answer my questions?

On Sun, May 20, 2012 at 4:35 PM, Mohit Anchlia mohitanchlia@gmail.comwrote:

I have a large xml which consists of set of forms that users fill out. So
essentially each json doc has a list of forms and each for has list of
fields and each field has a value that user types. This comes as xml doc.
This is what I would then need to convert to json. User requirements:

User want every field to be searchable.

User want to know just the filename that a particular field might be in.

They might want to pull the entire doc, if needed. But in most cases
they may not need that.

What do you suggest I should do?

Typical Json looks like this:

Json doc 1
{
"fileName":"filename",
"createdDate":"05/20/12 16:21:56",
"setModel":[
{
"id":"1",
"compliance":false,
"forms":[
{
"id":"40",
"copy":null,
"tpsId":null,
"forms":[
{
"id":"F40_SW_2",
"copy":null,
"tpsId":"1/F40",
"forms":[
],
"tables":[
],
"fields":[
{
"id":"L31A",
"security":null,
"value":"3000."
},
{
"id":"MRSSN1",
"security":null,
"value":"656465464"
}
]
}
]
}
}

Json doc 2

{
"fileName":"filename",
"createdDate":"05/20/12 16:21:56",
"setModel":[
{
"id":"1",
"compliance":false,
"forms":[
{
"id":"50",
"copy":null,
"tpsId":null,
"forms":[
{
"id":"F50_SW_2",
"copy":null,
"tpsId":"1/F50",
"forms":[
],
"tables":[
],
"fields":[
{
"id":"L31A",
"security":null,
"value":"3000."
},
{
"id":"MRSSN1",
"security":null,
"value":"656465464"
}
]
}
]
}
}

On Sun, May 20, 2012 at 11:29 AM, Randy randall.mcree@gmail.com wrote:

One suggestion is to avoid indexing large docs. Eg break them into
smaller units like chapters paragraphs, even sentence groups. Elastic's
parent-child feature is a natural in this context.

Do you really want to present a 500k doc in response to a phrase search?
Probably not. You want to present the matching context.

Of course we need ro more about the requirements in order to be more
helpful.

Sent from my iPhone

On May 20, 2012, at 8:10 AM, Mohit Anchlia mohitanchlia@gmail.com
wrote:

Are there any performance suggestion when indexing documents of size
300k-500k? I am planning to do the following:

Increase the default shard from 5 to 20

Increase the heap size

Enable compression

Use 4 nodes

We expect volume of 100 documents/sec. Is there anything else I should
do?

Patrick · May 21, 2012, 6:48pm

Hi Mohit,

Unfortunately you've not provided answers to all of the questions listed in
your thread. While I'm certain this is something you eagerly want to get
going, unfortunately this is not paid support, and responses are due in a
best effort fashion. You have some of the smartest search guys in the open
source community here, but alas, they have day jobs, and while they'll do
their best job to respond to you, it's not usually the best etiquette to
poke for responses within a few hours of your last mail.

I think for us to best assist you, we should probably get an example
document (or documents), some idea on statistics you're looking for
(inserts? searches? etc....), perhaps example searches you're planning on
running, and a breakdown of hardware (which you've partially given). The
more information you can provide, the better the answers you'll get would
be.

Patrick

patrick eefy net

On Mon, May 21, 2012 at 12:12 PM, Mohit Anchlia mohitanchlia@gmail.comwrote:

Could someone help answer my questions?

On Sun, May 20, 2012 at 4:35 PM, Mohit Anchlia mohitanchlia@gmail.comwrote:

I have a large xml which consists of set of forms that users fill out. So
essentially each json doc has a list of forms and each for has list of
fields and each field has a value that user types. This comes as xml doc.
This is what I would then need to convert to json. User requirements:

User want every field to be searchable.

User want to know just the filename that a particular field might be
in.

They might want to pull the entire doc, if needed. But in most cases
they may not need that.

What do you suggest I should do?

Typical Json looks like this:

Json doc 1
{
"fileName":"filename",
"createdDate":"05/20/12 16:21:56",
"setModel":[
{
"id":"1",
"compliance":false,
"forms":[
{
"id":"40",
"copy":null,
"tpsId":null,
"forms":[
{
"id":"F40_SW_2",
"copy":null,
"tpsId":"1/F40",
"forms":[
],
"tables":[
],
"fields":[
{
"id":"L31A",
"security":null,
"value":"3000."
},
{
"id":"MRSSN1",
"security":null,
"value":"656465464"
}
]
}
]
}
}

Json doc 2

{
"fileName":"filename",
"createdDate":"05/20/12 16:21:56",
"setModel":[
{
"id":"1",
"compliance":false,
"forms":[
{
"id":"50",
"copy":null,
"tpsId":null,
"forms":[
{
"id":"F50_SW_2",
"copy":null,
"tpsId":"1/F50",
"forms":[
],
"tables":[
],
"fields":[
{
"id":"L31A",
"security":null,
"value":"3000."
},
{
"id":"MRSSN1",
"security":null,
"value":"656465464"
}
]
}
]
}
}

On Sun, May 20, 2012 at 11:29 AM, Randy randall.mcree@gmail.com wrote:

One suggestion is to avoid indexing large docs. Eg break them into
smaller units like chapters paragraphs, even sentence groups. Elastic's
parent-child feature is a natural in this context.

Do you really want to present a 500k doc in response to a phrase search?
Probably not. You want to present the matching context.

Of course we need ro more about the requirements in order to be more
helpful.

Sent from my iPhone

On May 20, 2012, at 8:10 AM, Mohit Anchlia mohitanchlia@gmail.com
wrote:

Are there any performance suggestion when indexing documents of size
300k-500k? I am planning to do the following:

Increase the default shard from 5 to 20

Increase the heap size

Enable compression

Use 4 nodes

We expect volume of 100 documents/sec. Is there anything else I should
do?

Mohit_Anchlia · May 21, 2012, 6:50pm

I did provide information, or atleast I thought has all the info. Pasting
my previous mail:

I have a large xml which consists of set of forms that users fill out. So
essentially each json doc has a list of forms and each for has list of
fields and each field has a value that user types. This comes as xml doc.
This is what I would then need to convert to json. User requirements:

User want every field to be searchable.
User want to know just the filename that a particular field might be in.
They might want to pull the entire doc, if needed. But in most cases
they may not need that.

What do you suggest I should do?

Typical Json looks like this:

Json doc 1
{
"fileName":"filename",
"createdDate":"05/20/12 16:21:56",
"setModel":[
{
"id":"1",
"compliance":false,
"forms":[
{
"id":"40",
"copy":null,
"tpsId":null,
"forms":[
{
"id":"F40_SW_2",
"copy":null,
"tpsId":"1/F40",
"forms":[
],
"tables":[
],
"fields":[
{
"id":"L31A",
"security":null,
"value":"3000."
},
{
"id":"MRSSN1",
"security":null,
"value":"656465464"
}
]
}
]
}
}

Json doc 2
{
"fileName":"filename",
"createdDate":"05/20/12 16:21:56",
"setModel":[
{
"id":"1",
"compliance":false,
"forms":[
{
"id":"50",
"copy":null,
"tpsId":null,
"forms":[
{
"id":"F50_SW_2",
"copy":null,
"tpsId":"1/F50",
"forms":[
],
"tables":[
],
"fields":[
{
"id":"L31A",
"security":null,
"value":"3000."
},
{
"id":"MRSSN1",
"security":null,
"value":"656465464"
}
]
}
]
}
}

On Mon, May 21, 2012 at 11:48 AM, Patrick patrick@eefy.net wrote:

Hi Mohit,

Unfortunately you've not provided answers to all of the questions listed
in your thread. While I'm certain this is something you eagerly want to get
going, unfortunately this is not paid support, and responses are due in a
best effort fashion. You have some of the smartest search guys in the open
source community here, but alas, they have day jobs, and while they'll do
their best job to respond to you, it's not usually the best etiquette to
poke for responses within a few hours of your last mail.

I think for us to best assist you, we should probably get an example
document (or documents), some idea on statistics you're looking for
(inserts? searches? etc....), perhaps example searches you're planning on
running, and a breakdown of hardware (which you've partially given). The
more information you can provide, the better the answers you'll get would
be.

Patrick

Patrick Ancillotti - New York | about.me
patrick eefy net

On Mon, May 21, 2012 at 12:12 PM, Mohit Anchlia mohitanchlia@gmail.comwrote:

Could someone help answer my questions?

On Sun, May 20, 2012 at 4:35 PM, Mohit Anchlia mohitanchlia@gmail.comwrote:

I have a large xml which consists of set of forms that users fill out.
So essentially each json doc has a list of forms and each for has list of
fields and each field has a value that user types. This comes as xml doc.
This is what I would then need to convert to json. User requirements:

User want every field to be searchable.

User want to know just the filename that a particular field might be
in.

They might want to pull the entire doc, if needed. But in most cases
they may not need that.

What do you suggest I should do?

Typical Json looks like this:

Json doc 1
{
"fileName":"filename",
"createdDate":"05/20/12 16:21:56",
"setModel":[
{
"id":"1",
"compliance":false,
"forms":[
{
"id":"40",
"copy":null,
"tpsId":null,
"forms":[
{
"id":"F40_SW_2",
"copy":null,
"tpsId":"1/F40",
"forms":[
],
"tables":[
],
"fields":[
{
"id":"L31A",
"security":null,
"value":"3000."
},
{
"id":"MRSSN1",
"security":null,
"value":"656465464"
}
]
}
]
}
}

Json doc 2

{
"fileName":"filename",
"createdDate":"05/20/12 16:21:56",
"setModel":[
{
"id":"1",
"compliance":false,
"forms":[
{
"id":"50",
"copy":null,
"tpsId":null,
"forms":[
{
"id":"F50_SW_2",
"copy":null,
"tpsId":"1/F50",
"forms":[
],
"tables":[
],
"fields":[
{
"id":"L31A",
"security":null,
"value":"3000."
},
{
"id":"MRSSN1",
"security":null,
"value":"656465464"
}
]
}
]
}
}

On Sun, May 20, 2012 at 11:29 AM, Randy randall.mcree@gmail.com wrote:

One suggestion is to avoid indexing large docs. Eg break them into
smaller units like chapters paragraphs, even sentence groups. Elastic's
parent-child feature is a natural in this context.

Do you really want to present a 500k doc in response to a phrase
search? Probably not. You want to present the matching context.

Of course we need ro more about the requirements in order to be more
helpful.

Sent from my iPhone

On May 20, 2012, at 8:10 AM, Mohit Anchlia mohitanchlia@gmail.com
wrote:

Are there any performance suggestion when indexing documents of size
300k-500k? I am planning to do the following:

Increase the default shard from 5 to 20

Increase the heap size

Enable compression

Use 4 nodes

We expect volume of 100 documents/sec. Is there anything else I
should do?

Topic		Replies	Views
Indexing performance with doc values (particularly with larger number of fields) Elasticsearch	2	570	July 6, 2017
Search Performance of index with large documents (PDF's) Elasticsearch	4	1491	July 5, 2017
Performance Issues Elasticsearch	3	447	July 6, 2017
Elasticsearch Performance Issue Elasticsearch	7	560	September 4, 2020
Improving Bulk Indexing Elasticsearch	12	4149	July 6, 2017

Performance suggestions for Indexing large documents

Patrick

Patrick

Patrick

Patrick

Related topics