Deployment architecture

Roman_Kournjaev · November 20, 2012, 11:07am

Hi

I am building a product search on base of elastic search at my work place.
I have a few general & architectural & deployment questions to the more
experienced ones of you.

Here are a few numbers :

25M products
23 GB of indexed data
20 searches / second at spike times
Accessed from Nest client (c#) , might be irrelevant , but still ..

What would be the optimal settings (Number of nodes , Shards,Replicas ) and
amount of physical machines ( RAM , Cores ) to fullfill my requirements.

I am currently in the stage of developing and using :

2 vm's running Linux with 12GB each
2 Nodes each per vm , each node configured the same
{node.master=true;node.data=true}
Elastic process with 8G max heap size
5 shards with 1 replica

It seems that i am getting into trouble with this configuration , first of
all my indexing takes ages , I reach 1000 documents / sec when indexing and
i am not sure where the bottleneck is and whether there is one or not.
The suddenly one of my nodes might die without any error logs or so , just
"shutting down" , did anyone had this expierience ?

Here is my mapping and also a sample document from the index. I would love
to hear if there is anything i could do better.

mapping :

{
"product": {
"properties": {
"author": {
"type": "string",
"index": "not_analyzed"
},
"boosts": {
"properties": {
"isAvailable": {
"type": "boolean"
},
"searsProduct": {
"type": "boolean"
},
"staticBoost": {
"type": "double"
}
}
},
"categoryPath": {
"type": "long"
},
"extendedFacets": {
"type": "string",
"index": "not_analyzed"
},
"externalId": {
"type": "string",
"index": "not_analyzed"
},
"externalProductId": {
"type": "string",
"index": "not_analyzed"
},
"gtin": {
"type": "string",
"index": "not_analyzed"
},
"id": {
"type": "long"
},
"likes": {
"type": "integer"
},
"manufacturer": {
"type": "string",
"boost": 2,
"index": "not_analyzed"
},
"manufacturerPartNumber": {
"type": "string",
"index": "not_analyzed"
},
"name": {
"type": "string",
"omit_norms": true
},
"nameSort": {
"type": "string",
"index": "not_analyzed"
},
"owns": {
"type": "integer"
},
"price": {
"type": "double"
},
"priceBucket": {
"type": "double"
},
"seller": {
"type": "string",
"boost": 2,
"index": "not_analyzed"
},
"tagNames": {
"type": "string"
},
"tags": {
"type": "long"
},
"tagsForFacet": {
"type": "long"
},
"wants": {
"type": "integer"
}
}
}
}

Sample Product :

{
"_index": "products",
"_type": "product",
"_id": "43688439",
"_score": 1,
"_source": {
"id": 43688439,
"name": "ROPER REIN NYLON",
"nameSort": "roper rein",
"externalId": "2095933377",
"manufacturerPartNumber": "352030BLK",
"gtin": "0000399118010",
"manufacturer": "Weaver Leather",
"seller": "Big Dee's Tack & Vet Supplies",
"price": 13.5,
"priceBucket": 15,
"categoryPath": [
4520,
4728
],
"boosts": {
"staticBoost": 1,
"searsProduct": false,
"isAvailable": true
},
"extendedFacets": [],
"tags": [],
"tagsForFacet": [],
"tagNames": [],
"likes": 0,
"owns": 0,
"wants": 0
}
}

--

radu_gheorghe · November 20, 2012, 2:03pm

Hi Roman,

On Tue, Nov 20, 2012 at 1:07 PM, Roman Kournjaev kournjaev@gmail.com wrote:

Hi

I am building a product search on base of Elasticsearch at my work place.
I have a few general & architectural & deployment questions to the more
experienced ones of you.

Here are a few numbers :

25M products
23 GB of indexed data
20 searches / second at spike times
Accessed from Nest client (c#) , might be irrelevant , but still ..

What would be the optimal settings (Number of nodes , Shards,Replicas ) and
amount of physical machines ( RAM , Cores ) to fullfill my requirements.

I am currently in the stage of developing and using :

2 vm's running Linux with 12GB each
2 Nodes each per vm , each node configured the same
{node.master=true;node.data=true}
Elastic process with 8G max heap size
5 shards with 1 replica

I don't think you need 2 nodes per VM, I would assume one node per VM
is a better use of resources. Also, regarding memory settings:

rule of thumb is to allocate ~half of your total memory to ES, to
leave some space for the OS to do caching. If you have 2x8GB on 12GB
of RAM you're probably swapping
assuming that you'd only run ES on the machine, it would be good to
have min_size=max_size. So I think you should start with
ES_HEAP_SIZE=6g for a single node per VM. You can find some more info
on the topic here:
Elasticsearch Platform — Find real-time answers at scale | Elastic
you can make sure the ES memory is not swapped by setting
bootstrap.mlockall to "true" in your configuration. But before that
you might need to run ulimit -l unlimited

Also, you might find out that searches will be faster if you set your
index to have less shards.

Beyond that, I think it has a lot to do with how typical queries look
like, how much indexing activity you have, and so on.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

It seems that i am getting into trouble with this configuration , first of
all my indexing takes ages , I reach 1000 documents / sec when indexing and
i am not sure where the bottleneck is and whether there is one or not.
The suddenly one of my nodes might die without any error logs or so , just
"shutting down" , did anyone had this expierience ?

Here is my mapping and also a sample document from the index. I would love
to hear if there is anything i could do better.

mapping :

{
"product": {
"properties": {
"author": {
"type": "string",
"index": "not_analyzed"
},
"boosts": {
"properties": {
"isAvailable": {
"type": "boolean"
},
"searsProduct": {
"type": "boolean"
},
"staticBoost": {
"type": "double"
}
}
},
"categoryPath": {
"type": "long"
},
"extendedFacets": {
"type": "string",
"index": "not_analyzed"
},
"externalId": {
"type": "string",
"index": "not_analyzed"
},
"externalProductId": {
"type": "string",
"index": "not_analyzed"
},
"gtin": {
"type": "string",
"index": "not_analyzed"
},
"id": {
"type": "long"
},
"likes": {
"type": "integer"
},
"manufacturer": {
"type": "string",
"boost": 2,
"index": "not_analyzed"
},
"manufacturerPartNumber": {
"type": "string",
"index": "not_analyzed"
},
"name": {
"type": "string",
"omit_norms": true
},
"nameSort": {
"type": "string",
"index": "not_analyzed"
},
"owns": {
"type": "integer"
},
"price": {
"type": "double"
},
"priceBucket": {
"type": "double"
},
"seller": {
"type": "string",
"boost": 2,
"index": "not_analyzed"
},
"tagNames": {
"type": "string"
},
"tags": {
"type": "long"
},
"tagsForFacet": {
"type": "long"
},
"wants": {
"type": "integer"
}
}
}
}

Sample Product :

{
"_index": "products",
"_type": "product",
"_id": "43688439",
"_score": 1,
"_source": {
"id": 43688439,
"name": "ROPER REIN NYLON",
"nameSort": "roper rein",
"externalId": "2095933377",
"manufacturerPartNumber": "352030BLK",
"gtin": "0000399118010",
"manufacturer": "Weaver Leather",
"seller": "Big Dee's Tack & Vet Supplies",
"price": 13.5,
"priceBucket": 15,
"categoryPath": [
4520,
4728
],
"boosts": {
"staticBoost": 1,
"searsProduct": false,
"isAvailable": true
},
"extendedFacets": ,
"tags": ,
"tagsForFacet": ,
"tagNames": ,
"likes": 0,
"owns": 0,
"wants": 0
}
}

--

--

drewr · November 20, 2012, 3:28pm

Roman Kournjaev wrote:

2 vm's running Linux with 12GB each

2 Nodes each per vm , each node configured the same
{node.master=true;node.data=true}

Elastic process with 8G max heap size

5 shards with 1 replica

Just to clarify, does this mean you're running two ES nodes on a 12G
virtual machine each with -Xmx8g? If so, reduce that to one node per
VM more in the range of -Xmx6g or -Xmx7g.

It seems that i am getting into trouble with this configuration , first of
all my indexing takes ages , I reach 1000 documents / sec when indexing and
i am not sure where the bottleneck is and whether there is one or
not.

1k docs/sec isn't too bad actually. Are you using the bulk API?

The suddenly one of my nodes might die without any error logs or so
, just "shutting down" , did anyone had this expierience ?

This is likely due to the oom_killer if you're running on Linux,
which isn't surprising with your heap settings. Check
/var/log/kern.log or /var/log/messages for "killed" or "oom_killer"
lines.

-Drew

--

Roman_Kournjaev · November 20, 2012, 4:21pm

I didnt express myself crear enough.
I mean 2 VM's and only 1 node on each VM

On Tuesday, November 20, 2012 5:28:31 PM UTC+2, Drew Raines wrote:

Roman Kournjaev wrote:

2 vm's running Linux with 12GB each

2 Nodes each per vm , each node configured the same
{node.master=true;node.data=true}

Elastic process with 8G max heap size

5 shards with 1 replica

Just to clarify, does this mean you're running two ES nodes on a 12G
virtual machine each with -Xmx8g? If so, reduce that to one node per
VM more in the range of -Xmx6g or -Xmx7g.

It seems that i am getting into trouble with this configuration , first
of
all my indexing takes ages , I reach 1000 documents / sec when indexing
and
i am not sure where the bottleneck is and whether there is one or
not.

1k docs/sec isn't too bad actually. Are you using the bulk API?

The suddenly one of my nodes might die without any error logs or so
, just "shutting down" , did anyone had this expierience ?

This is likely due to the oom_killer if you're running on Linux,
which isn't surprising with your heap settings. Check
/var/log/kern.log or /var/log/messages for "killed" or "oom_killer"
lines.

-Drew

--

karmi · November 21, 2012, 12:30pm

The suddenly one of my nodes might die without any error logs or so
, just "shutting down" , did anyone had this expierience ?

This is likely due to the oom_killer if you're running on Linux,
which isn't surprising with your heap settings. Check
/var/log/kern.log or /var/log/messages for "killed" or "oom_killer"
lines.

Yes. Also, observe the heap_used vs heap_max with a tool such as BigDesk or
Paramedic, or via the API -- depending on what type of searches you do,
whether and how much you're faceting, whether you're sorting and on how
many fields, the available RAM maybe can't keep up with the demand.

Also consider setting the heap for ES to 1/2 of total RAM.

Karel

--

Topic		Replies	Views
New User -- Index Settings Reccomdendations and Suggestions Elasticsearch	8	462	July 6, 2017
Performance problems Elasticsearch	12	629	July 6, 2017
How to determine optimum RAM for an elasticsearch node Elasticsearch	5	361	July 6, 2017
System Requirements for ElasticSearch stack Elasticsearch	5	526	July 6, 2017
Scaling Elasticsearch for 40GB of data Elasticsearch	5	1210	July 6, 2017

Deployment architecture

Best regards, Radu

Related topics

Best regards,
Radu