Deployment architecture

Hi

I am building a product search on base of elastic search at my work place.
I have a few general & architectural & deployment questions to the more
experienced ones of you.

Here are a few numbers :

  • 25M products
  • 23 GB of indexed data
  • 20 searches / second at spike times
  • Accessed from Nest client (c#) , might be irrelevant , but still ..

What would be the optimal settings (Number of nodes , Shards,Replicas ) and
amount of physical machines ( RAM , Cores ) to fullfill my requirements.

I am currently in the stage of developing and using :

  • 2 vm's running Linux with 12GB each
  • 2 Nodes each per vm , each node configured the same
    {node.master=true;node.data=true}
  • Elastic process with 8G max heap size
  • 5 shards with 1 replica

It seems that i am getting into trouble with this configuration , first of
all my indexing takes ages , I reach 1000 documents / sec when indexing and
i am not sure where the bottleneck is and whether there is one or not.
The suddenly one of my nodes might die without any error logs or so , just
"shutting down" , did anyone had this expierience ?

Here is my mapping and also a sample document from the index. I would love
to hear if there is anything i could do better.

mapping :

{
"product": {
"properties": {
"author": {
"type": "string",
"index": "not_analyzed"
},
"boosts": {
"properties": {
"isAvailable": {
"type": "boolean"
},
"searsProduct": {
"type": "boolean"
},
"staticBoost": {
"type": "double"
}
}
},
"categoryPath": {
"type": "long"
},
"extendedFacets": {
"type": "string",
"index": "not_analyzed"
},
"externalId": {
"type": "string",
"index": "not_analyzed"
},
"externalProductId": {
"type": "string",
"index": "not_analyzed"
},
"gtin": {
"type": "string",
"index": "not_analyzed"
},
"id": {
"type": "long"
},
"likes": {
"type": "integer"
},
"manufacturer": {
"type": "string",
"boost": 2,
"index": "not_analyzed"
},
"manufacturerPartNumber": {
"type": "string",
"index": "not_analyzed"
},
"name": {
"type": "string",
"omit_norms": true
},
"nameSort": {
"type": "string",
"index": "not_analyzed"
},
"owns": {
"type": "integer"
},
"price": {
"type": "double"
},
"priceBucket": {
"type": "double"
},
"seller": {
"type": "string",
"boost": 2,
"index": "not_analyzed"
},
"tagNames": {
"type": "string"
},
"tags": {
"type": "long"
},
"tagsForFacet": {
"type": "long"
},
"wants": {
"type": "integer"
}
}
}
}

Sample Product :

{
"_index": "products",
"_type": "product",
"_id": "43688439",
"_score": 1,
"_source": {
"id": 43688439,
"name": "ROPER REIN NYLON",
"nameSort": "roper rein",
"externalId": "2095933377",
"manufacturerPartNumber": "352030BLK",
"gtin": "0000399118010",
"manufacturer": "Weaver Leather",
"seller": "Big Dee's Tack & Vet Supplies",
"price": 13.5,
"priceBucket": 15,
"categoryPath": [
4520,
4728
],
"boosts": {
"staticBoost": 1,
"searsProduct": false,
"isAvailable": true
},
"extendedFacets": [],
"tags": [],
"tagsForFacet": [],
"tagNames": [],
"likes": 0,
"owns": 0,
"wants": 0
}
}

--

Hi Roman,

On Tue, Nov 20, 2012 at 1:07 PM, Roman Kournjaev kournjaev@gmail.com wrote:

Hi

I am building a product search on base of Elasticsearch at my work place.
I have a few general & architectural & deployment questions to the more
experienced ones of you.

Here are a few numbers :

25M products
23 GB of indexed data
20 searches / second at spike times
Accessed from Nest client (c#) , might be irrelevant , but still ..

What would be the optimal settings (Number of nodes , Shards,Replicas ) and
amount of physical machines ( RAM , Cores ) to fullfill my requirements.

I am currently in the stage of developing and using :

2 vm's running Linux with 12GB each
2 Nodes each per vm , each node configured the same
{node.master=true;node.data=true}
Elastic process with 8G max heap size
5 shards with 1 replica

I don't think you need 2 nodes per VM, I would assume one node per VM
is a better use of resources. Also, regarding memory settings:

  • rule of thumb is to allocate ~half of your total memory to ES, to
    leave some space for the OS to do caching. If you have 2x8GB on 12GB
    of RAM you're probably swapping
  • assuming that you'd only run ES on the machine, it would be good to
    have min_size=max_size. So I think you should start with
    ES_HEAP_SIZE=6g for a single node per VM. You can find some more info
    on the topic here:
    Elasticsearch Platform — Find real-time answers at scale | Elastic
  • you can make sure the ES memory is not swapped by setting
    bootstrap.mlockall to "true" in your configuration. But before that
    you might need to run ulimit -l unlimited

Also, you might find out that searches will be faster if you set your
index to have less shards.

Beyond that, I think it has a lot to do with how typical queries look
like, how much indexing activity you have, and so on.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

It seems that i am getting into trouble with this configuration , first of
all my indexing takes ages , I reach 1000 documents / sec when indexing and
i am not sure where the bottleneck is and whether there is one or not.
The suddenly one of my nodes might die without any error logs or so , just
"shutting down" , did anyone had this expierience ?

Here is my mapping and also a sample document from the index. I would love
to hear if there is anything i could do better.

mapping :

{
"product": {
"properties": {
"author": {
"type": "string",
"index": "not_analyzed"
},
"boosts": {
"properties": {
"isAvailable": {
"type": "boolean"
},
"searsProduct": {
"type": "boolean"
},
"staticBoost": {
"type": "double"
}
}
},
"categoryPath": {
"type": "long"
},
"extendedFacets": {
"type": "string",
"index": "not_analyzed"
},
"externalId": {
"type": "string",
"index": "not_analyzed"
},
"externalProductId": {
"type": "string",
"index": "not_analyzed"
},
"gtin": {
"type": "string",
"index": "not_analyzed"
},
"id": {
"type": "long"
},
"likes": {
"type": "integer"
},
"manufacturer": {
"type": "string",
"boost": 2,
"index": "not_analyzed"
},
"manufacturerPartNumber": {
"type": "string",
"index": "not_analyzed"
},
"name": {
"type": "string",
"omit_norms": true
},
"nameSort": {
"type": "string",
"index": "not_analyzed"
},
"owns": {
"type": "integer"
},
"price": {
"type": "double"
},
"priceBucket": {
"type": "double"
},
"seller": {
"type": "string",
"boost": 2,
"index": "not_analyzed"
},
"tagNames": {
"type": "string"
},
"tags": {
"type": "long"
},
"tagsForFacet": {
"type": "long"
},
"wants": {
"type": "integer"
}
}
}
}

Sample Product :

{
"_index": "products",
"_type": "product",
"_id": "43688439",
"_score": 1,
"_source": {
"id": 43688439,
"name": "ROPER REIN NYLON",
"nameSort": "roper rein",
"externalId": "2095933377",
"manufacturerPartNumber": "352030BLK",
"gtin": "0000399118010",
"manufacturer": "Weaver Leather",
"seller": "Big Dee's Tack & Vet Supplies",
"price": 13.5,
"priceBucket": 15,
"categoryPath": [
4520,
4728
],
"boosts": {
"staticBoost": 1,
"searsProduct": false,
"isAvailable": true
},
"extendedFacets": ,
"tags": ,
"tagsForFacet": ,
"tagNames": ,
"likes": 0,
"owns": 0,
"wants": 0
}
}

--

--

Roman Kournjaev wrote:

  • 2 vm's running Linux with 12GB each
  • 2 Nodes each per vm , each node configured the same
    {node.master=true;node.data=true}
  • Elastic process with 8G max heap size
  • 5 shards with 1 replica

Just to clarify, does this mean you're running two ES nodes on a 12G
virtual machine each with -Xmx8g? If so, reduce that to one node per
VM more in the range of -Xmx6g or -Xmx7g.

It seems that i am getting into trouble with this configuration , first of
all my indexing takes ages , I reach 1000 documents / sec when indexing and
i am not sure where the bottleneck is and whether there is one or
not.

1k docs/sec isn't too bad actually. Are you using the bulk API?

The suddenly one of my nodes might die without any error logs or so
, just "shutting down" , did anyone had this expierience ?

This is likely due to the oom_killer if you're running on Linux,
which isn't surprising with your heap settings. Check
/var/log/kern.log or /var/log/messages for "killed" or "oom_killer"
lines.

-Drew

--

I didnt express myself crear enough.
I mean 2 VM's and only 1 node on each VM

On Tuesday, November 20, 2012 5:28:31 PM UTC+2, Drew Raines wrote:

Roman Kournjaev wrote:

  • 2 vm's running Linux with 12GB each
  • 2 Nodes each per vm , each node configured the same
    {node.master=true;node.data=true}
  • Elastic process with 8G max heap size
  • 5 shards with 1 replica

Just to clarify, does this mean you're running two ES nodes on a 12G
virtual machine each with -Xmx8g? If so, reduce that to one node per
VM more in the range of -Xmx6g or -Xmx7g.

It seems that i am getting into trouble with this configuration , first
of
all my indexing takes ages , I reach 1000 documents / sec when indexing
and
i am not sure where the bottleneck is and whether there is one or
not.

1k docs/sec isn't too bad actually. Are you using the bulk API?

The suddenly one of my nodes might die without any error logs or so
, just "shutting down" , did anyone had this expierience ?

This is likely due to the oom_killer if you're running on Linux,
which isn't surprising with your heap settings. Check
/var/log/kern.log or /var/log/messages for "killed" or "oom_killer"
lines.

-Drew

--

The suddenly one of my nodes might die without any error logs or so
, just "shutting down" , did anyone had this expierience ?

This is likely due to the oom_killer if you're running on Linux,
which isn't surprising with your heap settings. Check
/var/log/kern.log or /var/log/messages for "killed" or "oom_killer"
lines.

Yes. Also, observe the heap_used vs heap_max with a tool such as BigDesk or
Paramedic, or via the API -- depending on what type of searches you do,
whether and how much you're faceting, whether you're sorting and on how
many fields, the available RAM maybe can't keep up with the demand.

Also consider setting the heap for ES to 1/2 of total RAM.

Karel

--