Hi,
We have been using elasticsearch 0.19.2 for storing and analyzing data
from social media blogs and forums. The data volume is going up to
500000 documents per index, and size of this volume of data in
Elasticsearch index is going up to 3 GB per index per node (all
shards). We always maintain the number of replicas 1 less than the
total number of nodes to ensure that a copy of all shards should
reside on every node at any instant. The number of shards are
generally 10 for the size of indexes we mentioned above.
We try different queries on these data for advanced visualization
purpose, and mainly facets for showing trend charts or keyword clouds.
Following are some example of the query we execute:
{
"query" : {
"match_all" : { }
},
"size" : 0,
"facets" : {
"tag" : {
"terms" : {
"field" : "nouns",
"size" : 100
},
"_cache":false
}
}
}
{
"query" : {
"match_all" : { }
},
"size" : 0,
"facets" : {
"tag" : {
"terms" : {
"field" : "phrases",
"size" : 100
},
"_cache":false
}
}
}
While executing such queries we often encounter heap space shortage,
and the nodes becomes unresponsive. Our main concern is that the nodes
do not recover to normal state even after dumping the heap to a hprof
file. The node still consumes the maximum allocated memory as shown in
task manager java.exe process, and the nodes remain unresponsive until
we manually kill and restart them.
ES Configuration 1:
ElasticSearch Version 0.19.2
2 Nodes, one on each physical server
Max heap size 6GB per node.
10 shards, 1 replica.
ES Configuration 2:
ElasticSearch Version 0.19.2
6 Nodes, three on each physical server
Max heap size 2GB per node.
10 shards, 5 replica.
Server Configuration:
Windows 7 64 bit
64 bit JVM
8 GB pysical memory
Dual Core processor
For both the configuration mentioned above ElasticSearch was unable to
respond to the facet queries mentioned above, it was also unable to
recover when a query failed due to heap space shortage.
We are facing this issue in our production environments, and request
you to please suggest a better configuration or a different approach
if required.
The mapping of the data is we use is as follows:
(keyword1 is a customized keyword analyzer, similarly standard1 is a
customized standard analyzer)
{
"properties": {
"adjectives": {
"type": "string",
"analyzer": "stop2"
},
"alertStatus": {
"type": "string",
"analyzer": "keyword1"
},
"assignedByUserId": {
"type": "integer",
"index": "analyzed"
},
"assignedByUserName": {
"type": "string",
"analyzer": "keyword1"
},
"assignedToDepartmentId": {
"type": "integer",
"index": "analyzed"
},
"assignedToDepartmentName": {
"type": "string",
"analyzer": "keyword1"
},
"assignedToUserId": {
"type": "integer",
"index": "analyzed"
},
"assignedToUserName": {
"type": "string",
"analyzer": "keyword1"
},
"authorJsonMetadata": {
"properties": {
"favourites": {
"type": "string"
},
"followers": {
"type": "string"
},
"following": {
"type": "string"
},
"likes": {
"type": "string"
},
"listed": {
"type": "string"
},
"subscribers": {
"type": "string"
},
"subscription": {
"type": "string"
},
"uploads": {
"type": "string"
},
"views": {
"type": "string"
}
}
},
"authorKloutDetails": {
"dynamic": "true",
"properties": {
"amplificationScore": {
"type": "string"
},
"authorKloutDetailsFound": {
"type": "string"
},
"description": {
"type": "string"
},
"influencees": {
"dynamic": "true",
"properties": {
"kscore": {
"type": "string"
},
"twitter_screen_name": {
"type": "string"
}
}
},
"influencers": {
"dynamic": "true",
"properties": {
"kscore": {
"type": "string"
},
"twitter_screen_name": {
"type": "string"
}
}
},
"kloutClass": {
"type": "string"
},
"kloutClassDescription": {
"type": "string"
},
"kloutScore": {
"type": "string"
},
"kloutScoreDescription": {
"type": "string"
},
"kloutTopic": {
"type": "string"
},
"slope": {
"type": "string"
},
"trueReach": {
"type": "string"
},
"twitterId": {
"type": "string"
},
"twitterScreenName": {
"type": "string"
}
}
},
"author_media": {
"type": "string",
"analyzer": "keyword1"
},
"brandTerms": {
"type": "string",
"analyzer": "keyword1"
},
"calculatedSentimentId": {
"type": "integer",
"index": "analyzed"
},
"calculatedSentimentName": {
"type": "string",
"analyzer": "keyword1"
},
"categories": {
"properties": {
"category": {
"type": "string",
"analyzer": "keyword1"
},
"categoryWords": {
"type": "string",
"analyzer": "keyword1"
},
"score": {
"type": "double"
}
}
},
"commentCount": {
"type": "integer",
"index": "analyzed"
},
"contentAuthorId": {
"type": "integer",
"index": "analyzed"
},
"contentAuthorName": {
"type": "string",
"analyzer": "keyword1"
},
"contentId": {
"type": "integer",
"index": "analyzed"
},
"contentJsonMetadata": {
"properties": {
"comment Count": {
"type": "string"
},
"dislikes": {
"type": "string"
},
"favourites": {
"type": "string"
},
"likes": {
"type": "string"
},
"retweet Count": {
"type": "string"
},
"views": {
"type": "string"
}
}
},
"contentPublishedTime": {
"type": "date",
"index": "analyzed",
"format": "dateOptionalTime"
},
"contentTextFull": {
"type": "string",
"analyzer": "standard1"
},
"contentTextFullHighlighted": {
"type": "string",
"analyzer": "standard1"
},
"contentTextSnippetHighlighted": {
"type": "string",
"analyzer": "standard1"
},
"contentType": {
"type": "string",
"analyzer": "keyword1"
},
"contentUrlId": {
"type": "integer",
"index": "analyzed"
},
"contentUrlPath": {
"type": "string",
"analyzer": "keyword1"
},
"contentUrlPublishedTime": {
"type": "date",
"index": "analyzed",
"format": "dateOptionalTime"
},
"ctmId": {
"type": "long"
},
"domainName": {
"type": "string",
"analyzer": "keyword1"
},
"domainUrl": {
"type": "string",
"analyzer": "keyword1"
},
"domain_media": {
"type": "string",
"analyzer": "keyword1"
},
"findings": {
"type": "string",
"analyzer": "keyword1"
},
"geographyId": {
"type": "integer",
"index": "analyzed"
},
"geographyName": {
"type": "string",
"analyzer": "keyword1"
},
"kloutScore": {
"type": "object"
},
"languageId": {
"type": "integer",
"index": "analyzed"
},
"languageName": {
"type": "string",
"analyzer": "keyword1"
},
"listListeningObjectiveName": {
"type": "string",
"analyzer": "keyword1"
},
"mediaSourceIconPath": {
"type": "string",
"analyzer": "keyword1"
},
"mediaSourceId": {
"type": "integer",
"index": "analyzed"
},
"mediaSourceName": {
"type": "string",
"analyzer": "keyword1"
},
"mediaSourceTypeId": {
"type": "integer",
"index": "analyzed"
},
"mediaSourceTypeName": {
"type": "string",
"analyzer": "keyword1"
},
"notesCount": {
"type": "integer",
"index": "analyzed"
},
"nouns": {
"type": "string",
"analyzer": "stop2"
},
"opinionWords": {
"type": "string",
"analyzer": "keyword1"
},
"phrases": {
"type": "string",
"analyzer": "keyword1"
},
"profileId": {
"type": "integer",
"index": "analyzed"
},
"profileName": {
"type": "string",
"analyzer": "keyword1"
},
"topicId": {
"type": "integer",
"index": "analyzed"
},
"topicName": {
"type": "string",
"analyzer": "keyword1"
},
"userSentimentId": {
"type": "integer",
"index": "analyzed"
},
"userSentimentName": {
"type": "string",
"analyzer": "keyword1"
},
"verbs": {
"type": "string",
"analyzer": "stop2"
}
}
}
A sample of the structure of the data is as follows:
{
"contentType": "comment",
"topicId": 9,
"mediaSourceId": 3,
"contentId": 34834,
"ctmId": 73322,
"contentTextFull": "The low numbers nationally published by
Corelogic were a result of banks holding off foreclosures until
settlement. \nAs Bloomberg and RealtyTrac stated. this will result in
more foreclosure pain in the short term as some of the foreclosures
that should have happened last year instead happen this year which
will likely result in higher foreclosure numbers in 2012 than
2011.\nThe estimates from Realtytrac and Zillow are hovering around 1
million completed foreclosures, or REOs, in 2012, a 25 percent
increase from 2011. \nThe positive is that the data suggests that
short sales net the banks more money so they should be expected to
increase\nThe bottom line is that in the longer term the bank
settlement will help to more quickly clear the so-called shadow
inventory, which will in turn help the housing market finally bottom
out once and for all. \nMy buddy who bought in Santa Luz in 2006 is
asked every month by his bank when he makes his payment on his $1.2mm
underwater home, do you plan on staying in the house? . Per
Corelogic, there are still large numbers still underwater in SD\n-
3800 underwater in 92127\n- 2700 underwater in 92130\nThe good news is
we only have one last market to get hit, and expect the high end.
The $1mm to $2mm has to get hit next.\nhttp://www.mercurynews.com/
business/ci_19899224\nUnfortunately, we can not avoid the headwinds.",
"contentTextFullHighlighted": null,
"contentTextSnippetHighlighted": "The low numbers nationally
published by Corelogic were a result of banks holding off foreclosures
until settlement. \nAs Bloomberg and RealtyTrac stated. this will
result in more foreclosure pain in the short term as some of the
foreclosures that should have happened last year instead happen...",
"contentJsonMetadata": null,
"commentCount": 117,
"contentUrlId": 13535,
"contentUrlPath": "http://www.bubbleinfo.com/2012/02/09/mortgage-
settlement-renegade/",
"domainUrl": "http://www.bubbleinfo.com",
"domainName": null,
"contentAuthorId": 15614,
"contentAuthorName": "Hankster",
"authorJsonMetadata": null,
"authorKloutDetails": null,
"mediaSourceName": "Board Reader Blog",
"mediaSourceIconPath": "BoardReaderBlog.gif",
"mediaSourceTypeId": 1,
"mediaSourceTypeName": "Blog",
"geographyId": 0,
"geographyName": "Unknown",
"languageId": 1,
"languageName": "English",
"topicName": "Bank of America",
"profileId": 3,
"profileName": "USAA_Competition1",
"contentPublishedTime": 1328798840000,
"contentUrlPublishedTime": 1329336423000,
"calculatedSentimentId": 4,
"calculatedSentimentName": "POS",
"userSentimentId": 0,
"userSentimentName": null,
"listListeningObjectiveName": [
"Untagged LO"
],
"alertStatus": "assigned",
"assignedToUserId": 2,
"assignedToUserName": null,
"assignedByUserId": 1,
"assignedByUserName": null,
"assignedToDepartmentId": 0,
"assignedToDepartmentName": null,
"notesCount": 0,
"nouns": [
"bank",
"banks",
"Bloomberg",
"buddy",
"Corelogic",
"data",
"estimates",
"foreclosure",
"foreclosures",
"headwinds",
"home",
"house",
"housing",
"increase",
"inventory",
"line",
"Luz",
"market",
"mm",
"money",
"month",
"net",
"news",
"numbers",
"pain",
"payment",
"percent",
"Realtytrac",
"RealtyTrac",
"REOs",
"result",
"sales",
"Santa",
"SD",
"settlement",
"shadow",
"term",
"turn",
"year",
"Zillow"
],
"verbs": [
"asked",
"avoid",
"bought",
"completed",
"expect",
"expected",
"get",
"happen",
"happened",
"help",
"hit",
"holding",
"hovering",
"increase",
"makes",
"plan",
"published",
"result",
"stated",
"staying",
"suggests"
],
"adjectives": [
"bottom",
"clear",
"finally",
"good",
"high",
"higher",
"instead",
"large",
"last",
"likely",
"longer",
"low",
"nationally",
"next",
"not",
"positive",
"quickly",
"short",
"so-called",
"underwater",
"Unfortunately"
],
"phrases": [
"2012 than 2011",
"25 percent",
"25 percent increase",
"2700 underwater in 92130",
"3800 underwater in 92127",
"92130 The good news",
"asked every month",
"avoid the headwinds",
"bank settlement",
"banks holding off foreclosures",
"banks more money",
"Bloomberg and RealtyTrac",
"bottom line",
"bought in Santa",
"bought in Santa Luz",
"clear the so-called shadow",
"completed foreclosures",
"estimates from Realtytrac",
"foreclosure numbers",
"foreclosure numbers in 2012",
"foreclosure pain",
"foreclosures until settlement",
"good news",
"happen this year",
"happen this year --",
"happened last year",
"help the housing",
"help the housing market",
"higher foreclosure",
"higher foreclosure numbers",
"holding off foreclosures",
"housing market",
"increase from 2011",
"increase The bottom line",
"instead happen this year",
"large numbers",
"last market",
"last year",
"longer term",
"longer term the bank",
"low numbers",
"Luz in 2006",
"makes his payment",
"million completed foreclosures",
"mm underwater home",
"month by his bank",
"nationally published by Corelogic",
"net the banks",
"not avoid the headwinds",
"numbers in 2012",
"percent increase",
"percent increase from 2011",
"published by Corelogic",
"Realtytrac and Zillow",
"result in higher foreclosure",
"result in more foreclosure",
"result of banks",
"sales net",
"sales net the banks",
"Santa Luz",
"Santa Luz in 2006",
"shadow inventory",
"short sales",
"short sales net",
"short term",
"so-called shadow",
"so-called shadow inventory",
"staying in the house",
"suggests that short sales",
"term the bank",
"term the bank settlement",
"turn help the housing",
"underwater home",
"underwater in 92127",
"underwater in 92130",
"underwater in SD",
"year --"
],
"author_media": "15614~Hankster~1~Blog",~null~
"domain_media": "http://www.bubbleinfo.com1~Blog",
"categories": [
{
"category": "post closing",
"categoryWords": [
"foreclosure",
"foreclosure"
],
"score": "2.0"
},
{
"category": "pre buy research",
"categoryWords": [
"term",
"term"
],
"score": "2.0"
}
],
"opinionWords": [
"positive",
"good news",
"expect",
"unfortunately"
],
"brandTerms": [],
"findings": []
}