Efficient and scalable way of constructing a word cloud


(l00l) #1

Hi

I am trying to build a word cloud(1,2,3 - gram words) cloud using Shingle token filter.

I have around 1 million records and the word cloud is generated by doing a facets operation on 7 fields(all of them contain LOTS of text data)

I am using the default resident type field data caching.

This works perfectly for a small number of records(say 1000 records). It was also acceptable for 500,000 records. It took some time to build the "field data cache"(took 8gb heap mem) the first time and then later on the computation was quick. But when I tested the facet query on 1 million records(which is also not so huge), my entire 15gb heap memory(all that I had) got eaten up because elasticsearch tried to cache all those 7 fields. Clearly this method is not scalable. Please suggest a better/alternate way of constructing a word(words and phrases)cloud. Also I don't think caching the fields for a long time is right thing to do, because I want the word cloud to be very dynamic.

My Facet Query:

curl -X POST 'http://localhost:9200/monitoring/mention_reports/_search?&pretty=true' -d '
{
"size":"0",

"query": {
"filtered":{
"query":{
"text": {
"positive_keyword": {
"query": "quora"
}
}
},
"filter":{

                . . .


  }
}

},

"facets": {
"tagcloud": {
"terms": {
"fields":["field1","field2","field3","field4","field5","field6","field7"],
"size":"300"
}
}
}
}
'

My Mapping:

curl -XPOST http://localhost:9200/monitoring/ -d '
{
"settings":{
"index":{
"number_of_shards":5,
"number_of_replicas":1
},
"analysis":{
"filter":{
"myCustomShingle":{
"type":"shingle",
"max_shingle_size":3,
"output_unigrams":true
},
"myCustomStop":{
"type":"stop",
"stopwords":["a","about","abov ... ]
}
},
"analyzer":{
"myAnalyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":[
"lowercase",
"myCustomShingle",
"stop",
"myCustomStop"
]
}
}
}
},
"mappings":{
"mention_reports":{
"_source":{
"enabled":true
},
"_all":{
"enabled":false
},
"index.query.default_field":"post_message",
"properties":{
"id":{
"type":"string",
"index":"not_analyzed",
"include_in_all" : "false",
"null_value" : "null"
},
"creation_time":{
"type":"date"
},
"field1":{
"type":"string",
"analyzer":"standard",
"include_in_all":"false",
"null_value":0
},
"field2":{
"type":"string",
"index":"not_analyzed",
"include_in_all":"false",
"null_value":"null"
},

        . . .


    "field6":{
      "type":"string",
      "analyzer":"myAnalyzer",
      "term_vector":"with_positions_offsets",
      "null_value" : "null"
    }                                           

  }
}

}
}
'


(Nazar Hussain) #2

Hello @l00l I am the exact same scenario, So what approach you were ended up with?


(system) #3