Find trending article

svarup · November 23, 2019, 4:53pm

I have an index with lots of news articles with respective datetime. How elasticsearch can help me to find trending article of given period like how to find today's trending articles.

I search a lot and find "significant text aggregation" but i don't find any real examples.

Mark_Harwood · November 23, 2019, 9:32pm

This is my significant_text results for some recent news articles:

Kibana

All pretty topical. Here they are clustered using the adjacency_matrix agg:
Kibana

Some tips:

Query the most recent docs using a range query and use the significant_text aggregation with the filter_duplicate_text setting turned on.
Use a single index and shard if possible (it's hard to do this sort of "what's new?" analysis if you use time-based indices and today's content is on a machine separated from the previous days we'd want to compare against).
Index using 2 word "shingles" to spot the sort of things shown in my example: prince andrew, chagos islands and journalist murder.
Use the adjacency matrix aggregation to see how the discovered concepts are related.

svarup · November 24, 2019, 8:44am

thank your for grate explanation. As you said in 3 point (Index using 2 word "shingles"), i made a query in shingle field but return an empty buckets. the response looks like this

{
  "took" : 18,
  "timed_out" : false,
  "_shards" : {
    "total" : 64,
    "successful" : 64,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 847,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "trending" : {
      "doc_count" : 847,
      "keywords" : {
        "doc_count" : 847,
        "bg_count" : 301038,
        "buckets" : [ ]
      }
    }
  }
}

If i change the query field from shingle to normal text analyzer filed it return the result as below.

{
  "took" : 40,
  "timed_out" : false,
  "_shards" : {
    "total" : 64,
    "successful" : 64,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 847,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "trending" : {
      "doc_count" : 847,
      "keywords" : {
        "doc_count" : 847,
        "bg_count" : 301038,
        "buckets" : [
          {
            "key" : "watling",
            "doc_count" : 9,
            "score" : 2.1136922940749283,
            "bg_count" : 16
          },
          {
            "key" : "kshiti",
            "doc_count" : 4,
            "score" : 1.6737509565673134,
            "bg_count" : 4
          },
          {
            "key" : "sumatran",
            "doc_count" : 4,
            "score" : 1.3380562552184319,
            "bg_count" : 5
          },
          {
            "key" : "kakade",
            "doc_count" : 4,
            "score" : 1.3380562552184319,
            "bg_count" : 5
          },
          {
            "key" : "bj",
            "doc_count" : 5,
            "score" : 1.3054042394227003,
            "bg_count" : 8
          }
        ]
      }
    }
  }
}

Mark_Harwood · November 24, 2019, 9:47am

Should work fine. I think I’d need to see the relevant JSON for your mapping, your query and an example doc.

svarup · November 24, 2019, 11:14am

here it is the query

{
    "query": {
        //for only today's news articles
        "range" : {
            "date" : {
                "gt" : "now-1d/d",
                "lte" :  "now/d"
            }
        }
    },
    "size": 0,
    "aggregations" : {
        "trending" : {
            "sampler" : {
                "shard_size" : 100
            },
            "aggregations": {
                "keywords" : {
                    "significant_text" : { "field" : "headline_shingle", "filter_duplicate_text": true }
                }
            }
        }
    }
}

and here it is the mapping for headline_shingle

'headline_shingle' => [
  'type' => 'text',
  'analyzer' => 'shingle_two_words',
]

'shingle_two_words' => [
  'type'			=> 'custom',
  'char_filter'	=> ['html_strip', 'quotes'],
  'tokenizer'		=> 'icu_tokenizer',
   'filter'		=> ['lowercase', 'icu_normalizer', 'icu_folding', 'shingle_word'],
],

'shingle_word' => [
  'type'				=> 'shingle',
  'min_shingle_size'	=> 2,
  'max_shingle_size'	=> 2,
  'output_unigrams'	=> TRUE,
],

Mark_Harwood · November 24, 2019, 12:06pm

Ok - maybe it’s failing to find anything statistically significant in the sample of 100 headlines that differs materially from other days.
I suggest increasing the sample size to a few thousand and/or reducing the minimum number of word uses in the results - set ‘shard_min_doc_count’ to 2 or perhaps 1 (default is 3)

svarup · November 24, 2019, 12:48pm

Thank you so much Mark Harwood for your help. shingle field still return empty bucket but i am happy with aggregating normal text field instead of shingle field and it gives me list of bucket with trending article keyword and this will work for me what i have to do. i have been using elasticsearch for last 5 year and it is great database.

Mark_Harwood · November 24, 2019, 1:48pm

I think it must be because your document is missing the ‘headline_shingle’ field in the JSON?

svarup · November 24, 2019, 2:02pm

i have added "headline_shingle" field in the doc and reindex all the doc for testing, i also have same "headline" field with text, icu analyzer and i use that field and it return the result with trending keywords of buckets. i use copy_to parameter in "headline" field and add "headline_shingle" field.

Mark_Harwood · November 25, 2019, 9:35am

OK I've figured out your shingle problem. Significant text relies on parsing the JSON of matching docs. Because you copied-to to the headline_shingle field it can't take that field name and find the original text in the _source JSON. There's no traceability of where headline_shingle content may have come from because copy_to is designed to allow multiple fields to store their content in one indexed field (to provide the sort of matching experience we used to offer with the _all field).

Normally, an indexing variation of a single field is done using a sub-field -e.g.

  "properties": {
    "headline": {
      "type": "text",
      "fields":{
        "shingled":{
          "type": "text",
          "analyzer": "shingle_two_words"           
        }
      }
    }
  }

Indexing this way means that significant text can work on the headline.shingled field.

svarup · November 25, 2019, 7:32pm

Thank you Mark Harwood for your fantastic help. As you said i have to use "multi-field" index mapping so i remapped all required field accordingly "multi-field" including "headline_shingle" and it just only not solve the shingle issue as we discuss here but also solve and improve the search quality and time and it work well, it return all the buckets key with "headline_shingle" field. so the issue is with "copy_to" parameter.

I know the concept of "multi-field" and actually previously i used the "multi-filed" index mapping in my index but i read some where that if we use "copy_to" instead of "multi-field" then it improve the indexing time. so i changed my mapping to "copy_to" but i don't know that this changes also give me this type of issue in searching side. so again as we discuss here i changed back my index mapping to "multi-filed", and it work fine as you said.

Thank you for your help now i can find today's trending article.

Mark_Harwood · November 25, 2019, 7:38pm

Good to hear! Glad we got it working in the end

system · December 23, 2019, 7:38pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Search Trend with ElasticSearch API Elasticsearch	5	695	July 31, 2017
Finding Trending Documents Elasticsearch	1	826	July 5, 2017
Grouping most used terms by geohash_grid and date Elasticsearch	2	359	July 6, 2017
Recreating Google's Ngram Viewer with elasticsearch Elasticsearch	1	533	July 6, 2017
Search similar words in a big text Elasticsearch	3	529	July 6, 2017

Find trending article

Related topics