Find trending article

I have an index with lots of news articles with respective datetime. How elasticsearch can help me to find trending article of given period like how to find today's trending articles.

I search a lot and find "significant text aggregation" but i don't find any real examples.

This is my significant_text results for some recent news articles:

Kibana

All pretty topical. Here they are clustered using the adjacency_matrix agg:
Kibana

Some tips:

  1. Query the most recent docs using a range query and use the significant_text aggregation with the filter_duplicate_text setting turned on.
  2. Use a single index and shard if possible (it's hard to do this sort of "what's new?" analysis if you use time-based indices and today's content is on a machine separated from the previous days we'd want to compare against).
  3. Index using 2 word "shingles" to spot the sort of things shown in my example: prince andrew, chagos islands and journalist murder.
  4. Use the adjacency matrix aggregation to see how the discovered concepts are related.
1 Like

thank your for grate explanation. As you said in 3 point (Index using 2 word "shingles"), i made a query in shingle field but return an empty buckets. the response looks like this

{
  "took" : 18,
  "timed_out" : false,
  "_shards" : {
    "total" : 64,
    "successful" : 64,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 847,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "trending" : {
      "doc_count" : 847,
      "keywords" : {
        "doc_count" : 847,
        "bg_count" : 301038,
        "buckets" : [ ]
      }
    }
  }
} 

If i change the query field from shingle to normal text analyzer filed it return the result as below.

{
  "took" : 40,
  "timed_out" : false,
  "_shards" : {
    "total" : 64,
    "successful" : 64,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 847,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "trending" : {
      "doc_count" : 847,
      "keywords" : {
        "doc_count" : 847,
        "bg_count" : 301038,
        "buckets" : [
          {
            "key" : "watling",
            "doc_count" : 9,
            "score" : 2.1136922940749283,
            "bg_count" : 16
          },
          {
            "key" : "kshiti",
            "doc_count" : 4,
            "score" : 1.6737509565673134,
            "bg_count" : 4
          },
          {
            "key" : "sumatran",
            "doc_count" : 4,
            "score" : 1.3380562552184319,
            "bg_count" : 5
          },
          {
            "key" : "kakade",
            "doc_count" : 4,
            "score" : 1.3380562552184319,
            "bg_count" : 5
          },
          {
            "key" : "bj",
            "doc_count" : 5,
            "score" : 1.3054042394227003,
            "bg_count" : 8
          }
        ]
      }
    }
  }
}

Should work fine. I think I’d need to see the relevant JSON for your mapping, your query and an example doc.

here it is the query

{
    "query": {
        //for only today's news articles
        "range" : {
            "date" : {
                "gt" : "now-1d/d",
                "lte" :  "now/d"
            }
        }
    },
    "size": 0,
    "aggregations" : {
        "trending" : {
            "sampler" : {
                "shard_size" : 100
            },
            "aggregations": {
                "keywords" : {
                    "significant_text" : { "field" : "headline_shingle", "filter_duplicate_text": true }
                }
            }
        }
    }
}

and here it is the mapping for headline_shingle

'headline_shingle' => [
  'type' => 'text',
  'analyzer' => 'shingle_two_words',
]

'shingle_two_words' => [
  'type'			=> 'custom',
  'char_filter'	=> ['html_strip', 'quotes'],
  'tokenizer'		=> 'icu_tokenizer',
   'filter'		=> ['lowercase', 'icu_normalizer', 'icu_folding', 'shingle_word'],
],

'shingle_word' => [
  'type'				=> 'shingle',
  'min_shingle_size'	=> 2,
  'max_shingle_size'	=> 2,
  'output_unigrams'	=> TRUE,
],

Ok - maybe it’s failing to find anything statistically significant in the sample of 100 headlines that differs materially from other days.
I suggest increasing the sample size to a few thousand and/or reducing the minimum number of word uses in the results - set ‘shard_min_doc_count’ to 2 or perhaps 1 (default is 3)

Thank you so much Mark Harwood for your help. shingle field still return empty bucket but i am happy with aggregating normal text field instead of shingle field and it gives me list of bucket with trending article keyword and this will work for me what i have to do. i have been using elasticsearch for last 5 year and it is great database.

I think it must be because your document is missing the ‘headline_shingle’ field in the JSON?

i have added "headline_shingle" field in the doc and reindex all the doc for testing, i also have same "headline" field with text, icu analyzer and i use that field and it return the result with trending keywords of buckets. i use copy_to parameter in "headline" field and add "headline_shingle" field.

OK I've figured out your shingle problem. Significant text relies on parsing the JSON of matching docs. Because you copied-to to the headline_shingle field it can't take that field name and find the original text in the _source JSON. There's no traceability of where headline_shingle content may have come from because copy_to is designed to allow multiple fields to store their content in one indexed field (to provide the sort of matching experience we used to offer with the _all field).

Normally, an indexing variation of a single field is done using a sub-field -e.g.

  "properties": {
    "headline": {
      "type": "text",
      "fields":{
        "shingled":{
          "type": "text",
          "analyzer": "shingle_two_words"           
        }
      }
    }
  }

Indexing this way means that significant text can work on the headline.shingled field.

1 Like

Thank you Mark Harwood for your fantastic help. As you said i have to use "multi-field" index mapping so i remapped all required field accordingly "multi-field" including "headline_shingle" and it just only not solve the shingle issue as we discuss here but also solve and improve the search quality and time and it work well, it return all the buckets key with "headline_shingle" field. so the issue is with "copy_to" parameter.

I know the concept of "multi-field" and actually previously i used the "multi-filed" index mapping in my index but i read some where that if we use "copy_to" instead of "multi-field" then it improve the indexing time. so i changed my mapping to "copy_to" but i don't know that this changes also give me this type of issue in searching side. so again as we discuss here i changed back my index mapping to "multi-filed", and it work fine as you said.

Thank you for your help now i can find today's trending article.

1 Like

Good to hear! Glad we got it working in the end

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.