Finding relevant documents

NikhilJoshi2 · August 15, 2017, 9:32am

Hello, I was going though "more_like_this" clause but not able to find relevant documents. I have below data in ElasticSearch and "description" field is having huge non-indexed data of size >1 million bytes. Like below I have ten thousand documents. How can I figure out a set of documents which are matching at least 80% with each other:

{
	"_index": "school",
	"_type": "book",
	"_id": "1",
	"_source": {
	  "title": "How to drive safely",
	  "description": "The book is written to help readers about giving driving safety guidelines. Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. LONG...."
	}
}

At the end, I am looking for list of document ID's which have at least 80% matching contents. Possible expected result containing matching document IDs (any format is fine):
[ [1,30, 500, 8000], [2, 40, 199], .... ]

Do I need to write batch and compare each document with all others and build output set?

Please guide.

NikhilJoshi2 · August 16, 2017, 8:27am

Can someone please help.

++ @dadoonet @warkolm

NikhilJoshi2 · August 18, 2017, 7:57am

Anyone would like to help?

xavierfacq · August 18, 2017, 8:34am

If you run a query with only the title it'll not be relevant. If you run a query with the description you'll have to truncate it and you'll have some strange results due to small words (the, is , to , about, etc...) I think that you should extract words from the description, then keep relevant words (len > 5 for exemple) and finaly run a query with 80% of minimun should match. (minimum_should_match parameter | Elasticsearch Guide [8.11] | Elastic) Using a dictionnary could be interesting too.

NikhilJoshi2 · August 18, 2017, 8:55am

Hello @xavierfacq, thanks a lot for the response. So if I fire below query, will it only compare "description" fields of all available books against "description" field of document id 50 and show documents matching "80%" of "description" field?

GET school/book/_search
{
  "query": {
    "more_like_this": {
      "fields": [
        "description"
      ],
      "like": [
        {
          "_index": "school",
          "_type": "book",
          "_id": "50"
        }
      ],
      "minimum_should_match": "80%"
    }
  }
}

xavierfacq · August 18, 2017, 9:02am

I would say yes but I'm not very familiar with the more_like_this query...

What I suggested was to get the document (_id 50) extract relevant words, and then run a match query with the "minimum_should_match": "80%".

NikhilJoshi2 · August 18, 2017, 9:10am

This is certainly doable, but volume is an issue. I need to do it for all 10,000 documents. And keep on doing this for newly added documents or remove reference of purged documents. Using "stop_words" enables query to ignore specific words during search, but not sure if there is anything better?

Does ElasticSearch offers something out of the box?

xavierfacq · August 18, 2017, 9:20am

I don't know

system · September 15, 2017, 9:20am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.