Find Duplicate candidates (similar) articles


(Yusef Sheypoor) #1

We have a websites which users can post Content to website and after moderation check Contents they published,Title and Description are the most important field that content, we want to prevent users to publish similar posts So we looking for implement a method to find similar post and hint the moderation these content are very similar to some old posts and moderator check them carefully for duplication, I mean warn moderation them as Suspicious duplicates, we index all content in Elastic search and my question about the optimum query we have to write.
This is part of code we tried but

    $nameDesc = $title->Title. ' ' . $item->Description;

    $query = [
        '_source' => ['name', 'description', 'price'],
        'query' => [
            'filtered' => [
                'query' => [
                    'multi_match' => [
                        'fields' => ['title', 'description'],
                        'type' => 'cross_fields',
                        'query' => $nameDesc
                    ]
                ],
                'filter' => [
                    'not' => [
                        'ids' => ['values' => [$item->ID]]
                    ]
                ],
            ],
        ]
    ];
    $dupeCandidates = $this->indexService->buildSearch('articles', $query)->setLimit(4)->get();

I suppose it's better instead of concat Title and Description and do cross_fields multi match, try two separate match query, or better solution.

In concise We're looking for optimum query detect high similar content by Title and Description in Elasticsearch.


(system) #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.