Retrieving document groups with MoreLikeThis query


(Alain D├ęsilets) #1

Say I have a set of documents D={d1,..., dn} indexed in ES.

I also have a set of document groups G={g1,...,gk}, where the groups may
not be disjoint.

Given a group g in G, called the Query Group, I want to find "similar"
groups g' in G, called the Similar Groups. where "similar" means that the
documents in g tend to use the same words as documents in each of the g'.

What would be the easiest way do this with ES?

Note that I already know how to find individual documents d which are
similar to a group g. You just do a MoreLikeThis search, feeding it a
multi-get query with the IDs of the various documents in g. But this will
retrieve individual documents, when in fact what I want document groups.

I can think of several ways of doing this (see below). Not sure what the
pros and cons of each approach are, in terms of speed and "accuracy". Does
anyone have advice on this?

Thx (PS Sorry for the long post, but it's a pretty complex issue)

Alain

===
Approach 1: Groups as pseudo-documents that concatenate content of their
members

Create a type 'documents-group'. These are "pseudo-documents" whose content
will be the concatenation of the content of all documents in a given group.

You then do a MoreLikeThis (MLT) query, feeding it the pseudo-document for
the Query Group, and asking specifically for documents of type
documents-group

PROS:

  • Can be supported for sure.

CONS:

  • Need to write code in the application to create and manage those
    pseudo-documents.
  • The content of each document will be duplicated N times, where N is the
    number of groups that it belongs to.
    -- This means the index will be much larger (in my application, a document
    can belong to hundreds of groups)
  • Requires a lot of reindexing
    -- Everytime you change a document, you need to re-index all the groups it
    belongs to.
    -- If you add/remove a document to a group, you need to re-index that
    group, which means reindex the equivalent of all the documents it contains
    (in my application, a group may contain 100 documents or more)

===
Approach 2: Groups as pseudo-documents containing only the most important
terms

This is similar to approach 1, except that the pseudo-document contains a
list of the most significant terms of documents in that group (obtained
with the Significant Terms aggregation), as opposed to containing a
concatenation of all the member documents.

You then do a MoreLikeThis (MLT) query, feeding it the documents-group
document for the Query Group, and asking specifically for documents of type
documents-group

PROS:

  • Can be supported for sure.
  • The content of individual documents is not completely duplicated. At
    words, a handful of terms from each document may be duplicated.

CONS:

  • Need to write code in the application to create and manage those
    document-group pseudo-documents.
  • Requires a lot of reindexing
    -- Everytime you change a document, you need to re-generate the the list of
    significant terms for all the groups it belongs to.
    -- If you add/remove a document to a group, you re-generate the the list of
    significant terms for that group.
  • May be less accurate, as similarity is based on only the most significant
    terms, instead of all available terms. In other words, it could be that
    terms which seem a-priori unimportant, turn out to be very important when
    trying to determine similarity to the Query Group.

===
Approach 3: Multi-get MLT, with automatic aggregation of average score on
the groups

Each document includes a field that specifies the groups it belongs to.

You then do a MLT search, feeding it a Multi-Get query of the IDs of all
the documents in the Query Group. The query also asks ES to do a metric
aggregate using the Group membership field as the basis for the
aggregation, and the average MLT score as the metric to be computed.

The group that has the highest MLT average score is considered to be the
most relevant group.

PROS:

  • No need to write application code to create and manage pseudo-documents
    for the groups.
  • Does not requires too much reindexing
    -- If you modify a document, you only need to reindex that document.
    -- If you add/remove a document in a group, you only need to reindex that
    one document (because the list of groups that the document belongs to has
    changed).

CONS:

  • Not clear that it can be done. Possible problems are:
    -- The groups are overlapping. In other words, a given document may appear
    in more than one parts of the aggregation. Not sure if ES aggregation deals
    properly with that.
    -- Not sure that you can use the MLT score as the metric to be averaged.
    The reason being that the score is a dynamic property of each document,
    whose value is determined specifically by MLT at query time. Metric
    aggregation is typically used with static attributes (ex: a document's
    author).

===
Approach 4: Multi-get MLT, with "manual" aggregation of average score on
the groups

This is very similar to Approach 2, except that the aggregation of MLT
scores is done by the application instead of ES.

You do a MLT search, feeding it a Multi-Get query of the IDs of all the
documents in the Query Group, and asking ES to give you up to say, 10K
individual similar documents.

For each group g in G, you then compute a relevance score by looking at the
10K hits. For example, you may take from those 10K hits, the ones that
belong to a given group, add their scores, and divide that sum by the total
number of documents in the group (because larger groups are a-priori more
likely to contain a hit in the top 10K most similar documents).

PROS:

  • Definitely doable
  • Does not require too much reindexing
    -- If you modify a document, you only need to reindex that document.
    -- If you add/remove a document in a group, you only need to reindex that
    one document (because the list-of-groups attribute for that document will
    have changed)

CONS:

  • Need to write application code to do the aggregation.
  • Not clear that the aggregation will always work well
    -- In particular, not clear how to normalize for the fact that larger
    groups will tend to include more hits in the 10K most similar documents.

===
Approach 5: Create the groups as parents of their documents

Create a new type 'documents-group'.

Create a parent-child relationship between a group and each of the
documents it contains.

When searching, do a MLT on the parent document that represents the Query
Group, and ask for it to return documents that are of type
'documents-group'.

PROS:

  • No need to write application code to create pseudo-documents for the
    groups.
  • Fast (Probably):
    -- If you modify a document, you only need to reindex that document
    -- If you add/remove a document in a group, you probably don't end up
    reindexing anything

CONS:

  • Not clear that it can be done. Possible problems are:
    -- Not clear that an ES document can have more than one parents (and our
    groups are overlapping)
    -- When you do a MTL on a parent, does it actually consider the content of
    every child it contains?
    -- Similarly, when evaluating the similarity of various parent documents,
    does it consider the content of every one of its children?
    -- Not really sure what actually happens in terms of reindexing if you you
    modify the list of children, or the content of one of the children. It
    could be that under the hood, ES keeps some kind of duplicate index in the
    parent, and that this duplicate needs to be updated when you change one of
    the children.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/da88c17d-40a9-4083-9b79-cda816610d50%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #2