Opinion sought: uniqueness of document when its meta is available from different sources at different times


(Venkateshprasanna) #1

We would come across a lot of scenarios where a specific document to be
indexed could be deriving its metadata from different sources at different
times.

Let us take an example of a specific online document, editable by multiple
people and accessed by a large set of folks too. The list of all people who
have edit permissions could be a predefined meta around the document, and
the list of all people who accessed the document over a period of time is
something that is available through the logs as and when access happens. If
we bring these two fields into two different types of the same index (which
seems a reasonable thing to do) driven by the same document id, what is an
elegant way to consider both these entries as the "same document" at search
time?

That is, if someone is searching for documents, and we have to show the
list of contributors and accessors for each document, we are deriving this
information from two types and getting them as two "results" - but we would
have to combine them and normalize the overall ranking for all the hits. Is
parent-child the only way or are there other options people have tried?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #2

You can identify all contributors and accessors, visitors etc. by ID and
after obtaining a single doc as search result, use multiget on the IDs in
the doc to retrieve additional info about them, even from other indexes.

Jörg
Am 15.10.2013 17:20 schrieb "Venkateshprasanna" hmvprasanna@gmail.com:

We would come across a lot of scenarios where a specific document to be
indexed could be deriving its metadata from different sources at different
times.

Let us take an example of a specific online document, editable by multiple
people and accessed by a large set of folks too. The list of all people who
have edit permissions could be a predefined meta around the document, and
the list of all people who accessed the document over a period of time is
something that is available through the logs as and when access happens. If
we bring these two fields into two different types of the same index (which
seems a reasonable thing to do) driven by the same document id, what is an
elegant way to consider both these entries as the "same document" at search
time?

That is, if someone is searching for documents, and we have to show the
list of contributors and accessors for each document, we are deriving this
information from two types and getting them as two "results" - but we would
have to combine them and normalize the overall ranking for all the hits. Is
parent-child the only way or are there other options people have tried?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Venkateshprasanna) #3

Yes, I can use the multiget to fetch all the metadata once I have all the
unique IDs, but what if the initial step that you mention, of identifying
the contributors and accessors, itself gives you multiple results of the
same "logical document" (same document being identified by the id being
same across the types, although I do understand, for elasticsearch, they
are two different documents anyway. This situation arises due to the need
to index at different times as stated in the original post).

So, if I'm getting document results based on a specific person being either
the contributor or the accessor of documents, and I get the same document
because that person is both a contributor and an accessor, I need to merge
these results right there, before moving on to get further metadata.
Creating a parent type and mapping both these types to the new type on the
document id field is one way. But are there any more?

On Tuesday, October 15, 2013 9:07:39 PM UTC+5:30, Jörg Prante wrote:

You can identify all contributors and accessors, visitors etc. by ID and
after obtaining a single doc as search result, use multiget on the IDs in
the doc to retrieve additional info about them, even from other indexes.

Jörg
Am 15.10.2013 17:20 schrieb "Venkateshprasanna" <hmvpr...@gmail.com<javascript:>

:

We would come across a lot of scenarios where a specific document to be
indexed could be deriving its metadata from different sources at different
times.

Let us take an example of a specific online document, editable by
multiple people and accessed by a large set of folks too. The list of all
people who have edit permissions could be a predefined meta around the
document, and the list of all people who accessed the document over a
period of time is something that is available through the logs as and when
access happens. If we bring these two fields into two different types of
the same index (which seems a reasonable thing to do) driven by the same
document id, what is an elegant way to consider both these entries as the
"same document" at search time?

That is, if someone is searching for documents, and we have to show the
list of contributors and accessors for each document, we are deriving this
information from two types and getting them as two "results" - but we would
have to combine them and normalize the overall ranking for all the hits. Is
parent-child the only way or are there other options people have tried?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #4

I do not fully understand your challenge and the role you want ES to play.
If you operate with IDs, you can iterate through the multiget response and
visit contributors, accessors etc. and select a unique list of members in
your app simply by looking at the doc ID.

It is not very elegant to assign more than one ID to the same entity,
because then an ID is no longer unique. Then you'd have to address the
problem of entity identification or entity matching to obtain unique IDs,
which is outside the scope of ES, it is in the domain of the app.

Parent/child is for special queries on parent/child relationships
(has_parent/has_children) and updating parents and children docs on their
own steps, where children docs can link to a unique parent ID. So I'm not
sure why parent/child can solve your challenge.

For more inspiration about relationships, I recommend this overview of
Zachary Tong

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Venkateshprasanna) #5

Thanks Jörg, I do understand your point. As for the clarity of my question,
let me explain with an example.

Let us say we have an index called "document" and two types "contrib" and
"access".

Here are the inserts into these types:

XPUT document/contrib/1

{
"url":"http://en.wikipedia.org/wiki/Semantic_web",
"contributors": [
{
"user":"Cutting"
},
{
"user":"Lee"
}
]
}

XPUT document/contrib/2

{
"url":"http://en.wikipedia.org/wiki/Information_retrieval",
"contributors": [
{
"user":"Cutting"
},
{
"user":"Raghavan"
}
]
}

XPUT document/access/1

{
"url":"http://en.wikipedia.org/wiki/Semantic_web",
"accessors": [
{
"user":"Mahesh"
},
{
"user":"Suresh"
}
]
}

XPUT document/access/2

{
"url":"http://en.wikipedia.org/wiki/Information_retrieval",
"accessors": [
{
"user":"Banon"
},
{
"user":"Raghavan"
}
]
}

These two types are stored separately because:

Contributions are rare, and updates to the index would not be frequent.
Accesses are very regular and updates need to be real time.

Now, if we want to get all documents where "Cutting" is an actor (either
contributor or accessor), then the results would be:

  "hits": [
     {
        "_index": "document",
        "_type": "contrib",
        "_id": "2",
        "_score": 0.375,
        "_source": {
           "url": "http://en.wikipedia.org/wiki/Information_retrieval",
           "contributors": [
              {
                 "user": "Cutting"
              },
              {
                 "user": "Raghavan"
              }
           ]
        }
     },
     {
        "_index": "document",
        "_type": "contrib",
        "_id": "1",
        "_score": 0.375,
        "_source": {
           "url": "http://en.wikipedia.org/wiki/Semantic_web",
           "contributors": [
              {
                 "user": "Cutting"
              },
              {
                 "user": "Lee"
              }
           ]
        }
     }
  ]

If I do this for "Raghavan", it would be:

  "hits": [
     {
        "_index": "document",
        "_type": "contrib",
        "_id": "2",
        "_score": 0.22295055,
        "_source": {
           "url": "http://en.wikipedia.org/wiki/Information_retrieval",
           "contributors": [
              {
                 "user": "Cutting"
              },
              {
                 "user": "Raghavan"
              }
           ]
        }
     },
     {
        "_index": "document",
        "_type": "access",
        "_id": "2",
        "_score": 0.22295055,
        "_source": {
           "url": "http://en.wikipedia.org/wiki/Information_retrieval",
           "accessors": [
              {
                 "user": "Banon"
              },
              {
                 "user": "Raghavan"
              }
           ]
        }
     }
  ]

Ultimately, I would like to show the results in terms of the documents,
i.e., the URLs. If you observe, in case of Cutting, it was indeed two
different documents he was acting on. And for Raghavan, he is playing two
roles with respect to the same ddocument. I would like to group these
results into one, and make sure the ranking is updated based on the fact
that he has played both the roles too. So, if Raghavan had only contributed
or accessed another document, that would have to rank lower than the one
here, as he is appearing twice.

Does this make the scenario a little clear? Hope I have also clarified why
it is difficult to make these teo types roll into one, due to the huge
differences in update frequency behavior.

Regards,
VP.

On Tuesday, October 15, 2013 10:22:22 PM UTC+5:30, Jörg Prante wrote:

I do not fully understand your challenge and the role you want ES to play.
If you operate with IDs, you can iterate through the multiget response and
visit contributors, accessors etc. and select a unique list of members in
your app simply by looking at the doc ID.

It is not very elegant to assign more than one ID to the same entity,
because then an ID is no longer unique. Then you'd have to address the
problem of entity identification or entity matching to obtain unique IDs,
which is outside the scope of ES, it is in the domain of the app.

Parent/child is for special queries on parent/child relationships
(has_parent/has_children) and updating parents and children docs on their
own steps, where children docs can link to a unique parent ID. So I'm not
sure why parent/child can solve your challenge.

For more inspiration about relationships, I recommend this overview of
Zachary Tong
http://www.elasticsearch.org/blog/managing-relations-inside-elasticsearch/

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Venkateshprasanna) #6

Thanks Jörg, I do understand your point. As for the clarity of my question,
let me explain with an example.

Let us say we have an index called "document" and two types "contrib" and
"access".

Here are the inserts into these types:

XPUT document/contrib/1

{
"url":"http://en.wikipedia.org/wiki/Semantic_web",
"contributors": [
{
"user":"Cutting"
},
{
"user":"Lee"
}
]
}

XPUT document/contrib/2

{
"url":"http://en.wikipedia.org/wiki/Information_retrieval",
"contributors": [
{
"user":"Cutting"
},
{
"user":"Raghavan"
}
]
}

XPUT document/access/1

{
"url":"http://en.wikipedia.org/wiki/Semantic_web",
"accessors": [
{
"user":"Mahesh"
},
{
"user":"Suresh"
}
]
}

XPUT document/access/2

{
"url":"http://en.wikipedia.org/wiki/Information_retrieval",
"accessors": [
{
"user":"Banon"
},
{
"user":"Raghavan"
}
]
}

These two types are stored separately because:

Contributions are rare, and updates to the index would not be frequent.
Accesses are very regular and updates need to be real time.

Now, if we want to get all documents where "Cutting" is an actor (either
contributor or accessor), then the results would be:

  "hits": [
     {
        "_index": "document",
        "_type": "contrib",
        "_id": "2",
        "_score": 0.375,
        "_source": {
           "url": "http://en.wikipedia.org/wiki/Information_retrieval",
           "contributors": [
              {
                 "user": "Cutting"
              },
              {
                 "user": "Raghavan"
              }
           ]
        }
     },
     {
        "_index": "document",
        "_type": "contrib",
        "_id": "1",
        "_score": 0.375,
        "_source": {
           "url": "http://en.wikipedia.org/wiki/Semantic_web",
           "contributors": [
              {
                 "user": "Cutting"
              },
              {
                 "user": "Lee"
              }
           ]
        }
     }
  ]

If I do this for "Raghavan", it would be:

  "hits": [
     {
        "_index": "document",
        "_type": "contrib",
        "_id": "2",
        "_score": 0.22295055,
        "_source": {
           "url": "http://en.wikipedia.org/wiki/Information_retrieval",
           "contributors": [
              {
                 "user": "Cutting"
              },
              {
                 "user": "Raghavan"
              }
           ]
        }
     },
     {
        "_index": "document",
        "_type": "access",
        "_id": "2",
        "_score": 0.22295055,
        "_source": {
           "url": "http://en.wikipedia.org/wiki/Information_retrieval",
           "accessors": [
              {
                 "user": "Banon"
              },
              {
                 "user": "Raghavan"
              }
           ]
        }
     }
  ]

Ultimately, I would like to show the results in terms of the documents,
i.e., the URLs. If you observe, in case of Cutting, it was indeed two
different documents he was acting on. And for Raghavan, he is playing two
roles with respect to the same document. I would like to group these
results into one, and make sure the ranking is updated based on the fact
that he has played both the roles too. So, if Raghavan had only contributed
or accessed another document, that would have to rank lower than the one
here, as he is appearing twice.

Does this make the scenario a little clear? Hope I have also clarified why
it is difficult to make these two types roll into one, due to the huge
differences in update frequency behavior. Also, not saying parent/child
would be the best solution in this case, that was the best I could think
of, although I agree that is not the ideal solution.

Regards,
VP.

On Tuesday, October 15, 2013 10:22:22 PM UTC+5:30, Jörg Prante wrote:

I do not fully understand your challenge and the role you want ES to play.
If you operate with IDs, you can iterate through the multiget response and
visit contributors, accessors etc. and select a unique list of members in
your app simply by looking at the doc ID.

It is not very elegant to assign more than one ID to the same entity,
because then an ID is no longer unique. Then you'd have to address the
problem of entity identification or entity matching to obtain unique IDs,
which is outside the scope of ES, it is in the domain of the app.

Parent/child is for special queries on parent/child relationships
(has_parent/has_children) and updating parents and children docs on their
own steps, where children docs can link to a unique parent ID. So I'm not
sure why parent/child can solve your challenge.

For more inspiration about relationships, I recommend this overview of
Zachary Tong
http://www.elasticsearch.org/blog/managing-relations-inside-elasticsearch/

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #7

From the example you give, I see you do not use IDs for your entities in
the application. This makes it very hard because ES can not solve domain
specific knowledge organization for us.

What I'd like to suggest is indexing your entities not by names (which may
not be unique) but by IDs. Maybe session IDs, primary keys, whatever. The
point is, ES is not aware of this ID->entity function at all, it is domain
independent.

Imagine you have software and persons with relationships, e.g. authors or
contributors. You may have different entities but the method is the same.

The first challenge in our app is to assign unique IDs to all of your
entities. This is easy, for example in web apps, visitors could get
assigned a session ID.

Then, write the attributes of the entities into documents and index them in
ES.

At last, create the main search index, where you want users to search on.
This index is different from the entities you indexed. In this main search
index, enrich docs with the IDs of the entities which are related to the
search hits.

Iterating over a result set, you can retrieve more information about your
entities by using _mget, using just one call per relation. In a web UI,
this can be attached to user interaction (mouse click or move), so you
limit the number of _mgets.

As a demonstration, I have prepared a gist

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #8