Faceted search using RDF-triple like related documents


(peos) #1

Hi,

We are using ElasticSearch for navigating through our product catalog. We
have fairly simple documents like:

    {
        "_index": "catalog",
        "_type": "product",
        "_id": "476",
        "_score": 1,
        "_source": {
           "id": 476,
           "description": "Product description",
           "a8": "100 mm",
           "a12": "250 g",
           "categories": [
              8,
              4213
           ]
        }
     }

where every product has the following attributes:

  • id, unique identifier;
  • description, a short description;
  • a*, custom defined attributes;
  • categories, the categories the product is linked to.

We've added queries (including autocomplete), filters and facets so far and
it works really great.

So lately we've added a new feature where users can add RDF-triple like
relations between products using custom predicates. E.g.

  1. is an alternative for ;
  2. is a dispenser for ;
  3. etc.

My question is about the second example where products are dispensers for
other products.

We want the user to be able to find disposables using both the disposable
product attributes as well as the linked dispenser product attributes.
Example:

For every printer there are different toners available (e.g. different
capacities, different brands, etc.) and several printers can use the same
toner. When trying to find a toner we want the user to be able to select
both attributes of the toners as well as attributes of the printers linked
to the toners. So when the user selects the brand "Brother" for the toner
brand facet, only "Brother" toners are shown. But when the user selects
"Brother" as a filter for the printer brand facet, all toners that are
suited for the printer are shown, regardless of the toner brand.

So how would this translate in a document design in ES. As both the
dispenser and disposable products are documents within ES we could only
store references on each document categorized on the custom predicate like:

    {
        "_index": "catalog",
        "_type": "product",
        "_id": "476",
        "_score": 1,
        "_source": {
           "id": 476,
           "description": "Product description",
           "a8": "100 mm",
           "a12": "250 g",
           "categories": [
              8,
              4213
           ],
  •           "<predicate_p>": [*
    
  •              <product_id_x>,*
    
  •              <product_id_y>*
    
  •           ]*
          }
       }
    

However when also wanting to represent a facet result count that makes
sense for both dispenser and disposable, meaning the count for both types
of products are based on the resulting disposables, this would probably not
work. We would first need to filter using the dispenser followed by the
disposable, showing different counts for both the dispenser and disposable
attributes.

Another option would be storing the whole related document(s) under the
predicate defined for every document. This means a huge expansion of the
index and a lot of repetition in all data that would make the maintenance
of the documents a lot more complex.

So what would be a best practice solution for this scenario? Or could it be
that we are looking at the wrong type of storage (document store) for this
kind of question (graph database?).

Any idea on this would be very welcome. Thank you in advance!

Cheers,

Peter

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6d379b8e-4452-4ced-a025-8dd80e22fc10%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #2

I am using JSON-LD, which boils down to something like this

{
...
"_source" : {
"@context" : { "rel" : "..." },
"@id" : 476,
"@type" : " .... ",
"description" : "Product description",
"a8" : "100 mm",
"a12" : "250 g",
"categories" : [ 8, 4213 ],
"rel:hasDispenser" : [
"prod_id_x",
"prod_id_y"
],
"rel:hasDisposable" : [
"prod_id_z"
]
}
}

and it is possible to lay aggregations over rel:hasDispenser and
rel:hasDisposable.

Does this help?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEQccntkXvOANxLB7oaJ69j20LrutSLkOkpgJXj7HCiHQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(peos) #3

Hi Jörg,

Thank you for your answer. Lots of new stuff in there though which will
require some studying to understand :slight_smile: !

JSON-LD seems like an excellent addition to JSON which could actually mean
some competition for graph databases?!

I've tried to setup the following simple 2 dispenser, 2 disposable example
to support the JSON-LD data based on the example you gave me and the
resources I've found on the web so far (no namespace for rel though?):

{
"_index": "catalog5",
"_type": "product",
"_id": "2",
"_score": 1,
"_source": {
"@context": {
"rel": ""
},
"@id": 2,
"@type": "product",
"description": "Brother TN-3230",
"categories": [
2
],
"rel:hasDispenser": [
1,
4
]
}
},
{
"_index": "catalog5",
"_type": "product",
"_id": "1",
"_score": 1,
"_source": {
"@context": {
"rel": ""
},
"@id": 1,
"@type": "product",
"description": "Brother HL-5340",
"categories": [
1
],
"rel:hasDisposable": [
2,
3
]
}
},
{
"_index": "catalog5",
"_type": "product",
"_id": "4",
"_score": 1,
"_source": {
"@context": {
"rel": ""
},
"@id": 4,
"@type": "product",
"description": "Brother HL-5350",
"categories": [
2
],
"rel:hasDisposable": [
2,
3
]
}
},
{
"_index": "catalog5",
"_type": "product",
"_id": "3",
"_score": 1,
"_source": {
"@context": {
"rel": ""
},
"@id": 3,
"@type": "product",
"description": "Brother TN-3280",
"categories": [
2
],
"rel:hasDispenser": [
1,
4
]
}
}

So I'm now trying to wrap my head around the second part of your sentence
"... it is possible to lay aggregations of rel:hasDispenser and
rel:hasDisposable". For my use case I want to be able to use the attributes
for the hasDispenser relations for faceting. But that would mean actually
create a relation between the two documents. As far as I understood
aggregations are an improved concept of faceted search, but I don't see how
it can handle the relations I'm looking for...

I'm curious what kind of data you are using this concept for and if you are
willing to share an example with me?

Cheers,

Peter

2014-02-11 11:44 GMT+01:00 joergprante@gmail.com joergprante@gmail.com:

I am using JSON-LD, which boils down to something like this

{
...
"_source" : {
"@context" : { "rel" : "..." },
"@id" : 476,
"@type" : " .... ",
"description" : "Product description",
"a8" : "100 mm",
"a12" : "250 g",
"categories" : [ 8, 4213 ],
"rel:hasDispenser" : [
"prod_id_x",
"prod_id_y"
],
"rel:hasDisposable" : [
"prod_id_z"
]
}
}

and it is possible to lay aggregations over rel:hasDispenser and
rel:hasDisposable.

Does this help?

Jörg

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/AR80CGriVFQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEQccntkXvOANxLB7oaJ69j20LrutSLkOkpgJXj7HCiHQ%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAC%3DvprAMjNji4RarYbdCMOL4rv68XWUn56GNsBekXQAOSCXXMA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #4

I'm not sure, and I try hard to understand your use case.

I assume you want a single query that can filter attributes of both the
entity "1" and for attributes of related entities "2" and "3".

As you have noticed, in a single query, this is not possible unless you had
"bubbled up" the relevant attributes into one big document that holds all
attributes of "1", "2", and "3" you want to filter for.

If you are willing to execute multiple queries in a sequence, the following
might work: either "top down" (first search the entity "1", filter for
attributes, and expand from there all related entities with a second filter
query) or "bottom up" (first search the entities "2" and "3" for their
attributes, and then go up to entities like "1" and filter them for their
attributes).

It depends how much IDs you want to handle, because, in a single multi term
query, there is a limit of 1024 clauses. So if you have many thousands of
hits, you have to iterate over several search results and concatenate the
IDs into a result set.

If you are not able to invest into query side, you have to expand the
attributes on the index side. This can be expensive if you want a "live"
index with many updates in real time. But if you can afford to recreate the
index only occasionally (daily at night for example), the index expansion
strategy is worth it. For expansion, you have two options: write the
expansion logic into your indexing client, or let an ES plugin expand the
documents automatically (with some caution, because the order in which
documents are indexed is important.)

Just a hint for expansion in the index: you could also maintain two
instances of your documents, one version with IDs, and one with expanded
IDs (this is also called "inferring triples" and "materialization"). So
whatever index is best for a task, you could choose between them. A program
logic could also work dynamically on the documents that arrive for update,
find inconsistencies in the materialized ones, and fix them instantly (this
is called "dynamic materialization").

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHV6YJ5HriJLij%2BaVNqmmTiqHckzk2P5Nr5QNWj1oJsmg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(peos) #5

Hi Jörg,

The assumptions you've made on my use case are correct.

The nightly update could definitely work, but I think even live updates
could work as the data is quite static in nature.

A few more questions:

  • You're talking about recreation of the index, with this you mean update I
    presume?;

  • The terms you're using: "inferring triples", "materialization", "dynamic
    materialization", to what paradigm do they belong?

Lots of reading for me to do!

Thank you for your comments and hints.

Best regards,

Peter

In our case there is no big problem in nightly recreation of the index and
on the other hand there aren't many mutations.

2014-02-11 16:05 GMT+01:00 joergprante@gmail.com joergprante@gmail.com:

I'm not sure, and I try hard to understand your use case.

I assume you want a single query that can filter attributes of both the
entity "1" and for attributes of related entities "2" and "3".

As you have noticed, in a single query, this is not possible unless you
had "bubbled up" the relevant attributes into one big document that holds
all attributes of "1", "2", and "3" you want to filter for.

If you are willing to execute multiple queries in a sequence, the
following might work: either "top down" (first search the entity "1",
filter for attributes, and expand from there all related entities with a
second filter query) or "bottom up" (first search the entities "2" and "3"
for their attributes, and then go up to entities like "1" and filter them
for their attributes).

It depends how much IDs you want to handle, because, in a single multi
term query, there is a limit of 1024 clauses. So if you have many thousands
of hits, you have to iterate over several search results and concatenate
the IDs into a result set.

If you are not able to invest into query side, you have to expand the
attributes on the index side. This can be expensive if you want a "live"
index with many updates in real time. But if you can afford to recreate the
index only occasionally (daily at night for example), the index expansion
strategy is worth it. For expansion, you have two options: write the
expansion logic into your indexing client, or let an ES plugin expand the
documents automatically (with some caution, because the order in which
documents are indexed is important.)

Just a hint for expansion in the index: you could also maintain two
instances of your documents, one version with IDs, and one with expanded
IDs (this is also called "inferring triples" and "materialization"). So
whatever index is best for a task, you could choose between them. A program
logic could also work dynamically on the documents that arrive for update,
find inconsistencies in the materialized ones, and fix them instantly (this
is called "dynamic materialization").

Jörg

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/AR80CGriVFQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHV6YJ5HriJLij%2BaVNqmmTiqHckzk2P5Nr5QNWj1oJsmg%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAC%3DvprA8pohCt%2BEbGfk2_AmPrXMvyehi1X%3DayX-Oa_SQ%3DY7Rkg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #6

It's the semantic web.

For inference, see http://www.w3.org/standards/semanticweb/inference

Materialization is the pre-computation and storage of inferred triples
http://www.w3.org/wiki/LargeTripleStores

In fact, I use JSON-LD, which is convenient for both storing triples and
loading them for search into Elasticsearch, because I don't want to mess
with triple stores.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF21REg06ijZQ2KHR%3D%2BvQJc6vVmtHz1w0UJW10fWHVg2A%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #7