Hi Jelle,
On Tuesday, October 16, 2012 3:10:30 PM UTC+2, wires wrote:
Exactly. Do you know how such segmentation can be efficiently computed?
You know, graph algorithms are hard in general, for practical cases, there
are a lot of papers in the net (labelled graphs, DAGs, Trees). Few code
that is related to document retrieval with graph data, but it's
understandable, it's a hard topic.
I suppose for my case one could keep track of nodes with a high degree and
select those nodes as "centers" and perform a small 'breadth first' walk
around them. The subgraphs generated this way could overlap a bit. If the
graph consists of many small disconnected components then I can "fatten"
the subgraphs by picking multiple starting nodes for the walk.
There are two approaches. First, you have a ready-built graph and you can't
create it by yourself. This needs traversing the graph as you describe, and
get the valuable data out if it. You can segment it or not, as long as the
data can be indexed into ES somehow.
Another approach is, you have only a graph model and you can build your
data by yourself. Let's start by the simplest data structures (known as
tuples and sequences) and end up at the most complex data structures. It's
not that complex if you implement the graph traversal as iterating over an
ordered stream of nodes and go attribute by attribute. In RDF, you have a
huge sequence of such traversed elements, called triples (or statements).
The graphs I work with are RDF graphs (directed edge-labeled graphs, also
known as linked data, LD) and have a lot of predefined identifiers in it
(URIs), so the idea is to catch up with these IDs (the subject IDs), and
interpret the connected nodes as features (or attribute lists) of an object
(connected by value URIs or blank nodes). These features of an object can
be iterated and expressed in JSON syntax. The JSON document can be indexed
in ES, carrying all the hierarchical information of the original RDF.
But this is speculation, if you know about any algorithms etc, I'm very
interested.
As an initial implementation, can't I just create an index that doesn't
index but only stores the document, effectively turning it into a key/value
store without the Lucene overhead?
Exactly, I have done some work in constructing key/value streams, mapping
them to an indexable representation, create RDF triples, convert them to
JSON (now there is JSON-LD which makes the process easier as it is an
evolving W3C standard) and pushing the JSON-LD to an ES cluster.
I won't be doing any queries on it anyway, just lots of gets. You can
then run the denormalization process on a dedicated node which has that
index locally.
What is your opinion on this?
Please consider that querying is the strength of ES. A lot of gets is
challenging, because such an I/O-style does not scale. Soon the system will
bog down under heavy I/O, if you mix reads and writes.
If you consider a document being a collection of attributes relating to s
unique ID, all you need to do is to build a sequential list of subgraphs
and index them into ES. The unique ID is crucial for the denormalization.
I think of denormalization as being a part of an input data processing
pipeline. While using the XContentBuilder, it should be convenient to
connect to external source via unique IDs/URIs (perhaps to key/value
streams) and fetch additional fields by evaluating the current fields.
A demonstration of SKOS/RDF-based denormalization targeting a single Lucene
field can be found in Bernhard Haslhofer's SKOS for Lucene. It has some
limitations as it requires the SKOS graph in memory. I ported it to ES, see
Another remark that might be interesting: Documents are identified by
(id,version). To each document I associate it's subgraph and label all the
vertices and edges in the subgraph with the id and version of the document.
This way I can just take a subgraph out of the graph and replace it with a
newer and keep the graph consistent. Several of these subgraphs together
might be natural candidates for the documents in the "graph index".
Cool idea, versioning of data (documents) is also on my list of challenges
I need to tackle. Right now, I use versioning only ES style, that is, an
atomic counter to prevent collisions.
But a good implementation could hide this overhead by caching subgraphs
as you explained above, right?
Yes, but be aware, ES is not something like neo4j (a graph database),
though the REST GET API looks very tempting to do so. Its strength is the
inverted index and the fast search over large amounts of text literals.
Anyway, it is a good fit for our project (I'm only interested in
dependency tracking of denormalized attributes ATM); but it might be a good
idea to see how we can store and process graphs on ES. The distributed
nature should make ES quite suited for BSP style graph processing. I'm
interested in thoughts you might have on this.
Hm, BSP is something for Apache Hama. But I agree that distributed
computing makes sense with ES, as long as search and indexing is the main
part of the computation.
So while I like this approach, we could mix it with our ES proxy
application, I don't think you can still use the rivers etc to get data
into ES and have it automatically denormalized? (as you need to explicitly
pass it through the wrapped client)
Rivers are no magic when it comes to indexing, then you need a client (such
as a node client) to get the data into ES. The process of "content
scripting" could run in a river, for example (or just in a normal ingest
client).
I think I can do this using the IndexingOperationsListener; I've
committed some initial code to the git repository:
https://github.com/0x01/degraphmalizer/blob/030be1c7ae35759f1a6e4bbfbde0ea2684b06b37/src/main/java/org/elasticsearch/plugin/degraphmalizer/DegraphmalizerListener.java
Thanks for sharing this. Afaik the indexing operation listener's job is to
catch write/update events inside of a Lucene index within an ES shard
(which is not the scope of a whole index, only a part of it), just as the
translog works. Pretty much low-level stuff. I'm not sure if shard level
operations are the best place for adding denormalized data to documents, I
would prefer the node client level, where you have access to the field
mapping and the cluster state.
Best regards
Jörg
--