Support for Dublin Core


(ian mayo) #1

Hi all,
I'd like to find out if/how ElasticSearch supports Dublin Core
metadata, as formalised by the Dublin Core Metadata Initiative (DCMI)

I'm aware that ElasticSearch sits on top of Apache Lucene, but I can't
find any reference to DCMI on the Lucene Documentation.

So, does ElasticSearch have any native support for DCMI?

cheers,
Ian


(Shay Banon) #2

Not really, but you can name the fields you index using the dublin core metadata scheme.

On Friday, February 10, 2012 at 1:44 PM, ian mayo wrote:

Hi all,
I'd like to find out if/how ElasticSearch supports Dublin Core
metadata, as formalised by the Dublin Core Metadata Initiative (DCMI)

I'm aware that ElasticSearch sits on top of Apache Lucene, but I can't
find any reference to DCMI on the Lucene Documentation.

So, does ElasticSearch have any native support for DCMI?

cheers,
Ian


(Jörg Prante) #3

I use Dublin Core in Elasticsearch extensively - together with all
kinds of metadata and bibliographic standards such as CQL - and might
to able to give some advise for best practice.

Beside naming the fields with Dublin Core element terms, I tackled the
namespacing challenge and implemented JSON to XML rendering in that
context.

Jörg

On Feb 12, 2:18 pm, Shay Banon kim...@gmail.com wrote:

Not really, but you can name the fields you index using the dublin core metadata scheme.

On Friday, February 10, 2012 at 1:44 PM, ian mayo wrote:

Hi all,
I'd like to find out if/how ElasticSearch supports Dublin Core
metadata, as formalised by the Dublin Core Metadata Initiative (DCMI)

I'm aware that ElasticSearch sits on top of Apache Lucene, but I can't
find any reference to DCMI on the Lucene Documentation.

So, does ElasticSearch have any native support for DCMI?

cheers,
Ian


(Michael Sick) #4

Jorg - I bet a number of people would find your DC work interesting. +1 for
sure.

On Sun, Feb 12, 2012 at 10:50 AM, jprante joergprante@gmail.com wrote:

I use Dublin Core in Elasticsearch extensively - together with all
kinds of metadata and bibliographic standards such as CQL - and might
to able to give some advise for best practice.

Beside naming the fields with Dublin Core element terms, I tackled the
namespacing challenge and implemented JSON to XML rendering in that
context.

Jörg

On Feb 12, 2:18 pm, Shay Banon kim...@gmail.com wrote:

Not really, but you can name the fields you index using the dublin core
metadata scheme.

On Friday, February 10, 2012 at 1:44 PM, ian mayo wrote:

Hi all,
I'd like to find out if/how ElasticSearch supports Dublin Core
metadata, as formalised by the Dublin Core Metadata Initiative (DCMI)

I'm aware that ElasticSearch sits on top of Apache Lucene, but I can't
find any reference to DCMI on the Lucene Documentation.

So, does ElasticSearch have any native support for DCMI?

cheers,
Ian


(Jörg Prante) #5

Hi,

the question was about native support in ES for Dublin Core. There are
at least two cases to think about.

If you need to deal just with the syntax of DC element names and you
have a bunch of static documents to index, the simplest way is to read
the content with an XML reader (Sax, STaX API), pick up the local
names of the elements, and write them either directly into the ES Java
API for building content (XContentBuilder) or to JSON files for later
bulk indexing.

But on the other hand, Dublin Core has also semantics, specified by an
abstract model, see http://dublincore.org/documents/abstract-model/

So if the question was if there is native support in ES for RDF
semantics, which is the underlying model for Dublin Core, you have to
do it for yourself outside ES. It is possible to build JSON syntax
from an RDF model. The way here is to go with a simplified resource/
properties abstraction API that is responsible for realizing your
semantics. Preferable is a light-weight resource/property RDF API to
transform everything you need to manage in a search index into a
nested structure of URIs and literal values. Then, roll out a JSON
structure which is finally picked up be the ES API. This process is
quite similar to generating RDF triples. This should be accompanied by
some hints in the ES mapping (e.g. dc:date should be given a date type
according to W3CDTF, which is recommended best practice).

For jsonizing, I recommended to replace all XML Namespace URIs by a
short prefix, for Dublin Core e.g. "dc:title", "dc:creator",
"dc:identifier" etc. By doing this, it is possible to keep a prefix/
namespace URI Map to manage arbitrary XML in your Dublin Core Model.
This is rather a common use case because the 15 core elements are
quite coarse and will need some refinements in most cases. These
custom refining elements could be nested into XML and so in JSON. And
that is totally fine with ES because ES supports JSON, which is
nested. So, ES can index RDF Dublin Core Models almost naturally. ES
works also smoothly with a colon delimiter in field names, so field
names like "dc:title" are the way to go.

It depends on the task which has to be done: in the case there are a
lot of static documents ready for indexing, it boils down to create a
smart XML parser for jsonizing the data. Probably, even XML namespaces
could be dropped, because they are not needed in case only simple
Dublin Core is present in the data.

In the case if an RDF data model based on Dublin Core is required, the
challenge is to integrate the core ingredients found in the RDF model,
which are resources and properties. There are a lot more, e.g.
ontologies, which are not essential for the rather straight-forward
task of indexing and query RDF literals. If you need validation over
RDF elements, you have much more work to do, which is outside the
scope of search indexes.

Query result processing is trivial by transforming JSON back to XML
(with an extra root wrapper element to ensure a sound XML tree) once
the data is indexed. It's just the other way round. From there, you
can go ahead with XSLT and all the like. If you are willing to drop
JSON at all, most elegant would be a XContentGenerator producing SAX/
StaX events. Optionally, query transformers can be put in front of ES
DSL, for example CQL, using only relevant subsets of the power of the
ES DSL.

To simplify the straightening out in the jsonizing of XML, I would
avoid special treatment of attributes in favor of nesting elements. It
helps a bit to get JSON back from XML. Otherwise you have to introduce
contracts, like a first character '@' in ES field name is always
denoting an XML attribute, which might not always be transparent to
all ES search clients.

These are my own experiences based on my implementation since ES
0.5.1. I would be glad to share more thoughts if there are further
questions.

Jörg

On Feb 13, 12:06 am, Michael Sick michael.s...@serenesoftware.com
wrote:

Jorg - I bet a number of people would find your DC work interesting. +1 for
sure.

On Sun, Feb 12, 2012 at 10:50 AM, jprante joergpra...@gmail.com wrote:

I use Dublin Core in Elasticsearch extensively - together with all
kinds of metadata and bibliographic standards such as CQL - and might
to able to give some advise for best practice.

Beside naming the fields with Dublin Core element terms, I tackled the
namespacing challenge and implemented JSON to XML rendering in that
context.

Jörg

On Feb 12, 2:18 pm, Shay Banon kim...@gmail.com wrote:

Not really, but you can name the fields you index using the dublin core
metadata scheme.

On Friday, February 10, 2012 at 1:44 PM, ian mayo wrote:

Hi all,
I'd like to find out if/how ElasticSearch supports Dublin Core
metadata, as formalised by the Dublin Core Metadata Initiative (DCMI)

I'm aware that ElasticSearch sits on top of Apache Lucene, but I can't
find any reference to DCMI on the Lucene Documentation.

So, does ElasticSearch have any native support for DCMI?

cheers,
Ian


(Michael Sick) #6

Jörg,

I haven't thought through the details but I've been curious if ES could be
used as an effective custom data store for Apache Jena. If it could be done
efficiently, Jena has lots of the Semantic/RDF functionality vs. rebuilding
it over ES.

http://incubator.apache.org/jena/

On Mon, Feb 13, 2012 at 4:34 AM, jprante joergprante@gmail.com wrote:

Hi,

the question was about native support in ES for Dublin Core. There are
at least two cases to think about.

If you need to deal just with the syntax of DC element names and you
have a bunch of static documents to index, the simplest way is to read
the content with an XML reader (Sax, STaX API), pick up the local
names of the elements, and write them either directly into the ES Java
API for building content (XContentBuilder) or to JSON files for later
bulk indexing.

But on the other hand, Dublin Core has also semantics, specified by an
abstract model, see http://dublincore.org/documents/abstract-model/

So if the question was if there is native support in ES for RDF
semantics, which is the underlying model for Dublin Core, you have to
do it for yourself outside ES. It is possible to build JSON syntax
from an RDF model. The way here is to go with a simplified resource/
properties abstraction API that is responsible for realizing your
semantics. Preferable is a light-weight resource/property RDF API to
transform everything you need to manage in a search index into a
nested structure of URIs and literal values. Then, roll out a JSON
structure which is finally picked up be the ES API. This process is
quite similar to generating RDF triples. This should be accompanied by
some hints in the ES mapping (e.g. dc:date should be given a date type
according to W3CDTF, which is recommended best practice).

For jsonizing, I recommended to replace all XML Namespace URIs by a
short prefix, for Dublin Core e.g. "dc:title", "dc:creator",
"dc:identifier" etc. By doing this, it is possible to keep a prefix/
namespace URI Map to manage arbitrary XML in your Dublin Core Model.
This is rather a common use case because the 15 core elements are
quite coarse and will need some refinements in most cases. These
custom refining elements could be nested into XML and so in JSON. And
that is totally fine with ES because ES supports JSON, which is
nested. So, ES can index RDF Dublin Core Models almost naturally. ES
works also smoothly with a colon delimiter in field names, so field
names like "dc:title" are the way to go.

It depends on the task which has to be done: in the case there are a
lot of static documents ready for indexing, it boils down to create a
smart XML parser for jsonizing the data. Probably, even XML namespaces
could be dropped, because they are not needed in case only simple
Dublin Core is present in the data.

In the case if an RDF data model based on Dublin Core is required, the
challenge is to integrate the core ingredients found in the RDF model,
which are resources and properties. There are a lot more, e.g.
ontologies, which are not essential for the rather straight-forward
task of indexing and query RDF literals. If you need validation over
RDF elements, you have much more work to do, which is outside the
scope of search indexes.

Query result processing is trivial by transforming JSON back to XML
(with an extra root wrapper element to ensure a sound XML tree) once
the data is indexed. It's just the other way round. From there, you
can go ahead with XSLT and all the like. If you are willing to drop
JSON at all, most elegant would be a XContentGenerator producing SAX/
StaX events. Optionally, query transformers can be put in front of ES
DSL, for example CQL, using only relevant subsets of the power of the
ES DSL.

To simplify the straightening out in the jsonizing of XML, I would
avoid special treatment of attributes in favor of nesting elements. It
helps a bit to get JSON back from XML. Otherwise you have to introduce
contracts, like a first character '@' in ES field name is always
denoting an XML attribute, which might not always be transparent to
all ES search clients.

These are my own experiences based on my implementation since ES
0.5.1. I would be glad to share more thoughts if there are further
questions.

Jörg

On Feb 13, 12:06 am, Michael Sick michael.s...@serenesoftware.com
wrote:

Jorg - I bet a number of people would find your DC work interesting. +1
for
sure.

On Sun, Feb 12, 2012 at 10:50 AM, jprante joergpra...@gmail.com wrote:

I use Dublin Core in Elasticsearch extensively - together with all
kinds of metadata and bibliographic standards such as CQL - and might
to able to give some advise for best practice.

Beside naming the fields with Dublin Core element terms, I tackled the
namespacing challenge and implemented JSON to XML rendering in that
context.

Jörg

On Feb 12, 2:18 pm, Shay Banon kim...@gmail.com wrote:

Not really, but you can name the fields you index using the dublin
core

metadata scheme.

On Friday, February 10, 2012 at 1:44 PM, ian mayo wrote:

Hi all,
I'd like to find out if/how ElasticSearch supports Dublin Core
metadata, as formalised by the Dublin Core Metadata Initiative
(DCMI)

I'm aware that ElasticSearch sits on top of Apache Lucene, but I
can't

find any reference to DCMI on the Lucene Documentation.

So, does ElasticSearch have any native support for DCMI?

cheers,
Ian


(Jörg Prante) #7

Michael,

yes, there were several projects to lay fulltext search over RDF
storage like Jena with the help of search engines like Lucene. See
here http://jena.sourceforge.net/ARQ/lucene-arq.html or here
http://www.dfki.uni-kl.de/~sauermann/papers/Minack+2008.pdf
Smart folks at Talis tried Elasticsearch with Jena a year ago, see
here https://github.com/castagna/EARQ

Well, you have an impedance mismatch. A graph of interpretable
statements (think of it as datalog or a deductive database) versus an
information retrieval model of data flow with almost no logical
interpretations inside, but documents (hits) constructed by queries
instead. Compare SPARQL semantics with Elasticsearch DSL and you get
the idea.

Most RDF fulltext search approaches are placing RDF storage beside
search indexes, loosely coupled, and try to sync both worlds. That
comes at a price. And it's tough to manage when everything needs to
scale.

Jörg

On Feb 13, 10:44 pm, Michael Sick michael.s...@serenesoftware.com
wrote:

Jörg,

I haven't thought through the details but I've been curious if ES could be
used as an effective custom data store for Apache Jena. If it could be done
efficiently, Jena has lots of the Semantic/RDF functionality vs. rebuilding
it over ES.

http://incubator.apache.org/jena/

On Mon, Feb 13, 2012 at 4:34 AM, jprante joergpra...@gmail.com wrote:

Hi,

the question was about native support in ES for Dublin Core. There are
at least two cases to think about.

If you need to deal just with the syntax of DC element names and you
have a bunch of static documents to index, the simplest way is to read
the content with an XML reader (Sax, STaX API), pick up the local
names of the elements, and write them either directly into the ES Java
API for building content (XContentBuilder) or to JSON files for later
bulk indexing.

But on the other hand, Dublin Core has also semantics, specified by an
abstract model, seehttp://dublincore.org/documents/abstract-model/

So if the question was if there is native support in ES for RDF
semantics, which is the underlying model for Dublin Core, you have to
do it for yourself outside ES. It is possible to build JSON syntax
from an RDF model. The way here is to go with a simplified resource/
properties abstraction API that is responsible for realizing your
semantics. Preferable is a light-weight resource/property RDF API to
transform everything you need to manage in a search index into a
nested structure of URIs and literal values. Then, roll out a JSON
structure which is finally picked up be the ES API. This process is
quite similar to generating RDF triples. This should be accompanied by
some hints in the ES mapping (e.g. dc:date should be given a date type
according to W3CDTF, which is recommended best practice).

For jsonizing, I recommended to replace all XML Namespace URIs by a
short prefix, for Dublin Core e.g. "dc:title", "dc:creator",
"dc:identifier" etc. By doing this, it is possible to keep a prefix/
namespace URI Map to manage arbitrary XML in your Dublin Core Model.
This is rather a common use case because the 15 core elements are
quite coarse and will need some refinements in most cases. These
custom refining elements could be nested into XML and so in JSON. And
that is totally fine with ES because ES supports JSON, which is
nested. So, ES can index RDF Dublin Core Models almost naturally. ES
works also smoothly with a colon delimiter in field names, so field
names like "dc:title" are the way to go.

It depends on the task which has to be done: in the case there are a
lot of static documents ready for indexing, it boils down to create a
smart XML parser for jsonizing the data. Probably, even XML namespaces
could be dropped, because they are not needed in case only simple
Dublin Core is present in the data.

In the case if an RDF data model based on Dublin Core is required, the
challenge is to integrate the core ingredients found in the RDF model,
which are resources and properties. There are a lot more, e.g.
ontologies, which are not essential for the rather straight-forward
task of indexing and query RDF literals. If you need validation over
RDF elements, you have much more work to do, which is outside the
scope of search indexes.

Query result processing is trivial by transforming JSON back to XML
(with an extra root wrapper element to ensure a sound XML tree) once
the data is indexed. It's just the other way round. From there, you
can go ahead with XSLT and all the like. If you are willing to drop
JSON at all, most elegant would be a XContentGenerator producing SAX/
StaX events. Optionally, query transformers can be put in front of ES
DSL, for example CQL, using only relevant subsets of the power of the
ES DSL.

To simplify the straightening out in the jsonizing of XML, I would
avoid special treatment of attributes in favor of nesting elements. It
helps a bit to get JSON back from XML. Otherwise you have to introduce
contracts, like a first character '@' in ES field name is always
denoting an XML attribute, which might not always be transparent to
all ES search clients.

These are my own experiences based on my implementation since ES
0.5.1. I would be glad to share more thoughts if there are further
questions.

Jörg

On Feb 13, 12:06 am, Michael Sick michael.s...@serenesoftware.com
wrote:

Jorg - I bet a number of people would find your DC work interesting. +1
for
sure.

On Sun, Feb 12, 2012 at 10:50 AM, jprante joergpra...@gmail.com wrote:

I use Dublin Core in Elasticsearch extensively - together with all
kinds of metadata and bibliographic standards such as CQL - and might
to able to give some advise for best practice.

Beside naming the fields with Dublin Core element terms, I tackled the
namespacing challenge and implemented JSON to XML rendering in that
context.

Jörg

On Feb 12, 2:18 pm, Shay Banon kim...@gmail.com wrote:

Not really, but you can name the fields you index using the dublin
core

metadata scheme.

On Friday, February 10, 2012 at 1:44 PM, ian mayo wrote:

Hi all,
I'd like to find out if/how ElasticSearch supports Dublin Core
metadata, as formalised by the Dublin Core Metadata Initiative
(DCMI)

I'm aware that ElasticSearch sits on top of Apache Lucene, but I
can't

find any reference to DCMI on the Lucene Documentation.

So, does ElasticSearch have any native support for DCMI?

cheers,
Ian


(system) #8