Revisiting colons in field names


#1

TL;DR version : Is there any possibility of searching on field names that contain colons in ElasticSearch? (Necessary for semantically tagged JSON-LD documents)


This is a bit of a re-visit of Problem with colon ':' in fieldname. Where can I find naming guidelines? from 2011, but the end result of that discussion seemed to be to just not use them if possible. Fast forward to today, and the development of semantic markup and JSON-LD. Semantic tags/fields link to a specific ontology, which provides a formal description of the field and its meaning. For an example, see here: http://json-ld.org/spec/ED/json-ld-syntax/20120522/#dfn-compact_iri (reproduced below - note dc:creator, dc:title, ex:contains...).

{
  "@context":
  {
    "dc": "http://purl.org/dc/elements/1.1/",
    "ex": "http://example.org/vocab#"
  },
  "@id": "http://example.org/library",
  "@type": "ex:Library",
  "ex:contains":
  {
    "@id": "http://example.org/library/the-republic",
    "@type": "ex:Book",
    "dc:creator": "Plato",
    "dc:title": "The Republic",
    "ex:contains":
    {
      "@id": "http://example.org/library/the-republic#introduction",
      "@type": "ex:Chapter",
      "dc:description": "An introductory chapter on The Republic.",
      "dc:title": "The Introduction"
    }
  }
}

In my case, I have JSON-LD documents containing tags/fields such as dc:title and foaf:organization, which are used as shorthand for http://purl.org/dc/terms/title and http://xmlns.com/foaf/0.1/organization respectively. The colon is a necessary part of the field identifier, expanded or not. The documents seem to be indexing properly, and 'exists' searches are possible using {"query" : {"query_string" : {"query" : "_exists_:\"dc:title\""}}}. But different permutations of searching on the field name fail - e.g.

{"query_string" : {"\"dc:title\"\:Republic"}}
{"query_string" : {"\"dc:title\":Republic"}}
{"query_string" : {"dc\:title\:Republic"}}
{"query_string" : {"dc:title:Republic"}}

etc...

Given that it looks like the colon is a critical part of semantically tagged documents, is there any possibility that ElasticSearch could/will handle colons as special characters in field names? The basic goal is to be able to search for content associated with a given semantic tag - e.g. 'dc:title'. If it's a case of me using improper syntax, a fix would be welcomed.


#2

After doing some more digging, it looks like the issue is down at the level of Lucene's QueryParser - i.e. the parser would need to be customized. Thankfully, Lucene is Open Source, so I'll do some more digging, and possibly re-post there.


#3

I'm also trying to index JSON-LD with prefixes. Did you have any success?


#4

In my case, indexing fields with colons worked (I think), it was searching for fields containing colons that was the problem (but there isn’t much point in storing them if you can’t retrieve them properly). As mentioned, I did dig into Lucene and found the culprit, but unfortunately I didn’t get around to developing a patch. Good luck!


(system) #5