Metadata: using one field to populate another


(jschelle-2) #1

I've been searching around trying to find the best way to do this, but
haven't really found anything so far. Any help would be appreciated.
Let's say I have simple documents that come in through a custom River:

doc:
title: string
content: string

And I want to add the following fields (numberOfNames, numberOfWords,
numberOfPlaces) and populate them based on a custom parsing of the content
field. Essentially I'm analyzing one field of the original document and
using it to populate addition fields (not part of the original document) so
that I can provided facetted search on these additional fields. Are there
some existing approaches to do this? I would imagine this is pretty common
but haven't been able to find much out there.

I was thinking of making a new Analyzer, something like MetadataAnalyzer
where you could configure a sequence of Tokenizer and Filter objects that
lead to the tokens to be indexed for a given field. For example you could
do something like this:

index:
analysis:
analyzer:
metadata_analyzer:
numberOfWords: WordsTokenizer, CountTokenFilter
numberOfPlaces: WordsTokenizer, PlaceTokenFilter,
CountTokenFilter
numberOfNames: WordsTokenizer, NameTokenFilter, CountTokenFilter

Thoughts??


(Ivan Brusic) #2

If the only difference between each field is the analysis, you could
use multi-fields on the original source field.

http://www.elasticsearch.org/guide/reference/mapping/multi-field-type.html

Each field can have its own analyzer (custom or not). The primary use
of multi-field is for when you want to define different analyzers on
the same source field.

--
Ivan

On Tue, May 22, 2012 at 9:12 AM, shadow000fire jason.scheller@gmail.com wrote:

I've been searching around trying to find the best way to do this, but
haven't really found anything so far. Any help would be appreciated. Let's
say I have simple documents that come in through a custom River:

doc:
title: string
content: string

And I want to add the following fields (numberOfNames, numberOfWords,
numberOfPlaces) and populate them based on a custom parsing of the content
field. Essentially I'm analyzing one field of the original document and
using it to populate addition fields (not part of the original document) so
that I can provided facetted search on these additional fields. Are there
some existing approaches to do this? I would imagine this is pretty common
but haven't been able to find much out there.

I was thinking of making a new Analyzer, something like MetadataAnalyzer
where you could configure a sequence of Tokenizer and Filter objects that
lead to the tokens to be indexed for a given field. For example you could
do something like this:

index:
analysis:
analyzer:
metadata_analyzer:
numberOfWords: WordsTokenizer, CountTokenFilter
numberOfPlaces: WordsTokenizer, PlaceTokenFilter,
CountTokenFilter
numberOfNames: WordsTokenizer, NameTokenFilter, CountTokenFilter

Thoughts??


(jschelle-2) #3

Oh perfect, thanks!

Thanks,
Jay

On May 23, 2012, at 11:06 AM, Ivan Brusic ivan@brusic.com wrote:

If the only difference between each field is the analysis, you could
use multi-fields on the original source field.

http://www.elasticsearch.org/guide/reference/mapping/multi-field-type.html

Each field can have its own analyzer (custom or not). The primary use
of multi-field is for when you want to define different analyzers on
the same source field.

--
Ivan

On Tue, May 22, 2012 at 9:12 AM, shadow000fire jason.scheller@gmail.com wrote:

I've been searching around trying to find the best way to do this, but
haven't really found anything so far. Any help would be appreciated. Let's
say I have simple documents that come in through a custom River:

doc:
title: string
content: string

And I want to add the following fields (numberOfNames, numberOfWords,
numberOfPlaces) and populate them based on a custom parsing of the content
field. Essentially I'm analyzing one field of the original document and
using it to populate addition fields (not part of the original document) so
that I can provided facetted search on these additional fields. Are there
some existing approaches to do this? I would imagine this is pretty common
but haven't been able to find much out there.

I was thinking of making a new Analyzer, something like MetadataAnalyzer
where you could configure a sequence of Tokenizer and Filter objects that
lead to the tokens to be indexed for a given field. For example you could
do something like this:

index:
analysis:
analyzer:
metadata_analyzer:
numberOfWords: WordsTokenizer, CountTokenFilter
numberOfPlaces: WordsTokenizer, PlaceTokenFilter,
CountTokenFilter
numberOfNames: WordsTokenizer, NameTokenFilter, CountTokenFilter

Thoughts??


(Jconwell) #4

I'm trying to do something very similar.

Wouldn't using a multi-field for this scenario cause the same field to be tokenized 4 different times? Once for the normal text field tokenizaiton, and three times for each metric you are calculating.

Is there a way to perform all three analyses in one analyzer pipeline and then store the 3 resulting metrics to new fields?


(Nik Everett) #5

Not right now, no.


(system) #6