Metadata: using one field to populate another

I've been searching around trying to find the best way to do this, but
haven't really found anything so far. Any help would be appreciated.
Let's say I have simple documents that come in through a custom River:

doc:
title: string
content: string

And I want to add the following fields (numberOfNames, numberOfWords,
numberOfPlaces) and populate them based on a custom parsing of the content
field. Essentially I'm analyzing one field of the original document and
using it to populate addition fields (not part of the original document) so
that I can provided facetted search on these additional fields. Are there
some existing approaches to do this? I would imagine this is pretty common
but haven't been able to find much out there.

I was thinking of making a new Analyzer, something like MetadataAnalyzer
where you could configure a sequence of Tokenizer and Filter objects that
lead to the tokens to be indexed for a given field. For example you could
do something like this:

index:
analysis:
analyzer:
metadata_analyzer:
numberOfWords: WordsTokenizer, CountTokenFilter
numberOfPlaces: WordsTokenizer, PlaceTokenFilter,
CountTokenFilter
numberOfNames: WordsTokenizer, NameTokenFilter, CountTokenFilter

Thoughts??

If the only difference between each field is the analysis, you could
use multi-fields on the original source field.

Each field can have its own analyzer (custom or not). The primary use
of multi-field is for when you want to define different analyzers on
the same source field.

--
Ivan

On Tue, May 22, 2012 at 9:12 AM, shadow000fire jason.scheller@gmail.com wrote:

I've been searching around trying to find the best way to do this, but
haven't really found anything so far. Any help would be appreciated. Let's
say I have simple documents that come in through a custom River:

doc:
title: string
content: string

And I want to add the following fields (numberOfNames, numberOfWords,
numberOfPlaces) and populate them based on a custom parsing of the content
field. Essentially I'm analyzing one field of the original document and
using it to populate addition fields (not part of the original document) so
that I can provided facetted search on these additional fields. Are there
some existing approaches to do this? I would imagine this is pretty common
but haven't been able to find much out there.

I was thinking of making a new Analyzer, something like MetadataAnalyzer
where you could configure a sequence of Tokenizer and Filter objects that
lead to the tokens to be indexed for a given field. For example you could
do something like this:

index:
analysis:
analyzer:
metadata_analyzer:
numberOfWords: WordsTokenizer, CountTokenFilter
numberOfPlaces: WordsTokenizer, PlaceTokenFilter,
CountTokenFilter
numberOfNames: WordsTokenizer, NameTokenFilter, CountTokenFilter

Thoughts??

Oh perfect, thanks!

Thanks,
Jay

On May 23, 2012, at 11:06 AM, Ivan Brusic ivan@brusic.com wrote:

If the only difference between each field is the analysis, you could
use multi-fields on the original source field.

Elasticsearch Platform — Find real-time answers at scale | Elastic

Each field can have its own analyzer (custom or not). The primary use
of multi-field is for when you want to define different analyzers on
the same source field.

--
Ivan

On Tue, May 22, 2012 at 9:12 AM, shadow000fire jason.scheller@gmail.com wrote:

I've been searching around trying to find the best way to do this, but
haven't really found anything so far. Any help would be appreciated. Let's
say I have simple documents that come in through a custom River:

doc:
title: string
content: string

And I want to add the following fields (numberOfNames, numberOfWords,
numberOfPlaces) and populate them based on a custom parsing of the content
field. Essentially I'm analyzing one field of the original document and
using it to populate addition fields (not part of the original document) so
that I can provided facetted search on these additional fields. Are there
some existing approaches to do this? I would imagine this is pretty common
but haven't been able to find much out there.

I was thinking of making a new Analyzer, something like MetadataAnalyzer
where you could configure a sequence of Tokenizer and Filter objects that
lead to the tokens to be indexed for a given field. For example you could
do something like this:

index:
analysis:
analyzer:
metadata_analyzer:
numberOfWords: WordsTokenizer, CountTokenFilter
numberOfPlaces: WordsTokenizer, PlaceTokenFilter,
CountTokenFilter
numberOfNames: WordsTokenizer, NameTokenFilter, CountTokenFilter

Thoughts??

I'm trying to do something very similar.

Wouldn't using a multi-field for this scenario cause the same field to be tokenized 4 different times? Once for the normal text field tokenizaiton, and three times for each metric you are calculating.

Is there a way to perform all three analyses in one analyzer pipeline and then store the 3 resulting metrics to new fields?

Not right now, no.