How to extract the fields using regex?


(Dong Aihua) #1

Hi,
I want to extract fields using regex. for example, I have a doc {body:
jackie's email is xxx@gmail.com}. I want to extract the email address from
the body using a regex.
I know that script fields seem for this. But I don't find any exact
example in the API doc.
Does anyone know how to use regex in script field?
Thanks a lot.


(Chris Male) #2

So if I understand correctly, you have indexed documents with a single
field say 'body' and you want to extract some content from that? Is this
just for display purposes or this is for a search? Could you do it before
the content was indexed or on the client side after the search?

On Thursday, August 9, 2012 2:26:27 PM UTC+12, jackiedong wrote:

Hi,
I want to extract fields using regex. for example, I have a doc {body:
jackie's email is x...@gmail.com <javascript:>}. I want to extract the
email address from the body using a regex.
I know that script fields seem for this. But I don't find any exact
example in the API doc.
Does anyone know how to use regex in script field?
Thanks a lot.


(Dong Aihua) #3

Yes, I indexed the documents with a field "body", and then I want to
extract some content from that.
Not only for display, but also for search and statistics.

在 2012年8月9日星期四UTC+8上午10时41分27秒,Chris Male写道:

So if I understand correctly, you have indexed documents with a single
field say 'body' and you want to extract some content from that? Is this
just for display purposes or this is for a search? Could you do it before
the content was indexed or on the client side after the search?

On Thursday, August 9, 2012 2:26:27 PM UTC+12, jackiedong wrote:

Hi,
I want to extract fields using regex. for example, I have a doc
{body: jackie's email is x...@gmail.com}. I want to extract the email
address from the body using a regex.
I know that script fields seem for this. But I don't find any exact
example in the API doc.
Does anyone know how to use regex in script field?
Thanks a lot.


(Dong Aihua) #4

In fact, the real purpose is extract the new fields from the "body"
content, and add these extracted fields back into the original doc.

在 2012年8月9日星期四UTC+8下午1时50分26秒,jackiedong写道:

Yes, I indexed the documents with a field "body", and then I want to
extract some content from that.
Not only for display, but also for search and statistics.

在 2012年8月9日星期四UTC+8上午10时41分27秒,Chris Male写道:

So if I understand correctly, you have indexed documents with a single
field say 'body' and you want to extract some content from that? Is this
just for display purposes or this is for a search? Could you do it before
the content was indexed or on the client side after the search?

On Thursday, August 9, 2012 2:26:27 PM UTC+12, jackiedong wrote:

Hi,
I want to extract fields using regex. for example, I have a doc
{body: jackie's email is x...@gmail.com}. I want to extract the email
address from the body using a regex.
I know that script fields seem for this. But I don't find any exact
example in the API doc.
Does anyone know how to use regex in script field?
Thanks a lot.


(olof) #5

I would say it is better to extract fields before indexing, for example in
a document processing framework. Or simply do the content extraction in
your indexing client.

Do you want to do this for all documents or just the ones that a certain
query finds? I suppose you could do the extraction in the search client and
then do an _update of the document with the new fields. This will be
expensive in the long run though, since it will cause the document to be
reindexed. You'll end up with lots of deletes in your index or, if you use
versioning, double the index size.

Den torsdagen den 9:e augusti 2012 kl. 08:06:12 UTC+2 skrev jackiedong:

In fact, the real purpose is extract the new fields from the "body"
content, and add these extracted fields back into the original doc.

在 2012年8月9日星期四UTC+8下午1时50分26秒,jackiedong写道:

Yes, I indexed the documents with a field "body", and then I want to
extract some content from that.
Not only for display, but also for search and statistics.

在 2012年8月9日星期四UTC+8上午10时41分27秒,Chris Male写道:

So if I understand correctly, you have indexed documents with a single
field say 'body' and you want to extract some content from that? Is this
just for display purposes or this is for a search? Could you do it before
the content was indexed or on the client side after the search?

On Thursday, August 9, 2012 2:26:27 PM UTC+12, jackiedong wrote:

Hi,
I want to extract fields using regex. for example, I have a doc
{body: jackie's email is x...@gmail.com}. I want to extract the email
address from the body using a regex.
I know that script fields seem for this. But I don't find any exact
example in the API doc.
Does anyone know how to use regex in script field?
Thanks a lot.


(Dong Aihua) #6

I see this feature in the Splunk. Splunk allows the user defines his own
regex to extract the fields from the existing docs. Then the user can use
these new fields to do the search.
I want to have the same feature like this.
But from the above response, it seems it is expensive to do such thing in
Elasticsearch.

在 2012年8月9日星期四UTC+8下午4时52分32秒,olof写道:

I would say it is better to extract fields before indexing, for example in
a document processing framework. Or simply do the content extraction in
your indexing client.

Do you want to do this for all documents or just the ones that a certain
query finds? I suppose you could do the extraction in the search client and
then do an _update of the document with the new fields. This will be
expensive in the long run though, since it will cause the document to be
reindexed. You'll end up with lots of deletes in your index or, if you use
versioning, double the index size.

Den torsdagen den 9:e augusti 2012 kl. 08:06:12 UTC+2 skrev jackiedong:

In fact, the real purpose is extract the new fields from the "body"
content, and add these extracted fields back into the original doc.

在 2012年8月9日星期四UTC+8下午1时50分26秒,jackiedong写道:

Yes, I indexed the documents with a field "body", and then I want to
extract some content from that.
Not only for display, but also for search and statistics.

在 2012年8月9日星期四UTC+8上午10时41分27秒,Chris Male写道:

So if I understand correctly, you have indexed documents with a single
field say 'body' and you want to extract some content from that? Is this
just for display purposes or this is for a search? Could you do it before
the content was indexed or on the client side after the search?

On Thursday, August 9, 2012 2:26:27 PM UTC+12, jackiedong wrote:

Hi,
I want to extract fields using regex. for example, I have a doc
{body: jackie's email is x...@gmail.com}. I want to extract the email
address from the body using a regex.
I know that script fields seem for this. But I don't find any
exact example in the API doc.
Does anyone know how to use regex in script field?
Thanks a lot.


(Dong Aihua) #7

在 2012年8月9日星期四UTC+8下午5时16分26秒,jackiedong写道:

I see this feature in the Splunk. Splunk allows the user defines his own
regex to extract the fields from the existing docs. Then the user can use
these new fields to do the search.
I want to have the same feature like this.
But from the above response, it seems it is expensive to do such thing in
Elasticsearch.

在 2012年8月9日星期四UTC+8下午4时52分32秒,olof写道:

I would say it is better to extract fields before indexing, for example
in a document processing framework. Or simply do the content extraction in
your indexing client.

Do you want to do this for all documents or just the ones that a certain
query finds? I suppose you could do the extraction in the search client and
then do an _update of the document with the new fields. This will be
expensive in the long run though, since it will cause the document to be
reindexed. You'll end up with lots of deletes in your index or, if you use
versioning, double the index size.

Den torsdagen den 9:e augusti 2012 kl. 08:06:12 UTC+2 skrev jackiedong:

In fact, the real purpose is extract the new fields from the "body"
content, and add these extracted fields back into the original doc.

在 2012年8月9日星期四UTC+8下午1时50分26秒,jackiedong写道:

Yes, I indexed the documents with a field "body", and then I want to
extract some content from that.
Not only for display, but also for search and statistics.

在 2012年8月9日星期四UTC+8上午10时41分27秒,Chris Male写道:

So if I understand correctly, you have indexed documents with a single
field say 'body' and you want to extract some content from that? Is this
just for display purposes or this is for a search? Could you do it before
the content was indexed or on the client side after the search?

On Thursday, August 9, 2012 2:26:27 PM UTC+12, jackiedong wrote:

Hi,
I want to extract fields using regex. for example, I have a doc
{body: jackie's email is x...@gmail.com}. I want to extract the
email address from the body using a regex.
I know that script fields seem for this. But I don't find any
exact example in the API doc.
Does anyone know how to use regex in script field?
Thanks a lot.


(olof) #8

Well, it probably depends on the size of your index and the number of
updates you'll be doing, and how often. If you're only modifying some
documents, it might not be a problem.

Also, it sounds like you want to modify just some docs that the user gets
from a query, which I suppose won't work with preprocessing the data.

If you have the time, do some tests. Change your client so to run a query,
extract data with a regex and then update the document in the index with
that. Or look into alternative scripting languages for script fields; the
documentation mentions several
(http://www.elasticsearch.org/guide/reference/modules/scripting.html)
plugins you an install to get javascript, python and groovy for example.
That should give for regex capabilities. I'm not sure how you'd go about
storing the results from those scripts in the document for later though...

Den torsdagen den 9:e augusti 2012 kl. 11:16:26 UTC+2 skrev jackiedong:

I see this feature in the Splunk. Splunk allows the user defines his own
regex to extract the fields from the existing docs. Then the user can use
these new fields to do the search.
I want to have the same feature like this.
But from the above response, it seems it is expensive to do such thing in
Elasticsearch.

在 2012年8月9日星期四UTC+8下午4时52分32秒,olof写道:

I would say it is better to extract fields before indexing, for example
in a document processing framework. Or simply do the content extraction in
your indexing client.

Do you want to do this for all documents or just the ones that a certain
query finds? I suppose you could do the extraction in the search client and
then do an _update of the document with the new fields. This will be
expensive in the long run though, since it will cause the document to be
reindexed. You'll end up with lots of deletes in your index or, if you use
versioning, double the index size.

Den torsdagen den 9:e augusti 2012 kl. 08:06:12 UTC+2 skrev jackiedong:

In fact, the real purpose is extract the new fields from the "body"
content, and add these extracted fields back into the original doc.

在 2012年8月9日星期四UTC+8下午1时50分26秒,jackiedong写道:

Yes, I indexed the documents with a field "body", and then I want to
extract some content from that.
Not only for display, but also for search and statistics.

在 2012年8月9日星期四UTC+8上午10时41分27秒,Chris Male写道:

So if I understand correctly, you have indexed documents with a single
field say 'body' and you want to extract some content from that? Is this
just for display purposes or this is for a search? Could you do it before
the content was indexed or on the client side after the search?

On Thursday, August 9, 2012 2:26:27 PM UTC+12, jackiedong wrote:

Hi,
I want to extract fields using regex. for example, I have a doc
{body: jackie's email is x...@gmail.com}. I want to extract the
email address from the body using a regex.
I know that script fields seem for this. But I don't find any
exact example in the API doc.
Does anyone know how to use regex in script field?
Thanks a lot.


(Christoph Wurm) #9

Fyi: Since I just tested what @olof suggested here's how you would use regular expressions to extract stuff from existing fields.

Given a document in Elasticsearch that looks like this:

{
    "_index": "test",
    "_type": "tweet",
    "_id": "4",
    "_score": 1,
    "_source": {
       "date": "2014-09-16",
       "name": "John Smith",
       "tweet": "The Elasticsearch API is really easy to use",
       "user_id": 1
    }
}

You could extract the first name from the name field like this:

GET test/_search
{
   "script_fields": {
       "first_name": {
           "script": "/\\w+/.exec(_source['name'])[0]",
           "lang": "javascript"
       }
   },
   "_source": "*"
}

Install the Javascript plugin before.


(system) #10