Any way to index but not save parts of the document?


(mattx) #1

Hi,

I'm sorry if this is documented somewhere but I cannot find it. Although I do think I remember reading it.

I want to index some email messages for search but they are already in a database so I do not want ES to save a copy of all the data (for example the body). Is there any way I can index an ENTIRE doc that looks like:

{
dbid: 12345,
tos: [ a@foo.com, b@foo.com],
from: c@foo.com
subject: "A subject",
body: "A body"
}

and tell ES to only store a doc that looks like {dbid: 12345}?


(Clinton Gormley) #2

Hi Matt

Is there any way I can index an ENTIRE doc that looks like:
and tell ES to only store a doc that looks like {dbid: 12345}?

You can disable to _source field if you like.

http://www.elasticsearch.org/guide/reference/mapping/source-field.html

clint


(mattx) #3

Wow. The all or nothing approach doesn't work for me. I need to be able to at least get back the document ID of the thing I am indexing. What I really need is the ability to throw away only certain fields in the _source.

Am I being stupid? If I bulk index 100k email messages and don't include the _source then how can I later fetch these emails after doing a search? Do I have to store the IDs generated by the indexing operation as they map to my original IDs? I don't love that idea and I'm not even sure how to do that in a bulk indexing operation.


(Clinton Gormley) #4

Hi Matt

On Tue, 2011-06-28 at 10:44 -0700, mattx wrote:

Wow. The all or nothing approach doesn't work for me. I need to be able to
at least get back the document ID of the thing I am indexing. What I really
need is the ability to throw away only certain fields in the _source.

Am I being stupid? If I bulk index 100k email messages and don't include
the _source then how can I later fetch these emails after doing a search?
Do I have to store the IDs generated by the indexing operation as they map
to my original IDs? I don't love that idea and I'm not even sure how to do
that in a bulk indexing operation.

I don't follow what it is you are trying to do. Whether you index or
bulk_index you get back the ID (either the ID that you specify, or an
autogenerated ID)

Why don't you want the _source? Because it contains too much
information? What about deleting the information that you don't want to
store before indexing the email?

Perhaps a bit more context will help...

clint


(medcl.net) #5

Hey,you can use the parameter : fields to specify which fields you are
expected to return.

On 6/29/11, Clinton Gormley clinton@iannounce.co.uk wrote:

Hi Matt

On Tue, 2011-06-28 at 10:44 -0700, mattx wrote:

Wow. The all or nothing approach doesn't work for me. I need to be able
to
at least get back the document ID of the thing I am indexing. What I
really
need is the ability to throw away only certain fields in the _source.

Am I being stupid? If I bulk index 100k email messages and don't include
the _source then how can I later fetch these emails after doing a search?
Do I have to store the IDs generated by the indexing operation as they map
to my original IDs? I don't love that idea and I'm not even sure how to
do
that in a bulk indexing operation.

I don't follow what it is you are trying to do. Whether you index or
bulk_index you get back the ID (either the ID that you specify, or an
autogenerated ID)

Why don't you want the _source? Because it contains too much
information? What about deleting the information that you don't want to
store before indexing the email?

Perhaps a bit more context will help...

clint

--
从我的移动设备发送


(Shay Banon) #6

Another way is not to store the source, and specifically store certain fields (by enabling store in the mapping section for each field).

On Wednesday, June 29, 2011 at 7:06 AM, Medcl Zero wrote:

Hey,you can use the parameter : fields to specify which fields you are
expected to return.

On 6/29/11, Clinton Gormley <clinton@iannounce.co.uk (mailto:clinton@iannounce.co.uk)> wrote:

Hi Matt

On Tue, 2011-06-28 at 10:44 -0700, mattx wrote:

Wow. The all or nothing approach doesn't work for me. I need to be able
to
at least get back the document ID of the thing I am indexing. What I
really
need is the ability to throw away only certain fields in the _source.

Am I being stupid? If I bulk index 100k email messages and don't include
the _source then how can I later fetch these emails after doing a search?
Do I have to store the IDs generated by the indexing operation as they map
to my original IDs? I don't love that idea and I'm not even sure how to
do
that in a bulk indexing operation.

I don't follow what it is you are trying to do. Whether you index or
bulk_index you get back the ID (either the ID that you specify, or an
autogenerated ID)

Why don't you want the _source? Because it contains too much
information? What about deleting the information that you don't want to
store before indexing the email?

Perhaps a bit more context will help...

clint

--
从我的移动设备发送


(system) #7