Attachment Plugin Questions on Storing


(Mike Gaffney) #1

I'm trying to make use of the attachments plugin. I've got the
following Mapping:

{
"docs":{
"properties" : {
"contents" : {
"type" : "attachment",
"fields" : {
"contents" : {"store" : "no"}
}
},
"lastModified": { "type" : "long", "index" :
"analyzed", "store" : "no"}
}
}
}

And the following index code:

                                           XContentBuilder objectBuilder 

= jsonBuilder().startObject();

                                           objectBuilder.startObject(

Index.CONTENTS);
if (extension.equals("xml")){

objectBuilder.field("_content_type", MimeTypes.XML);
}
else {

objectBuilder.field("_content_type", MimeTypes.PLAIN_TEXT);
}
objectBuilder.field("_name",
file.getName());

objectBuilder.field("content",
Base64.encodeBase64(FileUtils.readFileToString(file).getBytes()));
objectBuilder.endObject();

objectBuilder.field(Index.LAST_MODIFIED, file.lastModified());
objectBuilder.endObject();
IndexRequestBuilder setSource
= client.prepareIndex(Index.INDEX,
Index.TYPE, file.getAbsolutePath()).setSource(objectBuilder);

setSource.execute().actionGet();

But when I look at the indexing on the server I see:
{
doc: {
properties: {
lastModified: {
index: "analyzed"
type: "long"
}
contents: {
path: "full"
type: "attachment"
fields: {
author: {
type: "string"
}
title: {
type: "string"
}
keywords: {
type: "string"
}
contents: {
type: "string"
}
date: {
format: "dateOptionalTime"
type: "date"
}
content_type: {
type: "string"
}
}
}
}
}
}

Basically, I don't really want to store the contents, just index the
documents and be able to search on them. I'm indexing files that are
on the computer already so I don't need the contents, and in fact it's
taking up a ton of space to have the contents in there.

Another question is, the contents seem to just be the base64. Is that
correct or am I doing something incorrectly.

I'm using this as a local machine file search mechanism for a large
art / document tree that each user has locally on their machines.

My results look like this (sorry for the redactions, it's proprietary info:

{"took":4,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1663,"max_score":1.0,"hits":[{"_index":"docs","_type":"doc","_id":"the_art_redacted","_score":1.0,"fields":{"contents":{"content":"REALLY_LONG_BASE64_STRING","_name":"the_file_name_redacted","_content_type":"text/plain"}}}]}}

Any additional explanation of attachments will be quite helpful.

Thanks,
Mike


(Paul Loy) #2

can you have a look in your logs for statements about mapping changes. It
may be that you don't have everything specified in your mapping so it's
getting overridden by dynamic mappings.

You should see update mapping messages in your logs.

On Thu, Sep 29, 2011 at 11:44 AM, Mike Gaffney mr.gaffo@gmail.com wrote:

I'm trying to make use of the attachments plugin. I've got the
following Mapping:

{
"docs":{
"properties" : {
"contents" : {
"type" : "attachment",
"fields" : {
"contents" : {"store" : "no"}
}
},
"lastModified": { "type" : "long", "index" :
"analyzed", "store" : "no"}
}
}
}

And the following index code:

                                           XContentBuilder

objectBuilder = jsonBuilder().startObject();

                                           objectBuilder.startObject(

**Index.CONTENTS);
if
(extension.equals("xml")){

objectBuilder.field("content**type", MimeTypes.XML);
}
else {

objectBuilder.field("content**type", MimeTypes.PLAIN_TEXT);
}
objectBuilder.field("_name",
file.getName());

objectBuilder.field("content",
Base64.encodeBase64(FileUtils.**readFileToString(file).**getBytes()));
objectBuilder.endObject();

                                           objectBuilder.field(Index.*

*LAST_MODIFIED, file.lastModified());
objectBuilder.endObject();
IndexRequestBuilder
setSource = client.prepareIndex(Index.**INDEX,
Index.TYPE, file.getAbsolutePath()).**setSource(objectBuilder);

setSource.execute().actionGet(**);

But when I look at the indexing on the server I see:
{
doc: {
properties: {
lastModified: {
index: "analyzed"
type: "long"
}
contents: {
path: "full"
type: "attachment"
fields: {
author: {
type: "string"
}
title: {
type: "string"
}
keywords: {
type: "string"
}
contents: {
type: "string"
}
date: {
format: "dateOptionalTime"
type: "date"
}
content_type: {
type: "string"
}
}
}
}
}
}

Basically, I don't really want to store the contents, just index the
documents and be able to search on them. I'm indexing files that are
on the computer already so I don't need the contents, and in fact it's
taking up a ton of space to have the contents in there.

Another question is, the contents seem to just be the base64. Is that
correct or am I doing something incorrectly.

I'm using this as a local machine file search mechanism for a large
art / document tree that each user has locally on their machines.

My results look like this (sorry for the redactions, it's proprietary info:

{"took":4,"timed_out":false,"shards":{"total":5,"
successful":5,"failed":0},"**hits":{"total":1663,"max
**
score":1.0,"hits":[{"_index":"docs","_type":"doc","_id":"
the_art_redacted","score":1.0,"fields":{"contents":{"
content":"REALLY_LONG_BASE64
STRING","name":"the_file
name_redacted","_content_type"**:"text/plain"}}}]}}

Any additional explanation of attachments will be quite helpful.

Thanks,
Mike

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy


(Mike Gaffney) #3

On debug all I see related to mapping is:

20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] DEBUG org.elasticsearch.index.mapper - [Ani-Mator]
[docs] using dynamic[true], default mapping: location[null] and
source[{
"default" : {
}
}]
20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] DEBUG
org.elasticsearch.index.cache.field.data.resident - [Ani-Mator]
[docs] using [resident] field cache with max_size [-1], expire [null]
20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] DEBUG org.elasticsearch.index.cache - [Ani-Mator]
[docs] Using stats.refresh_interval [1s]
20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] INFO org.elasticsearch.cluster.metadata - [Ani-
Mator] [docs] creating index, cause [api], shards [5]/[1], mappings
[doc]

I'm guessing that using dynamic[true] means that I'm not doing it
right.

On Sep 29, 11:52 am, Paul Loy ketera...@gmail.com wrote:

can you have a look in your logs for statements about mapping changes. It
may be that you don't have everything specified in your mapping so it's
getting overridden by dynamic mappings.

You should see update mapping messages in your logs.

On Thu, Sep 29, 2011 at 11:44 AM, Mike Gaffney mr.ga...@gmail.com wrote:

I'm trying to make use of the attachments plugin. I've got the
following Mapping:

{
"docs":{
"properties" : {
"contents" : {
"type" : "attachment",
"fields" : {
"contents" : {"store" : "no"}
}
},
"lastModified": { "type" : "long", "index" :
"analyzed", "store" : "no"}
}
}
}

And the following index code:

                                           XContentBuilder

objectBuilder = jsonBuilder().startObject();

                                           objectBuilder.startObject(

**Index.CONTENTS);
if
(extension.equals("xml")){

objectBuilder.field("content**type", MimeTypes.XML);
}
else {

objectBuilder.field("content**type", MimeTypes.PLAIN_TEXT);
}
objectBuilder.field("_name",
file.getName());

objectBuilder.field("content",
Base64.encodeBase64(FileUtils.**readFileToString(file).**getBytes()));
objectBuilder.endObject();

                                           objectBuilder.field(Index.*

*LAST_MODIFIED, file.lastModified());
objectBuilder.endObject();
IndexRequestBuilder
setSource = client.prepareIndex(Index.**INDEX,
Index.TYPE, file.getAbsolutePath()).**setSource(objectBuilder);

setSource.execute().actionGet(**);

But when I look at the indexing on the server I see:
{
doc: {
properties: {
lastModified: {
index: "analyzed"
type: "long"
}
contents: {
path: "full"
type: "attachment"
fields: {
author: {
type: "string"
}
title: {
type: "string"
}
keywords: {
type: "string"
}
contents: {
type: "string"
}
date: {
format: "dateOptionalTime"
type: "date"
}
content_type: {
type: "string"
}
}
}
}
}
}

Basically, I don't really want to store the contents, just index the
documents and be able to search on them. I'm indexing files that are
on the computer already so I don't need the contents, and in fact it's
taking up a ton of space to have the contents in there.

Another question is, the contents seem to just be the base64. Is that
correct or am I doing something incorrectly.

I'm using this as a local machine file search mechanism for a large
art / document tree that each user has locally on their machines.

My results look like this (sorry for the redactions, it's proprietary info:

{"took":4,"timed_out":false,"shards":{"total":5,"
successful":5,"failed":0},"**hits":{"total":1663,"max
**
score":1.0,"hits":[{"_index":"docs","_type":"doc","_id":"
the_art_redacted","score":1.0,"fields":{"contents":{"
content":"REALLY_LONG_BASE64
STRING","name":"the_file
name_redacted","_content_type"**:"text/plain"}}}]}}

Any additional explanation of attachments will be quite helpful.

Thanks,
Mike

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy


(Paul Loy) #4

If you can gist full logs, mappings, settings, code, etc, (or as much as you
can without giving away proprietary stuff) that's quite useful :wink:

So at the end you have a create index[api] for docs. Do you push the
mappings in there? Can I see that code?

On Thu, Sep 29, 2011 at 1:04 PM, Mike Gaffney mr.gaffo@gmail.com wrote:

On debug all I see related to mapping is:

20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] DEBUG org.elasticsearch.index.mapper - [Ani-Mator]
[docs] using dynamic[true], default mapping: location[null] and
source[{
"default" : {
}
}]
20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] DEBUG
org.elasticsearch.index.cache.field.data.resident - [Ani-Mator]
[docs] using [resident] field cache with max_size [-1], expire [null]
20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] DEBUG org.elasticsearch.index.cache - [Ani-Mator]
[docs] Using stats.refresh_interval [1s]
20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] INFO org.elasticsearch.cluster.metadata - [Ani-
Mator] [docs] creating index, cause [api], shards [5]/[1], mappings
[doc]

I'm guessing that using dynamic[true] means that I'm not doing it
right.

On Sep 29, 11:52 am, Paul Loy ketera...@gmail.com wrote:

can you have a look in your logs for statements about mapping changes. It
may be that you don't have everything specified in your mapping so it's
getting overridden by dynamic mappings.

You should see update mapping messages in your logs.

On Thu, Sep 29, 2011 at 11:44 AM, Mike Gaffney mr.ga...@gmail.com
wrote:

I'm trying to make use of the attachments plugin. I've got the
following Mapping:

{
"docs":{
"properties" : {
"contents" : {
"type" : "attachment",
"fields" : {
"contents" : {"store" : "no"}
}
},
"lastModified": { "type" : "long", "index" :
"analyzed", "store" : "no"}
}
}
}

And the following index code:

                                           XContentBuilder

objectBuilder = jsonBuilder().startObject();

objectBuilder.startObject(

**Index.CONTENTS);
if
(extension.equals("xml")){

objectBuilder.field("content**type", MimeTypes.XML);
}
else {

objectBuilder.field("content**type", MimeTypes.PLAIN_TEXT);
}

objectBuilder.field("_name",

file.getName());

objectBuilder.field("content",
Base64.encodeBase64(FileUtils.**readFileToString(file).**getBytes()));

objectBuilder.endObject();

objectBuilder.field(Index.*

*LAST_MODIFIED, file.lastModified());

objectBuilder.endObject();

                                           IndexRequestBuilder

setSource = client.prepareIndex(Index.**INDEX,
Index.TYPE, file.getAbsolutePath()).**setSource(objectBuilder);

setSource.execute().actionGet(**);

But when I look at the indexing on the server I see:
{
doc: {
properties: {
lastModified: {
index: "analyzed"
type: "long"
}
contents: {
path: "full"
type: "attachment"
fields: {
author: {
type: "string"
}
title: {
type: "string"
}
keywords: {
type: "string"
}
contents: {
type: "string"
}
date: {
format: "dateOptionalTime"
type: "date"
}
content_type: {
type: "string"
}
}
}
}
}
}

Basically, I don't really want to store the contents, just index the
documents and be able to search on them. I'm indexing files that are
on the computer already so I don't need the contents, and in fact it's
taking up a ton of space to have the contents in there.

Another question is, the contents seem to just be the base64. Is that
correct or am I doing something incorrectly.

I'm using this as a local machine file search mechanism for a large
art / document tree that each user has locally on their machines.

My results look like this (sorry for the redactions, it's proprietary
info:

{"took":4,"timed_out":false,"shards":{"total":5,"
successful":5,"failed":0},"**hits":{"total":1663,"max
**
score":1.0,"hits":[{"_index":"docs","_type":"doc","_id":"
the_art_redacted","score":1.0,"fields":{"contents":{"
content":"REALLY_LONG_BASE64
STRING","name":"the_file
name_redacted","_content_type"**:"text/plain"}}}]}}

Any additional explanation of attachments will be quite helpful.

Thanks,
Mike

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy


(Mike Gaffney) #5

On Sep 29, 1:08 pm, Paul Loy ketera...@gmail.com wrote:

If you can gist full logs, mappings, settings, code, etc, (or as much as you
can without giving away proprietary stuff) that's quite useful :wink:

So at the end you have a create index[api] for docs. Do you push the
mappings in there? Can I see that code?

On Thu, Sep 29, 2011 at 1:04 PM, Mike Gaffney mr.ga...@gmail.com wrote:

On debug all I see related to mapping is:

20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] DEBUG org.elasticsearch.index.mapper - [Ani-Mator]
[docs] using dynamic[true], default mapping: location[null] and
source[{
"default" : {
}
}]
20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] DEBUG
org.elasticsearch.index.cache.field.data.resident - [Ani-Mator]
[docs] using [resident] field cache with max_size [-1], expire [null]
20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] DEBUG org.elasticsearch.index.cache - [Ani-Mator]
[docs] Using stats.refresh_interval [1s]
20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] INFO org.elasticsearch.cluster.metadata - [Ani-
Mator] [docs] creating index, cause [api], shards [5]/[1], mappings
[doc]

I'm guessing that using dynamic[true] means that I'm not doing it
right.

On Sep 29, 11:52 am, Paul Loy ketera...@gmail.com wrote:

can you have a look in your logs for statements about mapping changes. It
may be that you don't have everything specified in your mapping so it's
getting overridden by dynamic mappings.

You should see update mapping messages in your logs.

On Thu, Sep 29, 2011 at 11:44 AM, Mike Gaffney mr.ga...@gmail.com
wrote:

I'm trying to make use of the attachments plugin. I've got the
following Mapping:

{
"docs":{
"properties" : {
"contents" : {
"type" : "attachment",
"fields" : {
"contents" : {"store" : "no"}
}
},
"lastModified": { "type" : "long", "index" :
"analyzed", "store" : "no"}
}
}
}

And the following index code:

                                           XContentBuilder

objectBuilder = jsonBuilder().startObject();

objectBuilder.startObject(

**Index.CONTENTS);
if
(extension.equals("xml")){

objectBuilder.field("content**type", MimeTypes.XML);
}
else {

objectBuilder.field("content**type", MimeTypes.PLAIN_TEXT);
}

objectBuilder.field("_name",

file.getName());

objectBuilder.field("content",
Base64.encodeBase64(FileUtils.**readFileToString(file).**getBytes()));

objectBuilder.endObject();

objectBuilder.field(Index.*

*LAST_MODIFIED, file.lastModified());

objectBuilder.endObject();

                                           IndexRequestBuilder

setSource = client.prepareIndex(Index.**INDEX,
Index.TYPE, file.getAbsolutePath()).**setSource(objectBuilder);

setSource.execute().actionGet(**);

But when I look at the indexing on the server I see:
{
doc: {
properties: {
lastModified: {
index: "analyzed"
type: "long"
}
contents: {
path: "full"
type: "attachment"
fields: {
author: {
type: "string"
}
title: {
type: "string"
}
keywords: {
type: "string"
}
contents: {
type: "string"
}
date: {
format: "dateOptionalTime"
type: "date"
}
content_type: {
type: "string"
}
}
}
}
}
}

Basically, I don't really want to store the contents, just index the
documents and be able to search on them. I'm indexing files that are
on the computer already so I don't need the contents, and in fact it's
taking up a ton of space to have the contents in there.

Another question is, the contents seem to just be the base64. Is that
correct or am I doing something incorrectly.

I'm using this as a local machine file search mechanism for a large
art / document tree that each user has locally on their machines.

My results look like this (sorry for the redactions, it's proprietary
info:

{"took":4,"timed_out":false,"shards":{"total":5,"
successful":5,"failed":0},"**hits":{"total":1663,"max
**
score":1.0,"hits":[{"_index":"docs","_type":"doc","_id":"
the_art_redacted","score":1.0,"fields":{"contents":{"
content":"REALLY_LONG_BASE64
STRING","name":"the_file
name_redacted","_content_type"**:"text/plain"}}}]}}

Any additional explanation of attachments will be quite helpful.

Thanks,
Mike

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy


(Paul Loy) #6

I would have expected that the following line would have caused put mapping,
cause [api] (or something similar in the logs):

InputStream docsMappings =
IndexerMain.class.getResourceAsStream("/mappings/docs.json");

On Thu, Sep 29, 2011 at 2:03 PM, Mike Gaffney mr.gaffo@gmail.com wrote:

https://gist.github.com/1251943

On Sep 29, 1:08 pm, Paul Loy ketera...@gmail.com wrote:

If you can gist full logs, mappings, settings, code, etc, (or as much as
you
can without giving away proprietary stuff) that's quite useful :wink:

So at the end you have a create index[api] for docs. Do you push the
mappings in there? Can I see that code?

On Thu, Sep 29, 2011 at 1:04 PM, Mike Gaffney mr.ga...@gmail.com
wrote:

On debug all I see related to mapping is:

20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] DEBUG org.elasticsearch.index.mapper - [Ani-Mator]
[docs] using dynamic[true], default mapping: location[null] and
source[{
"default" : {
}
}]
20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] DEBUG
org.elasticsearch.index.cache.field.data.resident - [Ani-Mator]
[docs] using [resident] field cache with max_size [-1], expire [null]
20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] DEBUG org.elasticsearch.index.cache - [Ani-Mator]
[docs] Using stats.refresh_interval [1s]
20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] INFO org.elasticsearch.cluster.metadata - [Ani-
Mator] [docs] creating index, cause [api], shards [5]/[1], mappings
[doc]

I'm guessing that using dynamic[true] means that I'm not doing it
right.

On Sep 29, 11:52 am, Paul Loy ketera...@gmail.com wrote:

can you have a look in your logs for statements about mapping
changes. It

may be that you don't have everything specified in your mapping so
it's

getting overridden by dynamic mappings.

You should see update mapping messages in your logs.

On Thu, Sep 29, 2011 at 11:44 AM, Mike Gaffney mr.ga...@gmail.com
wrote:

I'm trying to make use of the attachments plugin. I've got the
following Mapping:

{
"docs":{
"properties" : {
"contents" : {
"type" : "attachment",
"fields" : {
"contents" : {"store" :
"no"}

                           }
                   },
                   "lastModified": { "type" : "long", "index" :

"analyzed", "store" : "no"}
}
}
}

And the following index code:

                                           XContentBuilder

objectBuilder = jsonBuilder().startObject();

objectBuilder.startObject(

**Index.CONTENTS);
if
(extension.equals("xml")){

objectBuilder.field("content**type", MimeTypes.XML);
}
else {

objectBuilder.field("content**type", MimeTypes.PLAIN_TEXT);
}

objectBuilder.field("_name",

file.getName());

objectBuilder.field("content",

Base64.encodeBase64(FileUtils.**readFileToString(file).**getBytes()));

objectBuilder.endObject();

objectBuilder.field(Index.*

*LAST_MODIFIED, file.lastModified());

objectBuilder.endObject();

                                           IndexRequestBuilder

setSource = client.prepareIndex(Index.**INDEX,
Index.TYPE, file.getAbsolutePath()).**setSource(objectBuilder);

setSource.execute().actionGet(**);

But when I look at the indexing on the server I see:
{
doc: {
properties: {
lastModified: {
index: "analyzed"
type: "long"
}
contents: {
path: "full"
type: "attachment"
fields: {
author: {
type: "string"
}
title: {
type: "string"
}
keywords: {
type: "string"
}
contents: {
type: "string"
}
date: {
format: "dateOptionalTime"
type: "date"
}
content_type: {
type: "string"
}
}
}
}
}
}

Basically, I don't really want to store the contents, just index
the

documents and be able to search on them. I'm indexing files that
are

on the computer already so I don't need the contents, and in fact
it's

taking up a ton of space to have the contents in there.

Another question is, the contents seem to just be the base64. Is
that

correct or am I doing something incorrectly.

I'm using this as a local machine file search mechanism for a large
art / document tree that each user has locally on their machines.

My results look like this (sorry for the redactions, it's
proprietary

info:

{"took":4,"timed_out":false,"shards":{"total":5,"
successful":5,"failed":0},"**hits":{"total":1663,"max
**
score":1.0,"hits":[{"_index":"docs","_type":"doc","_id":"
the_art_redacted","score":1.0,"fields":{"contents":{"
content":"REALLY_LONG_BASE64
STRING","name":"the_file
name_redacted","_content_type"**:"text/plain"}}}]}}

Any additional explanation of attachments will be quite helpful.

Thanks,
Mike

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy


(Paul Loy) #7

or rather the block:

		InputStream docsMappings =

IndexerMain.class.getResourceAsStream("/mappings/docs.json");
String docsMappingAsString = IOUtils.toString(docsMappings);
prepareCreate.addMapping(Index.TYPE, docsMappingAsString);
prepareCreate.execute().actionGet();

On Thu, Sep 29, 2011 at 2:08 PM, Paul Loy keteracel@gmail.com wrote:

I would have expected that the following line would have caused put
mapping, cause [api] (or something similar in the logs):

InputStream docsMappings = IndexerMain.class.getResourceAsStream("/mappings/docs.json");

On Thu, Sep 29, 2011 at 2:03 PM, Mike Gaffney mr.gaffo@gmail.com wrote:

https://gist.github.com/1251943

On Sep 29, 1:08 pm, Paul Loy ketera...@gmail.com wrote:

If you can gist full logs, mappings, settings, code, etc, (or as much as
you
can without giving away proprietary stuff) that's quite useful :wink:

So at the end you have a create index[api] for docs. Do you push the
mappings in there? Can I see that code?

On Thu, Sep 29, 2011 at 1:04 PM, Mike Gaffney mr.ga...@gmail.com
wrote:

On debug all I see related to mapping is:

20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] DEBUG org.elasticsearch.index.mapper - [Ani-Mator]
[docs] using dynamic[true], default mapping: location[null] and
source[{
"default" : {
}
}]
20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] DEBUG
org.elasticsearch.index.cache.field.data.resident - [Ani-Mator]
[docs] using [resident] field cache with max_size [-1], expire [null]
20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] DEBUG org.elasticsearch.index.cache - [Ani-Mator]
[docs] Using stats.refresh_interval [1s]
20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] INFO org.elasticsearch.cluster.metadata - [Ani-
Mator] [docs] creating index, cause [api], shards [5]/[1], mappings
[doc]

I'm guessing that using dynamic[true] means that I'm not doing it
right.

On Sep 29, 11:52 am, Paul Loy ketera...@gmail.com wrote:

can you have a look in your logs for statements about mapping
changes. It

may be that you don't have everything specified in your mapping so
it's

getting overridden by dynamic mappings.

You should see update mapping messages in your logs.

On Thu, Sep 29, 2011 at 11:44 AM, Mike Gaffney mr.ga...@gmail.com
wrote:

I'm trying to make use of the attachments plugin. I've got the
following Mapping:

{
"docs":{
"properties" : {
"contents" : {
"type" : "attachment",
"fields" : {
"contents" : {"store" :
"no"}

                           }
                   },
                   "lastModified": { "type" : "long", "index"

:

"analyzed", "store" : "no"}
}
}
}

And the following index code:

                                           XContentBuilder

objectBuilder = jsonBuilder().startObject();

objectBuilder.startObject(

**Index.CONTENTS);
if
(extension.equals("xml")){

objectBuilder.field("content**type", MimeTypes.XML);
}
else {

objectBuilder.field("content**type", MimeTypes.PLAIN_TEXT);
}

objectBuilder.field("_name",

file.getName());

objectBuilder.field("content",

Base64.encodeBase64(FileUtils.**readFileToString(file).**getBytes()));

objectBuilder.endObject();

objectBuilder.field(Index.*

*LAST_MODIFIED, file.lastModified());

objectBuilder.endObject();

                                           IndexRequestBuilder

setSource = client.prepareIndex(Index.**INDEX,
Index.TYPE, file.getAbsolutePath()).**setSource(objectBuilder);

setSource.execute().actionGet(**);

But when I look at the indexing on the server I see:
{
doc: {
properties: {
lastModified: {
index: "analyzed"
type: "long"
}
contents: {
path: "full"
type: "attachment"
fields: {
author: {
type: "string"
}
title: {
type: "string"
}
keywords: {
type: "string"
}
contents: {
type: "string"
}
date: {
format: "dateOptionalTime"
type: "date"
}
content_type: {
type: "string"
}
}
}
}
}
}

Basically, I don't really want to store the contents, just index
the

documents and be able to search on them. I'm indexing files that
are

on the computer already so I don't need the contents, and in fact
it's

taking up a ton of space to have the contents in there.

Another question is, the contents seem to just be the base64. Is
that

correct or am I doing something incorrectly.

I'm using this as a local machine file search mechanism for a
large

art / document tree that each user has locally on their machines.

My results look like this (sorry for the redactions, it's
proprietary

info:

{"took":4,"timed_out":false,"shards":{"total":5,"
successful":5,"failed":0},"**hits":{"total":1663,"max
**
score":1.0,"hits":[{"_index":"docs","_type":"doc","_id":"
the_art_redacted","score":1.0,"fields":{"contents":{"
content":"REALLY_LONG_BASE64
STRING","name":"the_file
name_redacted","_content_type"**:"text/plain"}}}]}}

Any additional explanation of attachments will be quite helpful.

Thanks,
Mike

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy


(Mike Gaffney) #8

added the log output. There is a create by api that happens. But not
much else that I can tell.


(Paul Loy) #9

20110929-14:22:17 [elasticsearch[Xi'an Chi
Xan]clusterService#updateTask-pool-11-thread-1] DEBUG
org.elasticsearch.index.mapper - [Xi'an Chi Xan] [docs] using
dynamic[true], default mapping: location[null] and source[{
"default" : {
}
}]

So yeah, that doesn't look good. Can you try putting the mapping after
creating the index?

On Thu, Sep 29, 2011 at 2:25 PM, Mike Gaffney mr.gaffo@gmail.com wrote:

added the log output. There is a create by api that happens. But not
much else that I can tell.

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy


(Mike Gaffney) #10

Done. I get this log output:

20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] TRACE
org.elasticsearch.index.shard.service - [White Tiger] [docs][4]
refresh with waitForOperations[false]
20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
org.elasticsearch.index.gateway - [White Tiger] [docs][4] recovery
completed from local, took [6ms]
index : files [0] with total_size [0b], took[1ms]
: recovered_files [0] with total_size [0b]
: reusing_files [0] with total_size [0b]
translog : number_of_operations [0], took [6ms]
20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
org.elasticsearch.cluster.action.shard - [White Tiger] sending shard
started for [docs][4], node[LK1HamGVSqCXrOxh7zr8yA], [P],
s[INITIALIZING], reason [after recovery from gateway]
20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
org.elasticsearch.cluster.action.shard - [White Tiger] received shard
started for [docs][4], node[LK1HamGVSqCXrOxh7zr8yA], [P],
s[INITIALIZING], reason [after recovery from gateway]
20110929-15:02:25 [elasticsearch[White Tiger]clusterService#updateTask-
pool-11-thread-1] DEBUG org.elasticsearch.cluster.metadata - [White
Tiger] [docs] create_mapping [doc] with source [{"doc":{"properties":
{"contents":{"type":"attachment","path":"full","fields":{"contents":
{"type":"string"},"author":{"type":"string"},"title":
{"type":"string"},"date":
{"type":"date","format":"dateOptionalTime"},"keywords":
{"type":"string"},"content_type":{"type":"string"}}},"lastModified":
{"type":"long","index":"analyzed"}}}}]
... added
20110929-15:02:25 [elasticsearch[White Tiger]clusterService#updateTask-
pool-11-thread-1] TRACE org.elasticsearch.cluster.service - [White
Tiger] cluster state updated:
version [5], source [put-mapping [doc]]
nodes:

with this config:

{
"docs":{
"properties" : {
"contents": {
"type" : "attachment",
"path":"full",
"store": "no",
"fields":{
"contents":{"type":"string", "store": "no",
"index":"analyzed"},
"author":{"type":"string"},
"title":{"type":"string"},
"date":
{"type":"date","store":"no","format":"dateOptionalTime"},
"keywords":{"type":"string"},
"content_type":{"type":"string"}
}
},
"lastModified": { "type" : "long", "index" : "analyzed", "store" :
"no"}
}
}
}


(Mike Gaffney) #11

I never got the attachment system to stop storing the full document
base64 in the db or from returning it in results. The results isn't
that big of a deal, but I can't really store all of the documents 2x
(once for real and once in the index). We're already harddrive
constrained as we are.

Shay do you have any thoughts?

On Sep 29, 4:12 pm, Mike Gaffney mr.ga...@gmail.com wrote:

Done. I get this log output:

20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] TRACE
org.elasticsearch.index.shard.service - [White Tiger] [docs][4]
refresh with waitForOperations[false]
20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
org.elasticsearch.index.gateway - [White Tiger] [docs][4] recovery
completed from local, took [6ms]
index : files [0] with total_size [0b], took[1ms]
: recovered_files [0] with total_size [0b]
: reusing_files [0] with total_size [0b]
translog : number_of_operations [0], took [6ms]
20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
org.elasticsearch.cluster.action.shard - [White Tiger] sending shard
started for [docs][4], node[LK1HamGVSqCXrOxh7zr8yA], [P],
s[INITIALIZING], reason [after recovery from gateway]
20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
org.elasticsearch.cluster.action.shard - [White Tiger] received shard
started for [docs][4], node[LK1HamGVSqCXrOxh7zr8yA], [P],
s[INITIALIZING], reason [after recovery from gateway]
20110929-15:02:25 [elasticsearch[White Tiger]clusterService#updateTask-
pool-11-thread-1] DEBUG org.elasticsearch.cluster.metadata - [White
Tiger] [docs] create_mapping [doc] with source [{"doc":{"properties":
{"contents":{"type":"attachment","path":"full","fields":{"contents":
{"type":"string"},"author":{"type":"string"},"title":
{"type":"string"},"date":
{"type":"date","format":"dateOptionalTime"},"keywords":
{"type":"string"},"content_type":{"type":"string"}}},"lastModified":
{"type":"long","index":"analyzed"}}}}]
... added
20110929-15:02:25 [elasticsearch[White Tiger]clusterService#updateTask-
pool-11-thread-1] TRACE org.elasticsearch.cluster.service - [White
Tiger] cluster state updated:
version [5], source [put-mapping [doc]]
nodes:

with this config:

{
"docs":{
"properties" : {
"contents": {
"type" : "attachment",
"path":"full",
"store": "no",
"fields":{
"contents":{"type":"string", "store": "no",
"index":"analyzed"},
"author":{"type":"string"},
"title":{"type":"string"},
"date":
{"type":"date","store":"no","format":"dateOptionalTime"},
"keywords":{"type":"string"},
"content_type":{"type":"string"}
}
},
"lastModified": { "type" : "long", "index" : "analyzed", "store" :
"no"}
}
}

}


(Lukáš Vlček) #12

Hi,

may be there are other possibilities, but, you can completely disable
storing the _source [1] and you can also return only selected fields [2] in
the search results.

[1] http://www.elasticsearch.org/guide/reference/mapping/source-field.html
[2] http://www.elasticsearch.org/guide/reference/api/search/fields.html

However, your request to disable storing only the attachments base64 data
might be reasonable. You are probably not the only user requesting this. On
the other hand, this can make things more complicated later because the
compete document source may not be available for re-indexing. This is
probably up to Shay whether he wants to allow this or not, you can always
clone mapper-attachments plugin and do your customizations and try to sent
pull request.

just my 2 cents.

Regards,
Lukas

On Wed, Oct 12, 2011 at 8:39 PM, Mike Gaffney mr.gaffo@gmail.com wrote:

I never got the attachment system to stop storing the full document
base64 in the db or from returning it in results. The results isn't
that big of a deal, but I can't really store all of the documents 2x
(once for real and once in the index). We're already harddrive
constrained as we are.

Shay do you have any thoughts?

On Sep 29, 4:12 pm, Mike Gaffney mr.ga...@gmail.com wrote:

Done. I get this log output:

20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] TRACE
org.elasticsearch.index.shard.service - [White Tiger] [docs][4]
refresh with waitForOperations[false]
20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
org.elasticsearch.index.gateway - [White Tiger] [docs][4] recovery
completed from local, took [6ms]
index : files [0] with total_size [0b], took[1ms]
: recovered_files [0] with total_size [0b]
: reusing_files [0] with total_size [0b]
translog : number_of_operations [0], took [6ms]
20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
org.elasticsearch.cluster.action.shard - [White Tiger] sending shard
started for [docs][4], node[LK1HamGVSqCXrOxh7zr8yA], [P],
s[INITIALIZING], reason [after recovery from gateway]
20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
org.elasticsearch.cluster.action.shard - [White Tiger] received shard
started for [docs][4], node[LK1HamGVSqCXrOxh7zr8yA], [P],
s[INITIALIZING], reason [after recovery from gateway]
20110929-15:02:25 [elasticsearch[White Tiger]clusterService#updateTask-
pool-11-thread-1] DEBUG org.elasticsearch.cluster.metadata - [White
Tiger] [docs] create_mapping [doc] with source [{"doc":{"properties":
{"contents":{"type":"attachment","path":"full","fields":{"contents":
{"type":"string"},"author":{"type":"string"},"title":
{"type":"string"},"date":
{"type":"date","format":"dateOptionalTime"},"keywords":
{"type":"string"},"content_type":{"type":"string"}}},"lastModified":
{"type":"long","index":"analyzed"}}}}]
... added
20110929-15:02:25 [elasticsearch[White Tiger]clusterService#updateTask-
pool-11-thread-1] TRACE org.elasticsearch.cluster.service - [White
Tiger] cluster state updated:
version [5], source [put-mapping [doc]]
nodes:

with this config:

{
"docs":{
"properties" : {
"contents": {
"type" : "attachment",
"path":"full",
"store": "no",
"fields":{
"contents":{"type":"string", "store":
"no",
"index":"analyzed"},
"author":{"type":"string"},
"title":{"type":"string"},
"date":
{"type":"date","store":"no","format":"dateOptionalTime"},
"keywords":{"type":"string"},
"content_type":{"type":"string"}
}
},
"lastModified": { "type" : "long", "index" :
"analyzed", "store" :
"no"}
}
}

}


(Shay Banon) #13

Currently, you can disable _source and thus the attachment will not be
stored as well. There is no option to "remove" the attachment from _source
(the json doc) and store in the _source everything but the attachment.

On Wed, Oct 12, 2011 at 9:22 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi,

may be there are other possibilities, but, you can completely disable
storing the _source [1] and you can also return only selected fields [2] in
the search results.

[1] http://www.elasticsearch.org/guide/reference/mapping/source-field.html
[2] http://www.elasticsearch.org/guide/reference/api/search/fields.html

However, your request to disable storing only the attachments base64 data
might be reasonable. You are probably not the only user requesting this. On
the other hand, this can make things more complicated later because the
compete document source may not be available for re-indexing. This is
probably up to Shay whether he wants to allow this or not, you can always
clone mapper-attachments plugin and do your customizations and try to sent
pull request.

just my 2 cents.

Regards,
Lukas

On Wed, Oct 12, 2011 at 8:39 PM, Mike Gaffney mr.gaffo@gmail.com wrote:

I never got the attachment system to stop storing the full document
base64 in the db or from returning it in results. The results isn't
that big of a deal, but I can't really store all of the documents 2x
(once for real and once in the index). We're already harddrive
constrained as we are.

Shay do you have any thoughts?

On Sep 29, 4:12 pm, Mike Gaffney mr.ga...@gmail.com wrote:

Done. I get this log output:

20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] TRACE
org.elasticsearch.index.shard.service - [White Tiger] [docs][4]
refresh with waitForOperations[false]
20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
org.elasticsearch.index.gateway - [White Tiger] [docs][4] recovery
completed from local, took [6ms]
index : files [0] with total_size [0b], took[1ms]
: recovered_files [0] with total_size [0b]
: reusing_files [0] with total_size [0b]
translog : number_of_operations [0], took [6ms]
20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
org.elasticsearch.cluster.action.shard - [White Tiger] sending shard
started for [docs][4], node[LK1HamGVSqCXrOxh7zr8yA], [P],
s[INITIALIZING], reason [after recovery from gateway]
20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
org.elasticsearch.cluster.action.shard - [White Tiger] received shard
started for [docs][4], node[LK1HamGVSqCXrOxh7zr8yA], [P],
s[INITIALIZING], reason [after recovery from gateway]
20110929-15:02:25 [elasticsearch[White Tiger]clusterService#updateTask-
pool-11-thread-1] DEBUG org.elasticsearch.cluster.metadata - [White
Tiger] [docs] create_mapping [doc] with source [{"doc":{"properties":
{"contents":{"type":"attachment","path":"full","fields":{"contents":
{"type":"string"},"author":{"type":"string"},"title":
{"type":"string"},"date":
{"type":"date","format":"dateOptionalTime"},"keywords":
{"type":"string"},"content_type":{"type":"string"}}},"lastModified":
{"type":"long","index":"analyzed"}}}}]
... added
20110929-15:02:25 [elasticsearch[White Tiger]clusterService#updateTask-
pool-11-thread-1] TRACE org.elasticsearch.cluster.service - [White
Tiger] cluster state updated:
version [5], source [put-mapping [doc]]
nodes:

with this config:

{
"docs":{
"properties" : {
"contents": {
"type" : "attachment",
"path":"full",
"store": "no",
"fields":{
"contents":{"type":"string", "store":
"no",
"index":"analyzed"},
"author":{"type":"string"},
"title":{"type":"string"},
"date":
{"type":"date","store":"no","format":"dateOptionalTime"},
"keywords":{"type":"string"},
"content_type":{"type":"string"}
}
},
"lastModified": { "type" : "long", "index" :
"analyzed", "store" :
"no"}
}
}

}


(Mike Gaffney) #14

Thanks for the clarification guys! Good enough for what I'm doing

On Oct 12, 2:32 pm, Shay Banon kim...@gmail.com wrote:

Currently, you can disable _source and thus the attachment will not be
stored as well. There is no option to "remove" the attachment from _source
(the json doc) and store in the _source everything but the attachment.

On Wed, Oct 12, 2011 at 9:22 PM, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

may be there are other possibilities, but, you can completely disable
storing the _source [1] and you can also return only selected fields [2] in
the search results.

[1]http://www.elasticsearch.org/guide/reference/mapping/source-field.html
[2]http://www.elasticsearch.org/guide/reference/api/search/fields.html

However, your request to disable storing only the attachments base64 data
might be reasonable. You are probably not the only user requesting this. On
the other hand, this can make things more complicated later because the
compete document source may not be available for re-indexing. This is
probably up to Shay whether he wants to allow this or not, you can always
clone mapper-attachments plugin and do your customizations and try to sent
pull request.

just my 2 cents.

Regards,
Lukas

On Wed, Oct 12, 2011 at 8:39 PM, Mike Gaffney mr.ga...@gmail.com wrote:

I never got the attachment system to stop storing the full document
base64 in the db or from returning it in results. The results isn't
that big of a deal, but I can't really store all of the documents 2x
(once for real and once in the index). We're already harddrive
constrained as we are.

Shay do you have any thoughts?

On Sep 29, 4:12 pm, Mike Gaffney mr.ga...@gmail.com wrote:

Done. I get this log output:

20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] TRACE
org.elasticsearch.index.shard.service - [White Tiger] [docs][4]
refresh with waitForOperations[false]
20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
org.elasticsearch.index.gateway - [White Tiger] [docs][4] recovery
completed from local, took [6ms]
index : files [0] with total_size [0b], took[1ms]
: recovered_files [0] with total_size [0b]
: reusing_files [0] with total_size [0b]
translog : number_of_operations [0], took [6ms]
20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
org.elasticsearch.cluster.action.shard - [White Tiger] sending shard
started for [docs][4], node[LK1HamGVSqCXrOxh7zr8yA], [P],
s[INITIALIZING], reason [after recovery from gateway]
20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
org.elasticsearch.cluster.action.shard - [White Tiger] received shard
started for [docs][4], node[LK1HamGVSqCXrOxh7zr8yA], [P],
s[INITIALIZING], reason [after recovery from gateway]
20110929-15:02:25 [elasticsearch[White Tiger]clusterService#updateTask-
pool-11-thread-1] DEBUG org.elasticsearch.cluster.metadata - [White
Tiger] [docs] create_mapping [doc] with source [{"doc":{"properties":
{"contents":{"type":"attachment","path":"full","fields":{"contents":
{"type":"string"},"author":{"type":"string"},"title":
{"type":"string"},"date":
{"type":"date","format":"dateOptionalTime"},"keywords":
{"type":"string"},"content_type":{"type":"string"}}},"lastModified":
{"type":"long","index":"analyzed"}}}}]
... added
20110929-15:02:25 [elasticsearch[White Tiger]clusterService#updateTask-
pool-11-thread-1] TRACE org.elasticsearch.cluster.service - [White
Tiger] cluster state updated:
version [5], source [put-mapping [doc]]
nodes:

with this config:

{
"docs":{
"properties" : {
"contents": {
"type" : "attachment",
"path":"full",
"store": "no",
"fields":{
"contents":{"type":"string", "store":
"no",
"index":"analyzed"},
"author":{"type":"string"},
"title":{"type":"string"},
"date":
{"type":"date","store":"no","format":"dateOptionalTime"},
"keywords":{"type":"string"},
"content_type":{"type":"string"}
}
},
"lastModified": { "type" : "long", "index" :
"analyzed", "store" :
"no"}
}
}

}


(system) #15