Mapper Plugin Issues

I am using ElasticSearch mapper plugin for indexing contents for pdf, xls,
ppt file types. My mapping is as follows:

Indexing of the documents seems to be working fine and I am getting
expected results. However, when I look at the actual index size, it
increases linearly with the file size. In other words, if I index 100KB
pdf, the actual index size increases by ~100KB. Ideally, mapper should have
extracted only text data and index it. However, it doesn't seem to do soI
have following two questions:

  1. Is it required to specify "content_type" for indexing contents of
    "non-text" files?
  2. What is the right way of doing content indexing? Doesn't mapper take
    care of file types? Based on their documentation, it looks like they do.
    However, it doens't seem to be the case during implementation.

Using ElasticSearch Nest for C#

[ElasticType(
    Name = "IndexDocument",
    SearchAnalyzer = "standard",
    IndexAnalyzer = "standard",
    DateDetection = true,
    NumericDetection = true
)]
public class Document
{
    public string id { get; set; }
    [ElasticProperty(Type = Nest.FieldType.attachment, Store = false, TermVector = Nest.TermVectorOption.with_positions_offsets)]
    public ESAttachment esAttachment { get; set; }
}

public class ESAttachment
{
    public string _content_type { get; set; }
    public string _name { get; set; }
    public string content { get; set; }
}

Here is the code for indexing:

    esClient.MapFromAttributes<Document>();

    var item = new Document();
    item.esAttachment = new ESAttachment();
    item.esAttachment._content_type = "application/pdf";
    item.esAttachment.content = Convert.ToBase64String(System.IO.File.ReadAllBytes(file));
    item.esAttachment._name = "test-pdf";

    List<Document> bulkDoc = new List<Document>();
    bulkDoc.Add(item);

    var des = new BulkDescriptor();
    foreach (var doc in bulkDoc)
    {
        des.Index<Document>(j => j.Object(doc).Index("indexname"));
    }

    var status = esClient.BulkAsync(des);

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/02b8b822-ed47-4da5-901b-07b020179614%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

One of the concern with the mapper attachment is that you have to provide the full document (100kb) even if you will at the end extract only one single character.
Also, by default, _source is stored. That means you BASE64 encoded field will be stored as is in elasticsearch.

You can disable _source or you can also remove some part of the source using http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-source-field.html#include-exclude

Also, _all field which is enable by default also index the content a second time. You may want to disable it.

1/ You don't have to set _content_type. It will be automatically set by the plugin. If you force it, you need to make sure it corresponds to the actual content.
2/ Do you mean file extension? No we don't care about filename or extension…

I hope this helps

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 23 juin 2014 à 22:46:39, Deepikaa Subramaniam (deeps.subramaniam@gmail.com) a écrit:

I am using ElasticSearch mapper plugin for indexing contents for pdf, xls, ppt file types. My mapping is as follows:

Indexing of the documents seems to be working fine and I am getting expected results. However, when I look at the actual index size, it increases linearly with the file size. In other words, if I index 100KB pdf, the actual index size increases by ~100KB. Ideally, mapper should have extracted only text data and index it. However, it doesn't seem to do soI have following two questions:

Is it required to specify "content_type" for indexing contents of "non-text" files?
What is the right way of doing content indexing? Doesn't mapper take care of file types? Based on their documentation, it looks like they do. However, it doens't seem to be the case during implementation.
Using ElasticSearch Nest for C#

[ElasticType(
    Name = "IndexDocument",
    SearchAnalyzer = "standard",
    IndexAnalyzer = "standard",
    DateDetection = true,
    NumericDetection = true
)]
public class Document
{
    public string id { get; set; }
    [ElasticProperty(Type = Nest.FieldType.attachment, Store = false, TermVector = Nest.TermVectorOption.with_positions_offsets)]
    public ESAttachment esAttachment { get; set; }
}

public class ESAttachment
{
    public string _content_type { get; set; }
    public string _name { get; set; }
    public string content { get; set; }
}

Here is the code for indexing:

    esClient.MapFromAttributes<Document>();

    var item = new Document();
    item.esAttachment = new ESAttachment();
    item.esAttachment._content_type = "application/pdf";
    item.esAttachment.content = Convert.ToBase64String(System.IO.File.ReadAllBytes(file));
    item.esAttachment._name = "test-pdf";

    List<Document> bulkDoc = new List<Document>();
    bulkDoc.Add(item);

    var des = new BulkDescriptor();
    foreach (var doc in bulkDoc)
    {
        des.Index<Document>(j => j.Object(doc).Index("indexname"));
    }

    var status = esClient.BulkAsync(des);

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/02b8b822-ed47-4da5-901b-07b020179614%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.53a927e5.6b8b4567.950f%40MacBook-Air-de-David.local.
For more options, visit https://groups.google.com/d/optout.