Index size increase dramatically

AdaYang · October 7, 2016, 5:15pm

I am running a 3-nodes v2.4.0 elasticsearch cluster mainly indexing pdf files. I was using Base64.getEncoder().encodeToString(bytes) for the pdf content. This method is using a deprecated String method which cause some some tika exception; So I change to Base64.getEncoder().encode(bytes) but the index size increase dramatically that run of disk space on my ec2 instance.

Anybody seen this before? OS is Ubuntu 14.04.3 LTS (GNU/Linux 3.13.0-74-generic x86_64; java version "1.8.0_91" Java(TM) SE Runtime Environment (build 1.8.0_91-b14),Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)

abeyad · October 7, 2016, 7:24pm

You might want to try encoding your documents outside of ES to see what the differences are between the two method calls. I presume these are meant to be unanalyzed fields if they are base64 encoded?

AdaYang · October 7, 2016, 7:29pm

Thanks for you reply.

I am using https://www.elastic.co/guide/en/elasticsearch/plugins/current/mapper-attachments.html with the type "attachment". Not sure if they are analyzed or unanalyzed. But I do want to search the content of the PDF.

abeyad · October 7, 2016, 7:41pm

I see, you are using the mapper attachment plugin to set attachment content, but that means you have to provide the base64 encoding of the attachment. When you run those two different methods to get the base64 encoding, are there any differences? It sounds like a tika issue, not an ES one?

AdaYang · October 7, 2016, 7:50pm

in the PDF there are some diagrams. if I use the encodetoString(bytes[] ), I got error like this:

bulk has failure: failure in bulk execution:
[0]: index [tech_pdf0], type [pdf], id [ptv855.pdf], message [MapperParsingException[Failed to extract [-1] characters of text for [null] : Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@4f3926fb]; nested: NotSerializableExceptionWrapper[tika_exception: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@4f3926fb]; nested: NotSerializableExceptionWrapper[runtime_exception: java.io.IOException: Value is not an integer: 38567420115579128561790115]; nested: NotSerializableExceptionWrapper[i_o_exception: Value is not an integer: 38567420115579128561790115];]

in Base64.java:
@SuppressWarnings("deprecation")
public String encodeToString(byte[] src) {
byte[] encoded = encode(src);
return new String(encoded, 0, 0, encoded.length); -- this [new String] is deprecated method
}

So I changed to encode(bytes[]) it was able to index without failure. but the size blows up about 3 - 4 times for my documents.

You might be right, not an elasticsearch issue, but tika issue...

AdaYang · October 7, 2016, 7:57pm

The encoded content sizes are same:

		byte[] bytes = IOUtils.toByteArray(content);			
		byte[] encoded = Base64.getEncoder().encode(bytes);
		logger.info("encoded: " + encoded.length);
		logger.info("string size " + Base64.getEncoder().encodeToString(bytes).length());

encoded: 5900292
string size 5900292

encoded: 5903112
string size 5903112

abeyad · October 7, 2016, 8:15pm

what happens if you call Base64.getEncoder().encode(bytes), then wrap that in a string yourself: new String(encoded, StandardCharsets.ISO_8859_1): see the Javadocs here

AdaYang · October 7, 2016, 8:43pm

Thought about that. But the doc said it's the same. I will try and update
here when I am back! Appreciate your help.

AdaYang · October 7, 2016, 10:43pm

if I use String(encoded, StandardCharsets.ISO_8859_1) I will see errors like this:

[0]: index [tech_pdf0], type [pdf], id [AVehT8YU7ThCmg0aRx0r], message [MapperParsingException[Failed to extract [-1] characters of text for [null] : Unable to extract PDF content]; nested: NotSerializableExceptionWrapper[tika_exception: Unable to extract PDF content]; nested: NotSerializableExceptionWrapper[i_o_exception: null]; nested: NotSerializableExceptionWrapper[data_format_exception: invalid block type];]