Hi all,
I'm looking at incorporating ES into our environment to allow us to
search some large databases that simply don't do well with standard SQL
queries to find things. First, I just want to say I am very impressed
with ES so far. Great stuff.
One of the big requirements I need to test out is indexing not just of
plain text but of attachments - usually standard business docs like
docx, html, pdf, xls, etc. After getting the plugin installed I am
seeing two issues. I can reproduce from a fresh index/jvm.
- Image files (GIFs in particular in my testing) seem to cause issues
in replication between nodes - it will index on the node it was posted
to (or perhaps the primary node for the shard? My mental picture of the
clustering side of things isn't entirely formed yet) and show up fine if
I try to pull up that id, but if I try to retrieve it from the other
node I'll get a 404 error. On the node I posted the data to I'll get
this in the log:
[2011-01-29 17:20:26,063][WARN ][action.index ] [Rafferty]
Failed to perform indices/index/shard/index on replica Index Shard
[tickets][2]
org.elasticsearch.transport.RemoteTransportException: [Nathaniel
Essex][inet[/10.140.20.168:9300]][indices/index/shard/index/replica]
Caused by: java.lang.NoClassDefFoundError: Could not initialize class
sun.java2d.Disposer
at
javax.imageio.stream.FileCacheImageInputStream.(FileCacheImageInputStream.java:94)
at
com.sun.imageio.spi.InputStreamImageInputStreamSpi.createInputStreamInstance(InputStreamImageInputStreamSpi.java:51)
at javax.imageio.ImageIO.createImageInputStream(ImageIO.java:331)
at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:72)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
at org.apache.tika.Tika.parseToString(Tika.java:290)
at
org.elasticsearch.index.mapper.xcontent.AttachmentMapper.parse(AttachmentMapper.java:254)
at
org.elasticsearch.index.mapper.xcontent.ObjectMapper.serializeValue(ObjectMapper.java:377)
at
org.elasticsearch.index.mapper.xcontent.ObjectMapper.parse(ObjectMapper.java:295)
at
org.elasticsearch.index.mapper.xcontent.ObjectMapper.serializeObject(ObjectMapper.java:316)
at
org.elasticsearch.index.mapper.xcontent.ObjectMapper.serializeArray(ObjectMapper.java:360)
at
org.elasticsearch.index.mapper.xcontent.ObjectMapper.parse(ObjectMapper.java:289)
at
org.elasticsearch.index.mapper.xcontent.ObjectMapper.serializeObject(ObjectMapper.java:316)
at
org.elasticsearch.index.mapper.xcontent.ObjectMapper.serializeArray(ObjectMapper.java:360)
at
org.elasticsearch.index.mapper.xcontent.ObjectMapper.parse(ObjectMapper.java:289)
at
org.elasticsearch.index.mapper.xcontent.XContentDocumentMapper.parse(XContentDocumentMapper.java:430)
at
org.elasticsearch.index.mapper.xcontent.XContentDocumentMapper.parse(XContentDocumentMapper.java:368)
at
org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:230)
at
org.elasticsearch.action.index.TransportIndexAction.shardOperationOnReplica(TransportIndexAction.java:187)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$ReplicaOperationTransportHandler.messageReceived(TransportShardReplicationOperationAction.java:180)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$ReplicaOperationTransportHandler.messageReceived(TransportShardReplicationOperationAction.java:173)
at
org.elasticsearch.transport.netty.MessageChannelHandler$3.run(MessageChannelHandler.java:195)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-01-29 17:20:26,063][WARN ][cluster.action.shard ] [Rafferty]
sending failed shard for [tickets][2], node[I6QZH35TSTiYm0Ud5EIQ3A],
[R], s[STARTED], reason [Failed to perform [indices/index/shard/index]
on replica, message [RemoteTransportException[[Nathaniel
Essex][inet[/10.140.20.168:9300]][indices/index/shard/index/replica]];
nested: NoClassDefFoundError[Could not initialize class
sun.java2d.Disposer]; ]]
On the other node I'll see this:
[2011-01-29 17:20:27,605][WARN ][cluster.action.shard ] [Nathaniel
Essex] received shard failed for [tickets][2],
node[I6QZH35TSTiYm0Ud5EIQ3A], [R], s[STARTED], reason [Fail
ed to perform [indices/index/shard/index] on replica, message
[RemoteTransportException[[Nathaniel
Essex][inet[/10.140.20.168:9300]][indices/index/shard/index/replica]];
nested:
NoClassDefFoundError[Could not initialize class sun.java2d.Disposer]; ]]
Just looks like a missing class I guess, but obviously an issue.
Doesn't seem to happen with all attachments. If this happens enough, I
seem to get full shard failures where they'll go offline. I haven't
been able to reproduce this particular aspect of the problem, though, so
that may have been unrelated.
- When I encountered issue #1 I figured I could just filter out image
files since we really don't care about them anyway. Once I did this my
index build went along at a nice clip until I ran into a .doc file. I
basically time out waiting for a server response from the post. No
errors come back, nothing in the log on either node (I am running
logging at DEBUG). I just don't get a response. The document is small
so it isn't transport time that is causing the timeout. I don't see
much of anything to troubleshoot or provide more information with. No
message is logged. I can use the tika jar directly and it returns data
from the exact same document without issue in a very reasonable amount
of time. Maybe 5 seconds including jvm startup. It doesn't seem to
happen with all .doc files, which is odd. FWIW here is whatfile
tells me about the document in question -
CDF V2 Document, Little Endian, Os: Windows, Version 6.0
If there's any more info that would assist let me know and I will be
happy to provide it. I am going to do another pass and log the files
that cause the timeout and see if I can find any more of a pattern to it.
Thanks!
Travis