Defining a useful mapping for indexing Office docs on network drives


(Rian Stockbower) #1

I'm indexing MS Office, PDFs, and plain text files on a network share
drive. In small scale testing, I've successfully used the fsriver and
mapper-attachments plugin to index and search the files. The mapping was
created automatically when I pointed fsriver at the directory of test data
when I created the index.

I'm pretty sure I need a more robust mapping to scale up and support
faceted search (at some point in the future). The attributes I'm interested
in are these:

  • UNC path to file
  • file name
  • parent directory of file
  • file type[1]
  • last modified timestamp
  • file contents[2]

I think the UNC path to the file (or a hash thereof) should serve as the
index of the file. This should make it easy to re-index specific documents
individually in the index(?), with O(1) lookup characteristics?

[1] If I want useful faceted search later, I think I need something like a
"spreadsheet" type that corresponds to xls, xlsx, etc file types. Is that
reasonable?

[2] mapper-attachments + fsriver stored the file contents as a
base64-encoded string, and the index took up ~50MB on disk -- nearly the
same size as the documents themselves. That doesn't seem very scalable to
me. Do I need to extract the text in some way, instead of naively encoding
the binary artifacts as well? (A problem to be solved later, perhaps.)

I think I need an index that looks something like this:
{
"settings" : {...}
"mappings" :
{
"xlsx" : {
"unc_path" : "",
"file_name" : "",
"parent_directory" : "",
"file_extension" : "",
"file_type" : "spreadsheet",
"last_modified" : "",
"file_contents" : ""
}

"xls" : {

}
}
}

I feel like I could create some general types that would have certain
characteristics. (Perhaps these would make good separate indices so I could
call localhost:9200/docs/spreadsheet/_search?q=whatever ?)

--

When adding a document to the index (or re-indexing it), I'd like to
reference it specifically by id. (Each document's UNC path is unique.) I
have no idea how to do that.

Perhaps these things are documented somewhere, but I haven't found good
examples. Most of the ES docs seem to use Twitter for examples, which
doesn't seem very helpful because the mapping between Twitter and my use
case seems ambiguous. Probably because I haven't groked the gestalt of
elasticsearch.

More context, probably not necessary:
The documents I'm most interested in (for now) are MS Office documents,
PDFs, and plain text files. I'm using the mapper-attachments plugin, and
combined with the fsriver plugin, I have gotten useful search results in
small-scale testing.

Taking this outside a small sandbox is where I'm having some trouble: the
idea is to use ES to index the contents of our network share drives.
There's about a million directories, 10-15 million individual documents,
consuming 20-30TB space. Walking the directory trees (which is what fsriver
does) won't scale very well, so I wrote a utility to watch for CRUD
operations on the filetypes I'm interested in. This is working fine. I
think I can combine this utility with NEST in a pretty elegant way if I can
get the indices set up, and the mappings done correctly.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ralf Schmitt) #2

Rian Stockbower rstockbower@gmail.com writes:

I'm indexing MS Office, PDFs, and plain text files on a network share
drive. In small scale testing, I've successfully used the fsriver and
mapper-attachments plugin to index and search the files. The mapping was
created automatically when I pointed fsriver at the directory of test data
when I created the index.

Hi Rian,

you may want to take a look at es-nozzle [1][2], which will also allow
you to index documents from a network share and does so in a scalable
way. es-nozzle is open source.

I'm pretty sure I need a more robust mapping to scale up and support
faceted search (at some point in the future). The attributes I'm interested
in are these:

  • UNC path to file
  • file name
  • parent directory of file
  • file type[1]
  • last modified timestamp
  • file contents[2]

I think the UNC path to the file (or a hash thereof) should serve as the
index of the file. This should make it easy to re-index specific documents
individually in the index(?), with O(1) lookup characteristics?

es-nozzle uses the file path with the root directory stripped as _id.

[2] mapper-attachments + fsriver stored the file contents as a
base64-encoded string, and the index took up ~50MB on disk -- nearly the
same size as the documents themselves. That doesn't seem very scalable to
me. Do I need to extract the text in some way, instead of naively encoding
the binary artifacts as well? (A problem to be solved later, perhaps.)

es-nozzle doesn't use the mapper-attachments plugin. It uses tika to
extract text and only stores the extracted text inside elasticsearch.
tika is also being used by the mapper-attachments plugin, so the
extracted text should be the same.

Taking this outside a small sandbox is where I'm having some trouble: the
idea is to use ES to index the contents of our network share drives.
There's about a million directories, 10-15 million individual documents,
consuming 20-30TB space. Walking the directory trees (which is what fsriver
does) won't scale very well, so I wrote a utility to watch for CRUD
operations on the filetypes I'm interested in. This is working fine. I
think I can combine this utility with NEST in a pretty elegant way if I can
get the indices set up, and the mappings done correctly.

es-nozzle's predecessor has been used to index around 30 million
documents into brainbot's proprietary search engine. es-nozzle is based
on the same design, so it should be able to handle your requirements.

Please give it a try and let us know about any problems you run into!

[1] https://github.com/brainbot-com/es-nozzle
[2] http://brainbot.com/es-nozzle/doc/

--
Cheers
Ralf

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #3