Common lines for multiple files

Effash · July 22, 2014, 8:52am

Hello everyone,

I have a lot of files with a lot of short lines (~45000 per file). The
lines consist in a keyword and some additional data
I store each file and its metadata as objects in {index: "default", _type:
"file", id: filename, _source: {various metadata} }
I store each line as children of my files:
body_mapping = {"line": {
"_parent" :{
"type" :"file"
}
}
}

{"_index": "default",
"_type": "line",
"_id": line_number,
"_parent": filename
"_source": {"keyword": keyword,
"metadata"}
}

My goal is to search accross all my files by keyword {"query":

{"query_string":

{"query" : keyword,

"fields" : ["keyword"]

}

}
But there is more to it: I want to search a bunch of keywords from a given
file (all lines from an existing file or a new one) and aggregate the
results by filename.
For example, the result would be:
{filename1: [{keyword: my_search_keyword,
metadata_for_this_keyword_in_file1, _id: line_number},
{keyword: my_search_keyword, metadata_for_this_keyword_in_file1, _id:
line_number}, ...],
filename2: [{keyword: my_search_keyword,
metadata_for_this_keyword_in_file2, _id:
line_number}, keyword: {my_search_keyword,
metadata_for_this_keyword_in_file2, _id: line_number}, ...],
filename5: [{keyword: my_search_keyword,
metadata_for_this_keyword_in_file5, _id: line_number},
{keyword: my_search_keyword, metadata_for_this_keyword_in_file5, _id:
line_number}, ...],
}

Important point: There are a lot of collisions, keyword-wise.

At the moment I am using elasticsearch-py with the es.msearch function. My
query is mentioned above. However this is quite slow, so I suspect that
either my object design, mapping, or search strategy are wrong.

Would you have an insight to give? Thanks a lot!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/39410826-61e2-4b31-8e17-72358c5a6ed6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.