Hi everybody,
I want to perform URL extraction from my PDF files. I use mapper-attachment
plugin to index my PDF files.
In order to be able to perform some regex queries and extract all the urls
present in a pdf file, I used uax_url_email:
curl -X PUT "localhost:9200/test" -d '{
"settings" : {
"index": {
"analysis" :{
"analyzer": {
"default": {
"type" : "custom",
"tokenizer" : "uax_url_email",
"filter" : ["standard", "lowercase", "stop"]
}
}
}
}
}
}'
and the map :
curl -X PUT "localhost:9200/test/attachment/_mapping" -d '{
"attachment" : {
"properties" : {
"file" : {
"type" : "attachment",
"fields" : {
"title" : { "store" : "yes" },
"file" : { "term_vector":"with_positions_offsets",
"store":"yes" }
}
}
}
}
I indexed some PDF files, the problem is for a file , I get this (while
urls in this file start with http://):
https://lh3.googleusercontent.com/-6uzhp-v0qFs/VOSfMU95byI/AAAAAAAAAUc/H4c6xvb54kg/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.17.19.png
for another file, I got this (it leaves the http:// ):
https://lh3.googleusercontent.com/-1rYIYWJJEbU/VOSfweFpgbI/AAAAAAAAAUk/bWzfst_uZUE/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.19.43.png
But the problem is the urls are not recognized completely , look at this:
Is it caused by the double column representation in the PDF file?
So, what did I do wrong? how can I fix this and use regexp queries
successfully to extract all the URLs?
Thank you
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/28cf5243-c60b-4cbc-b488-2da97c65061d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.