Problem with using uax_url_email

Hi everybody,

I want to perform URL extraction from my PDF files. I use mapper-attachment
plugin to index my PDF files.

In order to be able to perform some regex queries and extract all the urls
present in a pdf file, I used uax_url_email:

curl -X PUT "localhost:9200/test" -d '{

"settings" : {

"index": {
  "analysis" :{
    "analyzer": {
      "default": {
        "type" : "custom",
        "tokenizer" : "uax_url_email",
        "filter" : ["standard", "lowercase", "stop"]
      }
    }
  }
}

}

}'

and the map :

curl -X PUT "localhost:9200/test/attachment/_mapping" -d '{

"attachment" : {

"properties" : {
  "file" : {
    "type" : "attachment",
    "fields" : {
      "title" : { "store" : "yes" },
      "file" : { "term_vector":"with_positions_offsets", 

"store":"yes" }

    }
  }
}

}

I indexed some PDF files, the problem is for a file , I get this (while
urls in this file start with http://):

https://lh3.googleusercontent.com/-6uzhp-v0qFs/VOSfMU95byI/AAAAAAAAAUc/H4c6xvb54kg/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.17.19.png
for another file, I got this (it leaves the http:// ):

https://lh3.googleusercontent.com/-1rYIYWJJEbU/VOSfweFpgbI/AAAAAAAAAUk/bWzfst_uZUE/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.19.43.png
But the problem is the urls are not recognized completely , look at this:

https://lh3.googleusercontent.com/-vsKUj5I9MiA/VOSgtyS3yWI/AAAAAAAAAUw/64lgO4gYSdI/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.22.32.png

Is it caused by the double column representation in the PDF file?

https://lh4.googleusercontent.com/-c7n5-oMygRM/VOShm4hwnWI/AAAAAAAAAU4/CQNjTTctMnY/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.26.46.png

So, what did I do wrong? how can I fix this and use regexp queries
successfully to extract all the URLs?

Thank you

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/28cf5243-c60b-4cbc-b488-2da97c65061d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi,

for people having the same problem like me, here an answer I received from
Pablo in PT group:

About your problem I beleive this is a constraint of the Apache Tika [1],
which is used by the mapper-attachment plugin.
I believe that a search over Tika pdf limitations or a question on their
list will help you more than we can.
Anyway, maybe you want to ask in the Elasticsearch main list [2], which is
bigger than ours and has the Elasticsearch engineers.

I am sorry for not being able to help you that much.

Cheers,
Pablo

[1] http://tika.apache.org/
[2] elasti...@googlegroups.com

Le mercredi 18 février 2015 15:37:33 UTC+1, Marria a écrit :

Hi everybody,

I want to perform URL extraction from my PDF files. I use
mapper-attachment plugin to index my PDF files.

In order to be able to perform some regex queries and extract all the urls
present in a pdf file, I used uax_url_email:

curl -X PUT "localhost:9200/test" -d '{

"settings" : {

"index": {
  "analysis" :{
    "analyzer": {
      "default": {
        "type" : "custom",
        "tokenizer" : "uax_url_email",
        "filter" : ["standard", "lowercase", "stop"]
      }
    }
  }
}

}

}'

and the map :

curl -X PUT "localhost:9200/test/attachment/_mapping" -d '{

"attachment" : {

"properties" : {
  "file" : {
    "type" : "attachment",
    "fields" : {
      "title" : { "store" : "yes" },
      "file" : { "term_vector":"with_positions_offsets", 

"store":"yes" }

    }
  }
}

}

I indexed some PDF files, the problem is for a file , I get this (while
urls in this file start with http://):

https://lh3.googleusercontent.com/-6uzhp-v0qFs/VOSfMU95byI/AAAAAAAAAUc/H4c6xvb54kg/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.17.19.png
for another file, I got this (it leaves the http:// ):

https://lh3.googleusercontent.com/-1rYIYWJJEbU/VOSfweFpgbI/AAAAAAAAAUk/bWzfst_uZUE/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.19.43.png
But the problem is the urls are not recognized completely , look at this:

https://lh3.googleusercontent.com/-vsKUj5I9MiA/VOSgtyS3yWI/AAAAAAAAAUw/64lgO4gYSdI/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.22.32.png

Is it caused by the double column representation in the PDF file?

https://lh4.googleusercontent.com/-c7n5-oMygRM/VOShm4hwnWI/AAAAAAAAAU4/CQNjTTctMnY/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.26.46.png

So, what did I do wrong? how can I fix this and use regexp queries
successfully to extract all the URLs?

Thank you

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/12a4f452-6c6f-4e4f-ba00-97208efdbcba%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.