The ES documentation here  states this about the standard tokenizer:
standardtokenizer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation symbols. It is the best choice for most languages.
Where can I find the definition of the exact set of punctuation symbols that get removed (e.g. hyphen, ampersand)?