I run a problem validation - https://needgap.com where people post their problems for startups to solve, so there are often URLs of products which could solve a problem or other resources for discussing about the problem.
This often leads to the issue of invalid URL due to period ".", comma ",", semicolon ";" following the URL, But they are actually valid URL/URI according to the RFC[1] among several other characters which are not used commonly; but can be used[2].
Common sense says, I should filter these characters out, expecting users to know about RFC is absurd, making sure that the URL is valid should be the priority. But my Engineering education is giving me depression, nightmares about someone who is going to enter a URL like [2] in the comments and I break it due to validation.
So, what do you guys do for URL validation?
• Remove the less common characters from the URL?
If so how? What characters do you decide shouldn't live on your platform? If it is a dot ".", then how many dots in a URL are too many for you?
(or)
• Follow the RFC, if so how do you do it?
A demonic RegEx[3] or check if it's a valid URL by hitting the server?
If this discussion goes well, we'll do the email[4] next time!
[1]https://tools.ietf.org/html/rfc3986
[2]http://www.example.com/module.php/lib/lib.php
[3]https://stackoverflow.com/a/190405
[4]http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html