Validate URL Regex

G&G Podcast Host
Matt Farina's picture
Joined: 06/01/2006
User offline. Last seen 21 weeks 6 days ago.

I'm working on a regular expression for url validation. The goal is to provide a better validator that will end up in drupal and used to validate urls (like the fields we fill out for profiles and getting them to work with flickr). If anyone has a minute here is my proposed regex.

  1. preg_match('
  2. /^ # Start at the beginning of the text
  3. (ftp|http|https):\/\/ # Look for just ftp, http, or https followed by ://
  4. ( # Username:password combinations (optional)
  5. [\w\.\-\+]+ # A username
  6. :{0,1} # an optional : to separate the username and password
  7. [\w\.\-\+]*@ # A password
  8. )?
  9. ([A-Za-z0-9\-\.]+) # The domain limiting it to just allowed characters
  10. (:[0-9]+)? # Server port number
  11. ( # The path (optional)
  12. \/| # a /
  13. \/([\w#!:\.\?\+=&%@!\-\/\(\)]+) # or a / followed by a full path
  14. )?
  15. $/i', $url);

Any suggestions for improvement? I'm looking for something at spec or close to spec. See http://www.ietf.org/rfc/rfc2396.txt, http://tools.ietf.org/html/rfc3305, http://tools.ietf.org/html/rfc1738, and http://www.w3.org/Addressing/URL/url-spec.txt for all the gory spec details.

Matt Farina
Geeks and God Former Co-Host
www.mattfarina.com

Joined: 11/28/2008
User offline. Last seen 22 weeks 3 days ago.
MF, I'll test it a bit. One

MF, I'll test it a bit. One thing, though--if you don't plan on using any of the matches out of the parenthesized sections, you should put ?: directly after each opening paren. This tells the parser to not store matches for the set of parens making it run more efficiently.

/Some (text)/
If the whole matches, a second match for the word 'text' is stored

/Some (?:text)/
If the whole matches, no second match for the word 'text' is stored.

Joined: 11/28/2008
User offline. Last seen 22 weeks 3 days ago.
Only one oddity so far

For a query string to be allowed, your regex is forcing a path to be in place

http://www.domain.com/?test=taste
Passes

http://www.domain.com?test=taste
Fails

G&G Podcast Host
Matt Farina's picture
Joined: 06/01/2006
User offline. Last seen 21 weeks 6 days ago.
Thanks

Thanks for the feedback. Here is an updated regex.

  1. preg_match('
  2. /^ # Start at the beginning of the text
  3. (?:ftp|http|https):\/\/ # Look for ftp, http, or https
  4. (?: # Username:password combinations (optional)
  5. [\w\.\-\+]+ # A username
  6. :{0,1} # an optional colon to separate the username and password
  7. [\w\.\-\+]*@ # A password
  8. )?
  9. (?:[a-z0-9\-\.]+) # The domain limiting it to just allowed characters
  10. (?::[0-9]+)? # Server port number
  11. (?: # The path (optional)
  12. \/| # a forward slash
  13. \/(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+)| # or a forward slash followed by a full path
  14. \?(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+) # or a question mark followed by key value pairs
  15. )?$
  16. /xi', $url);

Matt Farina
Geeks and God Former Co-Host
www.mattfarina.com

Joined: 08/17/2007
User offline. Last seen 8 weeks 6 days ago.
Some regex tweaks

Here's a few cents from me...

(?:ftp|http|https)
can be shortened to
(?:ftp|https?)

the same can be done for this
:{0,1}       # an optional colon to separate the username and password

A couple of resources I rely heavily on - http://www.regular-expressions.info/tutorial.html and https://addons.mozilla.org/en-US/firefox/addon/2077

G&G Podcast Host
Matt Farina's picture
Joined: 06/01/2006
User offline. Last seen 21 weeks 6 days ago.
Thanks

Thanks for the firefox add on and tips. I'd like to get this regex worked out before the end of the year (if I have time).

Matt Farina
Geeks and God Former Co-Host
www.mattfarina.com

Joined: 08/17/2007
User offline. Last seen 8 weeks 6 days ago.
RE: Validate URL Regex

Try this out...

preg_match('
/^
(?:ftp|https?):\/\/                    #changed here
(?:
[\w\.\-\+]+
:*?                                    #changed here
[\w\.\-\+]*@)?
(?:[a-z0-9\-\.]+)
(?::[0-9]+)?
?:
(\/)+(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+)*| #changed here
\?(?:[\w#!:\.\?\+=&i%@!\-\/\(\)]+))$
/xi', $url);

G&G Podcast Host
Matt Farina's picture
Joined: 06/01/2006
User offline. Last seen 21 weeks 6 days ago.
More Complicated

Thanks for the help and feedback. I'll take a look at this in the next couple weeks when I'm not so busy. I've learned that there are more specs to take into account and more characters to allow... http://www.ietf.org/rfc/rfc3986.txt and maybe http://www.whatwg.org/specs/web-apps/current-work/...

It would be so nice if there were one spec to rule them all.

Matt Farina
Geeks and God Former Co-Host
www.mattfarina.com

G&G Podcast Host
Matt Farina's picture
Joined: 06/01/2006
User offline. Last seen 21 weeks 6 days ago.
valid_url regex update

Here is an update to my valid_url regular expression.

preg_match("
  /^                                                # Start at the beginning of the text
  (?:ftp|https?):\/\/                               # Look for ftp, http, or https
  (?:                                               # Userinfo (optional)
    (?:[\w\.\-\+%!$&'\(\)*\+,;=]+:)*
    [\w\.\-\+%!$&'\(\)*\+,;=]+@
  )?
  (?:[a-z0-9\-\.%]+)                                # The domain
  (?::[0-9]+)?                                      # Server port number (optional)
  (?:[\/|\?][\w#!:\.\?\+=&%@!$'~*,;\/\(\)\[\]\-]*)? # The path (optional)
$/xi", $url);

This is for http://drupal.org/node/124492 and to fix an issue on this site. Anyone see a way to improve on it. The goal is for it to work to RFC 3986.

Matt Farina
Geeks and God Former Co-Host
www.mattfarina.com

Joined: 04/07/2010
User offline. Last seen 1 year 44 weeks ago.
There is a problem with this

Great work on this so far, i've been looking for a URL checker for something i'm currently working on.

I was testing this and most of my issues i had with previous URL checkers were fine.

This fails on:

http://h.

My understanding is this should be fail? if it doesnt then ok. My URL checker function looks like this:

static function validateURL($url) {
$url = trim($url);
if($url == "") { return false; }
if(!preg_match("!^(?:https?://|ftp://)!", $url)) { $url = "http://" . $url; }
if(preg_match("!.*?//[a-z0-9\-\.%]+\.$!", $url)) { return false; }
if(!preg_match("
  /^                                                # Start at the beginning of the text
  (?:ftp|https?):\/\/                               # Look for ftp, http, or https
  (?:                                               # Userinfo (optional)
    (?:[\w\.\-\+%!$&'\(\)*\+,;=]+:)*
    [\w\.\-\+%!$&'\(\)*\+,;=]+@
  )?
  (?:[a-z0-9\-\.%]+)                                # The domain
  (?::[0-9]+)?                                      # Server port number (optional)
  (?:[\/|\?][\w#!:\.\?\+=&%@!$'~*,;\/\(\)\[\]\-]*)? # The path (optional)
$/xi", $url)) {
return false;
}
if(!filter_var($url, FILTER_VALIDATE_URL)) { return false; }
return true;
}

Granted there are probably alot of issues with what i've got, i'm more than happy to be told them should there be some. The above seems to catch every case ( that i can think of )

Joined: 04/07/2010
User offline. Last seen 1 year 44 weeks ago.
Found another failure ...

Your Regex also fails when using:

http://example.com/?s=&keyword=&url=http%3A%2F%2Fhello+world.com

I'm not sure your querystring section works properly

Joined: 07/15/2010
User offline. Last seen 1 year 30 weeks ago.
HTML5

Just a note: HTML5 browsers will have URL validators built into them, so this will only be necessary for security double-checks on the server, and may not need to be a perfect.

BTW, according to RFC1738, the "/" is required after the hostname, if there is a query string. It;s weird, but according to my reading of RFC1738

http://www.example.com?

is legal, but

http://www.example.com?value="something"

is not.

Joined: 07/19/2010
User offline. Last seen 1 year 29 weeks ago.
Regular expression to validate URL - javascript

Hi,

In need of a er to validate all possible URL's. I want to avoid the wrong URL's, Example: (htp: / / http:/www., Ww.domain ...).

She has to validate access FTP (S) and HTTP (S). In HTTP access can contain querystring, which also should be validated.

Until now I have the following:

var regexp = 	/^							# casa o início da url
			(((f|ht)tp(s)?):\/\/)?				# Protocolos ftp, ftps, http e https - opcionais
			(www\.)?					# www. - opcional
			([a-zA-Z0-9\-]{1,}\.){1,}?			# subdomínios - opcional
			(						#
				[a-zA-Z0-9\-]{2,}\.[a-zA-Z0-9\-]{2,4}	# domínio de primeiro nível
				(\.[a-zA-Z0-9\-]{2,4})?			# segundo nível - opcional
			)						#
			(\/|\?)?					# / ou ? para iniciar diretórios e querystrings - opcional
		$/

I'm not optimizing the size of the ER, the idea is that it is functional. Once you are working want to optimize.

Failure to implement verification login via FTP and check IP's .... any suggestions?

Greetings,

GuttoSP

G&G Podcast Host
Matt Farina's picture
Joined: 06/01/2006
User offline. Last seen 21 weeks 6 days ago.
html5 reliance fail!

@Arlen - There are 3 reasons why relying on html5 will fail for the near to mid future.

  1. A majority of browsers are not html5. Look at the IE6 market share. It's ancient and still one of the most widely used browsers.
  2. A lot of software that connects to sites and interacts with them is not browsers. It could be bots or people taking advantage of APIs. Those need to handle the validation as well.
  3. Security! Sometimes you need to know a url is a valid url and contains nothing malicious. Trusting users and clients is never a good thing. There will always be bad eggs who do bad things.

Matt Farina
Geeks and God Former Co-Host
www.mattfarina.com