i want to make sure people enter only urls that i can throw into an <a> tag that will link properly. will this do it? any false positives or false negatives?

function is_valid_url($url) {
  $pattern = "#^(http:\/\/|https:\/\/|www\.)(([A-Z0-9][A-Z0-9_-]*)(\.[A-Z0-9][A-Z0-9_-]*)+)(:(\d+))?(\/)*$#i";
  if (!preg_match($pattern, $url)) {
    return false;
  } else {
    return true;
  }
} // is_valid_url()
    10 days later

    It considers these valid URLs as invalid:
    //example.com
    //example.com.
    http://example.com.

    I also recall some 63 character limit to each label, and that labels cannot begin or end with hyphens. If so, then:
    [url]http://-example.com[/url]
    [url]http://example-.com[/url]
    are both invalid yet accepted by your regex.

    That said, have you considered that index.html is a valid URL, just that it is not an absolute URL, which is what people typically want to post on a forum? After all, it will still "link properly", though it may not be what people expect 😉

      do you really want people having to enter http:// ??

      MOST people know a url, or copy it and paste it from the browser, but as a more skilled person I'd like to be able to type in

      relatebase.com/index.php?Category=PHP

      and have it work. So, make the http:// optional.

      Also, do some parsing on the www. for the first label, to discover urls which are equal such as www.amazon.com and just amazon.com - this might be of value to you.

      suggest you parse the entire string and return the parts as an array. See also:

      http://php.net/parse_url

      Hope that was linkable :-)

        Also, do some parsing on the www. for the first label, to discover urls which are equal such as www.amazon.com and just amazon.com - this might be of value to you.

        I note that it is not correct to assume that example.com and www.example.com are equivalent. They are two different host names.

          technically, laserlight, you're right on that last point, and in some cases (annoying) one or the other won't resolve. my clients always ask me "do I put www in front of that" so I always make sure that domain.com and www.domain.com are the same, however I'd say that for 99.9% of all websites, www. vs. the plain domain are not built as separate sites with separate pages. Think this is a holdover from a previous era when the www. was more important and the web was more technical. Way back then you HAD to add www. to see the web pages.

          Samuel

            To clarify, my goal here is to make sure that anything that validates can be thrown directly into an <a> tag and clicking that A tag will take you to the intended spot.

            I don't want people to type index.html or any other relative url that would point to a file on my site.

            this code:

            function is_valid_url($url) {
              $pattern = "#^(http:\/\/|https:\/\/|www\.)(([A-Z0-9][A-Z0-9_-]*)(\.[A-Z0-9][A-Z0-9_-]*)+)(:(\d+))?(\/)*$#i";
              if (!preg_match($pattern, $url)) {
                return false;
              } else {
                return true;
              }
            } // is_valid_url() 
            
            $str = '//example.com';
            if (is_valid_url($str)) {
              echo 'YES, ' . $str . ' is a valid url';
            } else {
              echo 'NO!, ' . $str . ' is NOT a valid url';
            }
            

            produces this:

            NO!, //example.com is NOT a valid url

            which is good! I wouldn't want that to be considered valid, would I?

            I think the 63-char limit sounds good. to which sections does it apply? Is there an RFC for this?

            Also, it might also be true that someone would want a URL with a filename or parameters in it like sfullman said.

            How would I modify the expression to make the www/http/https optional?

              here's a cool idea - have a text box w/a button next to it that says "Check URL" - the n use some javascript (you need to write some javascript), and when they check, have an iframe window just below the text box open up, displaying that url - when they submit the form they know it's the page they want.

              Samuel

                sneakyimp wrote:

                I think the 63-char limit sounds good. to which sections does it apply? Is there an RFC for this?

                There certainly is. Two are relevant; one for URLs and one for Host Names.

                RFC3986
                A URI resolution implementation might use DNS, host tables, yellow pages, NetInfo, WINS, or any other system for lookup of registered names. However, a globally scoped naming system, such as DNS fully qualified domain names, is necessary for URIs intended to have global scope. URI producers should use names that conform to the DNS syntax, even when use of DNS is not immediately apparent, and should limit these names to no more than 255 characters in length. [§3.2.2]

                RFC1123
                Host software MUST handle host names of up to 63 characters and SHOULD handle host names of up to 255 characters.

                Whenever a user inputs the identity of an Internet host, it SHOULD be possible to enter either (1) a host domain name or (2) an IP address in dotted-decimal ("#.#.#.#") form. The host SHOULD check the string syntactically for a dotted-decimal number before looking it up in the Domain Name System. [§2.1]

                  23 days later

                  Them's some good facts weedpacket. I'll get on this shortly.

                    Hm. I've been reading that first RFC. So COMPLICATED! The wordiness is almost like an exercise in obfuscation. It does provide a regular expression for 'breaking-down a well-formed URI reference into its components' but i'm not even sure what to do with it.

                    Would 'www' in this URL be considered 'scheme' or 'authority'?
                    www.mydomain.com

                    And forget about percent notation. I'm probably going to have to scale back my expectations here.

                      10 months later

                      Well it's been almost a year since last posted here, but I improved my url checker today. It's still not perfect but it is improved. Here's the function in a script to let you test urls:

                      <?php
                      
                      function isValidURL($url) {
                          $pattern = "#^(http:\/\/|https:\/\/|www\.|//)*(([A-Z0-9][A-Z0-9_-]*)(\.[A-Z0-9][A-Z0-9_-]*)+)(:(\d{1,5}))?([A-Z0-9_-]|\.|\/|\?|\#|=|&|%)*$#i";
                          if (!preg_match($pattern, $url)) {
                              return false;
                          } else {
                              return true;
                          }
                      }
                      
                      if (isset($_POST['submit'])) {
                              $_POST['url'] = stripslashes($_POST['url']);
                              if (isValidURL($_POST['url'])) {
                                      echo '<div style="color:green;">' . $_POST['url'] . ' is valid</div>';
                              } else {
                                      echo '<div style="color:red;">' . $_POST['url'] . ' is NOT valid</div>';
                              }
                      }
                      ?>
                      <html>
                      <body>
                      <form method="post">
                      <input type="text" name="url" value="<?=htmlspecialchars($_POST['url']) ?>">
                      <input type="submit" name="submit" value="submit">
                      </form>
                      </body>
                      </html>
                      

                      You can test drive it here. Given my failure at comprehending the RFCs, I would appreciate any suggestions to improve the pattern i'm using. This part at the end is particularly ugly:
                      code[/code]

                        Write a Reply...