Hello,

I am currently working on a php script to parse a list of addresses. I need to ensure that the addresses have a specific format as follows:

2 or 3 lines of text, followed by a blank line, followed by a phone number.

basically I am doing a search on yellowpages.com, doing a copy+paste into a form and then filtering out the junk I don't want.

For the most part it works, however I would like to make it better.
An example input would be:

Consider this an array where each "----" separates the values.

Business name One
address line 1
address line 2

(555) 555-1212


Business name Two
address
address

(555) 555-1212

Business name Three
address
address

(555) 555-1212

Business name Four
address
address

(555) 555-1212

Business name Five
address
address

(555) 555-1212

Business name Six
address line one

(555) 555-1212

The idea is that I want to be sure that what I have is actually an address, but I am not sure how to do it with regex.

edit by admin: do not post anything to the forum that you don't want search engines to find. You won't be able to remove it later.

Edit by me: The information posted is already available online. I retrieved it from the yellowpages.com. I never post things I don't want found. (besides that what are you implying?)

    Looks like maybe the admin edit of your post means you need to re-edit it with some sample data so that we can see what the input data looks like (and what you want to search/filter for)?

      bump

      I placed some fake addresses in there. Even though the ones I used before were found freely online

        This might need some tweaking, but works with the test data:

        <?php
        $text = <<<EOD
        Business name One
        address line 1
        address line 2
        
        (555) 555-1212
        
        
        Business name Two
        address just 1
        
        (555) 555-1212
        
        Business name Three
        address
        address
        
        (555) 555-1212
        
        Business name Four
        address
        address
        
        (555) 555-1212
        
        This should not show up
        
        Business name Five
        address
        address
        
        (555) 555-1212
        
        Business name Six
        address line one
        
        (555) 555-1212
        EOD;
        
        preg_match_all('/(?:[^\n]+\n){2,3}[ \t]*\n\(\d+\)[ \d\-]+(?=\n|$)/', $text, $matches);
        $addresses = $matches[0];
        printf("<pre>%s</pre>\n", print_r($addresses, 1));
        ?>
        

          hmm
          that does indeed appear as though it would work...

          Could you explain the regex for me? I am familar with PHP and Java but I am a regex n00b.

          bows

            '/(?:[^\n]+\n){2,3}[ \t]*\n\(\d+\)[ \d\-]+(?=\n|$)/'
            
            '/         start of regex
            (?:        start of non-matching sub-pattern
            [^\n]+     one or more of anything that is not a newline
            \n         newline
            )          end of sub-pattern
            {2,3}      match 2 or 3 occurrences of preceding sub-pattern
            [ \t]*\n   0 to any number of spaces/tabs followed by a newline
            \(\d+\)    "(" followed by 1 or more digits followed by ")"
            [ \d\-]+   1 or more occurrences of space, digit, or hyphen
            (?=        start look-ahead assertion
            \n|$       newline or end of string
            )          end of look-ahead assertion
            /'         end of regex
            

            More info: http://www.php.net/manual/en/reference.pcre.pattern.syntax.php

              Write a Reply...