FOR ALL -- address parser -- please feedback

sfullman

[P.S.] -- previously forgot to add this link, which makes the whole post stupid otherwise.

http://www.relatebase.com/development/misc/parse_address.php

OK, here's number two, before I put a phone # parser on this forum for general use, wanting feedback and error checking. This feature is an address parser.

I haven't found anything like this out there. Additionally, there's a right and wrong strategy to reading an address and I'm not sure I've thought through all the angles here. But by strength is regex, and if anyone has any constructive contributions or changes, MAKE THEM and put a link on this thread.

I suppose I should (or someone should) look at the Post Office's standard for addressing. But realize that this function should eventually be "intelligent" so that if it's written:

125 S. Park St.

we know the parse is no doubt:
number:125
prefix-direction:S
name😛ark
type:St (Street)

Whereas

125 South Park Street

should be flagged so you'd also be aware of:

prefix-direction: (none)
name:South Park
type:St (Street)
--OR--
prefix-direction: S
name: Park Street
type🙁none)

TheDefender

This one is interesting, and has been tried many times in the past. I think the main problem is that there really is no set standard for addressing like there is for phone numbers.

I am wondering how the script would handle something like the following:

One Dell Parkway (Dell's address in Nashville, Tennessee)

Note, the number is written out, and there is no directional indicator...

Just more to think about.

sfullman

the way I pop things off the ends, starting with the right side and then going to the number on the left, saves this issue for last.

In the case of Dell's address, it would be put in the name, and there should be a flag in the returned array indicating non-standard -- easy because of lack of number at front.

The next thing to build in optionally would be a function which would read english language versions of a number, like:

read('Twelve Hundred'); //==1200

which could analyze this as a possible option

so the genius is not in the parsing, but in the vision to come up with the flags. If you know of a english language number converter function, let me know (these are all great ideas, you'd think there'd be a section where all of these could be found).

Sam

Weedpacket

155 Gt. Sth. Rd.
37 St. George St.
12/5 Customs Street West
27th Floor, AXA Building, Shortland Street
Shop 12, Kelston Shopping Centre, Aranui Road (or Pilkington Avenue, which the centre is also located on)
Corner Elliot and Wellesley Streets
Private Bag
PO Box 45332
Brancott Estate, RD1

I can think of two different kinds of delivery address - postal and physical. If a physical location is required, 7 and 8 are irrelevant - since the bit that follows those lines is the specification of a post office (and in 9, "RD1" is a postal rural delivery region), but if a postal address is required, then 6 is meaningless (there are two such corners, at it happens, and you have to go there to decide which is which). In my current environment, postal addresses are useless, and physical addresses are required (Occasionally a PO Box number comes through and I get to break skulls*.)

If I recall rightly, the lines in Chinese addresses are traditionally written in general->specific order, the reverse of Western specific->general ordering.

What the Postal Service standards do (and I don't know if they vary from service to service or not - they probably do) is provide a means of parsing addresses to enable machine sorting - anything failing those standards is parsed as narrowly as possible to mechanically sort the item so that it is routed to the most specific possible post office, where local knowledge (i.e., hand-sorting) is applied to finish the route.

With that in mind, what you're suggesting (in its most general sense) is an expert system that embodies that local knowledge.

*By the pitch of my whining

TheDefender

Here's another:

Rural Route 2 (or RR2), Dunk Hill

(Several family members of mine in upstate New York have this exact address (along with evryone else on that road), and the local mail person knows who is who, and delivers the mail to the right houses)

This will be quite the function when you finish sam... 🙂

drawmack

HC1 Box 148
Blakslee PA

TheDefender

Bah, anyone who lives in the splendor beauty of the Poconos doesn't count anyhow. 😉 My family is in the Adirondaks, so I don't imagine they count for much either. 😃

sfullman

Lotsa people talking, nobody helping :-)

Drawmack, see my code for Rural routes, Rt1 Box 425 for example. What does HC stand for? All you'd do is add an if clause with that in the regex and you've got it
Weed: I do catch P.O. Boxes. Also your long post including info about chinese addresses I think misses the point that we are assuming a subset of world addresses because we know the country already. I never meant this to catch every address in the world. What I want to know is
1) how many countries this'd work for in the current embodiment
2) how much more is being missed to make this marginally effective
Defender, your RR2(,) Dunk Hill would be caught if I modified the Rural route section, because I pull out many things to the right first like apartment number etc. Then I go back and look for a # to the left, and if not then I look for a P.O. Box or RRte. Currently it expects a "Box XX" after the route but that could be expanded to Alpha also.
For ALL: the genius here is not the parsing but
1) the ability to recognized the probability that an address is a certain format
2) what sensible flags to return which indicate where the problems might be, so that a duplicate check function or human can make the corrections from a list.

THAT is what I would appreciate more constructive ideas on. I put a lot of thought into strategy on this, if comments could address an elegant grouping of flags and partial strings for correction options, that would be worthwhile (not just showing me addresses that would get by :-)

TheDefender

Hey sam, slow down there skippy. The point is nobody probably really knew exactly what kind of help you were after since you dindn't explicitly say so in your original post. Heck, I didn't even realize you had any code in place until your follow-up. We're not mind readers. AND, giving you examples is probably the best way for us to help you think through strategies, since we don't have any code to go on... (but that's just one guy's opinion.)

sfullman

I never included the link to the function. Please everyone check this link out:

http://www.relatebase.com/development/misc/parse_address.php

Enter an address and just click submit; there's a link to the code. I really am sorry guys; thanks for being patient -- this thing is close to being finished, but again I'm looking for an elegant structure of flags.

This will be used for parsing addresses but also for comparing an entered address against potential duplicates, or perh. telling the user their structure may be incorrect.

Many thanks and once again I'm sorry to leave that link out.

Sam

drawmack

Personally, If you can't catch the addresses that I want to enter then your probability and flags are worthless to me no matter how well thought out they may be.

HC is Highway County and is commonly used in rural pa As is Star Route which is used when there are not enough people on a single road to list it they clump a bunch of themn together and call it Route. Both Star Route and Route are acceptable forms of this address. Also there are a lot of hyphenated roads in my area for example Syalorsburg-Palmerton Rd. Which goes between Saylorsburg and Palmerton (image that).

Everyone posting here understands that this is not an easy task - if it were easy we would have already done it. We are not criticising you, we are attempting to help you complete your project.

Another thing you may want to consider is getting the zip code database from the US Census bereau and then you'll be able to check zip code against town and state with accuracy.

sfullman

Drawmack,

Please test the link again, I added the features you mentioned, and it wasn't hard to do with regex.

As far as correlation with zip codes, I agree that's helpful but it's still appropriate I think to recognize the parts of an address on their own merit.

Anyway give it a try, you can see the line that deals with route and HC, let me know if it should be changed. Does route include a box or does the postman just know everyone? 🙂

BTW my algorithm should find hyphenated Syalorsburg-Palmerton with no problem.

Sam

drawmack

Yeah Route includes a box. Basically HC, Route and Rural Route are all the same thing just different practices by different post offices. Aint Standards Great?

Weedpacket

I think you'll want to include hyphens in the number, too.

267-273 The Mall

To indicate that the property in question covers a range of numbers along that thoroughfare.

It would also accomodate Unit-Street# forms:

5-12 Crummer Pl.

except in the cases where the writer wrote

5/12 Crummer Pl.

Unit 5, 12 Crummer Pl.

When checking for possible duplicate addresses, it seems reasonable to expect "271 The Mall" to be a potential duplicate of "267-273 The Mall", but probably not "270 The Mall" unless the numbering on The Mall is unconventional; "8 Crummer Pl." certainly shouldn't be a potential duplicate of 5-12 Crummer Pl., though.

Incidentally, the particular subset of addressing formats I'm most familiar with are New Zealand's (I missed the bit about this being applicable to a "subset"); both "267-273 The Mall" and "5-12 Crummer Place" are compliant to machine-sorting standards; "5/12 Crummer Place" is not, due to the potential for OCR scanners read the number as "5112".