Extracting address, city, zip, state,... from POI address column

Sun, 08/02/2009 - 8:44pm

turboccc

15 years

Hello POI Factory,

Most of you already knows my Extra_POI_Editor (EPE). Many of you are using 4-column CSV files to create POI files. It is good, but how better would it be if you could separate the address information from the 4th column into distinct columns?

Am I mad? Yes, doing this manually is a bit crazy, but what if you could do this automatically?

Let's see an example already supported by EPE. It works great when all the information is structured uniformely:

589 4th street, Ney York, NY, 212-333-4567
1089 Peachtree, Atlanta, GA, 245-122-4456
762 Pete street, Los Angeles, CA, 613-432-8753

Remember, this is the 4th column of your CSV, but because the address, city, state and phone information is all separated by commas, it becomes easy to use EPE to split the information into 4 different columns.

What about this more complex example:

589 4th street, New York, NY 10089 212-333-4567
1089 Peachtree, Atlanta, GA 23828 245-122-4456
762 Pete street, Los Angeles, CA 87234 613-432-8753

Now, it is not as fun because the state, postcode and phone are all together with no separator. Well, this can be done too and especially in the US or Canada. Why? Because the state is always 2 characters, the zip 5 digits and we know what makes a phone number.

The problem is not really those examples. The problem comes when you have different POI structured differently on different lines:

589 4th street, New York, NY 10089 212-333-4567
1089 Peachtree, GA
Virginia Beach
762 Pete street Los Angeles CA

What about those? Well, not everything can be done, but what if you could extract at least some of it? Then, you can inspect you output CSV and see what is missing in the new columns. If a program can do 75% or 90% of a file, isn't worth it?

This thread is all about telling me what do your POI files look like and telling me what you want me to do with them. If nobody want to split their 4th column, I will do something else. Otherwise, post me some examples and I will see what can be done no matter how crazy you think it would be. I have tons of ideas. I am just probing to see if there is interest. This will be a lot of work, so I am expecting a lot of interest before I dig into this.

Turbo

Sun, 08/02/2009 - 9:24pm

a_user

15 years

Quite a project

Turbo,

This is quite a project and if you can come close, it will save hours of work in updating older files. For an example of a large file that has all the problems you are looking for, try the Hampton Inn POI. It was evidently compiled from various state files made by individuals with varying degrees of success. Some entries have address, city, state, post code, and phone number while others are missing one or more elements.

Some entries have only a street address and no other information, while others may just be missing the phone number. Still others may have everything except the postal code. Some have the format State-address-city-State all separated by dashes or hyphens.

The format of the telephone number is not consistent when present either. Sometimes it has parens, sometimes dashes. The postal code also varies, sometimes it is 5 digits, other times it is the +4 variety which has a hyphen and 4 additional digits. This one file should provide just about all the testing you would need with the exception of line breaks.

--
ɐ‾nsǝɹ Just one click away from the end of the Internet

Sun, 08/02/2009 - 11:59pm

turboccc

15 years

Oh mannnnn!

Hi a_user,

Pretty nasty file, but exactly what I was looking for. smile

Phone number = Works good, but I do not catch the second phone number when present. This can be fixed easily.

State+zip: Got them all!

State alone: To be done, but some minor errors are to be expected.

City: To be done, but major errors are to be expected. Assuming a dash (-) as a separator would cause many errors.

I still have some ideas... I'll post back later.

Mon, 08/03/2009 - 9:45am

Gary A

16 years

No Commas

When I make my files, I use line break characters, not commas in the 4th column.

Here's an example: http://www.poi-factory.com/node/13301

Then, for something really different, there is this one with lots of info in the 4th column that is not address related: http://www.poi-factory.com/node/17672

But both display very nicely on my 660.

--
Tampa, FL - Garmin nüvi 660 (Software Ver 4.90), 2021.20 CN NA NT maps | Magellan Meridian Gold

Mon, 08/03/2009 - 11:15am

turboccc

15 years

.

Hi Gary A,

You file with line breaks is pretty easy to break into different columns. The line break becomes the character separator instead of commas. Also, your file is very uniform and complete. It seems there is no exception. Good job.

The second file is interesting. See an example below of the 4th column with line breaks:

2902 W Fletcher Ave
Tampa FL 33618
w3.aatampa-area.org
Mo-NS OS 8:15p
Tu-NS OD 8:00p
Th-NS OD 7:30p

A bit like your file, I would use the line breaks as separators with the following field order:

Address, (City+State+Zip), Link, Desc, Desc, Desc

When this is implemented in EPE, this would move opening hours into the description column.

Thank you for the feedback. This is exactly what I am looking for.

Mon, 08/03/2009 - 12:03pm

Gary A

16 years

Inconsistencies

You're right, its much harder to do what you are trying to do when your raw input data (in this case, col #4) is full of inconsistencies.

Is changing the www to w3 just to reduce the number of characters?

--
Tampa, FL - Garmin nüvi 660 (Software Ver 4.90), 2021.20 CN NA NT maps | Magellan Meridian Gold

Mon, 08/03/2009 - 2:22pm

phranc

16 years

The www isn't really

The www isn't really necessary. http://www.google.com is the same thing as http://google.com. Try it for most any address.

If some sites won't work, it's indicative of host/dns server problems.

--
Frank DriveSmart55 37.322760, -79.511267

Mon, 08/03/2009 - 2:58pm

turboccc

15 years

w3

LOL! No, I just edited the www link to w3 to remove the blue link when displayed in the post. smile

Mon, 08/03/2009 - 2:59pm

Felix Krull

15 years

Canadian postal codes

turboccc wrote:

Canadian postal codes are six-character alphanumeric. And they may or may not be represented with a space separating the field into two 3 character pieces.

You've set out quite a challenge for yourself. Many of us here will be rooting for you to succeed.

Mon, 08/03/2009 - 3:33pm

turboccc

15 years

Partial results

Here's a little example of what can be done. I used the Hampton Inn_USA.csv file from POI Factory.

You can see my partial results here:

http://turboccc.wikispaces.com/file/view/Hampton+Inn_USA_wit...

You can see the original address field in the 4th column. You can see the cleaned address in the 5th column. The other columns on the right are the extracted state, postcode, phone and city. When you see a * in the field, it indicates the information is coming from geocoding, otherwise, it comes from the original address field.

Like I said, not perfect, but not so bad. Cleaning the address field can be tricky and especially the left part of it: sometimes it contains the POI name, sometimes it is slightly altered, sometimes it is all different. At least, I try to remove the reduncdant information, but still I have work to do.

Comments welcomed.

Edit: Oh yeah: some cities with a * seems to be different from the one in the address field. I am not sure how far they are from the reality, but this just means we cannot rely entirely on geocoding.

Turbo

Mon, 08/03/2009 - 3:47pm

a_user

15 years

RE: Partial Results

From my playing in the file the city/location named in the POI name (Col C) is the name given on the Hilton (Hampton Inn) web site. I think they use the name of a major area rather than the postal address city for marketing purposes. People looking at the web site would be more tempted to look at Inns in a major city than one of the 'burbs.

When the location is then displayed on a map, they can see if it is in the area they want or need.

Oh - it is LOOKING GOOD!!!

--
ɐ‾nsǝɹ Just one click away from the end of the Internet

Tue, 08/04/2009 - 10:02am

turboccc

15 years

Beta test for state & postcode extraction

I released EPE v3.90.

When reading a CSV file, it will extract the State & Postcode from the address field (Col4) if those fields are empty.

To extract the City, start the Batch Geocoding under Tools. If EPE finds the City in the address field, it will extract it. Otherwise, it will take the geocoded one and add a *. It will do the State & Postcodes as well if the fields are still empty.

Warning: Sometimes, massive geocoding may crash the EPE application. I need to check this.

Here's how to test this:
------------------------

1-Set the CSV fields to:

Col1=Lon, Col2=Lat, Col3=Name, Col4=Address
Col5=City, Col6=State, Col7=Zip

It does not matter if Col5-Col7 do not exist when reading the CSV file. They will be filled when writing the M-Column CSV.

2- Read CSV
3- Inspect Phone+State+Postcode extraction if you want.
4- Tools-Batch Geocoding
5- File-Save As and select M-Column CSV as the file type. Choose a filename.
6- Open the CSV in Excel (or others) and see the changes.

Report feedback.

home | top of page | forum areas