Geonames: The only terrible choice we have

Table of Contents

  1. Loading your data
  2. Associating your data
  3. The mysterious admin1Codes.txt

When it comes to getting an accurate, up to date database of cities, towns, and what have you, really the best we can hope for is Geonames. They manage to keep things updated fairly regularly, however there also seems to be a slight, just a slight, massive tidal wave of garbage all mixed in.

As they say on the BBC: I'm on a journey… to figure out how to get the cities, associated provinces, and postal codes from these data dumps, and associate them together. You'd think it'd be easy, you'd think the CSVs have all of the proper association information there, easy to manage, and of course that's what a sane person would think.

Well, I've got a surprise for you.

I'm not going to go super in-depth here all tutorial-style, but rather just outline some things I had to deal with and the solutions I came up with, in hopes they will help you, because they sure as hell were not easy to find.

Loading your data

Loading it is fairly straight forward for a lot of people, but some people it isn't quite clear. A few things need to be considered:

  1. Regular utf8 columns in MySQL will not allow you to import alternative names and some other stuff
  2. In some cases even utf8mb4 columns aren't "good enough" to import some alternative names because the alternative names themselves aren't "good enough" and are malformed Unicode.
  3. Most alternative names are either identical, misspellings, or bizarre stuff like airport codes, misspellings in other languages, what seems to be some sort of sub-typing (like one for Moscow was "wsa     MOW" , what the hell that means, I have no idea), and you'd think the city in the native language/script would come first, but it usually doesn't, if at all. Though I'm sure you'll enjoy all of the Gothic unicode script of names which are absolutely necessary since we live sometime in the Early Middle Ages. 
  4. My suggestion is to just ignore the alternative names completely, but if you insist on importing them, use varbinary or blob.
  5. There are tons and tons of duplicates, with our Moscow example alone there are currently 7 versions of the exact same city, in the same country, and yes the feature type is the same.

The easiest way to do this is create a table in MySQL identical to the schema they lay out for whatever dump you have, but maybe varbinary instead of varchar (as mentioned above). In some cases you may want to add a little on the end too, some columns go over the documented width.

From your MySQL client:

This general idea works with all of the dumps so long as the table is exactly the same.

Whoops, hold on, for some reason there are needless backslashes escaping the tabs from time to time, so you have to go in and replace those otherwise MySQL will freak out. Are these backslashes a part of the names of some cities or places, perhaps a part of a strange alphabet?

LOL, of course they're not, they're just randomly there.

Associating your data

With the locations, you'd think with name like geonameid that would mean other dumps like postal would match up with it in some way, or they'd easily reference administrative divisions by ID… come on, you know that wouldn't do something that logical.

Geonames has the most horrendously, implausibly baffling, most terribly undocumented method of associating data I've seen in years; it's a wonder anyone uses it. It's made worse by the fact that those who have figured it out (which to me seems on par with breaking Egyptian Hieroglyphs) essentially keep it a secret as if posting about it on a forum will cause the Stasi to burst in their door and take their families away.

How the data are associated is pretty damn goofy and poorly thought out:

  • Countries are referenced by their ISO-3166-1 code, at least this is somewhat consistent, in fact GB is used rather than UK, so don't confuse with the almost identical country top-level domains. SX and XK are also in use for the new nations of South Sudan and Kosovo, since as of writing this, they do not have codes yet.
  • Cities and states/provinces: You'd think they'd follow the logical geonameid integer logic, but at this point I guess I don't need to say it's far more retarded than that. Cities are associated with provinces by the country and admin1 columns.  So what are these admin1 values? In some cases like US states they make sense, they're the ISO standard abbreviations, in some cases they're area codes, and in some cases they're various other things. (See bottom of this post about where to get the admin1 values from).
  • Postal codes are matched primarily by country and admin1 , admin2 (when they're properly filled, sometimes not), and the "place_name," however if you're expecting place_name to match the city in spelling in the other dump (such as O'Brien in the locations dump to be also O'Brien in the postal codes dump), you are sorely mistaken, yet again. Instead spelling changes are haphazard, so it can be "O Brien" or "OBrien" but never the expected "O'Brien," I mean jeeze, it's almost like they're running a psychological experiment on us.
  • As for other stuff, I don't have any information, my only concern was getting associated cities, provinces/states, countries, and postal codes, but it doesn't take a massive leap of faith to guess they're all fucked up too.
The mysterious admin1Codes.txt

So you want to associate your cities with the appropriate province/state? "Tough shit, asshole" say the administrators of Geonames, they make it basically impossible.

How you say?

So where do they keep the states/provinces? No where, they deleted them. Apparently they felt it was "confusing for a lot of users" because of various issues, so instead of fixing those, they simply remove admin1Codes.txt from their web site, but continue to reference it in the official readme.txt for how to import the data.

Brilliant! That's exactly how not to confuse people, reference a non-existent file which contains important information on which cities belong to which provinces/states in countries, just delete it and replace it with nothing, because having anything at all would be confusing.

I don't recommend using the one linked by "marc," it's missing a lot and it's badly encoded.

I went ahead and recreated a new admin1Codes.txt, I based it on combining several versions of admin1Codes.txt I found, being certain to actually use the same encoding type through the whole file and making sure all of the cities listed have a code.

admin1Codes.txt with names in "plain-text" Unicode
admin1Codes.txt with names in hex (for easy transport, but they're still Unicode bytes)

If you import this, be certain to use utf8mb4_unicode_ci columns, and I explain at great length in my post: Better Unicode support for MySQL (including emoji).

Disclaimer: I can't vouch for the accuracy of the names or whether or not obsolete ones still exist, but hell, it's better than providing you with nothing.

In case you are interested in contributing to this list, an example of the missing codes can be found here: missing.txt, if you manage to match them up, let me know. A few I started to manually do, but since I spent so much time on this already, I stopped. I'm also considering starting an API service like Geonames that uses proper association. I realise the job of getting accurate data isn't easy, but if they've got people manually entering things to properly set it (and they do) then why in the hell don't they maintain a logical, static association?

Better Unicode support for MySQL (including emoji)

When it comes to character support I think the only thing that should ever be used is Unicode. That's right, I said it. However, when it comes to support in MySQL, things get a little bit murky.

I never had too much of an issue using plain ol' utf8_general_ci, however when trying to add language support for Gothic (tested because it's rare), I ran into a serious issue:

Source: brainstuck.com
Source: brainstuck.com

Son of a …

This issue is caused by the fact UTF-8 in MySQL isn't fully supported by utf8 the character set, it only supports a maximum of 3-byte characters. If you want something more realistic you're going to need to have at least MySQL 5.5.3 and you're going to have to use utf8mb4 not regular utf8. Yes, seriously.

Make sure you read through all this before trying anything, because there are edge issues, especially with indices (indexes) which you may need to consider.

Also back up your data first.

Updating Database

I'll be working under the assumption that you want your entire database to be utf8mb4, but if you don't then you'll have to adjust a bit, but seriously reconsider joining the 21st century if you're not using unicode. I'm also assuming you want case insensitive text, and if you don't, replace utf8mb4_unicode_ci with utf8mb4_bin — most people want case insensitive text in most cases.

Update the default character set and collation for your database:

Updating Tables

First we need to change the default character set, this way when you add new columns in the future, or whatever, you don't need to worry about adding all of the character set specification:

Updating Table Columns

Now, you can convert one column at a time, and this may be what you wish to do if you require different character sets for your CHAR, VARCHAR, and TEXT columns, here's how you do that:

Now obviously you're going to want to make sure that you're converting to the same column type and length, etc, the above is for example only and if you copy/paste it, you may screw up your column schema. Essentially you're just using ALTER TABLE CHANGE on the column in order to change the character set to utf8mb4 and collation to utf8mb4_unicode_ci.

If on the other hand you just want to change the entire the entire table at once, you can do:

Updating Table Indices

When changing the character type, you may run into this on InnoDB:

or this on MyISAM:

Ah, CRAP!

Oh snippy snap, snap, there are solutions:

If you're using InnoDB on MySQL 5.6.3 or higher you can enable innodb_large_prefix in your MySQL config file (more information in manual here), but if you aren't you can take a few steps to work it out the old way:

  1. Make note of the conflicting index, it's going to likely be one which is something like VARCHAR(255) or an index across multiple columns which includes VARCHAR. Make note of the index name, type, and which column(s) it crosses.
  2. In my own scenario, I had a lot of columns which included some sort of VARCHAR(254) and ID which was binary(20).  Now it seems like 254+20 = 274, and hey that's less than 767 (or 1000) so what's the deal?Well, not so fast there, Professor.MySQL doesn't count literal bytes in VARCHAR when it comes to Unicode, rather potential Unicode bytes are themselves counted as a byte (wait, what?).So if the column is 254 and it's utf8 that means the actual potential length is literally (254 * 3) bytes, and with utf8mb4 it's (254 * 4). So really the length of the key you're trying to create is ((254 * 4) + 20).InnoDB only allows a maximum of 255 bytes for the column in an index with utf8 and 191 bytes for utf8mb4.So if you need the entire column indexed, you aren't going to want to change the character set for that column(s), and instead I recommend changing all others one by one (as seen in the Table Columns section) rather than trying to convert the entire table. However if you do not need the entire column to be index, and in certain cases I did not.Drop the index:

    Then recreate it with the offending column(s) limited to 191:

    or if across multiple columns (assuming mycolumnb is not utf8 for example):

As long as the indices are the same, and in the same column order, you should receive the same benefits for the indices without worrying about redoing your queries.

Additional Notes and Considerations

If a column is not being used for search and case insensitivity isn't an issue, instead of using CHAR or VARCHAR, I suggest using BINARY and VARBINARY. Not only is comparison vastly faster, but also there's less to worry about as far as character set issues go, i.e. they don't matter. Further also VARBINARY is literal length so the UTF-8 limitations described in the index section of this post do not apply, so you can get the full width for your index.

Additionally instead of using TEXT, use BLOB, for the same reasons, but also realise the same limitations apply, such as no fulltext searching.

In summation, if you don't need case sensitivity and you don't need fulltext search, consider BINARY, VARBINARY, and BLOB over CHAR, VARCHAR, and TEXT, it'll be a lot easier to deal with when it comes to Unicode.

You can learn more about this on my MySQL performance, using case insensitive columns post.

Database Connections

Depending on your programming language, you may need to specify when connecting which chartype to use (you can also, in most cases, specify this on configuration, see that section at the bottom), this usually can be done by sending this query  right after connection:

Configuration

You can edit your my.cnf (or my.ini on Windows) and make these changes to the appropriate sections of the configuration file (applicable to MySQL 5.6, older versions may need adjusted configuration):

Properly escaping MySQL queries in PHP

I'm on various boards and such and from time to time people run into issues where they're trying to insert something into MySQL via a raw query and they inevitably run into that pesky apostrophe and the query dies.

Then almost always someone comes along to tell them that they need to use addslashes().

This is wrong.

Ideally you really want to use prepared statements (mysqli and PDO extensions), but let's assume for now you're throwing caution to the wind and you're going to do it the old fashioned way.

If you're using the mysql extention, you should use mysql_real_escape_string() around all of your variables which are not cast as integers. But actually, you shouldn't be using this function because mysql_* is deprecated, way deprecated. Instead you should be using…

mysqli which is faster, better, sexier, everything you want in a wom… extension. In this case we have the more logical name mysqli_escape_string() or you can use the back-to-goofiness-again method in the mysqli class $mysqli::real_escape_string() and it works the same way.

One issue is that with both of the above functions you have to actually be connected to the server to use them, that's because it escapes based on your connection chartype and some other stuff.

However assuming you're not too worried about potential unicode issues (I've yet to have any, supporting Serbian and Hungarian) you can always make your own function to escape based on what MySQL requires:

But there's always a potential danger in doing things yourself and I actually don't have proof the above is faster than the connection required escape functions, so just use prepared statements ideally.

What are HIPAA's encryption requirements?

There's a lot of assumptions about what HIPAA states when it comes to encryption, be it over the wire, files, whatever. The fact is that HIPAA makes absolutely no requirements for encryption*, just that if there's reasonable risk, it must have encryption. What kind of encryption? What sort of strength? It does not specify*.

So to break it down:

  • Does HIPAA require encryption? No, unless there's a reasonable risk something could be read, as in over a network or what have you
  • What sort of encryption does HIPAA require? Essentially anything.

My suggestions though are:

  • You should use encryption in as many places as possible, especially if devices are storing information, almost all HIPAA data violations come from people losing laptops or whatever and the drives aren't encrypted. You can use something like TrueCrypt or even Windows EFS.
  • I suggest PGP since it's so widely implemented and available, and SSL for networks, etc since again, implementation is widely available. Where not available you can tunnel over things such as encrypted VPN connections as well.

* Source: HIPAA 45 CFR § 164.312(a)(2)(iv) and (e)(2)(ii).

By the way: IANAL/TINLA

So, what the hell is type casting anyway?

Casting is a way to take a liquid and mold it int… oh yeah

So casting is just a fancy way to refer to type conversion, that is where you change the "type" of a variable from one thing to another. For example changing a string to an integer.

How about some examples? Is that what you want?

OK, fine, you talked me into it. Here are some PHP examples:

So, who cares? What's the point?

Well, depending on what you're wanting to do, it's important to change the type, and this is especially true in languages where there is no dynamic typing (like C#) and it's still useful in languages with dynamic typing like PHP, because it allows for one to avoid potential issues with mathematics, concatenation, etc. Aside from math related things, in PHP I use (int) a lot to clean up variables for SQL queries for both safety and also so MySQL doesn't have to convert the types itself.

You can learn more about type casting in PHP specifically and why it's a great way to do certain things here: Casting int faster than intval in PHP.

Let them learn COBOL / PHP isn't evil

I received this in my inbox earlier:

What programming languages should a modern-day programmer have in his/her arsenal? (Quora)

OK, fine, now I'm forced to evangelize for PHP, this puts me in a really painful position, but since I'm apparently the only person

Give me some of those Valley trends, baby
Give me some of those Valley trends, baby

reading this who can think for myself instead of freebasing whatever the Valley tells me to use, here we go…

The general theme seems to be to either learn a pretty hardcore language like C or C++ which won't benefit most people right away these days, since there's almost no excuse to make classic applications anymore. I think if anything it will discourage some people from learning to program since they have to spend a lot of time learning to clean up garbage, compiling, debugging, etc. Way to ruin their fun by making them spend all that time on a language better suited for drivers than web or phone apps.

Promoting Java is also a thing for some reason, I thought we were trying to kill this language? It's still used by a lot of places, but so is COBOL. In fact there's still a ton of places that use COBOL, so why not promote it? Probably because it doesn't come with a hipster mustache and a really tall bicycle.

If it's about job security, automatically Python and Ruby were a terrible suggestion, same with Erlang. You might as well be one of those skinny guys promoting Lisp.

A huge one though is promoting Python (and sometimes Ruby), blindly suggesting it's the best way to go without consideration for how huge of a pain in the ass it is to start a project. The syntax of the language(s) is very easy and the language itself quite powerful, but also slower than other options, harder to get going, and not widely supported. Starting a project in Python is about as difficult as starting a car by putting the engine in the car first. Turnkey? Hell no. You can get used to it, take some shortcuts, etc, but really for a new person, it's a nightmare.

It's really a hipster language, and Monty Python isn't funny, I'm just saying, it really isn't, I mean, seriously.

That's unrelated to this topic, but since Python is named after it, I felt it was important for me to communicate that it's just … knights who say Ni? yeah, fucking falling over laughing. Monty Python films had a few snicker moments here and there, but it was mostly diarrhea (or diarrheoa). I liked Flying Circus much better, why don't many people talk about that?

Yes, I've seen all of the popular films, and no I didn't laugh. I didn't go into expecting it to be about as funny as a hernia operation either. I had thought they would be funny since that's what people were saying, and after wasting about six hours of my life I realized: holy shit, I didn't laugh once. No, I mean that literally, I didn't laugh one time. A few smiles, sure, but not much else.

Anyway, where was I? Oh yeah, terrible ideas…

Some other promotions for assembly, as if it's 1977 or something.

In general though there was a lot of PHP hate spread through the entire thread, mostly that it was bad, but nobody ever saying why, it just is. That's a lot of bullshit. It's because PHP is widely used, widely available, and despite their claims PHP has made a massive amount of headway over the last few years, and is only getting better.

Much of the complaints about PHP people have are true.. if you've fallen out of a time machine from 2004. Hating PHP is like hating MySQL, it's just easier to ignore the last decade and pretend nothing ever changes, then go on to promote your slower, less widely available, much cooler alternatives of Python and PostgreSQL.

It's just the toxic runoff coming from the Valley of essentially acting like Pookie for anything cool coming out of the Valley, Bay Area, etc. And hey, I've lived in the Bay Area, so that makes me an authority on everything there.

I don't mind the C# suggestion, I don't like the platform limitations. Yeah there's mono, but seriously, yeah, who cares. C# has a lot of things like static typing that I wish PHP had, but Hack from Facebook does add a lot of those features right back into PHP and many of those will be moved into core PHP over the next couple of years.

The blinding hatred of PHP out there causes people to promote things in a manner which can slow newcomers down. PHP sure isn't perfect and there are of things I'd change about PHP, but it's faster, extremely powerful, and most importantly easy as hell to get going.

I'm of the mind though if we're going to want to stop people from learning to program, then yes, let's promote Python, Ruby, Erlang (what the fuck are you promoting this for, do people making small sites really need message queues? Don't be an asshole.), and while we're at it Java. Languages which can be easy at face value, easy in syntax, but a pain in the ass to get going and deal with, not to mention slower. Except Java and Erlang, those can be pretty fast.

So reasons not to learn PHP?

  1. It's not really cool
  2. It's not the steam punk of languages like Python, so you don't get a stupid ass top hat with goggles and proclaim you're awesome
  3. It's making headway faster than most languages, some of which aren't even changing or improving at all any more.
  4. It's widely available, i.e. essentially everywhere, so you're not held hostage by host availability
  5. It will help you learn C-style syntax which you can more easily pass on to other languages like JavaScript (also used on the web), plus countless other languages like C, C++, Java, C#, etc

Python and Ruby aren't bad to have in your arsenal, but blindly suggesting them first, when C-style languages is king is just ridiculous. Meanwhile the most popular web language being PHP, which is a C-style language, oh no, don't use that, it's bad just because it's bad, I mean, no reasons listed here, it just is.

Anyway, now a choice, spend 10 seconds starting a PHP project or spend half an hour setting up a Python environment and prepping shit just to get coding, and I mean really coding, throwing things directly to the interpreter isn't how you make real projects, it's how you demonstrate the language without making it obvious how much of a pain in the ass it is.

I'll use the language best suited for the situation, I'm not going to blindly dislike something because a broader community of self-deluded permanent man-children hate it.

My choices of languages:

  • PHP
  • JavaScript / node.js
  • Ruby
  • C#

My choice of languages in 2004:

  • Perl
  • C++
  • PHP

My choice of languages in 1997:

  • Perl
  • C++
  • Visual Basic

Nope, shit never changes, I'll just use Python forever and tell everyone that's all I've ever loved.

I hope you can appreciate the irony of blind hatred and ignorance of modern PHP meanwhile essentially doing the same thing with Python. That's my point, when it's turned around, it's obvious how idiotic you look.