I have a little project written in Ruby on Rails called xmasgifts. It’s a really small application that helps my family coordinate Christmas wish lists with each other because we are geographically scattered, which makes it hard to keep up-to-date with who wants what, and more importantly, who’s gotten what for whom.
The basic gist is that you post items that you want, and part of the information that you can attach to an item is a URL, say, for the website where you can order the item or find more information about it. Last year I noticed that the kinds of URLs people were posting were really long, things like https://w1.yahoo.com/pubs/RP/RWD/oae_gemini.jsp?cds_page_id=30226&cds_mag_code=RWD&id=1165543849543&lsid=63412010495049968&vid=1&cds_response_key=I6MW0050&cds_mag_code=RWD
. (That’s a non-working URL, BTW) Well, it’s not very pretty to see that sort of thing displayed on somebody’s wish list, nor is it useful, because I typically do not care what somebody’s Java session ID was when they were viewing the page. It’s important information to store, of course, but not to see. One could make the argument, however, that it could conceivably be useful to know what the website is, like if it’s on amazon.com or panic.com or whatever.
So I set out to do a lil regex scrubbing and clean that bad boy up. Turns out it was a little trickier than I thought but I’m pretty sure I finally got it. If you’re already good at regular expressions then you probably have your own fu for prettifying URLs, but here’s what I’ve got:
(^https?:\/\/)?([^\/]+\.)?([^\/]+\.[^\/]+)(\/.*)?
If you grab subpattern 4 (that’s index 3 if you’re 0-indexed), you should get just the domain name. Again, if you’re already handy at regex, you probably don’t need this, but if you’re just starting, let’s walk through it to see what each piece does:
- (^https?:\/\/)?
- Here we’re trying to chop off the http:// portion of the URL. The ^ means “assert beginning o string”, because if we’re going to find the http:// part, it’ll be right at the beginning. The first ? comes right after the ‘s’, and means “you may or may not see the preceding character”. This allows us to catch either http:// or https:// URLs. Then we have the colon and the two slashes (which ar both escaped so that the regex parse doesn’t interpret them as special symbols). Finally, we wrap that whole thing with a ?, which again means “you can find 0 or 1 of the preceding thing”. We do this because there may or may not actually be the http:// piece at all!
- ([^\/]+\.)?
- Now that we’ve chopped off the protocol (if it exists), we want to catch any bits of the address that aren’t the domain name. Sometimes a URL for a store will be something like wwws.shop.store.com and we really only care about the shop.com part, so we want to ditch the rest. We start off by specifying a character class with the square brackets ([]). A character class is simply a range of characters. Now, whereas the ^ alone means assert beginning of string, as the first character inside of square brackets it means “not any of this stuff”. We set up the character class with all of the things we don’t want to see, which in this case is just a slash. The + sign then means “at least one of the preceding thing”, so we suck up as many characters that aren’t a slash as we can (as long as it is followed, at some point, by a period). Again, we finish off by wrapping the whole thing in a ?, since the URL may or may not include a host name.
- ([^\/]+\.[^\/]+)
- Hot dang! Here’s the good stuff: the thing we actually want. Remember the discussion about [^\/]+ from the last pattern? We do the same thing here. We want to catch every single character that is not a slash that is followed by a period that is then followed by one or more characters that are also not slashes. This is equivalent to the domain name and the TLD, which is what we’ll want to print out to the user.
- (\/.*)?
- Finally we rip off the rest of the URL: everything after that / that might be after the domain name. Any directories, filenames, the query string… we don’t want to see any of it. So we say, “Hey, grab a slash, followed by 0 or more of any character, but you might see 0 or 1 of that whole thing.”
Now, of course this is a regular expression, so you can use it in whatever language you want, but I said I was writing Ruby so here it is in that language:
- items.yml
- Big ol’ list of test URLs. If you have one that breaks this, please send it to me!
- item_test.rb
- Contains a loop and some assertions to run against the previous list of URLs.
- item.rb
- Look at the pretty_url method to see the regex in action.
Those files belong to a Rails project, so the tests etc. are set up to run inside of Rails, but there is nothing stopping you from extracting it out so that it doesn’t need Rails. And like I say, it’s just regex, so do what you will with it. :) Hope this helps somebody!