2015 Desktop

I took the opportunity of the new year to give my programming environments a fresh look and try out some new things. Check it out:

zsh

I’ve tried a few times before, but I’m making a real attempt to ditch bash and give zsh a try this year. Some cool things so far:

  • I set REPORTTIME=2 which will automatically output how long a command took if it took more than 2 seconds (this is awesome! I used to try to prepend time unless of course when I forgot and I wished I did!)
  • zsh-git-prompt for useful indicators for git repos (current branch, changed files, etc.)

I’m using a fairly minimal config, eschewing oh-my-zsh and whatnot as I learn more about “stock” zsh. Follow along with my config files on GitHub.

Dark Theme

For the past couple years, I’ve used a light theme for Sublime Text. I’ve decided to try a darker theme hoping it’ll reduce eye strain when staring at code for long hours.

Menubar Apps

For reference, here’s whats currently up in my menubar.

  • Twitter - I still use the official Twitter client. I don’t love it, but don’t use it enough to look for any non-obvious replacements
  • Dropbox
  • TextExpander
  • 1Password
  • Freckle - New timer app, helps me keep track of hours for billing
  • Degrees - Weather
  • Plex - Media center
  • CCMenu - (Almost) real-time status of various CI builds (Drone, Jenkins, and Travis CI)
  • Fantastical - Quick viewing and entry of calendar items
  • Visitors - Real-time gauge of how many people are on a site, powered by Google Analytics
  • Usage - Lightweight way tokeep track of how much time you spend in various apps
  • Creative Cloud - I have the Photography plan, $10 a month for Photoshop and Lightroom

Friday Link Review

Whew! It’s been awhile. Here’s some links for the New Year!

Web Scraping with pup and jq

I’m sure you’ve heard of jq by now, it’s the best way to deal with JSON data on the command line. I previously wrote about a nifty way to use jq to import JSON into Elasticsearch. It’s great for filtering, transforming, and otherwise munging JSON data but 70% of the time I use it just to pretty print and syntax highlight JSON on the command line. Next time you have a big pile of JSON, try piping it to jq '.'.

pup is the spiritual cousin of jq, except for HTML. It has become my go-to tool for munging (and yes, even just viewing: pipe some HTML to pup -c) HTML on the command line.

I recently used the two together to scrape a local legal news site.

Disclaimer: Be nice when you do this! And know that even if you’re nice most “webmasters” wouldn’t like you to do this. In this case it’s public data (sort-of…)

I wanted to setup an automated alert when real estate was transferred on my street. Most people just read the “newspaper” every day to do this but I have better things to do with my time (like write scripts to… er…)

There are three steps to this, let’s look at each in turn and how they connect together:

Getting the HTML (curl)

curl "https://www.legalnewsexample.com/realestate/index" > out.html

Extracting the HTML (pup)

In my case, the data we are interested in is in a table with the rows of interest starting with “id”. We can use one of the many CSS3 selector supports to extract just those elements:

cat out.html | pup 'tr[id^="id"] json{}' > out.json

We’re actually not asking much from pup in this case, just using its ability to subset some HTML and convert it to a JSON structure for our next step.

Converting to JSON (jq)

cat out.json | jq '.[] | {"id": .children[0].children[0].href, "soldDate": .children[1].text, "seller": .children[2].text, "buyer": .children[3].text, "parcel": .children[4].text, "address": .children[5].children[0].children[0].text, "cityStateZip": .children[5].children[0].children[1].text, "county": .children[6].text, "saleAmount": .children[7].text|gsub("[\\$\\,]"; "")|tonumber}'

jq does the bulk of the work here. I looked at the JSON generated from pup and sussed out the structure. It’s a bit hard to read as one line, let’s look at it formatted:

{
  "id":           .children[0].children[0].href,
  "soldDate":     .children[1].text,
  "seller":       .children[2].text,
  "buyer":        .children[3].text,
  "parcel":       .children[4].text,
  "address":      .children[5].children[0].children[0].text,
  "cityStateZip": .children[5].children[0].children[1].text,
  "county":       .children[6].text,
  "saleAmount":   .children[7].text | gsub("[\\$\\,]"; "") | tonumber
}

Most of the data we need is in the text of the elements, but in the case of the id, we’re actually using the href attribute of a link.

The saleAmount field uses some jq functions to transform a string like "$100,000.00" into a number like 100000 by stripping out the $ and , with a regular expression (gsub) and converting that string to a number (tonumber).

Truth be told this is a process of trial and error. I had 13 separate iterations of this in my bash history as I built up this long command string.

All Together Now

Once we have the steps in place, we can replace those redirections and cats to files with pipes:

curl "https://www.legalnewsexample.com/realestate/index" | pup 'tr[id^="id"] json{}' | jq '.[] | {"id": .children[0].children[0].href, "soldDate": .children[1].text, "seller": .children[2].text, "buyer": .children[3].text, "parcel": .children[4].text, "address": .children[5].children[0].children[0].text, "cityStateZip": .children[5].children[0].children[1].text, "county": .children[6].text, "saleAmount": .children[7].text|gsub("[\\$\\,]"; "")|tonumber}'

3 small utilities each doing what they do well, connected together via pipes to solve a problem. Unix!