How to Auto Create Issues in Jira From PHP

February 4th, 2011

Good morning again everyone, in an effort to always increase productivity and reduce friction I have a new weapon in my workflow arsenal. To gain some context regarding this post here is some background. We use Jira at Gravity for tracking issues and bugs. Since I'm not always on VPN or have access to our network managing my todos has been cumbersome. I've tried every Todo app out there and always fail to use them for more than 2 days.

I finally saw a great article on just using a simple Todo.txt file in your Dropbox folder and working from that. It's been a dream and working out great. I can use all my fav shell commands and have amazing Quicksilver integration that I'll show in a little video soon.

Anyway, the final joy was to be able to have a cron job run every hour and automatically create jira issues from my todo file for any item that has a tag. That helps me sync with Jira when I want others to know what I have on my plate at the moment. My script that calls this PHP file strips out the tag after it creates the issue

Here is a typical sample of my todo list

read the full entry on the new blog site released to showcase article extraction

January 9th, 2011

This week I was happy to get up a little project called which shows you the 20 most viral sports articles right now. It used a combination of technologies behind the scenes but it relies on the newly released Goose Article Extractor from Labs. It's formatted for the iPhone and iPad as well so you can get a quick digest of what's happening in the sporting world right now. Goose is available on GitHub at:

Goose - Article Extractor now open source, same as Flipboard/Instapaper

December 21st, 2010

Today I'm releasing a project I worked on for It's an HTML Article Extractor ala Flipboard / Instapaper style. It will take an article, run some calculations on it and give you an Article object back with the text of the extracted Article as well as the main image that we think is relevant to the article.

The goal is to create an open source article extractor for use with open source applications, crawlers or academic NLP processing initiatives.

see more at the new blog:

This Blog is moving! to

December 6th, 2010

Created to use for my blogging articles from here on out. The old software was just too much of a pain to maintain so I started out from scratch with CodeIgniter.

Why you should never read comments on blogs

November 30th, 2010

Here is an article from 2005 on techcrunch: 85% of College Students use FaceBook

And here are some insightful comments from TechCrunchers as always:

"I think that the Facebook is worse than pornography."

"I think Facebook is a really bad idea."

Its getting old and boring.

"I thought Facebook was the dumbest thing in the world"

"Honestly, FaceBook isn't that great. It's like LiveJournal for retards."

"Opening Facebook for high school kids is just stupid. Facebook should only open to those who are in college or University."

"Anyone else notice how the intelligence of the comments plummeted? This is exactly why I don't use the facebook (or facebook now is it?)."

"Facebook sold out when it started connecting college and high school"

I could post more but it's pretty hilarious to see how wrong most people are in the comment world. Ever since DIGG went to shit I've been avoiding comments in blogs and articles for that reason.

Using the Python NLTK Bayesian Classifier for word sense disambiguation - 92% accuracy

November 30th, 2010

Today's article will be going over some basic word sense disambiguation using the NLTK toolkit in Python and Wikipedia. Word sense disambiguation is the process of trying to determine if you mention the world "apple" are you talking about Apple the company or apple the fruit? I've read a few white papers on the subject and decided to try out some of my own tests to compare results. I also wanted to make it so no humans would be needed to see the initial data set and it could be done with data openly available. There are many example of classifiers out there but they all seem to focus on movie reviews so I figured another example may be helpful. Trained NLP professionals will perhaps balk on the simplistic approach but this is meant as more of an intro to NLTK and some of the things you can do with it.

I will demo this approach against real tweets from searching twitter for tweets with the word "apple" in them and creating a data set to test against. I suggest a winner take all vote off between 3 classification/similarity metrics. Jaccard Co-efficient, TF-IDF and Bayesian Classifiers. To the extent that if you were to run all three against an input tweet, whoever pulled 2 or more votes would win and give you a reasonable level of confidence. Although probably not the fastest solution, my goal is accuracy vs performance but your mileage may vary and also not trying to spend weeks developing involved solutions.

Here is a sample tweet: "I guess I'm making a stop at the Apple store along with my quest to find an ugly sweater tomorrow. boo!" It's easy for a human to determine we're talking about Apple the company in this tweet, however to a computer it's not so easy. We first need to find a dataset to seed our algorithms to compare against. Wikipedia has over 2 million ambiguous word definitions so it's important to not require manual training for each word or we'd never get anywhere. My first idea is to look at wikipedia itself. If you look at the disambiguation page for "apple" you can see there are a couple entries of importance: Apple, Inc and apple the fruit. To seed my dataset I suggest grabbing each wikipedia page and storing the complete text of that topic page, along with following each link in the first paragraph and storing the text from each link against the Apple company corpus. So we're grabbing the text from ,, ,, along with all the other wiki links that are in the first paragraph of the Apple topic page. This would be something you could easily script out by looking at the openly available wiki dump pages. So this approach could be used for all the seed data for ambiguous words.

This file: contains a corpus of text for Apple the company, I will do the same with apple the fruit and create a corpus of apple the fruit terms by going to the apple wiki topic and following the links in the first paragraph as well to create this file: . So we now have two corpuses of text that can programmatically be created. The next step we want to do is take in a tweet, tokenize it and try and find some similarity between the tweet and our corpus. For our tokenization we'll grab all the unigrams as well as what NLTK determines to be the most significant bigrams as well. We'll apply porter stemming to each word and also use the WordPunctTokenizer to break up words without punctuation.

First we'll try and train a simple Naive Bayesian Classifier using the NLTK toolkit to try and determine what label we should give a tweet, "company" or "fruit"? We're first going to take each blob of training data and use it to seed our classifier with unigrams and bigrams (two word combinations). We're going to use the NLTK classes to do some of the heavy lifting for us. We will also be porter stemming each word to it's root sense. So "clapping" becomes just "clap". This is to minimize the number of variances of words in the corpus.

Here is a sample file of around 100 random tweets I found with the word apple in it. We'll use this to see how well our classifier is doing. I also hand curated two training files just to verify how accurate our classifier is. We have the following training files available with tweets that were curated into fruit or company buckets. All I did was search "apple" on twitter and grabbed the first tweets I could find, the tweets are picked to increase accuracy, just random apple company and fruit tweets.

Training files:

If you uncomment out the line #run_classifier_tests(classifier) you'll see based on this training data our trained classifier can accurately guess the sense of a tweet with 92.13% accuracy. Not bad for a few hours of work. There are many improvements we can make to the classifier such as clustering around the common hashtags used in tweets it was accurately able to classify, adding trigrams, playing around with other features found in tweets, trying out different classifiers, etc....

Here is the complete classifier code:

If there is interest I'll post the Jaccard Coefficient script and TF-IDF ones as well. The Jaccard script was about 91-93 percent accurate as well.

hit me up on twitter with any comments: @jimplush

** UPDATE **
Oreilly had lead me to this PDF which also discusses using Wikipedia for word sense disambiguations:
it seems to also conclude that this approach is accurate as well as having increased value in the future as wikipedia gets smarter and you retrain your classifiers.

New Twitter Account

November 8th, 2010

So I created a new twitter account should you care to follow: jimplush!/jimplush

Fix when compiling the Redis php extension on OSX - mach-o, but wrong architecture

September 13th, 2010

If you happen to get the error:

PHP Warning: PHP Startup: Unable to load dynamic library '/Users/jim/Downloads/owlient-phpredis-2675d15/modules/' - dlopen(/Users/jim/Downloads/owlient-phpredis-2675d15/modules/, 9): no suitable image found. Did find:
/Users/jim/Downloads/owlient-phpredis-2675d15/modules/ mach-o, but wrong architecture in Unknown on line 0

this is how I fixed it:

create a file anywhere, chmod 777 it, run it then do your normal

make clean
sudo make install


CFLAGS="-arch i386 -arch x86_64 -g -Os -pipe -no-cpp-precomp"
CCFLAGS="-arch i386 -arch x86_64 -g -Os -pipe"
CXXFLAGS="-arch i386 -arch x86_64 -g -Os -pipe"
LDFLAGS="-arch i386 -arch x86_64 -bind_at_load"

This is one reason I quit corporate coding

July 22nd, 2010

I got this instant msg from a buddy today at Panasonic:

today i got spanked for putting out issues of rolling stone that i was done with; apparently some female around here didn't appreciate the provocative pose of shakira on a previous cover
1:07 went to HR to complain


Example of Hadoop Python Streaming job script

May 21st, 2010

here is a sample job script I got running to test out some hadoop mapreduce jobs for our new cluster. You can put this in the same directory with the map/reducer files. the -file parameter will package up those files and send them to the tasknodes in the cluster so you don't have to install them yourself.


# remove local output data
rm -rf /data/out/insights-output-traffic

# remove dfs output data
/data/hadoop/bin/hadoop dfs -rmr output-traffic*

# start hadoop job
/data/hadoop/bin/hadoop jar /data/hadoop/contrib/streaming/hadoop-0.20.1-streaming.jar \
-jobconf mapred.reduce.tasks=9 \
-mapper \
-reducer \
-file \
-file \
-input daytest/smallday/* \
-output output-traffic

# move output to local dir
/data/hadoop/bin/hadoop dfs -copyToLocal output-traffic /data/out