Language and thought project

Tuesday, March 18, 2014

Future work idea

Learning

To begin, the algorithm gathers a set of Tweets based on a simple heuristic, like they have used the #Lakers hashtag. From this initial set, a keyword list of highest frequency is generated (discarding common words like the, and, etc).

During the initial run of the algorithm, we find potential fans of a team based on things like:
Following the official @Lakers Twitter account
Following the official NBA account

From these users, we match users with our keyword list using cosine similarity. From this, we hope to find users who are more likely to be Lakers fans than a different method would find.

Unfortunately, we can't think of a rigorous and automated way to determine whether a user is truly a Laker fan. The best way to determine that would be to ask them person directly, but that would be extremely time consuming and unlikely to receive responses. The next best thing seems to be to manually inspect a user's profile and Tweets, and determine yes, no, or unknown.

After this run, we would like to see what these new set of highly-likely-to-be fans says (Tweet's) about their team. From such Tweets, if we run the keyword frequency generator again, we may find an even more precise and accurate set of keywords to match on. Armed with such a list, perhaps re run the algorithm (probably on a different subset of @Lakers followers).

Presumably, after the first round of manual curation, successive rounds should yield better results. We can keep a history of the keyword changes, and have some ranking algorithm that keeps the best ones and gets rid of the bad ones.

Some python help

Using the "sort" and "sorted" functions
http://stackoverflow.com/questions/12791923/using-sorted-in-python
http://stackoverflow.com/questions/2909652/how-to-sort-a-list-by-the-2nd-tuple-element-in-python-and-c-sharp
https://wiki.python.org/moin/HowTo/Sorting

Twython, getting followers
http://stackoverflow.com/questions/19432202/twython-get-followers-list

Saturday, March 15, 2014

Finding the right twitter users

From users who follow the Lakers, get a subset (how? what criteria? random? how many?)

From these, get their tweets, and do the keyword statistics to get the most common words. This will be our golden set.

We want to find the "active" followers of the Lakers.

Algorithm:
Get the list of followers of @Lakers.
Find users with the most followers themselves.
How do we determine active. Most followers, most tweets, most recent activity.

Tuesday, March 4, 2014

Current design strategy

Get the rooting interest of the person who Tweeted.

Get the most frequently referenced team names, player names, etc.
Those are strong indicators of passion for a team.

Word associations
Lakers - Kobe Bryant
Lakers - Lakeshow
Lakers - lake show
Lakers - showtime
etc.

Some of these are basically synonyms, but that might not be worth exploring.

---
Another way to get a a list of good users is the followers of certain Twitter accounts like @NFL, @NBA, @Lakers, etc.

---
Algorithm:

For each team:
Finding the keywords that are related to a team
Team name, acronym, nicknames
Also get the most commonly associated keywords
(which might include player names)

For the followers of each team (like @Lakers, @Clippers, etc)
Find the followers to mention the keywords the most (maybe like 100)
For each follower
Get a collection of tweets
Perhaps a certain number
Perhaps a collection of tweets around the time of a game event
see how many of the keywords associated with each team they have

Have a confidence score for each team that this person might root for.

Collect the highest scoring individuals for a certain team
Manually read their Twitter feeds, determine if they are a fan of the team or not.

This generates a statistic: how many of the people our system found were actually fans of the given team?

Then, to compare:
From the followers of a certain team, randomly pick users.
Manually inspect these users, determine if they are a fan of the team they follow.

Get statistic for this set.

Compare statistics.

Some design thoughts

Tomer touched on this during our presentation. We need to figure out the exact schema for our database.

Currently:
Id, Team 1, Team 2, Score 1, Score 2

Add:
User name
Time

---
Other design decisions

How do we determine if a person is a fan of a certain team?

1. The person mentions one team more than any other team.

This probably indicates strong emotions. However, it might be possible for someone to tweet a lot about a team that they hate.

With this information in hand, we can analyze the moods of tweets and cross reference them with actual events.

----
Ideas

1. Predicting which athlete composed the tweets?
Doesn't seem that interesting
2. Predicting what a certain person's tweets would be or
creating a tweet someone would write?
Already done with tofu_product
3. Gambling line prediction
Analyzing the twitter world to figure out the best gambling line.
Doesn't really have a strong NLP aspect to it.
Regardless, a seemingly cool idea.
https://www.cs.cmu.edu/~nasmith/papers/sinha+dyer+gimpel+smith.mlsa13.pdf
4.

Thursday, February 27, 2014

We presented today

It went pretty well. I didn't get asked any tough questions regarding the Watson presentation.

Dyer was less happy about our project presentation though. He was especially upset when we mentioned "sentiment analysis", saying that it was low hanging fruit that was easy and not interesting. Generally I tend to agree. I figured we'd use the sentiment analysis as one aspect of our system.

Sentiment - happy
Likely rooting interest - winning
Increases evidence of rooting interest

Sentiment - happy
Likely rooting interest - losing
Sarcasm?
Team is tanking? (losing on purpose for better draft picks next year)

Tuesday, February 25, 2014

More database updates, researching other papers

I updated some of the code for the database API, where it now supports getting tweets based on the "tweetId" field and the team name fields.

Actually, I may have not completely done it, but its like 99% there. I have to make sure I'm returning the object from the function.

I wanted to start working on our truth data in regards to scores. Apparently ESPN has APIs now, but they don't expose the one for scores, presumably because it costs them money and there are licensing issues and what not. Looking for other sources, I didn't find anything. There were some paid things, but that's not happening.

I found some interesting and somewhat on topic papers regarding Twitter analysis and live sports events. In particular, I've started reading a paper from 2011 by some people at Rice University. One of their basic mechanisms involves using a sliding window of Tweets. They detect when a new event occurs by measuring the rate at which the Tweets are coming in, comparing it to the beginning of the sliding window. I haven't finished reading the paper, but they also do some lexicon analysis I think. The same group also attempted to do some sentiment analysis as a follow on. This could all be useful.

Some links:
http://arxiv.org/pdf/1106.4300v1.pdf
http://ceur-ws.org/Vol-720/Zhao.pdf

As an aside, there doesn't seem to be anything out there that tries to turn Tweets into actual score information. Of course there's no guarantee to such information, but it would probably be pretty good if the system was designed with the right flexibilities and what not.

---
Tomer got a lot of the mechanics of the Tweet parser tool he found working. He's ready to start shoving stuff into the database. This led to some discussion about the database schema. Things like:

What if someone tweets about more than two teams?
No teams, only players?
etc

---
It looks like we'll do our presentations on Thursday.