Tuesday, March 18, 2014

Future work idea

Learning

To begin, the algorithm gathers a set of Tweets based on a simple heuristic, like they have used the #Lakers hashtag. From this initial set, a keyword list of highest frequency is generated (discarding common words like the, and, etc).

During the initial run of the algorithm, we find potential fans of a team based on things like:
Following the official @Lakers Twitter account
Following the official NBA account

From these users, we match users with our keyword list using cosine similarity. From this, we hope to find users who are more likely to be Lakers fans than a different method would find.

Unfortunately, we can't think of a rigorous and automated way to determine whether a user is truly a Laker fan. The best way to determine that would be to ask them person directly, but that would be extremely time consuming and unlikely to receive responses. The next best thing seems to be to manually inspect a user's profile and Tweets, and determine yes, no, or unknown.

After this run, we would like to see what these new set of highly-likely-to-be fans says (Tweet's) about their team. From such Tweets, if we run the keyword frequency generator again, we may find an even more precise and accurate set of keywords to match on. Armed with such a list, perhaps re run the algorithm (probably on a different subset of @Lakers followers).

Presumably, after the first round of manual curation, successive rounds should yield better results. We can keep a history of the keyword changes, and have some ranking algorithm that keeps the best ones and gets rid of the bad ones.


Some python help

Using the "sort" and "sorted" functions
http://stackoverflow.com/questions/12791923/using-sorted-in-python
http://stackoverflow.com/questions/2909652/how-to-sort-a-list-by-the-2nd-tuple-element-in-python-and-c-sharp
https://wiki.python.org/moin/HowTo/Sorting

Twython, getting followers
http://stackoverflow.com/questions/19432202/twython-get-followers-list


Saturday, March 15, 2014

Finding the right twitter users

From users who follow the Lakers, get a subset (how? what criteria? random? how many?)

From these, get their tweets, and do the keyword statistics to get the most common words. This will be our golden set.

We want to find the "active" followers of the Lakers.

Algorithm:
Get the list of followers of @Lakers.
Find users with the most followers themselves.
How do we determine active. Most followers, most tweets, most recent activity.


Tuesday, March 4, 2014

Current design strategy

Get the rooting interest of the person who Tweeted.

Get the most frequently referenced team names, player names, etc.
Those are strong indicators of passion for a team.

Word associations
Lakers - Kobe Bryant
Lakers - Lakeshow
Lakers - lake show
Lakers - showtime
etc.

Some of these are basically synonyms, but that might not be worth exploring.

---
Another way to get a a list of good users is the followers of certain Twitter accounts like @NFL, @NBA, @Lakers, etc.

---
Algorithm:

For each team:
Finding the keywords that are related to a team
Team name, acronym, nicknames
Also get the most commonly associated keywords
(which might include player names)

For the followers of each team (like @Lakers, @Clippers, etc)
Find the followers to mention the keywords the most (maybe like 100)
   For each follower
   Get a collection of tweets
      Perhaps a certain number
      Perhaps a collection of tweets around the time of a game event
      see how many of the keywords associated with each team they have
     
      Have a confidence score for each team that this person might root for.

Collect the highest scoring individuals for a certain team
   Manually read their Twitter feeds, determine if they are a fan of the team or not.

This generates a statistic: how many of the people our system found were actually fans of the given team?

Then, to compare:
From the followers of a certain team, randomly pick users.
Manually inspect these users, determine if they are a fan of the team they follow.

Get statistic for this set.

Compare statistics.

Some design thoughts

Tomer touched on this during our presentation. We need to figure out the exact schema for our database.

Currently:
Id, Team 1, Team 2, Score 1, Score 2

Add:
User name
Time


---
Other design decisions

How do we determine if a person is a fan of a certain team?

1. The person mentions one team more than any other team.

This probably indicates strong emotions. However, it might be possible for someone to tweet a lot about a team that they hate.

With this information in hand, we can analyze the moods of tweets and cross reference them with actual events.

----
Ideas

1. Predicting which athlete composed the tweets?
Doesn't seem that interesting
2. Predicting what a certain person's tweets would be or
creating a tweet someone would write?
Already done with tofu_product
3. Gambling line prediction
Analyzing the twitter world to figure out the best gambling line.
Doesn't really have a strong NLP aspect to it.
Regardless, a seemingly cool idea.
https://www.cs.cmu.edu/~nasmith/papers/sinha+dyer+gimpel+smith.mlsa13.pdf
4.