Learning
To begin, the algorithm gathers a set of Tweets based on a simple heuristic, like they have used the #Lakers hashtag. From this initial set, a keyword list of highest frequency is generated (discarding common words like the, and, etc).
During the initial run of the algorithm, we find potential fans of a team based on things like:
Following the official @Lakers Twitter account
Following the official NBA account
From these users, we match users with our keyword list using cosine similarity. From this, we hope to find users who are more likely to be Lakers fans than a different method would find.
Unfortunately, we can't think of a rigorous and automated way to determine whether a user is truly a Laker fan. The best way to determine that would be to ask them person directly, but that would be extremely time consuming and unlikely to receive responses. The next best thing seems to be to manually inspect a user's profile and Tweets, and determine yes, no, or unknown.
After this run, we would like to see what these new set of highly-likely-to-be fans says (Tweet's) about their team. From such Tweets, if we run the keyword frequency generator again, we may find an even more precise and accurate set of keywords to match on. Armed with such a list, perhaps re run the algorithm (probably on a different subset of @Lakers followers).
Presumably, after the first round of manual curation, successive rounds should yield better results. We can keep a history of the keyword changes, and have some ranking algorithm that keeps the best ones and gets rid of the bad ones.
Tuesday, March 18, 2014
Some python help
Using the "sort" and "sorted" functions
http://stackoverflow.com/questions/12791923/using-sorted-in-python
http://stackoverflow.com/questions/2909652/how-to-sort-a-list-by-the-2nd-tuple-element-in-python-and-c-sharp
https://wiki.python.org/moin/HowTo/Sorting
Twython, getting followers
http://stackoverflow.com/questions/19432202/twython-get-followers-list
http://stackoverflow.com/questions/12791923/using-sorted-in-python
http://stackoverflow.com/questions/2909652/how-to-sort-a-list-by-the-2nd-tuple-element-in-python-and-c-sharp
https://wiki.python.org/moin/HowTo/Sorting
Twython, getting followers
http://stackoverflow.com/questions/19432202/twython-get-followers-list
Saturday, March 15, 2014
Finding the right twitter users
From users who follow the Lakers, get a subset (how? what criteria? random? how many?)
From these, get their tweets, and do the keyword statistics to get the most common words. This will be our golden set.
We want to find the "active" followers of the Lakers.
Algorithm:
Get the list of followers of @Lakers.
Find users with the most followers themselves.
How do we determine active. Most followers, most tweets, most recent activity.
From these, get their tweets, and do the keyword statistics to get the most common words. This will be our golden set.
We want to find the "active" followers of the Lakers.
Algorithm:
Get the list of followers of @Lakers.
Find users with the most followers themselves.
How do we determine active. Most followers, most tweets, most recent activity.
Tuesday, March 4, 2014
Current design strategy
Get the rooting interest of the person who Tweeted.
Get the most frequently referenced team names, player names, etc.
Those are strong indicators of passion for a team.
Word associations
Lakers - Kobe Bryant
Lakers - Lakeshow
Lakers - lake show
Lakers - showtime
etc.
Some of these are basically synonyms, but that might not be worth exploring.
---
Another way to get a a list of good users is the followers of certain Twitter accounts like @NFL, @NBA, @Lakers, etc.
---
Algorithm:
For each team:
Finding the keywords that are related to a team
Team name, acronym, nicknames
Also get the most commonly associated keywords
(which might include player names)
For the followers of each team (like @Lakers, @Clippers, etc)
Find the followers to mention the keywords the most (maybe like 100)
For each follower
Get a collection of tweets
Perhaps a certain number
Perhaps a collection of tweets around the time of a game event
see how many of the keywords associated with each team they have
Have a confidence score for each team that this person might root for.
Collect the highest scoring individuals for a certain team
Manually read their Twitter feeds, determine if they are a fan of the team or not.
This generates a statistic: how many of the people our system found were actually fans of the given team?
Then, to compare:
From the followers of a certain team, randomly pick users.
Manually inspect these users, determine if they are a fan of the team they follow.
Get statistic for this set.
Compare statistics.
Get the most frequently referenced team names, player names, etc.
Those are strong indicators of passion for a team.
Word associations
Lakers - Kobe Bryant
Lakers - Lakeshow
Lakers - lake show
Lakers - showtime
etc.
Some of these are basically synonyms, but that might not be worth exploring.
---
Another way to get a a list of good users is the followers of certain Twitter accounts like @NFL, @NBA, @Lakers, etc.
---
Algorithm:
For each team:
Finding the keywords that are related to a team
Team name, acronym, nicknames
Also get the most commonly associated keywords
(which might include player names)
For the followers of each team (like @Lakers, @Clippers, etc)
Find the followers to mention the keywords the most (maybe like 100)
For each follower
Get a collection of tweets
Perhaps a certain number
Perhaps a collection of tweets around the time of a game event
see how many of the keywords associated with each team they have
Have a confidence score for each team that this person might root for.
Collect the highest scoring individuals for a certain team
Manually read their Twitter feeds, determine if they are a fan of the team or not.
This generates a statistic: how many of the people our system found were actually fans of the given team?
Then, to compare:
From the followers of a certain team, randomly pick users.
Manually inspect these users, determine if they are a fan of the team they follow.
Get statistic for this set.
Compare statistics.
Some design thoughts
Tomer touched on this during our presentation. We need to figure out the exact schema for our database.
Currently:
Id, Team 1, Team 2, Score 1, Score 2
Add:
User name
Time
---
Other design decisions
How do we determine if a person is a fan of a certain team?
1. The person mentions one team more than any other team.
This probably indicates strong emotions. However, it might be possible for someone to tweet a lot about a team that they hate.
With this information in hand, we can analyze the moods of tweets and cross reference them with actual events.
----
Ideas
1. Predicting which athlete composed the tweets?
Doesn't seem that interesting
2. Predicting what a certain person's tweets would be or
creating a tweet someone would write?
Already done with tofu_product
3. Gambling line prediction
Analyzing the twitter world to figure out the best gambling line.
Doesn't really have a strong NLP aspect to it.
Regardless, a seemingly cool idea.
https://www.cs.cmu.edu/~nasmith/papers/sinha+dyer+gimpel+smith.mlsa13.pdf
4.
Currently:
Id, Team 1, Team 2, Score 1, Score 2
Add:
User name
Time
---
Other design decisions
How do we determine if a person is a fan of a certain team?
1. The person mentions one team more than any other team.
This probably indicates strong emotions. However, it might be possible for someone to tweet a lot about a team that they hate.
With this information in hand, we can analyze the moods of tweets and cross reference them with actual events.
----
Ideas
1. Predicting which athlete composed the tweets?
Doesn't seem that interesting
2. Predicting what a certain person's tweets would be or
creating a tweet someone would write?
Already done with tofu_product
3. Gambling line prediction
Analyzing the twitter world to figure out the best gambling line.
Doesn't really have a strong NLP aspect to it.
Regardless, a seemingly cool idea.
https://www.cs.cmu.edu/~nasmith/papers/sinha+dyer+gimpel+smith.mlsa13.pdf
4.
Thursday, February 27, 2014
We presented today
It went pretty well. I didn't get asked any tough questions regarding the Watson presentation.
Dyer was less happy about our project presentation though. He was especially upset when we mentioned "sentiment analysis", saying that it was low hanging fruit that was easy and not interesting. Generally I tend to agree. I figured we'd use the sentiment analysis as one aspect of our system.
Sentiment - happy
Likely rooting interest - winning
Increases evidence of rooting interest
Sentiment - happy
Likely rooting interest - losing
Sarcasm?
Team is tanking? (losing on purpose for better draft picks next year)
Dyer was less happy about our project presentation though. He was especially upset when we mentioned "sentiment analysis", saying that it was low hanging fruit that was easy and not interesting. Generally I tend to agree. I figured we'd use the sentiment analysis as one aspect of our system.
Sentiment - happy
Likely rooting interest - winning
Increases evidence of rooting interest
Sentiment - happy
Likely rooting interest - losing
Sarcasm?
Team is tanking? (losing on purpose for better draft picks next year)
Tuesday, February 25, 2014
More database updates, researching other papers
I updated some of the code for the database API, where it now supports getting tweets based on the "tweetId" field and the team name fields.
Actually, I may have not completely done it, but its like 99% there. I have to make sure I'm returning the object from the function.
I wanted to start working on our truth data in regards to scores. Apparently ESPN has APIs now, but they don't expose the one for scores, presumably because it costs them money and there are licensing issues and what not. Looking for other sources, I didn't find anything. There were some paid things, but that's not happening.
I found some interesting and somewhat on topic papers regarding Twitter analysis and live sports events. In particular, I've started reading a paper from 2011 by some people at Rice University. One of their basic mechanisms involves using a sliding window of Tweets. They detect when a new event occurs by measuring the rate at which the Tweets are coming in, comparing it to the beginning of the sliding window. I haven't finished reading the paper, but they also do some lexicon analysis I think. The same group also attempted to do some sentiment analysis as a follow on. This could all be useful.
Some links:
http://arxiv.org/pdf/1106.4300v1.pdf
http://ceur-ws.org/Vol-720/Zhao.pdf
As an aside, there doesn't seem to be anything out there that tries to turn Tweets into actual score information. Of course there's no guarantee to such information, but it would probably be pretty good if the system was designed with the right flexibilities and what not.
---
Tomer got a lot of the mechanics of the Tweet parser tool he found working. He's ready to start shoving stuff into the database. This led to some discussion about the database schema. Things like:
What if someone tweets about more than two teams?
No teams, only players?
etc
---
It looks like we'll do our presentations on Thursday.
Actually, I may have not completely done it, but its like 99% there. I have to make sure I'm returning the object from the function.
I wanted to start working on our truth data in regards to scores. Apparently ESPN has APIs now, but they don't expose the one for scores, presumably because it costs them money and there are licensing issues and what not. Looking for other sources, I didn't find anything. There were some paid things, but that's not happening.
I found some interesting and somewhat on topic papers regarding Twitter analysis and live sports events. In particular, I've started reading a paper from 2011 by some people at Rice University. One of their basic mechanisms involves using a sliding window of Tweets. They detect when a new event occurs by measuring the rate at which the Tweets are coming in, comparing it to the beginning of the sliding window. I haven't finished reading the paper, but they also do some lexicon analysis I think. The same group also attempted to do some sentiment analysis as a follow on. This could all be useful.
Some links:
http://arxiv.org/pdf/1106.4300v1.pdf
http://ceur-ws.org/Vol-720/Zhao.pdf
As an aside, there doesn't seem to be anything out there that tries to turn Tweets into actual score information. Of course there's no guarantee to such information, but it would probably be pretty good if the system was designed with the right flexibilities and what not.
---
Tomer got a lot of the mechanics of the Tweet parser tool he found working. He's ready to start shoving stuff into the database. This led to some discussion about the database schema. Things like:
What if someone tweets about more than two teams?
No teams, only players?
etc
---
It looks like we'll do our presentations on Thursday.
Friday, February 21, 2014
Lots of news to cover
We switched from twitter-python to twython. It worked out of the box, and Tomer quickly generated a sample script to use the API. This was around Super Bowl time, so it was fun to look at tweets making fun of the Broncos.
I created a word frequency counter fairly trivially. At the moment it might not be the most useful thing.
We also have a text file with all the NBA team names, their cities, the three letter acronym, etc. It's a good start for searching for relevant tweets.
We've put all relevant code on my github page:
https://github.com/lawrencechang/twitter-events
Possibly worried because I have my Twitter API keys in the files, just plain text. I might move them to a separate file, and not upload those files.
Presently I'm working on creating a database (using python and sqlite) schema and API to store the putative facts from the tweets. For example, if a tweet talks about the Lakers winning, they'll get an entry into the table. Later analysis will compare these "facts" to real truth data. By verifying or refuting, we hope to glean some information about the author of the tweet.
In terms of creating a middleware style API, I'm having a difficulty with figuring out the right way to do it. I want to make manipulating the database as brainless as possible. I have a create and delete function, which create or delete pre-defined table (called Tweets) from a database of your choice (by default, default.db, which is a file that'll be saved to your working directory). However, once you've already created and started working with you database and table, I'm not sure how best to "connect" to it again, with my API level. My next blog post will probably talk about what I did.
I imagine truth data being gathered in two ways. First, and most obvious, is to use a reputable source like ESPN or NBA or Yahoo, scraping their sites. The other idea was to do sort of a popularity regression on the facts table. The more popular a fact is, the more likely it is to be true. Hopefully.
I created a word frequency counter fairly trivially. At the moment it might not be the most useful thing.
We also have a text file with all the NBA team names, their cities, the three letter acronym, etc. It's a good start for searching for relevant tweets.
We've put all relevant code on my github page:
https://github.com/lawrencechang/twitter-events
Possibly worried because I have my Twitter API keys in the files, just plain text. I might move them to a separate file, and not upload those files.
Presently I'm working on creating a database (using python and sqlite) schema and API to store the putative facts from the tweets. For example, if a tweet talks about the Lakers winning, they'll get an entry into the table. Later analysis will compare these "facts" to real truth data. By verifying or refuting, we hope to glean some information about the author of the tweet.
In terms of creating a middleware style API, I'm having a difficulty with figuring out the right way to do it. I want to make manipulating the database as brainless as possible. I have a create and delete function, which create or delete pre-defined table (called Tweets) from a database of your choice (by default, default.db, which is a file that'll be saved to your working directory). However, once you've already created and started working with you database and table, I'm not sure how best to "connect" to it again, with my API level. My next blog post will probably talk about what I did.
I imagine truth data being gathered in two ways. First, and most obvious, is to use a reputable source like ESPN or NBA or Yahoo, scraping their sites. The other idea was to do sort of a popularity regression on the facts table. The more popular a fact is, the more likely it is to be true. Hopefully.
Sunday, February 2, 2014
Starting off
Tomer and I are working on a project that will take Twitter feeds and extract event information from them.
We decided to use one of many existing Twitter API wrappers for python, this one called python-twitter: https://github.com/bear/python-twitter
The instructions described installation using pip, which I've heard of before. I decided to suck it up and go through linux-installing-stuff hell and see how it went. As a side note, I'm using UCLA's SEAS linux server, so I don't have admin privileges to install whatever I want wherever I want. This is a good restriction and provides a nice learning opportunity about using linux systems the right way.
I downloaded pip from here: https://pypi.python.org/pypi/pip#downloads . Because I couldn't just install pip on the global packages location, I had to install it in my local user account. This article helped explain it, particularly the section about using the --user option: http://docs.python.org/2/install/
I think the command I ran was something like this (in the pip folder):
python setup.py install --user
(I might have had to build first, python setup.py build)
It put itself in my home directory, under .local/bin. I added this to my path in my .profile file.
export PATH=~/.local/bin/:$PATH
I was able to run the instructions on python-twitter's page, which talked about installing the necessary software using a requirements.txt file. When doing this, I also added the --user option.
Finally, I built and installed python-twitter, using the --user option of course.
I ran both the test scripts. Most of the tests seem to output an ERROR or FAIL. One test does seem to pass, namely GetStatus. Test also takes over 6 minutes.
In addition, I'm unable to follow the simple API calls. I'm getting errors that read "AttributeError: 'Api' object has no attribute '_Api__auth'. Do I need to enter some credentials before using this?
We decided to use one of many existing Twitter API wrappers for python, this one called python-twitter: https://github.com/bear/python-twitter
The instructions described installation using pip, which I've heard of before. I decided to suck it up and go through linux-installing-stuff hell and see how it went. As a side note, I'm using UCLA's SEAS linux server, so I don't have admin privileges to install whatever I want wherever I want. This is a good restriction and provides a nice learning opportunity about using linux systems the right way.
I downloaded pip from here: https://pypi.python.org/pypi/pip#downloads . Because I couldn't just install pip on the global packages location, I had to install it in my local user account. This article helped explain it, particularly the section about using the --user option: http://docs.python.org/2/install/
I think the command I ran was something like this (in the pip folder):
python setup.py install --user
(I might have had to build first, python setup.py build)
It put itself in my home directory, under .local/bin. I added this to my path in my .profile file.
export PATH=~/.local/bin/:$PATH
I was able to run the instructions on python-twitter's page, which talked about installing the necessary software using a requirements.txt file. When doing this, I also added the --user option.
Finally, I built and installed python-twitter, using the --user option of course.
I ran both the test scripts. Most of the tests seem to output an ERROR or FAIL. One test does seem to pass, namely GetStatus. Test also takes over 6 minutes.
In addition, I'm unable to follow the simple API calls. I'm getting errors that read "AttributeError: 'Api' object has no attribute '_Api__auth'. Do I need to enter some credentials before using this?
Subscribe to:
Posts (Atom)