Learning
To begin, the algorithm gathers a set of Tweets based on a simple heuristic, like they have used the #Lakers hashtag. From this initial set, a keyword list of highest frequency is generated (discarding common words like the, and, etc).
During the initial run of the algorithm, we find potential fans of a team based on things like:
Following the official @Lakers Twitter account
Following the official NBA account
From these users, we match users with our keyword list using cosine similarity. From this, we hope to find users who are more likely to be Lakers fans than a different method would find.
Unfortunately, we can't think of a rigorous and automated way to determine whether a user is truly a Laker fan. The best way to determine that would be to ask them person directly, but that would be extremely time consuming and unlikely to receive responses. The next best thing seems to be to manually inspect a user's profile and Tweets, and determine yes, no, or unknown.
After this run, we would like to see what these new set of highly-likely-to-be fans says (Tweet's) about their team. From such Tweets, if we run the keyword frequency generator again, we may find an even more precise and accurate set of keywords to match on. Armed with such a list, perhaps re run the algorithm (probably on a different subset of @Lakers followers).
Presumably, after the first round of manual curation, successive rounds should yield better results. We can keep a history of the keyword changes, and have some ranking algorithm that keeps the best ones and gets rid of the bad ones.
No comments:
Post a Comment