Archive for March, 2012
We’re happy to announce a new TweetReach Pro plan level for our larger enterprise, agency and media customers – TweetReach Pro Ultimate! This plan level is perfect for anyone managing multiple products, clients or accounts.
Our most comprehensive and personalized plan level, TweetReach Pro Ultimate comes with:
- 50 Trackers
- Access to TweetReach Back, our 30-day complete historical archive
- A dedicated account manager to help you get exactly the data you need
- Unlimited snapshot reports
- Unlimited users and projects
- API access
With 50 Trackers in your account, Ultimate subscribers will be able to monitor tweets about all of your campaigns, clients, products and events, in real time. Each Tracker can monitor unlimited tweets about your topic, including up to 20 distinct search queries to be sure we’re finding all relevant tweets.
TweetReach Back is our new historical analytics option. If you missed an important event or weren’t able to set up a Tracker before campaign tweets went out, we can go back up to 30 days and analyze all tweets about your topic. This is a more comprehensive option than our simple snapshot report, with no tweet limits and in-depth metrics like you see in a Tracker. Ultimate subscribers have access to up to 24 hours of TweetReachBack analysis each month.
A dedicated account manager will be available to answer all of your questions, from setting up tweet tracking, to interpreting metrics, to helping you improve next time.
We recently began rolling out a new look – and some new metrics – in our snapshot reports. As of today, all snapshot reports are now in the new format. Isn’t it so much nicer?
There’s more information about the new snapshot report below, including a few frequently asked questions and explanation about the metrics and our calculations. Or, skip all that and run a new and improved report right now!
How much does the new report cost?
As always, the quick snapshot report (up to 50 tweets) is free. The full snapshot report (up to 1500 tweets from the past week) is $20. The price has not changed.
How is the new report different from the old report?
First, it looks different. Way different and way better. Second, we’ve added some new metrics (details on those below). We’ve moved a few things around, but we haven’t removed anything from the old version of the snapshot report. The new version is just smarter and prettier than ever before.
What new metrics are included in the new report?
There are three major new sections in the new version of the TweetReach snapshot report. They are the Activity, Top Contributors and Top Tweets sections, explained below. There’s a more detailed explanation of all the report metrics here.
- Activity provides details about the tweets in this report, including a graphical timeline of when tweets were posted (times shown in UTC).
- Top Contributors shows you the top three contributors – participants whose tweets appear in this report. You’ll see the highest contributor for each of three influence dimensions: highest exposure, most retweeted, and most mentioned.
- Top Tweets shows the three most retweeted tweets in this report, showing retweet counts for each tweet.
Can I still see the old version of the report?
Yes, you can still access the old version. There’s a View Old Version link in the top right corner of the report.
So, how do I get one of these new reports?
Just go to TweetReach.com and give it a try. Run a new TweetReach report for free right now!
We’re excited to announce a new feature for TweetReach Pro subscribers – projects! Projects enable account holders to selectively share Trackers with their clients and colleagues, support multiple campaigns with one Pro subscription, and easily manage multiple users’ access.
You can use projects to:
- Group related Trackers and snapshot reports together
- Share select Trackers with clients or colleagues
- Manage user access and permissions
- Create guest access for one or more Trackers
Here in the United States, we’re right in the middle of the Republican primaries as the country tries to decide who the GOP nominee for President will be in our election later this year. One of the more interesting conversations around the 2012 Presidential election is the relationship between what people say on Twitter and what they do at the polls. Can we use Twitter conversations to predict election winners? Or, if they can’t predict results, what can tweets tell us about how potential voters feel about the candidates?
With Super Tuesday approaching and the GOP candidate field still wide open, we’ve been tracking tweets about the six top candidates for the Republican Presidential nomination since January 1 – Newt Gingrich, Jon Huntsman, Ron Paul, Rick Perry, Mitt Romney, and Rick Santorum. From those tweets, we built an interactive visualization of how Twitter talks about the GOP candidates, and how that relates to poll numbers over time.
Check out our interactive Republican primary Twitter tracker here or click on the screenshot below.
To create this visualization, we’re using a set of TweetReach Pro Trackers to track Twitter conversation about each of the candidates, along with our API to update the visualization daily. In the visualization, we’ve mapped the number of unique Twitter users talking about a candidate to the y-axis, polling results to the x-axis, and tweet volume to the circle radius. Polling data is from RealClearPolitics.
This post is by Jerry Chen, our Lead Engineer. Look for more in-depth technical posts like these in our TweetReach Tech category.
And of course, we did! Check out the TweetReach Academy Awards Explorer.
On the other end of the stack, we were revisiting Apache Cassandra. Since last we took a look at the datastore, it graduated from Incubator, got counters, hit a 1.0 version milestone, and continued to capture the hearts (and columns) of millions. We knew our chart data would be broken down by a time component, so this project would be a great fit for Cassandra.
After a few sketches about what to show and how to show it, we decided to capture tweets containing any mention of the Oscars, and then break them down by a few categories and nominees. For each minute we would measure the volume about a particular nominee, and provide a slider so the user could view the exact volume at a particular minute in time.
But first, how is datta formed?
From the beginning
Our journey begins, as with many things on the Internet, with text. We wrote Flamingo to consume the Twitter Streaming API (and later on, Gnip PowerTrack). Incoming tweets get appended to an event log, and optionally resque jobs are scheduled based on subscriptions. Normally, we use the latter for our larger pipeline (which includes search, OLAP, contributor and reach calculations), but for this special project we fork the events log and stream it to a separate server.
For moving log files around, there’s Apache Flume, Facebook Scribe, and maybe even time-tested syslog (here’s a great post by Urban Airship), but in the spirit of getting the job done, we can get away with tailing over SSH (and maybe wrapping that in a
nibbler$ tail -F /var/log/flamingod/events.log \ | pv -l \ | ssh -C parabox 'cat - >> /var/log/events.log'
(We use the capital
-F flag for
tail so to follow symlinks even if their destination changes, and
pv is a great utility which will be explained shortly.) Meanwhile, on the destination server, we employ
tail again and stream the events log into a ruby script which reads from
STDIN, for the actual data insertion into Cassandra.
The schema is simple. For each tweet, we see if there are any matching terms. If there are matching terms, we extract the timestamp of the tweet, get it into its minute-resolution “time bucket” format (
YYYYMMDDHHmm) and insert it into Cassandra. The schema ends up like this:
Optionally, we keep the available time buckets in a special super column called “index.” This is preferable to trying to list all the super columns under the row key. Thus, using the Ruby cassandra gem, an insertion looks like the following:
i64() is a function that packs 64-bit unsigned integers, which in this case is the tweet ID.
To get the volume at a given minute, count the columns:
>> client.count_columns(:volume,"hugo","201202241201",:count=>MAX_COLUMNS) 305
:count is 100, so if we have a magnitude greater than 100, it’ll get capped. I’ve set
MAX_COLUMNS to something high like
Streaming Insertions with Ruby
The actual processing task is straightforward, but the script is optimized to do the least amount of work possible. This is the key to high-throughput: don’t waste your time and if you can correctly get away with skipping a line, get away with it. Based on the nominees/terms we’re filtering out, we define the group of regular expressions to match against, and then combine them, e.g.
[/hugo/, /artist/] becomes
/hugo|artist/. Using the group regular expression as a first pass means not having to parse JSON unless we absolutely must.
The crux of the code uses
"Thu Mar 06 10:26:58 +0000 2008") to determine the time bucket, e.g.
"200803061026". Since consecutive tweets are likely to be close in time, and perhaps in the same time bucket, we take the substring of
created_at timestamp up to the minute and memoize the time bucket. In other words, if both the current and last tweets had
created_at strings beginning with
"Thu Mar 06 10:26", then skip parsing the timestamp and reuse the last time bucket. While this may seem like a micro-optimization, it’s with this mindset that we can maintain a processing rate of hundreds of tweets per second.
How do we measure performance? We could use Ruby’s Benchmark module and measure timing between various points. For a larger picture by way of throughput, we write the insertion script to consume
STDIN and combine use the incredibly handy utility called Pipe Viewer, which provides information like throughput about anything that’s being piped:
$ pv -l event.log | ruby insert.rb 26.3k 0:01:27 [303.4/s ] [===============> ] 0:01:30
In this example,
pv starts off by counting the lines (
-l), and then keeps track of lines seen, the duration and the rate. So far, 26k lines have been processed at a rate of 303k/s, and
pv estimates about 1m30 left.
It also works in streaming mode, which is how we use it with a live stream of tweets:
$ tail -F event.log | pv -l | ruby insert.rb 26.3k 0:01:27 [303.4/s ] [ <=> ]
Meeting in the Middle
All in all, it was a whirlwind expedition with two great pieces of open source — Cassandra and d3 — the latter of which deserves its own blog post. Cassandra took a hearty portion of memory but barely broke a sweat handling both insertions and queries.
Oh and by the way, if you want to build visualizations like this or wrangle terabytes of data, we’re hiring!