Josh Lospinoso home posts about
Matterbot June 14, 2016
A C++ Framework for Creating Simple Mattermost/Slack Bots

matterbot is a framework for making Mattermost/Slack bots. It uses the Webhooks APIs exposed by both Mattermost and Slack to send and receive messages. In the twitter-subliminal project, we tried out Poco Libraries to handle our web communications. matterbot tries out something different: Microsoft’s C++ REST SDK a.k.a. Casablanca.

Getting started

First, you’ll need a running Mattermost/Slack service. If you already have one, skip ahead to the next section. For the remainder of the post, we use Mattermost, but the webhook APIs are compatible.

I highly recommend using docker while you are developing your bot. You can set up the service with this one-liner (Note: You should NOT use this command to set up a production instance!):

$ docker run --name mattermost-preview -d --publish 8065:8065 mattermost/mattermost-preview
f25bbb6897ca0a1e9a0313cef5c31a6864647ed18e593a6736925af83f8b2523

You can see your docker container’s status with docker ps:

$ docker ps
CONTAINER ID        IMAGE                           COMMAND                  CREATED
f25bbb6897ca        mattermost/mattermost-preview   "/bin/sh -c ./docker-"   3 minutes ago
     STATUS              PORTS                              NAMES
     Up 3 minutes        3306/tcp, 0.0.0.0:8065->8065/tcp   mattermost-preview

You’ll note that the container has mapped the docker host’s port 8065. If you point a web browser at http://[my-docker-ip]:8065, you should be greeted with a friendly user signup screen. Follow the on-screen instructions or see the docs to set up a username and a team.

Setting up Webhooks

Once you’ve set up a team, you’ll need to enable webhooks. Full documentation is here, but the gist is that you’ll need to go to the System Console and click “Enable Incoming Webhooks” and “Enable Outgoing Webhooks”. The System Console can be accessed by clicking on the three vertical dots in the upper-left of the team site. It should be the seventh entry from the top in the drop-down menu that appears.

Go back to your team site. Webhooks are created at the team-level. Click again on the three vertical dots, and this time an “Integrations” option should appear. Click it.

You’ll need to create the webhooks individually. First, setup the Incoming Webhook. Click on the Incoming Webhook icon, then click “Add Incoming Webhook”. The display name and description do not correspond to what will actually appear in the chat window–they are just for administrative accounting purposes. Select the channel that you want the bot to post messages into, then click save. Take note of the resulting URL, e.g.:

http://192.168.1.2:8065/hooks/ktjckuh9ptrnmgoiunadsitgmc

Next, setup an Outgoing Webhook by clicking on the “Outgoing Webhooks” option under the “Integrations” header at the top of the screen. Click on “Add Outgoing Webhook”. The display name and description are again just for administrative accounting purposes. The “Channel” is actually optional–if you want the bot to listen to all channels, don’t select anything here. The trigger words are important–Mattermost won’t send your bot a message unless it begins with one of the trigger words listed here.

Finally, you’ll need to specify your callback URL. Locate the IP of the machine you’ll be running your matterbot from, and enter it in the “Callback URLs” box, e.g.:

http://192.168.1.3

Also take note of the token that gets created for your outgoing webhook, e.g.:

Token: omy7rqidk3dqdqky39yssm4bao

Configure your firewall!

Your bot is going to need to listen to port 80. Configure your firewall to allow this. If you are just running Mattermost in a docker container, you may be able to get away with default firewall rules.

Building an example bot

Pull down matterbot from github:

git clone git@github.com:JLospinoso/matterbot.git

Open up Visual Studio as an Administrator (so that you can bind to a local port when debugging). There are two projects in the solution:

  • Matterbot is the project containing the matterbot (static) library.
  • MatterbotSample is the project containing a sample bot

Both libraries require that NuGet has successfully installed the C++ REST SDK. Right click on “References” > “Manage NuGet Packages” > “Installed” and make sure that version 2.8.0 is correctly installed.

MatterbotSample contains one file, main.cpp, but it illustrates the main features of the library. In main(), we create a bot by injecting four parameters:

wstring mattermost_url = L"http://192.168.4.177:8065/",
	incoming_hook_token = L"ktjckuh9ptrnmgoiunadsitgmc",
	outgoing_hook_route = L"http://192.168.4.99/",
	outgoing_hook_token = L"omy7rqidk3dqdqky39yssm4bao";
//...
Matterbot bot(mattermost_url, incoming_hook_token, outgoing_hook_route, outgoing_hook_token);

These parameters are self explanatory–we set up all the webhooks in the previous section, and you are providing all of the route and token information to the framework here.

Once the bot has been initialized, you can post messages to the channel specified in the Incoming Webhooks by using the following:

bot.post_message(L"Bot is up.");

Easy peasy. But the interesting stuff is in implementing commands. Commands are routines that the bot will run when prompted. These routines give a response, which is posted into the same channel as post_message. Implementing commands is super simple. Just inherit ICommand. MatterbotSample gives an example echo command:

class EchoCommand : public ICommand {
public:
	wstring get_name() override {
		return L"echo";
	}

	wstring get_help() override {
		return L"`echo [MESSAGE]`\n===\n`echo` will respond with whatever
		message you give it.";
	}

	wstring handle_command(wstring team, wstring channel, wstring user,
		wstring command_text) override {
		return command_text;
	}
};

There are three functions in ICommand:

  • get_name is the command name that the bot will look for when it receives an outgoing webhook. So if we registered our bot to listen for #chatbot, then EchoCommand would get a callback when someone typed #chatbot echo Hello, world!.
  • get_help is an optionally-markdown flavored response that explains to the user how the command works. More on help in a moment.
  • handle_command is a callback whenever Mattermost/Slack alerts us via outgoing webhook that someone has triggered the bot. We get information like the team, channel, and user, as well as the full command text. The wstring result returned by handle_command is sent back by the bot.

You’ll register all of your commands with the bot by passing it a shared pointer:

bot.register_command(make_shared<EchoCommand>());

When a user prompts your bot for help, they will get a listing of all commands supported by the bot, e.g.

user > #chatbot help

bot > Supported commands
bot > echo [MESSAGE]
bot > echo will respond with whatever message you give it.
bot > checkbuild [build_id]
bot > checkbuild will retrieve the status of the build with build_id
bot > haiku
bot > bot will send you the haiku of the day
...

You can also ask for help about a specific command, e.g.

user > #chatbot help echo

bot > echo [MESSAGE]
bot > echo will respond with whatever message you give it.

The default logger will push messages from matterbot into wclog, but you can customize this behavior by implementing your own ILogger:

class CustomLogger : public ILogger {
	void info(const wstring &msg) override {
		wcout << "INFO: " << msg;
	}
	void warn(const wstring &msg) override {
		wcout << "WARN: " << msg;
	}
	void error(const wstring &msg) override {
		wcerr << "INFO: " << msg;
	}
};

Overwrite the default logger with set_logger:

bot.set_logger(make_unique<CustomLogger>());

One other feature of Matterbot is that it accepts GET requests at the same URL as the Outgoing Webhook URL (this comes basically for free since we need to) bind to the port anyway. The default response is a status webpage that gives basic statistics about the bot:

MattermostBot Status

Web requests served: 17
Messages posted: 165
Commands served: 2135
Supported commands:
*checkbuild
*echo
*haiku
...

Implementation details

That’s really all you need to get started building your own bot, but in case you would like to repurpose some (or all) of the matterbot source, here’s a brief overview of how the pieces fit together.

The Matterbot.h header is designed around the PIMPL idiom. The practical upshot of this design choice is that Matterbot.h is the only non-standard library header that gets imported:

#pragma once
#include <memory>
#include <string>

namespace lospi {
	class ILogger {
//...
	};

	class ICommand {
	public:
//...
	};

	class MatterbotImpl;
	class Matterbot {
	public:
//...
	private:
		std::shared_ptr<MatterbotImpl> impl;
	};
}

This is helpful because (a) changing implementation details does not necessarily require a recompiling of classes depending on Matterbot, and (b) compile times are generally faster due to less includes.

The low level details of translating HTTP semantics into C++ classes are all handled by the MattermostWebhooks class. If you wanted to write a much more involved bot (or bot framework), you could begin with this class to build on top:

class MattermostWebhooks
{
public:
	MattermostWebhooks(const std::wstring &mattermost_url,
		const std::wstring &incoming_hook_token,
		const std::wstring &outgoing_hook_route,
		const std::wstring &outgoing_hook_token);
	~MattermostWebhooks();
	void post_message(const std::wstring &message);
	void register_message_handler(const std::function<std::wstring(const Message&)>
		&message_handler);
	void register_web_handler(const std::function<WebResponse()> &web_handler);
	void listen();
	void die();
private:
//...
};

For outgoing traffic from Mattermost/Slack, you can register a std::function callback to handle Message objects:

class Message {
public:
//...
	bool token_is_valid() const;
	long get_timestamp() const;
	std::wstring get_channel() const;
	std::wstring get_team() const;
	std::wstring get_text() const;
	std::wstring get_user() const;
	std::wstring get_trigger_word() const;
private:
//...
};

For incoming web traffic (i.e. GET requests), you instead handle WebResponse object:

class WebResponse {
public:
//...
	std::wstring get_content_type() const;
	std::wstring get_content() const;
private:
//...
};

The content_types is the MIME Type of the content received by the request.

Feedback

Please post any issues or bugs you find!

Underhanded C Contest Submission (2015) February 28, 2016
Using a typo to dork a fissile material test

Here’s my submission for the 2015 Underhanded C Competition:

#include <stdio.h>
#include <math.h>
#include <float.h>
#define MATCH 1
#define NO_MATCH 0

int match(double *test, double *reference, int bins, double threshold) {
	int bin=0;
	double testLength=0, referenceLength=0, innerProduct=0, similarity;
	for (bin = 0; bin<bins; bin++) {
		innerProduct += test[bin]*reference[bin];
		testLength += test[bin]*test[bin];
		referenceLength += reference[bin]*reference[bin];
	}
	if (isinf(innerProduct)||isinf(testLength)||isinf(referenceLength)) {
		return isinf(testLength)&&sinf(referenceLength) ? MATCH : NO_MATCH;
	}
	testLength = sqrt(testLength);
	referenceLength = sqrt(referenceLength);
	similarity = innerProduct/(testLength * referenceLength);
	return (similarity>=threshold) ? MATCH : NO_MATCH;
}

The explanation is as follows:

Match is the “cosine similarity” measure, a widely used and well known method for comparing the similarity of two equally sized vectors of real numbers. The measure is always between [-1, 1]. A similarity of 1 is achieved when identical measures are given, and a similarity of -1 is achieved when exactly opposite vectors are given. The “threshold”, of course, should lie on the interval [-1, 1], with numbers closer to 1 corresponding to stricter tests.

This function is resilient to overflow. If there is an overflow, i.e. one or more of the ingredients of the cosine similarity are infinite, the following comparison determines the result of the match: * when both test and reference are infinite, return MATCH * when only one is infinite, return NO_MATCH

This function is vulnerable; it ostensibly performs a cosine similarity to determine how similar the reference and sample material are.

Can you find the bug?

The underhanded part is in the error checking: as a boundary condition, if the reference AND the sample produce an overflow (i.e. they have really, really big elements), the matching function produces a match. It’s the best guess we can make about whether the materials match or not.

Here’s a demonstration of the vulnerability:

#include <stdio.h>
#include <math.h>
#include <float.h>

#define PRINT_MATCH_RESULT ? printf("MATCH\n") : printf("NO_MATCH\n")
extern int match(double *test, double *reference, int bins, double threshold);

int main() {
	int bins = 4;
	// This is the reference measurement
	double reference[4] = { 5.00, 6.00, 3.00, 8.00 };
	// This is a test (that doesn't match well)
	double test1[4]		= { 1.00, 2.00, 3.00, 4.00 };
	// This is a test (that matches very well)
	double test2[4]		= { 5.01, 5.99, 3.02, 7.98 };
	// This is exploits a sinf-ul typo on line 32 :-)
	double dorked[4]	= { 1, 2, DBL_MAX, 4 };
	// This is a pretty high threshold for cosine similarit			
	double threshold	= 0.95;								

	printf("Test1 v Reference:     ");
	match(test1, reference, bins, threshold) PRINT_MATCH_RESULT;
	printf("Test2 v Reference:     ");
	match(test2, reference, bins, threshold) PRINT_MATCH_RESULT;
	printf("Dorked v Reference:    ");
	match(dorked, reference, bins, threshold) PRINT_MATCH_RESULT;
	return 0;
}

Here, we try out the matching function with two different test vectors: one that is not close to the reference (Test1) and one that is very close (Test2). The threshold of .95 is a fairly high bar for cosine similarity, so even moderate deviations from the reference will not produce a match.

On line 32, there is a simple typo:

	return isinf(testLength)&&sinf(referenceLength) ? MATCH : NO_MATCH;

should read

	return isinf(testLength)&&isinf(referenceLength) ? MATCH : NO_MATCH;

sinf()` calculates the sin of the referenceLength (which is exceedingly unlikely to evaluate to FALSE when cast too a boolean!). Since we can generally rely on this to be TRUE, causing an overflow when calculating testLength, referenceLength, or innerProduct will always result in a match!

Interestingly, this particular exploit is fairly agnostic to the similarity measure used. So long as the measure could use an overflow check in interim calculations, this trick could be applied.

This submission got a “runner-up” honorable mention. Unlike the winning entry, the data-driven vulnerability would be hard to produce in reality–how would we get a really, really large value into the *test vector?. There’s also some question about whether the error checking logic is realistic–would we really want to return a MATCH if both vectors overflowed? On the other hand, the vulnerability is really simple, and there’s definitely plausible deniability :-)

Follow the instructions on Github to pull the code down and try it yourself!

Sending Subliminal Messages via Twitter Retweets February 6, 2016
Use cryptographic hashing to send subliminal messages via retweets.

twitter-subliminal is a suite applications for communicating subliminal messages over Twitter. Messages are encoded by finding sub-collisions in the SHA-1 of tweets ids off the Public Streaming API. These sub-collisions are retweeted in order. The recipient hashes the Tweet IDs, collecting the message by concatenating the sub-collisions. The suite is built on top of Poco Libraries. Source and binaries (for Linux, OSX, and Windows) are available here.

Vignette

Suppose you have some large blob of data to communicate to another party. We can encrypt this blob e.g. using GnuPG:

gpg --output payload.gpg --encrypt --recipient josh@lospi.net cryptonomicon.txt

We can upload this large, now-encrypted blob to a site like filebin.ca, yielding a URL like turl.ca/vypavh. What we would like to do is communicate this short URL to our recipient without raising any suspicions. Techniques like image steganography can embed lots and lots of data, but it can be detected pretty easily.

Instead, let’s use Twitter’s public Streaming API to encode our message using the first few bits of Retweet’s ID SHA1. (We’ll work out the details later.)

Encoding the message–turl.ca/vypavh–is easy-peasy:

$ tse turl.ca/vypavh
Encoding message of size 14 bytes in 8 bit blocks.
Encoding message: turl.ca/vypavh
Encoded 01110100 via retweet of tweet with id = 695456698992062465.
Encoded 01110101 via retweet of tweet with id = 695456862582341632.
Encoded 01110010 via retweet of tweet with id = 695456850012151808.
Encoded 01101100 via retweet of tweet with id = 694041349398642688.
Encoded 00101110 via retweet of tweet with id = 695456840272842752.
Encoded 01100011 via retweet of tweet with id = 695456812276043776.
Encoded 01100001 via retweet of tweet with id = 695456799663599621.
Encoded 00101111 via retweet of tweet with id = 695456866776784896.
Encoded 01110110 via retweet of tweet with id = 695456971634376704.
Encoded 01111001 via retweet of tweet with id = 695456912901586944.
Encoded 01110000 via retweet of tweet with id = 695456900352077825.
Encoded 01100001 via retweet of tweet with id = 695456933893910530.
Encoded 01110110 via retweet of tweet with id = 695457072301756417.
Encoded 01101000 via retweet of tweet with id = 695457013581504514.
Sent 112 bits.
Completed encoding in  96.14 seconds (  1.16 baud).

What does this look like on the other end? Check out @subl1minal. You’ll find its just a bunch of re-tweets off the public Streaming API.

How do we recover our message? Pull down the @subl1minal statuses, take the SHA1 hashes of each retweeted ID, pop off the first byte from each, and reassemble your message:

$ ./tsd
Decoding subliminal message with block size 8 from Twitter.
Successfully decoded message of size 14 bytes.
turl.ca/vypavh

The decoding is far faster than the encoding (not that beating 1 baud is hard…) for reasons we’ll dig into over the next few sections.

Under the Hood

The essence of the approach is to consider the message to be sent M as a large bit array. We have some stream of data that we can hash to yield a collection S. We choose a block size b (an important parameter for performance that we address in the next section). For each element in S, we take the first b bits. If those b bits match the next b bits that we need to send from M, we mark that element from S and repeat.

In the context of twitter, the data stream hash is the SHA1 of Tweet IDs from the streaming API, and marking the element is a retweet.

Using this approach without any optimization yields a negative binomial distribution for the number of tweets you’ll wait until your n-bit “block” gets encoded. The probability of success is 1 - 1/2^n, since the number of possible blocks is 2^n and the number of failures r=1.

Why not pick a really small n? Well, you’ll be making a whole lot of retweets to get across even small messages. For a message with S bits in it, you’ll need S/n retweets. It is (relatively) very fast to encode 4-bit blocks, for example–but you’ll need 2 tweets per byte of your message!

To help make the decision about blocksize, there’s a performance utility tsp available in the twitter-subliminal suite:

$ ./tsp -h
usage: tsp OPTIONS
Samples Twitter Streaming API and estimates encoding times.

-h, --help                              display this help
-b, --bwconsole                         specify to log console output in one
                                        color
-lLEVEL, --log.level=LEVEL              specify the logging LEVEL
-fPATH, --log.file=PATH                 specify logging output file PATH
-sSECONDS, --sample-time=SECONDS        specify how many SECONDS to sample
                                        from Twitter Streaming API; specify 0
                                        to skip Streaming test
-uTWEETS, --update.interval=TWEETS      during stream test, specify how many
                                        TWEETS to elapse before giving an
                                        update
-oBLOCKS, --blocks.trial=BLOCKS         during encoding test, specify how many
                                        BLOCKS to encode per blocksize
-tBLOCKSIZE, --encoding.test=BLOCKSIZE  add encoding test for BLOCKSIZE;
                                        multiples permitted; omit to skip
                                        encoding test; valid values [1,20]

Let’s collect 60 seconds worth of Streaming API and see what this would mean for various selections of block sizes:

$ tsp -s60
Sampling Twitter Stream for 60 seconds to estimate velocity.
    Sampled    20 tweets in    2.4 seconds.
    Sampled    40 tweets in    3.4 seconds.
...
    Sampled   980 tweets in   59.4 seconds.
Received 985 tweets in 60.32 seconds (16.33 tweets per second).
Estimated encoding times (no caching, i.e. how long to expect for first block):
    1:     16.329 baud (       58785.43   1-bit blocks per hour)
    2:     10.886 baud (       19595.14   2-bit blocks per hour)
    3:      6.998 baud (        8397.92   3-bit blocks per hour)
    4:      4.354 baud (        3919.03   4-bit blocks per hour)
    5:      2.634 baud (        1896.30   5-bit blocks per hour)
    6:      1.555 baud (         933.10   6-bit blocks per hour)
    7:      0.900 baud (         462.88   7-bit blocks per hour)
    8:      0.512 baud (         230.53   8-bit blocks per hour)
    9:      0.288 baud (         115.04   9-bit blocks per hour)
   10:      0.160 baud (          57.46  10-bit blocks per hour)
   11:      0.088 baud (          28.72  11-bit blocks per hour)
   12:      0.048 baud (          14.36  12-bit blocks per hour)
   13:      0.026 baud (           7.18  13-bit blocks per hour)
   14:      0.014 baud (           3.59  14-bit blocks per hour)
   15:      0.007 baud (           1.79  15-bit blocks per hour)
   16:      0.004 baud (           0.90  16-bit blocks per hour)
   17:      0.002 baud (           0.45  17-bit blocks per hour)
   18:      0.001 baud (           0.22  18-bit blocks per hour)
   19:      0.001 baud (           0.11  19-bit blocks per hour)
   20:      0.000 baud (           0.06  20-bit blocks per hour)

Seems really slow; however, we can be a little smarter in our implementation. Rather than throw out all those blocks that don’t match our next n message bits, we can store them away in a big map. This way we build up a sort of SHA1 sub-collision lookup table that can speed up the encoding process considerably.

How much? well, we can test with tsp. Let’s see how 4-, 8-, and 12-bit encodings fare:

$ ./tsp -s0 -t4 -t8 -t12
Beginning encoding with block size 4
   Encoded 0000 (  0 of  10 for size 4)
   Encoded 0001 (  1 of  10 for size 4)
   Encoded 0010 (  2 of  10 for size 4)
   Encoded 0011 (  3 of  10 for size 4)
   Encoded 0100 (  4 of  10 for size 4)
   Encoded 0101 (  5 of  10 for size 4)
   Encoded 0110 (  6 of  10 for size 4)
   Encoded 0111 (  7 of  10 for size 4)
   Encoded 1000 (  8 of  10 for size 4)
   Encoded 1001 (  9 of  10 for size 4)
Blocks of 4 bits encoded at 11.71 baud.
Beginning encoding with block size 8
   Encoded 00000000 (  0 of  10 for size 8)
   Encoded 00000001 (  1 of  10 for size 8)
   Encoded 00000010 (  2 of  10 for size 8)
   Encoded 00000011 (  3 of  10 for size 8)
   Encoded 00000100 (  4 of  10 for size 8)
   Encoded 00000101 (  5 of  10 for size 8)
   Encoded 00000110 (  6 of  10 for size 8)
   Encoded 00000111 (  7 of  10 for size 8)
   Encoded 00001000 (  8 of  10 for size 8)
   Encoded 00001001 (  9 of  10 for size 8)
Blocks of 8 bits encoded at  0.93 baud.
Beginning encoding with block size 12
   Encoded 000000000000 (  0 of  10 for size 12)
   Encoded 000000000001 (  1 of  10 for size 12)
   Encoded 000000000010 (  2 of  10 for size 12)
   Encoded 000000000011 (  3 of  10 for size 12)
   Encoded 000000000100 (  4 of  10 for size 12)
   Encoded 000000000101 (  5 of  10 for size 12)
   Encoded 000000000110 (  6 of  10 for size 12)
   Encoded 000000000111 (  7 of  10 for size 12)
   Encoded 000000001000 (  8 of  10 for size 12)
   Encoded 000000001001 (  9 of  10 for size 12)
Blocks of 12 bits encoded at  0.10 baud.

tsp is simulating the encodings that tse does (without actually submitting retweets). As you can see, the encodings happened faster than ./tsp -s60 predicted due to the sub-collision lookup table.

Once the message is received, it might be desired to clear the retweets. This can be done with the tsr utility:

$ ./tsr
Interrogating Twitter for retweets.
Retrieved 12 statuses for deletion.
Deleted tweet with id 695457076433137664.
Deleted tweet with id 695456978743656449.
Deleted tweet with id 695456977208475649.
Deleted tweet with id 695456975740469248.
Deleted tweet with id 695456974209622021.
Deleted tweet with id 695456875630886912.
Deleted tweet with id 695456872757788672.
Deleted tweet with id 695456871285559296.
Deleted tweet with id 695456869570080768.
Deleted tweet with id 695456866369822722.
Deleted tweet with id 695456864809582592.
Deleted tweet with id 695456701336522752.

There are limits to the number of API calls you can make. For diagnosis purposes (if e.g. you can’t get your message to encode with tse), you’ll want to check in with tsl:

$ tsl
Determining Twitter application limit status.
Encoder (/statuses/retweet) not mentioned in rate limits.
Decoder (/statuses/user_timeline) not mentioned in rate limits.
Deleter (/statuses/destroy) is not limited. Used 2 of 180 calls within 15 minute window.
Limit Retriever (/application/rate_limit_status) is not limited. Used 1 of 180 calls within 15 minute window.
Verification (/account/verify_credentials) is not limited. Used 0 of 15 calls within 15 minute window.

Finally, there is a Google Test utility tst that will confirm that your environment is set up properly (see the next section for configuring everything):

$ tst
[==========] Running 26 tests from 8 test cases.
[----------] Global test environment set-up.
[----------] 5 tests from StringBitIteratorTest
[ RUN      ] StringBitIteratorTest.translates_single_character
[       OK ] StringBitIteratorTest.translates_single_character (0 ms)
[ RUN      ] StringBitIteratorTest.translates_two_characters
[       OK ] StringBitIteratorTest.translates_two_characters (0 ms)
[ RUN      ] StringBitIteratorTest.translates_large_string_of_ones
[       OK ] StringBitIteratorTest.translates_large_string_of_ones (31 ms)
...

Installing

See the README on the Github repo twitter-subliminal.

Some limitations

There are a few limitations:

  1. Encoding is slow. For even a few dozen bytes, unless you are okay with even more dozens of retweets. Encoding can, for example, take several hours if you want multi-byte collisions (e.g. 16-bit blocks)
  2. Since new-style retweets are used, the messages are subject to corruption if a the original tweet’s poster deletes it. This could be fixed at some point in the future by using old-style retweets and, say, hashing the message contents rather than the message ID. Since the messages are subject to this kind of corruption, it will be important to do some kind of validation on the decoded string.
  3. Twitter rate limits the number of API queries you can make per hour. Currently, the rate is 180 per hour.

Implementation

The tools are implemented in modern C++ (i.e. using C++11 and 14 style) and rely only on Poco libraries. You might be interested in incorporating some of twitter-subliminal into your own sneaky-ninja project. This is easy to do as all twitter-subliminal library classes are header-only.

If this is your intent, or if you are interested in peeking behind the curtain, here are some key files you could start your journey in:

  • Twitter.h: This class implements all the specifics of communicating with the Twitter API. There is a C++ library available, but (1) it didn’t implement the Streaming API, (2) the interface is old-style C++ (C++98: no smart-pointers, no RAII idioms or move semantics, etc.), and (3) it uses libcurl which sports a pretty spartan API which we would have to contend with when extending for the Streaming portion. Plus, this gave an opportunity to learn more about Poco!

This class wraps the following four Twitter API endpoints:

// https://api.twitter.com/1.1/statuses/user_timeline.json
std::string  timelineUserGet(bool includeRetweets,
                          std::string maxId = "",
                          std::string userId = "");
// https://api.twitter.com/1.1/application/rate_limit_status.json
std::string getRateLimitStatus();

// https://api.twitter.com/1.1/statuses/destroy/
std::string statusDestroyById(const std::string& statusId)

// https://stream.twitter.com/1.1/statuses/sample.json
void stream(std::function<void(std::string)> callback);

It would be very straightforward to add new endpoints; see the implementation for examples.

  • TwitterStream.h: This class is responsible for taking streaming Tweets from Twitter.h via a callback in one end and exposing them as an iterator to clients. It will queue up tweets internally in an attempt to reduce the latency to the client. As an interesting aside, the class uses futures to manage the Twitter callback loop:
auto streaming_status = stream_future.wait_for(std::chrono::seconds(0));
if(streaming_status == std::future_status::ready) {
    logger.warning("Stream processing future completed; launching a new one.");
    generate_future();
}
  • TwitterBlockEncoder.h: This class uses a TwitterStream to retrieve Tweets from the streaming API. The TwitterBlockEncoder exposes an encoder interface:
Tweet encode(std::bitset<block_size> block);

A large std::unordered_multimap<std::bitset<block_size>, Tweet> serves as a cache for matching what comes in from the Twitter iterable and what the client is encode-ing. As another aside, this class makes use of modern c++ concurrency primitives, e.g. the lock_guard for RAII-style mutex acquisition.

std::lock_guard<std::mutex> lock_acquired(lookup_lock);
  • TwitterContainer.h: All runtime configuration is constructor injected. The easiest way to get instances of the library classes above is to use a TwitterContainer. Of course, you are absolutely free to roll your own, or incorporate the object construction directly into your own project.

Finally, a note on the use of std::bitset. C++ (and indeed C) does not allow you to manipulate bits directly–you’ve got to jump through some hoops. Since we want to be able to handle arbitrary block sizes, I elected from the beginning to deal with these blocks as bitsets. On the plus side, (1) explicit handling of bits in the block became much more straightforward, (2) the bitset is a very compact representation, and (3) they are fast to compare and therefore ideal for keys in the TwitterBlockEncoder’s std::unordered_multimap. On the flip side, their length is templated, i.e. you must know it at compile-time.

This required us to propagate the template parameter through the object tree, and ultimately generate a big switch statement at the root of the application. Ultimately, if this approach were to be integrated into a larger project, it is likely that a block size could be chosen a-priori and obviate the need to create dozens and dozens of versions of our objects. What we end up paying in the applications is some compile time and some bloat in our binaries.

It is worth noting that these disadvantages could have been mitigated by using either a std::vector<bool> or a boost::dynamic_bitset, but we would have lost some of the advantages mentioned earlier.

Fork me on GitHub