Abstract:

I would like to write a new BridgeDB Distributor. Bridge distribution via Twitter direct messages would be a new channel with high collateral damage upon censorship. I would therefore write a Twitter bot implemented as a BridgeDB Distributor, with an optional rate control mechanism. I would also prepare the way for other distributors by writing reusable code, and hopefully starting work on another bot-like distributor.

1. What project would you like to work on?

I would like to write a new BridgeDB Distributor: a distributor that would act as a Twitter bot, responding to direct messages sent by Twitter users. We have been discussing this project with isis and sysrqb a bit, and I think I have a realistic plan for a useful new distributor, as well as ideas for possible further expansion, should the critical deliverables be met in good time.

The main plan is:

write a Twitter bot that responds to PMs (what Twitter calls "direct messages"):
- produce and populate a new bridge hashring
- be able to test hashring using fake bridge descriptors
- write Twitter bot: having received a direct message "get bridges", respond by giving out a few bridge lines: get bridges nearest to hashed(twitter_handle) in the twitter hashring
  - written using Python Twisted, by incorporating the RESTful Twitter API
  - isis wrote SSL certificate chain verification for Twisted, would reuse this.
  - the new distributor would be a new class inheriting from bridgedb.Dist.Distributor
  - (obvious: send in multiple messages if needed: if bridge lines in response include fingerprints (and/or passwords), ~2 bridge lines per message can fit (obviously split per bridge line as needed))
  - use multiple Twitter access tokens if possible, if we conclude that there might be problems with Twitter per-token POST rate limiting
  - as of now, we can see that Twitter only allows sending direct messages to those who follow the sender:
  - some accounts get a feature where one is able to choose whether to be able to receive messages from people they do not follow
    - this is maybe follower-number-related
    - consider asking Twitter to enable this (the plan assumes this is not possible, but this would make things easier for the users)
  - we can catch "follow" events. So as of now, the plan is to implement the workflow/pattern as follows:
    - user clicks "follow" on our bot
    - bot catches "follow" event, immediately starts following user
    - user sends direct message requesting bridges
    - bot responds (and reminds user to unfollow the bot)
    - user unfollows bot
    - bot unfollows user
    - if user does not unfollow bot in some amount of time, consider automatically unfollowing user
    - this might be counter-intuitive for users who wish to re-request bridges later on
    - but there is a privacy/anonymity nuance here: it is a bad idea to expose user<-> bot association publically
- be able to parse messages with keywords matching pluggable transports (e.g. "get obfs3 bridges", "get fte bridges"), give appropriate bridges when possible
- provide meaningful error messages
- be able to 'talk' in several (most relevant) languages? (if at all needed, this should be a core feature.) I am thinking that the instructions (response to 'help') and error messages should be carefully crafted
  - does it even make sense to support multiple languages, would this not hinder overall user experience? Not sure.
write a rate limiting/control mechanism for the bot:
- this includes further discussion whether this is needed, how easy it is to create new handles and use them to get bridges in bulk
- the idea here is to write a working rate limiting mechanism anyway, and be able to turn it on/off
- as of now, the plan would be to serve CAPTCHAs via Twitter Media CDN (we assume that if Twitter is not blocked, its CDNs are not, either)
  - Twitter direct messages can display images. What this would mean for the end user is that they would see the image in their message, and would respond directly (and privately), as if having a normal conversation
  - reuse parts of IPBasedDistributor, reCAPTCHA code, where applicable
  - how can we serve CAPTCHAs (e.g. if generated via GIMP) most efficiently? As of now, I do not see a scalable and decent way to serve them on-demand via Twitter CDN. Can we use some other CDN to 'proxy' the images? Overall, this does not sound like an elegant idea.
  - we can pre-upload everything to e.g. Twitter CDN. We can also not do that, upload on-demand, and remember which ones were uploaded (this is for the case if this kind of 'cache' expires.)

Further expansion plans would be (at least a subset of these is very desirable, but outside of critical deliverable scope, unless we decide otherwise):

write code or refactor code in a way that would make it easy to be reused by other distributors
- for example, an XMPP+OTR distributor is highly desireable
  - it would be great to make sure the actual distributor and bot parts are mostly ready for it (i.e. would require as little change as possible). The actual XMPP+OTR handler will be more difficult, but can be done in pure Python
  - would probably run on a separate machine from bridgedb. bridgedb would handle only highly sanitized input: most likely, specific requests into the hashring, with an authentication token coming from the other machine/instance.
  - ideally: write all or part of the XMPP+OTR distributor
- discuss IRC distributor options and nuances (especially rate control). This is easier to implement in and of itself
  - ideally: choose which other distributor to write; if IRC distributor is meaningful, write all or part of it
- discuss WhatsApp distributor development options. WhatsApp censorship is undesirable
  - from the point of view of the summer project, the main question is, "how should we write code such that it can be later on reused by other distributors as much as possible?"
consider and, if needed, implement alternative rate control mechanisms for the Twitter Distributor
- text-based challenge-response?
consider including more info in direct message responses, e.g. links to Tor download mirrors (sometimes that is where the effective censorship 'bottleneck' is) and/or torrent magnet links?
make a definite conclusion whether rate control for Twitter distributor is needed. Perhaps it is needed "sometimes":
- for example, perhaps we can heuristically guess whether we need to be careful with a specific new handle. The most simple metric would be age of account; number of tweets; and number of followers (the latter two are extremely rough and gameable metrics)
discuss, ideally help with bridgedb API from within tor-launcher
- again, is rate limiting needed, and if yes, how to do it? If media needs to be served (e.g. captchas), can it be served from a broad array of possible CDNs?

In terms of general code architecture, the idea would be to write a new generic bridgedb.Dist.Distributor with a hashring for 'handles'. (IRC would reuse most of this, XMPP / other federated communications systems might inherit and expand/override to have subrings per domain/network, etc. (this can also be done for Twitter - do we need it? Probably not; but worth discussing/thinking a bit.))

Twitter distributor would subclass this handle-based distributor, and implement actual bot functionality via Twisted. Parts of existing code can be reused.

Discussion points:

can expand if needed. TL;DR: actual bot functionality and if it is not too clumsy; captchas and are they needed (or something else?); languages; code architecture. (Later on: see further expansion plans.)

Rough timeline (to be discussed later on / made more concrete as needed):

write a working Twitter bot PoC with the user-follow->bot-follow->user-send-DM->bot-send-DM flow
- I think I should do this first because we may encounter snags / important nuances in this phase later on
=> as soon as possible, ideally in the coming days

=> done, at https://twitter.com/wfntestacct
generic 'handle' distributor (test out hashring / get familiar with this)
working Twitter bot as a subclassed 'handle' distributor

=> until June
rate control mechanism (first approximation of / something that we can test out)

=> until July (this is conservative; hopefully earlier; but allows for all sorts of snags, and delay from before, if any)

[27th June is mid-term evaluation deadline]
a working, clean, robust version of Twitter distributor, with deliverables and features as discussed with developers

=> until August (this is again a tad conservative, but allows for expansion, lots of refactoring and discussion, etc.)
tests, documentation (hopefully some or a lot of this before August)

=> 11th August [this is 'soft pencils down' date]
whatever we wanted from expansion plans to fit into GSoC scope, it should go here (and earlier / before this point, provided all good with core deliverables)

=> 18th August [this is 'hard pencils down' date]

2. Point us to a code sample:

Working PoC for a bridge-distributor-twitter-bot: https://github.com/wfn/twidibot (as of now, it can be interacted with here: https://twitter.com/wfntestacct)

torsearch backend code from last year is at https://github.com/wfn/torsearch. (Probably most / a lot of work at the nasty bottleneck solutions (in the *.sql scripts) and in the onionoo_api query logic.) But more code samples possible.

3. Why do you want to work with the Tor Project in particular?

I would like to continue with my efforts to help the Tor community, and to develop for the Tor Project.

4. Tell us about your experiences in free software development environments.

Besides Tor, nothing substantial in terms of free software development. Active free software user and supporter. Experience in using tools of the (open source) trade.

5. Will you be working full-time on the project for the summer?

My plan is (and I have spent time making sure this is possible) to be able to devote as much time to GSoC this summer as last summer (if not more.) I will be doing a "0.3"-time (basically quarter-time) job (which is mostly remote) for my faculty (light sysadmin/programming.) No academic obligations throughout the whole coding period, though.

6. Will your project need more work and/or maintenance after the summer ends? What are the chances you will stick around and help out with that and other related projects?

Yes, I think so. It should become a natural part of the BridgeDB codebase, but if all goes well, it will get deployed and actually used. First $x months will probably require at least some (if not continuous) attention. In addition, further work on BridgeDB definitely possible. In any case, no plans on disappearing!

7. What is your ideal approach to keeping everybody informed of your progress, problems, and questions over the course of the project?

Bi-weekly reports to @tor-dev, further discussion on @tor-dev (or with wider community, e.g. @tor-talk) or privately. Problems, etc. discussed via email; also via IRC for more synchronous discussion.

8. What school are you attending? What year are you, and what's your major/degree/focus? If you're part of a research group, which one?

Vilnius University, philosophy undergraduate, 3rd year.

9. Is there anything else we should know that will make us like your project more?

I would be very excited to be able to focus my efforts on Tor once again! I hope to continue to be involved.

This is the only GSoC project I am applying to.

Contact:

kostas at jakeliunas period com
XMPP: phistopheles at jabber period org
IRC: wfn
4096R/0E5DCE45