Here follows an accepted project proposal for GSoC'13, for the Tor Project.

(For a raw Markdown text version, see gsoc2013.md)

Also see the short intro on the tor-dev mailing list.

Abstract:

I'd like to create a more integrated and powerful descriptor archival search and browse system. (The current tools are very restrictive and the experience disjointed.) To do this, I'll write an archival browsing application wherein the results are interactive: they may act as further search filters. Together with a search string input tool which will have more filtering options, the application will provide a more cohesive archival browse & search experience and will be a more efficient tool.


1. What project would you like to work on?

I'm interested in one of the project ideas listed, namely, the Searchable Tor descriptor and Metrics data archive.

The current archival search incarnation (written using Java servlets) gets the job done if one simply wants to e.g. look up a specific relay, or knows a specific date range in advance. However, the results are not really 'browseable': while one may click on the descriptor IDs in the results to look up particular relays, or click on dates to get a consensus dump, the experience is not interactive: the only way to refine one's search is by changing the query string, which itself is very restrictive. The search tools provided (relay search, consensus date-based search/dump, descriptor info lookup/dump) are integrated to a rather limited extent. It would make sense to be able to browse through archival data by being able to continuously refine one's search filters. Clicking on data fields in the results should be semantically the same as entering search terms; both filtering approaches must be provided. (The current search query system allows for a few filters to be entered, but it (i) greatly restricts their combined use due to efficiency constraints (what if I want to specify a longer non-continuous interval of potential days of interest?); (ii) has a limited number of those filters.) The results page must therefore be sensitive to the semantic contents of each descriptor, so that entering search terms and browsing through results would be a more uniform experience.

Hence the implementation I am proposing involves not only refined specific search tools/options, but a more integrated overall architecture. That was a mouthful, more specific design follows - firstly a technical overview of the backend system with a working proof of concept, followed by a list of important user interface / overall architectural components, and then a timeline.

a) backend / database - this is part of the PoC

I have written a minimal working backbone of the backend. The whole application will be in Python, with Flask as our (lightweight) framework. SQLAlchemy acts as an efficient ORM abstraction layer.

Archival data available (consensuses, server descriptors) can be automatically downloaded from metrics.torproject.org. The most efficient way to keep the system updated is by rsync'ing the remote 'recent' folder at metrics.torproject.org::metrics-recent - there is no need to uncompress data (binary diffs would be inefficient with compressed data). A simple crontab can be set up to rsync once per hour (with relay descriptors being published every hour).

The data files are fed into a Postgres database by using Stem's DescriptorReader. DescriptorReader can use a file path list file for persistence: it will skip over files that had been already iterated over before. Mapping Stem's descriptor objects onto ORM is trivial and has been done in the proof of concept (see tsweb/importer.py).

Here, one of the two possible main efficiency/speed bottlenecks were expected: importing the extensive descriptor file amounts (and doing that every hour) might need specific technical solutions. However, to the best of my understanding, the current solution works rather efficiently: I timed an import of a months' worth of descriptor data (323715 descriptors in total), time reported was 838 seconds (see tsweb/models.py for the current Descriptor model for the DB). Three nuances here: firstly, the data field count will increase eventually (e.g. extradata is not being parsed as of yet). Secondly, the database itself currently contains only just shy of a million descriptor entries, which is really not that much compared to the overall archives. Continuous benchmarking of row insertion time is needed here. Finally, e.g. consensus data has not been imported yet (need to create a separate mapping of consensus status entries to ORM.) However on the whole, if we follow with the plan of rsync'ing and importing (only modified/new) descriptor data every hour, performance should not be an issue here. I suspect that the setting to turn autocommit off greatly helps here. (While at it: not committing hundreds of thousands of rows during import seems to eat a lot of memory; the current implementation commits every 100000 rows and attempts to issue garbage collection; it seems to have resulted in an agreeable memory footprint (not committing once resulted in the process getting killed by OS). If this however becomes an issue, dropping the ORM abstraction during import and doing raw SQL inserts should help both memory-wise and performance-wise (though the latter does not seem to be a problem at all, and the former does work fine now.))) The question of scalability of descriptor data to 100G+, though looking good, remains to be answered: import speed should not be a problem, but a benchmark test of all available archival data should be conducted soon.

Moving closer to the frontend/user, we find another area for performance bottlenecks, which is extracting results from DB when querying with complex filters.

Initial benchmarking of data extraction (filtering over columns neither of which is a primary key, e.g.) shows good backend response times. The question of scalability here is bound to the question of how complex the ORM queries will become, and how that will affect their performance. Firstly, when building ORM queries, one should obviously add clauses only when they are truly needed. While ORM abstraction systems can be critiqued to be leaky abstractions that may not necessarily generate the most efficient queries, my previous experience with e.g. SQLAlchemy shows that when one inspects the queries built, they are oftentimes very nicely built. What this means in our case in particular is that leveraging incremental ORM queries by mapping user-specified filter chains onto such incremental ORM query chains should do well. I have observed that the JOINs make sense and are needed, etc. However, I haven't observed the underlying produced structure of a truly deep ORM query object, for example. I have so far only tried a very limited approach for testing incremental queries for incremental descriptor search (e.g. in the current frontend version, one can click a primitive link which adds a date range filter on top of the user's existing query). The ORM is responsive, but restricting eventual user query/filter depth makes sense. However, making a significant improvement over current search constraints in the relay search servelet is, I think, a very realistic goal. (I am planning to write a simple internal logging tool which will basically record each query (SQL representation of it, for example) and its execution time. I will then be able to observe peaks, median time, etc. by grep/awk'inking from a simple text file.)

b) frontend

The web application serves via Flask (which in turn uses Werkzeug), there shouldn't be any noticeable scalability issues. The use of the abstracted ORM allows for quick changes in the frontend without breaking things.

Jinja2 (under Flask) templating engine is used, which will allow for clean frontend code/markup and a nice way to refer to the underlying backend objects. (The current layout/frontend is very minimal, especially if taken in perspective of what is planned. While no performance bottlenecks should be found in this area, it will require a considerable amount of work to allow for cohesive user experience.)

The user interface plan is elaborated upon in the section below.

c) user interface and architectural components:

i) Powerful search string input:

more keywords and more gracious input parsing. For example, a query like this should be allowed and encouraged:

myRelaysNickname from 2011-07 to 2011-09 or from 2011-12-05 to 2011-12-25 or on 2013-05-01 or on 2013-05-03

Other identifiers can be specified alongside - the idea would be to, by default, interpret such a combinations of different kinds of filters as an AND condition:

myRelaysNickname from 2010-01 to 2010-12-31 79.98.25.182

Ideally, a simple contradiction check ('from' < subsequent 'to', e.g.) could be done client-side, via Javascript. This would not be a priority, as merging browsing + searching is our main goal here.

Relay flags from consensuses are parsed by Stem. They should serve as possible filters alongside all the other data: user should be able to specify them using e.g. "-hsdir", "-exit". Here, we move onto another point:

ii) Archival/metrics data integration:

Relay flags are available via consensuses; consensuses refer to relay descriptors; the data is already 'integrated' as far as a backend cares, it only remains to be intelligently extracted and displayed.

A query amounting to be like the first one (via filters (below) or string input) but with a flag specified -

myRelaysNickname from 2010-01 to 2010-12-31 79.98.25.182 -exit

Would look in the date range, intersect with IP address, and would join/intersect the product with the consensus status list - the descriptors would be queried whether they had an exit flag set during said interval(s) of time.

Each relay descriptor in results should include a link to a list of network statuses about it.

Here, obviously there will be queries generating insane amounts of results. It might prove necessary to simply restrict the result set. However (this remains to be seen and tested for scaling as best as possible), current DB engines are no longer stupid when encountering a COUNT, and it might be possible to simply generate a paginated results page. While browsing / clicking through pages, the user could decide to further restrict their search parameters / filters.

This section does need expansion, however: the final list of additional data fields to be semantically evaluated (in the sense of them becoming potential filters which would produce different results, etc.) remains to be set. I will carefully go through the directory descriptor and directory specifications, and this will need to be done as soon as possible.

iii) Clickable filters

As all the relevant data fields will already have been neatly placed in our ORM, actually generating clickable results is not that hard: we will not dump raw data (it might become an option later on, like for gitweb's raw links; at the very least, it should be possible to link to a place where one could e.g. extract public keys, etc.), but rather will construct results from each field of interest. If there is a field for which we do not wish to provide filtering capabilities, we will simply print it out. Otherwise, clicking on a field (e.g., a directory-assigned flag) will introduce a new parameter in the search string (it should always be appended via GET, as that allows for easy copy-pasting of URIs and link permanence).

The trickier part (frontend coding wise) would be to provide a nice display of the currently employed filters. The cheapest workaround would simply be to provide the user with the option to edit the search string the input box (which should of course always reflect the current set of filters in place). It might make sense to generate a clickable coloured array of fields, with possibilities to change relationships between parameters (e.g. OR to AND). It sounds rather convoluted and remains to be seen whether it would be add to interactivity effectively. I am reminded of an online regular expression construction and visualization tool. If done correctly, this could be a powerful addon.

iv) Overall process of search & browse / user experience:

1) start page = simple input field, with examples / explanations (perhaps expandable, not to clutter). Very simple intuitive queries (simply enter an IP address, see what the system spits back) need to work - the most intuitive relationships - e.g. "relayName on 2013-04" also need to work well. (It makes sense to allow for both "YYYY-MM-DD" as well as "YYYY-MM" wherever possible; the current Servlet system seems to restrict these formats for worries of hindered performance it would seem; again, quite a lot of query benchmarking needs to be done. The overall incremental ORM query object approach should be fully attempted to find possible practical performance limits.)

2) results page = input field with data like before, possibly a clickable area to visually observe and refine the set of active filters, and results which should contain crucial info in one place - relays should include nicknames, fingerprints, IP addresses, which should all be clickable. There is no need to architecturally distinguish results pages and single entry pages: there should be an option to select which data to get (e.g. perhaps including public keys, etc.), and if the current query evaluates to a single result, the single result should simply be the individual descriptor page: should include more data fields by default, for example. Hence clicking on a particular descriptor id anywhere on results, for example, should obviously lead to a single descriptor's page with more info, but simply because it evaluated to a single result - the underlying system need not distinguish the two.

On any results page, the user can of course remove any of the filters to get back to a larger sample (minimally, by simply changing the string query in the input field; ideally, via the aforementioned visualization tool a la [2]); they should also be able to navigate to a consensus / network status list page seamlessly: the filters should be able to codify such user selections, so that the users could themselves manipulate what type of results they are to see.

d) Minimal set of deliverables:

e) Timeline / tasks and deliverables:

Until May 27th:

I have exams mid-June, so my idea is to do some work before that. Until May 27th (that's the official start if I understood correctly?), I want to have:

May 27th - June 3rd:

June 3rd - June 21st:

June 21st - July 1st:

July 1st - 8th

July 8th - 15th

July 15th - 22nd

July 22nd - 29th

July 29th 19 UTC: mid-term evalutations - start of submission period

August 2nd 19h UTC: eval submission deadline

July 29th - August 5th:

August 5th - 12th:

August 12th - 19th:

August 19th - September 9th (3 weeks):

September 9th - 16th

September 16th - 23rd 19h (code writing deadline)

2. Point us to a code sample:

The source code for the functioning proof of concept (descriptor import and search) is at the PoC.

More code available on request - this past year I've been intensely freelancing, mostly Python. (I've also submitted a bugfix for Stem today, but it's only 2 changed lines.. :) The test script attached is a few lines longer, but those are small quick thingies.)

3. Why do you want to work with the Tor Project in particular?

I'm becoming more and more convinced that 1. free speech and anonymity and the degrees to which they are actualised in a given place/domain affect human lives very directly. People get their heads chopped off because they post pro-uprising messages on facebook in Syria. Tor usage in Iran spikes during its elections. Next elections are in mid June. 2. (less dramatically,) technology can change lives, affect them directly, and empower people. Public-private key cryptorgraphy, in my opinion, was one of the more important technological achievements of the 20th Century - perhaps more time will need to pass for this to resurface; just like TrueCrypt's hidden volumes empower users (there is (usually) no way to prove there is a hidden volume), so does (very obviously) Tor. I've had the pleasure of using Tor, and I know people who use it to its full "I will send these important facts about my homeland" potential.

It is very interesting to realise these two points; and to meanwhile understand that I do care about people - and I'm beginning to understand that one should cherish others who care and do something, and to exercise my own dispositions and abilities in this regard. I'm young and naive, but I'd really like to participate in this climate and in this project in particular. I hope this is to become the start of my continued involvement, participation and volunteering activity in the Tor project in the future.

(that was still dramatic..)

4. Tell us about your experiences in free software development environments.

I have been using and have been a supporter of free software since my early high school days, however, I haven't written anything of significance in regards to e.g. open source projects (save for the occasional bug report and some patches long ago). I am however familiar with bug tracking software and version control systems and have used them extensively (especially the latter). Hopefully Tor will be one of the ways to start contributing to FOSS in a more decisive way. :)

5. Will you be working full-time on the project for the summer?

Yes, full-time - I won't need to (and will be able not to) do any freelancing / part-time work apart from developing for Tor.

6. Will your project need more work and/or maintenance after the summer ends? What are the chances you will stick around and help out with that and other related projects?

As per usual, it will have to be looked after, to continue observing how it scales to many users/visitors. I very much plan to stay put where I am, though - at the very least, I plan to be able to continue providing needed maintenance for it. My overall plan is to stick around Tor and contribute to other things - I'd like to imagine I'll be around for a long time!

7. What is your ideal approach to keeping everybody informed of your progress, problems, and questions over the course of the project?

IRC is for me a great tool to keep myself and others in the loop. It's a great tool to quickly discuss problems, plans, and to continually stay in touch with people. I'm available over XMPP and email, too - I plan to give (at the very least) bi-weekly summaries, and do them more frequently if need be. Mailing lists (tor-dev) are good for longer discussions, reports and so on. I plan to keep everyone interested updated over tor-dev / email.

8. What school are you attending? What year are you, and what's your major/degree/focus? If you're part of a research group, which one?

Here's the part where it'll sound somewhat random maybe - I'm a second year philosophy undergraduate in Vilnius University, Lithuania. The most relevant courses in terms of technology were in logics. I have certain academic interests in areas only tangentially related to software development & engineering, it would seem. As far as programming is concerned, I've been programming since my 9th grade, and have > five years of paid freelance programming experience.

9. Is there anything else we should know that will make us like your project more?

Oh man am I late to submit this application! I'm really looking forward to working with you folks, though!


This is the only GSoC project I'm applying to.

Contact:

kostas at jakeliunas period com

XMPP: phistopheles at jabber period org

IRC: wfn


NOTES:

EDITS:

End of proposal body.


Karsten Loesing May 4, 2013, 9:33 a.m.

Hi Kostas,

thanks for this nicely written proposal! Don't worry at all about the formatting, it's the content that counts.

While reading over your proposal I was wondering how to integrate your proposed tool into the metrics ecosystem. It would sure be good to replace the relay search application and ExoneraTor and provide a more general interface for those use cases. But maybe we can go one step further and make your tool the new Onionoo front-end application. The advantage would be that we don't get a new system to maintain, and that Onionoo clients like Atlas could use the new functionality with little effort.

Let me explain this in more detail. Onionoo has a quite simple search interface that allows you to search for relays or bridges that have been running in the past 7 days. Its summary and details replies contain data only from the most recent consensus and only from the most recent server descriptor. These limitations are there, you guess it, to provide reasonable search performance. What your tool could do is remove both limitations by allowing to search for relays or bridges that have been running at any time in the past (since 2007), and by providing descriptor details for any given time in the past. The idea would be that your tool does all the heavy lifting including parsing complex search strings, so that the user interface only needs to present results.

When you look at the Onionoo protocol, you'll note that it further provides aggregate bandwidth and weights information per relay. But those documents are not searchable, so your tool wouldn't have to worry about them. These documents could still be provided by the current Onionoo code and imported into your database, for example.

Does that idea make any sense to you?

Thanks! Karsten


Kostas Jakeliunas May 5, 2013, 10:35 a.m.

Hi Karsten,

thanks for the pointer to Onionoo, I looked it over, also looked at Atlas it's very nicely done. I'm still getting my feet wet into the whole array of Tor programs/projects.

I was actually thinking that it would make sense (in terms of reusability of components, maintainability, and maybe stability) to separate the backend from the frontend in the new system, such that the backend would receive queries, do all the hard work as you say, and return a standardized JSON response. So in general, I think the idea would probably be to kind of go in that direction in any case.

But maybe we can go one step further and make your tool the new Onionoo (https://onionoo.torproject.org/) front-end application.

Just to clarify: by frontend here you probably mean the whole implementation, i.e. a backend + browser frontend speaking an Onionoo-derived protocol? So the project would aim to:

Do you think the Onionoo-like protocol that the backend would speak could be made backwards-compatible with the current Onionoo spec? I suppose it could and that would be part of the plan: Atlas could basically use the new backend just like that.

Of course the new protocol would also need to have new capabilites, and the new client-side frontend would use them: queries would have to allow for more complexity, and the results/output would include more fields, etc. (e.g. for each descriptor there could be a pointer/URI to a list of network statuses wherein the descriptor was present, and so on.) But perhaps it would be possible to make it current-Onionoo-compatible, would that be the idea?

Introducing the new incarnation of the protocol into the rest of the ecosystem makes sense and would be very nice. At the very least, joining Exonerator etc. into one system (again, here it would make sense to separate backend and frontend in that different frontends (smaller tools, e.g. only to check if a relay is an exit, etc.) could be 'plugged in') makes sense, but also designing a uniform protocol would be great. I'm trying to get familiar with all the current tools in place, I'm still feeling very noobish, but hopefully that will eventually change. :)

I've to run off to meet with relatives / Mother's day, so the reply was somewhat hasty - let's talk soon

K.


Kostas Jakeliunas May 5, 2013, 11:11 p.m.

Also:

When you look at the Onionoo protocol, you'll note that it further provides aggregate bandwidth and weights information per relay. But those documents are not searchable, so your tool wouldn't have to worry about them. These documents could still be provided by the current Onionoo code and imported into your database, for example.

I haven't done any benchmarking for the existing bw/weight aggregation routines (they're done by the database itself as I see (PL/pgSQL), and if done right, should be efficient), but I suppose the bw aggregation process could be done by the same (new) backend; would have to get/precalculate the weight data for all old entries, but this would only need to be done once, if I'm not mixing things up.

I think the good news is that the backend part of the project (even if the backend+frontend end up working as a single application) can be worked on while Onionoo-related decisions (whether to make it into the new standard protocol(+implementation)) are being made: the backend will in any case receive complex queries as a simple string I suppose (so all / absolute majority of parsing logic in the backend), and it can communicate the results back to the frontend client-side part in standardized JSON.

(I've also looked at and ran the Compass tool which communicates with Onionoo, a very nice tool indeed, and could also be used to test out backwards-compatibility of the new onionoo-speaking backend were it to happen, etc.)

But let me know if you had something else in mind, or if I should try being more specific when I try to map the to-be solution onto constraints, so to speak. But in short, I do not see why we should not aim for a standardized protocol.


Karsten Loesing May 6, 2013, 8:19 a.m.

Hi Kostas,

thanks for the pointer to Onionoo, I looked it over, also looked at Atlas it's very nicely done. I'm still getting my feet wet into the whole array of Tor programs/projects.

Yeah, there are confusingly many programs in the Tor ecosystem. If you want to sit back for an hour and get an overview about them, there's Roger's and Jake's 29C3 talk:

The Tor software ecosystem [29c3[preview]]

I was actually thinking that it would make sense (in terms of reusability of components, maintainability, and maybe stability) to separate the backend from the frontend in the new system, such that the backend would receive queries, do all the hard work as you say, and return a standardized JSON response. So in general, I think the idea would probably be to kind of go in that direction in any case.

Okay, great!

Just to clarify: by frontend here you probably mean the whole implementation, i.e. a backend + browser frontend speaking an Onionoo-derived protocol? So the project would aim to:

Hmm, you're right, my use of the term "front-end" was rather confusing. But your description matches what I meant pretty well.

Do you think the Onionoo-like protocol that the backend would speak could be made backwards-compatible with the current Onionoo spec? I suppose it could and that would be part of the plan: Atlas could basically use the new backend just like that.

Yes, making it backwards-compatible would be really useful. Having said that, this requirement shouldn't prevent you from achieving your original goal. So, if you're at a point where you'd have to spend a lot of time on making your protocol backward-compatible, please ignore that and stay focused. We can always make it backward-compatible later. Or we might decide to deprecate that feature, adapt Atlas and the other Onionoo clients, and take out the feature two months later.

Of course the new protocol would also need to have new capabilites, and the new client-side frontend would use them: queries would have to allow for more complexity, and the results/output would include more fields, etc. (e.g. for each descriptor there could be a pointer/URI to a list of network statuses wherein the descriptor was present, and so on.) But perhaps it would be possible to make it current-Onionoo-compatible, would that be the idea?

Yes, that would be the idea. Adding new fields is possible, though we'll always have to keep document size in mind; if you download a details document for all running relays, size matters. We could also add new document types. In general, we should try to keep the protocol as simple as possible. But yes, that would be the idea.

Introducing the new incarnation of the protocol into the rest of the ecosystem makes sense and would be very nice. At the very least, joining Exonerator etc. into one system (again, here it would make sense to separate backend and frontend in that different frontends (smaller tools, e.g. only to check if a relay is an exit, etc.) could be 'plugged in') makes sense, but also designing a uniform protocol would be great. I'm trying to get familiar with all the current tools in place, I'm still feeling very noobish, but hopefully that will eventually change. :)

Yes, making ExoneraTor a simple web front-end of Onionoo would be really cool!

I've to run off to meet with relatives / Mother's day, so the reply was somewhat hasty - let's talk soon

Sure!

I'm also going to respond to your other reply here:

I haven't done any benchmarking for the existing bw/weight aggregation routines (they're done by the database itself as I see (PL/pgSQL), and if done right, should be efficient), but I suppose the bw aggregation process could be done by the same (new) backend; would have to get/precalculate the weight data for all old entries, but this would only need to be done once, if I'm not mixing things up.

Onionoo's bandwidth and weights calculation only uses flat files. What you refer to is the metrics website that indeed uses PostgreSQL's arrays to aggregate bandwidth data. But anyway, we should probably keep this out of scope in order not to risk failing the original goal of making the search scale for years of data. Your project could simply ignore bandwidth and weights documents or let the original Onionoo answer requests for those.

I think the good news is that the backend part of the project (even if the backend+frontend end up working as a single application) can be worked on while Onionoo-related decisions (whether to make it into the new standard protocol(+implementation)) are being made: the backend will in any case receive complex queries as a simple string I suppose (so all / absolute majority of parsing logic in the backend), and it can communicate the results back to the frontend client-side part in standardized JSON.

Agreed. Maybe we can extend the current search parameter to do complex queries and stay backward-compatible.

(I've also looked at and ran the Compass tool which communicates with Onionoo, a very nice tool indeed, and could also be used to test out backwards-compatibility of the new onionoo-speaking backend were it to happen, etc.)

Sounds good.

But let me know if you had something else in mind, or if I should try being more specific when I try to map the to-be solution onto constraints, so to speak. But in short, I do not see why we should not aim for a standardized protocol.

This is what I had in mind. Thanks for your very detailed replies!

Best, Karsten


Kostas Jakeliunas May 6, 2013, 12:58 p.m.

Onionoo's bandwidth and weights calculation only uses flat files. What you refer to is the metrics website that indeed uses PostgreSQL's arrays to aggregate bandwidth data.

Ah, my bad, haven't looked at it closer.. But yeah, I will keep in mind that it is possible to 'outsource' this data and not to care about it for now.

Thanks for the video link!

I will now try concentrating on the pre-May-27th list of tasks, in particular figure out all the descriptor data fields etc. that are to be needed in the future, wrap all DB import and query commands around timeit, and so on.

Thanks for all your replies!


Karsten Loesing May 7, 2013, 10:32 a.m.

Onionoo's bandwidth and weights calculation only uses flat files. What you refer to is the metrics website that indeed uses PostgreSQL's arrays to aggregate bandwidth data.

Ah, my bad, haven't looked at it closer.. But yeah, I will keep in mind that it is possible to 'outsource' this data and not to care about it for now.

Actually, you're not to blame here. I need to document these things better. I just made a start and put up a diagram how the various metrics tools fit together: https://metrics.torproject.org/tools.html -- not really related to your proposal, but I thought it can't hurt to mention it here.