Here follows an accepted project proposal for GSoC'13, for the Tor Project.

(For a raw Markdown text version, see gsoc2013.md)

Also see the short intro on the tor-dev mailing list.

Abstract:

I'd like to create a more integrated and powerful descriptor archival search and browse system. (The current tools are very restrictive and the experience disjointed.) To do this, I'll write an archival browsing application wherein the results are interactive: they may act as further search filters. Together with a search string input tool which will have more filtering options, the application will provide a more cohesive archival browse & search experience and will be a more efficient tool.

1. What project would you like to work on?

I'm interested in one of the project ideas listed, namely, the Searchable Tor descriptor and Metrics data archive.

The current archival search incarnation (written using Java servlets) gets the job done if one simply wants to e.g. look up a specific relay, or knows a specific date range in advance. However, the results are not really 'browseable': while one may click on the descriptor IDs in the results to look up particular relays, or click on dates to get a consensus dump, the experience is not interactive: the only way to refine one's search is by changing the query string, which itself is very restrictive. The search tools provided (relay search, consensus date-based search/dump, descriptor info lookup/dump) are integrated to a rather limited extent. It would make sense to be able to browse through archival data by being able to continuously refine one's search filters. Clicking on data fields in the results should be semantically the same as entering search terms; both filtering approaches must be provided. (The current search query system allows for a few filters to be entered, but it (i) greatly restricts their combined use due to efficiency constraints (what if I want to specify a longer non-continuous interval of potential days of interest?); (ii) has a limited number of those filters.) The results page must therefore be sensitive to the semantic contents of each descriptor, so that entering search terms and browsing through results would be a more uniform experience.

Hence the implementation I am proposing involves not only refined specific search tools/options, but a more integrated overall architecture. That was a mouthful, more specific design follows - firstly a technical overview of the backend system with a working proof of concept, followed by a list of important user interface / overall architectural components, and then a timeline.

a) backend / database - this is part of the PoC

I have written a minimal working backbone of the backend. The whole application will be in Python, with Flask as our (lightweight) framework. SQLAlchemy acts as an efficient ORM abstraction layer.

Archival data available (consensuses, server descriptors) can be automatically downloaded from metrics.torproject.org. The most efficient way to keep the system updated is by rsync'ing the remote 'recent' folder at metrics.torproject.org::metrics-recent - there is no need to uncompress data (binary diffs would be inefficient with compressed data). A simple crontab can be set up to rsync once per hour (with relay descriptors being published every hour).

The data files are fed into a Postgres database by using Stem's DescriptorReader. DescriptorReader can use a file path list file for persistence: it will skip over files that had been already iterated over before. Mapping Stem's descriptor objects onto ORM is trivial and has been done in the proof of concept (see tsweb/importer.py).

Here, one of the two possible main efficiency/speed bottlenecks were expected: importing the extensive descriptor file amounts (and doing that every hour) might need specific technical solutions. However, to the best of my understanding, the current solution works rather efficiently: I timed an import of a months' worth of descriptor data (323715 descriptors in total), time reported was 838 seconds (see tsweb/models.py for the current Descriptor model for the DB). Three nuances here: firstly, the data field count will increase eventually (e.g. extradata is not being parsed as of yet). Secondly, the database itself currently contains only just shy of a million descriptor entries, which is really not that much compared to the overall archives. Continuous benchmarking of row insertion time is needed here. Finally, e.g. consensus data has not been imported yet (need to create a separate mapping of consensus status entries to ORM.) However on the whole, if we follow with the plan of rsync'ing and importing (only modified/new) descriptor data every hour, performance should not be an issue here. I suspect that the setting to turn autocommit off greatly helps here. (While at it: not committing hundreds of thousands of rows during import seems to eat a lot of memory; the current implementation commits every 100000 rows and attempts to issue garbage collection; it seems to have resulted in an agreeable memory footprint (not committing once resulted in the process getting killed by OS). If this however becomes an issue, dropping the ORM abstraction during import and doing raw SQL inserts should help both memory-wise and performance-wise (though the latter does not seem to be a problem at all, and the former does work fine now.))) The question of scalability of descriptor data to 100G+, though looking good, remains to be answered: import speed should not be a problem, but a benchmark test of all available archival data should be conducted soon.

Moving closer to the frontend/user, we find another area for performance bottlenecks, which is extracting results from DB when querying with complex filters.

Initial benchmarking of data extraction (filtering over columns neither of which is a primary key, e.g.) shows good backend response times. The question of scalability here is bound to the question of how complex the ORM queries will become, and how that will affect their performance. Firstly, when building ORM queries, one should obviously add clauses only when they are truly needed. While ORM abstraction systems can be critiqued to be leaky abstractions that may not necessarily generate the most efficient queries, my previous experience with e.g. SQLAlchemy shows that when one inspects the queries built, they are oftentimes very nicely built. What this means in our case in particular is that leveraging incremental ORM queries by mapping user-specified filter chains onto such incremental ORM query chains should do well. I have observed that the JOINs make sense and are needed, etc. However, I haven't observed the underlying produced structure of a truly deep ORM query object, for example. I have so far only tried a very limited approach for testing incremental queries for incremental descriptor search (e.g. in the current frontend version, one can click a primitive link which adds a date range filter on top of the user's existing query). The ORM is responsive, but restricting eventual user query/filter depth makes sense. However, making a significant improvement over current search constraints in the relay search servelet is, I think, a very realistic goal. (I am planning to write a simple internal logging tool which will basically record each query (SQL representation of it, for example) and its execution time. I will then be able to observe peaks, median time, etc. by grep/awk'inking from a simple text file.)

b) frontend

The web application serves via Flask (which in turn uses Werkzeug), there shouldn't be any noticeable scalability issues. The use of the abstracted ORM allows for quick changes in the frontend without breaking things.

Jinja2 (under Flask) templating engine is used, which will allow for clean frontend code/markup and a nice way to refer to the underlying backend objects. (The current layout/frontend is very minimal, especially if taken in perspective of what is planned. While no performance bottlenecks should be found in this area, it will require a considerable amount of work to allow for cohesive user experience.)

The user interface plan is elaborated upon in the section below.

c) user interface and architectural components:

i) Powerful search string input:

more keywords and more gracious input parsing. For example, a query like this should be allowed and encouraged:

myRelaysNickname from 2011-07 to 2011-09 or from 2011-12-05 to 2011-12-25 or on 2013-05-01 or on 2013-05-03

Other identifiers can be specified alongside - the idea would be to, by default, interpret such a combinations of different kinds of filters as an AND condition:

myRelaysNickname from 2010-01 to 2010-12-31 79.98.25.182

this would look for an intersection of this IP address, the relay name, and the date range provided. (On the other hand, same type filters (date range intersections, e.g.) may be OR'ed implicitly. However, this would have to be done so that it is intuitive, and if and/or conditions are to differ based on filter types, this would need to be succinctly and clearly stated, with examples.

Ideally, a simple contradiction check ('from' < subsequent 'to', e.g.) could be done client-side, via Javascript. This would not be a priority, as merging browsing + searching is our main goal here.

Relay flags from consensuses are parsed by Stem. They should serve as possible filters alongside all the other data: user should be able to specify them using e.g. "-hsdir", "-exit". Here, we move onto another point:

ii) Archival/metrics data integration:

Relay flags are available via consensuses; consensuses refer to relay descriptors; the data is already 'integrated' as far as a backend cares, it only remains to be intelligently extracted and displayed.

A query amounting to be like the first one (via filters (below) or string input) but with a flag specified -

myRelaysNickname from 2010-01 to 2010-12-31 79.98.25.182 -exit

Would look in the date range, intersect with IP address, and would join/intersect the product with the consensus status list - the descriptors would be queried whether they had an exit flag set during said interval(s) of time.

Each relay descriptor in results should include a link to a list of network statuses about it.

Here, obviously there will be queries generating insane amounts of results. It might prove necessary to simply restrict the result set. However (this remains to be seen and tested for scaling as best as possible), current DB engines are no longer stupid when encountering a COUNT, and it might be possible to simply generate a paginated results page. While browsing / clicking through pages, the user could decide to further restrict their search parameters / filters.

This section does need expansion, however: the final list of additional data fields to be semantically evaluated (in the sense of them becoming potential filters which would produce different results, etc.) remains to be set. I will carefully go through the directory descriptor and directory specifications, and this will need to be done as soon as possible.

iii) Clickable filters

As all the relevant data fields will already have been neatly placed in our ORM, actually generating clickable results is not that hard: we will not dump raw data (it might become an option later on, like for gitweb's raw links; at the very least, it should be possible to link to a place where one could e.g. extract public keys, etc.), but rather will construct results from each field of interest. If there is a field for which we do not wish to provide filtering capabilities, we will simply print it out. Otherwise, clicking on a field (e.g., a directory-assigned flag) will introduce a new parameter in the search string (it should always be appended via GET, as that allows for easy copy-pasting of URIs and link permanence).

The trickier part (frontend coding wise) would be to provide a nice display of the currently employed filters. The cheapest workaround would simply be to provide the user with the option to edit the search string the input box (which should of course always reflect the current set of filters in place). It might make sense to generate a clickable coloured array of fields, with possibilities to change relationships between parameters (e.g. OR to AND). It sounds rather convoluted and remains to be seen whether it would be add to interactivity effectively. I am reminded of an online regular expression construction and visualization tool. If done correctly, this could be a powerful addon.

iv) Overall process of search & browse / user experience:

1) start page = simple input field, with examples / explanations (perhaps expandable, not to clutter). Very simple intuitive queries (simply enter an IP address, see what the system spits back) need to work - the most intuitive relationships - e.g. "relayName on 2013-04" also need to work well. (It makes sense to allow for both "YYYY-MM-DD" as well as "YYYY-MM" wherever possible; the current Servlet system seems to restrict these formats for worries of hindered performance it would seem; again, quite a lot of query benchmarking needs to be done. The overall incremental ORM query object approach should be fully attempted to find possible practical performance limits.)

2) results page = input field with data like before, possibly a clickable area to visually observe and refine the set of active filters, and results which should contain crucial info in one place - relays should include nicknames, fingerprints, IP addresses, which should all be clickable. There is no need to architecturally distinguish results pages and single entry pages: there should be an option to select which data to get (e.g. perhaps including public keys, etc.), and if the current query evaluates to a single result, the single result should simply be the individual descriptor page: should include more data fields by default, for example. Hence clicking on a particular descriptor id anywhere on results, for example, should obviously lead to a single descriptor's page with more info, but simply because it evaluated to a single result - the underlying system need not distinguish the two.

On any results page, the user can of course remove any of the filters to get back to a larger sample (minimally, by simply changing the string query in the input field; ideally, via the aforementioned visualization tool a la [2]); they should also be able to navigate to a consensus / network status list page seamlessly: the filters should be able to codify such user selections, so that the users could themselves manipulate what type of results they are to see.

d) Minimal set of deliverables:

the PoC backend providing needed scalable ORM support;
continuous descriptor data import into DB, benchmarked with the full archival dataset;
more powerful search query input (less restrictions, more keywords integrating across e.g. consensus data and relay descriptors) - multiple date ranges with possible ANDed parameters of differing kinds (IP + nickname + ranges, e.g.) should be possible at the very least;
the results are all composed from extracted ORM data (unless raw output - less important), hence easily composable into links which automatically add on top of existing query. Need to make sure AND/OR evaluations are as intuitive as possible;
intelligent data value selection (which to display) and cross-data-set mappings (what data is necessary to be displayed) need to be carefully thought through. Here, there is still some system/interface design work to be done. The consensus -> descriptor relationship is straightforward (obviously necessary to link to descriptors, etc.) but the one vice-versa might not be - need to see if multiple queries-with-huge-results can be scaled (multiple users making such queries) (One should be able to click on a descriptor's id and get a list (a paginated reply, etc) of network statuses in consensuses where it's been present.) Paginated query result extraction and display makes sense, but in any case, there are still some user interface questions.
filter and inter-filter-relationship visualization and modification - maybe not a 'minimal' deliverable, but it sure would be awesome to have some kind of (at least) "what kind of query you've got" visualization tool next to results.

e) Timeline / tasks and deliverables:

Until May 27th:

I have exams mid-June, so my idea is to do some work before that. Until May 27th (that's the official start if I understood correctly?), I want to have:

automatic benchmarking + recording of archival / descriptor data imports. We need to know if it is to slow down, etc. It's good to have statistics, in any case. This is easy.
have refined models for descriptor ORM. No need to import lots of data to DB if it's not imported in a way that will be used in the future. We basically need to know until then what kinds of data (fields) cannot be ignored. They need not participate in user interface display, even. Safest bet would be to import everything we can. Need to make an informed decision, so closely read some specs.
import (at least the better part of) archival data with the refined ORM and automatic benchmarking. Disk space might be a problem, might need to switch VPSes.
automatic benchmarking + recording of DB queries. Really necessary and will prove to be useful.
work at least a bit on any potential scalability issues in the future. If query times increase as our DB increases, will need to address this as early as possible. Keep thinking in this area.
set up and properly test rsync + crontab + continuous data import via 'recent'. Edge cases (prolonged connection failure, etc) need to be considered, addressing them can wait till later.
open discussion. Discuss these matters on tor-dev, etc. Get familiar with the Tor dev etc. community. Hopefully some nice design discussions, as this project's users are out there, need to be sensitive to actual needs, etc.

May 27th - June 3rd:

not sure how much time I am to have during this period. In any case - community/discussion.
start working on more deeply nested ORM queries, their benchmarking (if not already from before), and a general implementation of incremental queries <= filters. The idea is to have an abstract system which may then be operated be used by label selectors and search string parsers. This component can hopefully be reused elsewhere if need be - will hopefully prove to be a useful thing.
makes sense to also start work on search string parsing, to be continuously tested with the ORM query manager above.
both of the above might need to be moved to below - exam preparations etc.

June 3rd - June 21st:

exams should finish until June 21st - again, not sure how much time in the interval (basically free after this though)
obviously ongoing discussion hopefully - besides that, need to continue working on complex ORM queries and their benchmarking / if any practical limits are found, restrictions will need to be placed, etc.
hopefully will have a working search string evaluation -> mapping onto complex ORM manager version by now - can't be sure though.

June 21st - July 1st:

a working complex ORM query manager - might need work after that, but this should be usable.
will need to have at least a somewhat working search string eval tool by now - but might need more work still (if it's to be done properly, with additional operators easily addable, etc.)
more specific system design and interface questions should be answered by now. There at least should be satisfactory answers.

July 1st - 8th

finish string query eval if not finished and have a working input query + ORM manager system *query results page with some clickable links - one should be able to enter a query and get intelligible results. The click link -> added to query system might be simplistic at this point, but its testing will take place. The results shown will need to be as per discussion, (very) possibly without all intra-dataset items (e.g. data from consensuses in relay descriptor lists, though the underlying backend should be ready for this.)

July 8th - 15th

filters should work well by now. There could be improvements that would need to be made still, but the current system should render the search+browse experience.
depending on how hard it was to properly do filters & render nice results, should have some more relevant fields from multiple datasets / etc. showing up by now.

July 15th - 22nd

more or less query results page showing data as per design, intelligently integrated, paginated where needed / if possible, with some nice structure. It might still change after this, but it should feel deliverable. Minimally, it should work in the sense of adding to and rendering a proper browser experience.
if there were any snags from before, should get them done during ~this week

July 22nd - 29th

overall aim for a usable, working, stable system with the complete minimal set (without filter visualization.)
allocating time here for any issues from before
get feedback not only from mentor/s, but also from the community if possible and appropriate at all.

July 29th 19 UTC: mid-term evalutations - start of submission period

August 2nd 19h UTC: eval submission deadline

July 29th - August 5th:

feedback, discussions whether anything important needs to be changed (hopefully not by now of course, but maybe some part of ORM didn't properly scale, etc.)
if all is well / or in any case: think & work on query visualization - having at least some kind of query / parameter relationship visualization would be really great.

August 5th - 12th:

work on query visualization
start working on intelligent query editing if there's time and need
any deeper changes in the system should be resolved by now

August 12th - 19th:

hopefully at least a minimal / working version of query visualization
work on visual/intelligent query editing, unless setbacks
should hopefully start doing production-level scaling tests, maybe more users/testers using the system - improving stability, etc.
if have time - any other neat features / interface addons

August 19th - September 9th (3 weeks):

production-level tests of system, more users to test hopefully, etc.
unless major setbacks - should have a very neat and functioning query visualization tool
unless some setbacks - intelligent query editor tool
if more time and building some addons - finish those additional features

September 9th - 16th

(hopefully) production-ready system, with very minor changes maybe
if there were setbacks - hopefully query editing tool by now.
unit tests / write tests

September 16th - 23rd 19h (code writing deadline)

write tests
hopefully no need to change code from the point of view of our list of deliverables beyond this point

2. Point us to a code sample:

The source code for the functioning proof of concept (descriptor import and search) is at the PoC.

More code available on request - this past year I've been intensely freelancing, mostly Python. (I've also submitted a bugfix for Stem today, but it's only 2 changed lines.. :) The test script attached is a few lines longer, but those are small quick thingies.)

3. Why do you want to work with the Tor Project in particular?

I'm becoming more and more convinced that 1. free speech and anonymity and the degrees to which they are actualised in a given place/domain affect human lives very directly. People get their heads chopped off because they post pro-uprising messages on facebook in Syria. Tor usage in Iran spikes during its elections. Next elections are in mid June. 2. (less dramatically,) technology can change lives, affect them directly, and empower people. Public-private key cryptorgraphy, in my opinion, was one of the more important technological achievements of the 20th Century - perhaps more time will need to pass for this to resurface; just like TrueCrypt's hidden volumes empower users (there is (usually) no way to prove there is a hidden volume), so does (very obviously) Tor. I've had the pleasure of using Tor, and I know people who use it to its full "I will send these important facts about my homeland" potential.

It is very interesting to realise these two points; and to meanwhile understand that I do care about people - and I'm beginning to understand that one should cherish others who care and do something, and to exercise my own dispositions and abilities in this regard. I'm young and naive, but I'd really like to participate in this climate and in this project in particular. I hope this is to become the start of my continued involvement, participation and volunteering activity in the Tor project in the future.

(that was still dramatic..)

4. Tell us about your experiences in free software development environments.

I have been using and have been a supporter of free software since my early high school days, however, I haven't written anything of significance in regards to e.g. open source projects (save for the occasional bug report and some patches long ago). I am however familiar with bug tracking software and version control systems and have used them extensively (especially the latter). Hopefully Tor will be one of the ways to start contributing to FOSS in a more decisive way. :)

5. Will you be working full-time on the project for the summer?

Yes, full-time - I won't need to (and will be able not to) do any freelancing / part-time work apart from developing for Tor.

6. Will your project need more work and/or maintenance after the summer ends? What are the chances you will stick around and help out with that and other related projects?

As per usual, it will have to be looked after, to continue observing how it scales to many users/visitors. I very much plan to stay put where I am, though - at the very least, I plan to be able to continue providing needed maintenance for it. My overall plan is to stick around Tor and contribute to other things - I'd like to imagine I'll be around for a long time!

7. What is your ideal approach to keeping everybody informed of your progress, problems, and questions over the course of the project?

IRC is for me a great tool to keep myself and others in the loop. It's a great tool to quickly discuss problems, plans, and to continually stay in touch with people. I'm available over XMPP and email, too - I plan to give (at the very least) bi-weekly summaries, and do them more frequently if need be. Mailing lists (tor-dev) are good for longer discussions, reports and so on. I plan to keep everyone interested updated over tor-dev / email.

8. What school are you attending? What year are you, and what's your major/degree/focus? If you're part of a research group, which one?

Here's the part where it'll sound somewhat random maybe - I'm a second year philosophy undergraduate in Vilnius University, Lithuania. The most relevant courses in terms of technology were in logics. I have certain academic interests in areas only tangentially related to software development & engineering, it would seem. As far as programming is concerned, I've been programming since my 9th grade, and have > five years of paid freelance programming experience.

9. Is there anything else we should know that will make us like your project more?

Oh man am I late to submit this application! I'm really looking forward to working with you folks, though!

This is the only GSoC project I'm applying to.

Contact:

kostas at jakeliunas period com

XMPP: phistopheles at jabber period org

IRC: wfn

NOTES:

the schedule proposed (the second part of it anyway) probably does not reflect the currently considered backend-focused Onionoo-related development direction.

EDITS:

minor edit regarding university, shortened a bit for honesty.
reformatted the whole proposal / html so that it's actually easily readable..

End of proposal body.

Karsten Loesing May 4, 2013, 9:33 a.m.

Hi Kostas,

thanks for this nicely written proposal! Don't worry at all about the formatting, it's the content that counts.

While reading over your proposal I was wondering how to integrate your proposed tool into the metrics ecosystem. It would sure be good to replace the relay search application and ExoneraTor and provide a more general interface for those use cases. But maybe we can go one step further and make your tool the new Onionoo front-end application. The advantage would be that we don't get a new system to maintain, and that Onionoo clients like Atlas could use the new functionality with little effort.

Let me explain this in more detail. Onionoo has a quite simple search interface that allows you to search for relays or bridges that have been running in the past 7 days. Its summary and details replies contain data only from the most recent consensus and only from the most recent server descriptor. These limitations are there, you guess it, to provide reasonable search performance. What your tool could do is remove both limitations by allowing to search for relays or bridges that have been running at any time in the past (since 2007), and by providing descriptor details for any given time in the past. The idea would be that your tool does all the heavy lifting including parsing complex search strings, so that the user interface only needs to present results.

When you look at the Onionoo protocol, you'll note that it further provides aggregate bandwidth and weights information per relay. But those documents are not searchable, so your tool wouldn't have to worry about them. These documents could still be provided by the current Onionoo code and imported into your database, for example.

Does that idea make any sense to you?

Thanks! Karsten

Kostas Jakeliunas May 5, 2013, 10:35 a.m.

Hi Karsten,

thanks for the pointer to Onionoo, I looked it over, also looked at Atlas it's very nicely done. I'm still getting my feet wet into the whole array of Tor programs/projects.

I was actually thinking that it would make sense (in terms of reusability of components, maintainability, and maybe stability) to separate the backend from the frontend in the new system, such that the backend would receive queries, do all the hard work as you say, and return a standardized JSON response. So in general, I think the idea would probably be to kind of go in that direction in any case.

But maybe we can go one step further and make your tool the new Onionoo (https://onionoo.torproject.org/) front-end application.

Just to clarify: by frontend here you probably mean the whole implementation, i.e. a backend + browser frontend speaking an Onionoo-derived protocol? So the project would aim to:

extend the Onionoo protocol, at the very least the query string part
create a new, scalable backend (query parse -> extract from DB -> return in JSON) that would implement an Onionoo-like/-derived format for queries and responses
and also create a new frontend (by frontend I mean client-side browser code facing the user) implementing all the capabilities of that backend and speaking that protocol.

Do you think the Onionoo-like protocol that the backend would speak could be made backwards-compatible with the current Onionoo spec? I suppose it could and that would be part of the plan: Atlas could basically use the new backend just like that.

Of course the new protocol would also need to have new capabilites, and the new client-side frontend would use them: queries would have to allow for more complexity, and the results/output would include more fields, etc. (e.g. for each descriptor there could be a pointer/URI to a list of network statuses wherein the descriptor was present, and so on.) But perhaps it would be possible to make it current-Onionoo-compatible, would that be the idea?

Introducing the new incarnation of the protocol into the rest of the ecosystem makes sense and would be very nice. At the very least, joining Exonerator etc. into one system (again, here it would make sense to separate backend and frontend in that different frontends (smaller tools, e.g. only to check if a relay is an exit, etc.) could be 'plugged in') makes sense, but also designing a uniform protocol would be great. I'm trying to get familiar with all the current tools in place, I'm still feeling very noobish, but hopefully that will eventually change. :)

I've to run off to meet with relatives / Mother's day, so the reply was somewhat hasty - let's talk soon

Kostas Jakeliunas May 5, 2013, 11:11 p.m.

Also:

When you look at the Onionoo protocol, you'll note that it further provides aggregate bandwidth and weights information per relay. But those documents are not searchable, so your tool wouldn't have to worry about them. These documents could still be provided by the current Onionoo code and imported into your database, for example.

I haven't done any benchmarking for the existing bw/weight aggregation routines (they're done by the database itself as I see (PL/pgSQL), and if done right, should be efficient), but I suppose the bw aggregation process could be done by the same (new) backend; would have to get/precalculate the weight data for all old entries, but this would only need to be done once, if I'm not mixing things up.

I think the good news is that the backend part of the project (even if the backend+frontend end up working as a single application) can be worked on while Onionoo-related decisions (whether to make it into the new standard protocol(+implementation)) are being made: the backend will in any case receive complex queries as a simple string I suppose (so all / absolute majority of parsing logic in the backend), and it can communicate the results back to the frontend client-side part in standardized JSON.

(I've also looked at and ran the Compass tool which communicates with Onionoo, a very nice tool indeed, and could also be used to test out backwards-compatibility of the new onionoo-speaking backend were it to happen, etc.)

But let me know if you had something else in mind, or if I should try being more specific when I try to map the to-be solution onto constraints, so to speak. But in short, I do not see why we should not aim for a standardized protocol.

Karsten Loesing May 6, 2013, 8:19 a.m.

Hi Kostas,

thanks for the pointer to Onionoo, I looked it over, also looked at Atlas it's very nicely done. I'm still getting my feet wet into the whole array of Tor programs/projects.

Yeah, there are confusingly many programs in the Tor ecosystem. If you want to sit back for an hour and get an overview about them, there's Roger's and Jake's 29C3 talk:

The Tor software ecosystem [29c3[preview]]

I was actually thinking that it would make sense (in terms of reusability of components, maintainability, and maybe stability) to separate the backend from the frontend in the new system, such that the backend would receive queries, do all the hard work as you say, and return a standardized JSON response. So in general, I think the idea would probably be to kind of go in that direction in any case.

Okay, great!

Just to clarify: by frontend here you probably mean the whole implementation, i.e. a backend + browser frontend speaking an Onionoo-derived protocol? So the project would aim to:

extend the Onionoo protocol, at the very least the query string part

create a new, scalable backend (query parse -> extract from DB -> return in JSON) that would implement an Onionoo-like/-derived format for queries and responses

and also create a new frontend (by frontend I mean client-side browser code facing the user) implementing all the capabilities of that backend and speaking that protocol.

Hmm, you're right, my use of the term "front-end" was rather confusing. But your description matches what I meant pretty well.

Do you think the Onionoo-like protocol that the backend would speak could be made backwards-compatible with the current Onionoo spec? I suppose it could and that would be part of the plan: Atlas could basically use the new backend just like that.

Yes, making it backwards-compatible would be really useful. Having said that, this requirement shouldn't prevent you from achieving your original goal. So, if you're at a point where you'd have to spend a lot of time on making your protocol backward-compatible, please ignore that and stay focused. We can always make it backward-compatible later. Or we might decide to deprecate that feature, adapt Atlas and the other Onionoo clients, and take out the feature two months later.

Of course the new protocol would also need to have new capabilites, and the new client-side frontend would use them: queries would have to allow for more complexity, and the results/output would include more fields, etc. (e.g. for each descriptor there could be a pointer/URI to a list of network statuses wherein the descriptor was present, and so on.) But perhaps it would be possible to make it current-Onionoo-compatible, would that be the idea?

Yes, that would be the idea. Adding new fields is possible, though we'll always have to keep document size in mind; if you download a details document for all running relays, size matters. We could also add new document types. In general, we should try to keep the protocol as simple as possible. But yes, that would be the idea.

Introducing the new incarnation of the protocol into the rest of the ecosystem makes sense and would be very nice. At the very least, joining Exonerator etc. into one system (again, here it would make sense to separate backend and frontend in that different frontends (smaller tools, e.g. only to check if a relay is an exit, etc.) could be 'plugged in') makes sense, but also designing a uniform protocol would be great. I'm trying to get familiar with all the current tools in place, I'm still feeling very noobish, but hopefully that will eventually change. :)

Yes, making ExoneraTor a simple web front-end of Onionoo would be really cool!

I've to run off to meet with relatives / Mother's day, so the reply was somewhat hasty - let's talk soon

Sure!

I'm also going to respond to your other reply here:

I haven't done any benchmarking for the existing bw/weight aggregation routines (they're done by the database itself as I see (PL/pgSQL), and if done right, should be efficient), but I suppose the bw aggregation process could be done by the same (new) backend; would have to get/precalculate the weight data for all old entries, but this would only need to be done once, if I'm not mixing things up.

Onionoo's bandwidth and weights calculation only uses flat files. What you refer to is the metrics website that indeed uses PostgreSQL's arrays to aggregate bandwidth data. But anyway, we should probably keep this out of scope in order not to risk failing the original goal of making the search scale for years of data. Your project could simply ignore bandwidth and weights documents or let the original Onionoo answer requests for those.

I think the good news is that the backend part of the project (even if the backend+frontend end up working as a single application) can be worked on while Onionoo-related decisions (whether to make it into the new standard protocol(+implementation)) are being made: the backend will in any case receive complex queries as a simple string I suppose (so all / absolute majority of parsing logic in the backend), and it can communicate the results back to the frontend client-side part in standardized JSON.

Agreed. Maybe we can extend the current search parameter to do complex queries and stay backward-compatible.

(I've also looked at and ran the Compass tool which communicates with Onionoo, a very nice tool indeed, and could also be used to test out backwards-compatibility of the new onionoo-speaking backend were it to happen, etc.)

Sounds good.

But let me know if you had something else in mind, or if I should try being more specific when I try to map the to-be solution onto constraints, so to speak. But in short, I do not see why we should not aim for a standardized protocol.

This is what I had in mind. Thanks for your very detailed replies!

Best, Karsten

Kostas Jakeliunas May 6, 2013, 12:58 p.m.

Onionoo's bandwidth and weights calculation only uses flat files. What you refer to is the metrics website that indeed uses PostgreSQL's arrays to aggregate bandwidth data.

Ah, my bad, haven't looked at it closer.. But yeah, I will keep in mind that it is possible to 'outsource' this data and not to care about it for now.

Thanks for the video link!

I will now try concentrating on the pre-May-27th list of tasks, in particular figure out all the descriptor data fields etc. that are to be needed in the future, wrap all DB import and query commands around timeit, and so on.

Thanks for all your replies!

Karsten Loesing May 7, 2013, 10:32 a.m.

Onionoo's bandwidth and weights calculation only uses flat files. What you refer to is the metrics website that indeed uses PostgreSQL's arrays to aggregate bandwidth data.

Ah, my bad, haven't looked at it closer.. But yeah, I will keep in mind that it is possible to 'outsource' this data and not to care about it for now.

Actually, you're not to blame here. I need to document these things better. I just made a start and put up a diagram how the various metrics tools fit together: https://metrics.torproject.org/tools.html -- not really related to your proposal, but I thought it can't hurt to mention it here.