The Next Few Decades of Computing

Linas Vepstas linas@linas.org
Work In Progress, July 2000-July 2001
Revised, October 2004

Abstract

The future of computing (the next 20 years) will be radically different than what we know now, and it is sneaking up on us quickly enough. Below, I try to name and describe some of the technologies and concepts that will evolve into this revolution. The future is always hard to predict; it particularly hard to predict what people will be doing with thier computer networks and gadgets. However, I think we can get a pretty good idea of what the underlying technology concepts will be. This is because it often take a decade for a 'good idea' to turn into a widely implemented piece of software that everyone uses; thus we need only review today's good ideas to get a glimpse of where things are going.

The next incarnation of the Internet will liberate both the content and the CPU cycles from the actual hardware that performs storage and computation. That is, both the data and the compute power will be "virtualized" even further away from the hardware its been traditionally bound to. The popular P2P file trading systems already hint at what distributed storage might look like. Efforts such as ZeroInstall show that one might be able to run operating systems without actually having to 'install' them, and 'Stateless Linux' show how one might be able to access one's "desktop" from any handily available keyboard and monitor. Distributed computing efforts such as SETI@Home hint at how cpu cycles can be extracted from the vast repository of idle computers attached to the net.

Keeping this articule up-to-date is difficult. A lot has happened since the first draft of this article: one-fourth of "the next few decades" has already gone by. Some links below may be dead, and some statements may appear quaint. 2001 Draft of this paper.

Introduction: Computing 4 Me

- or -

Desktop Anywhere; Public and Private Presence

Today, my desktop at home looks nothing like my desktop at work. Now, I don't want them to be the same: its good to keep home-life and work-life separate. However, sometimes I need data from one on the other; you know, some file, some document, some email. Now that I think about it, what I really want is that the one would be just a click away from the other. That I really could just go to my work desktop, from home, just by clicking. Given the gauntlet of firewalls that need to be navigated, this would be really really hard to hack up, even if one were a computer professional, like I am. And you know, when I travel, to visit my cousins in California, to vacation in Lithuania, or maybe even on business, I would like to have both my home and my work desktops be just a click away.

Well, some of you might laugh: you read your email from gmail, read your news on the web, chat with your freinds with xchat/IRC or (g)AIM, and write your blogs from a web browser. Have browser, will travel: yes, there are almost enough things on the web that one can now lead a decent online existance without actually having to be tied to one computer. The webtop-anywhere is pretty good, but it's not quite the ideal technical solution. It really would be better if I could get to my real desktop, and not just some websites that do some of the things my desktop does. For example, just having my bookmarks-anywhere would be nice. :) My desktop is faster and smoother than that remote website: they don't call it "World Wide Wait" for nothing. My desktop also has things that I don't quite want to share with web-service providers: my summary financial info, for example, or some of my health records, a few confidential documents and legal contracts, stuff that's mine, that's private. This is my "private presence": I want access to it from anywhere, I just don't want others to have access to it.

I also have a public presence: things I publish on the web, blog entries I write, email I post to public mailing lists. I want everyone to see my public works. Well, since they're web pages, anyone can (as long as they know where to google). But I'll let you in on a dirty little secret: I don't back up my data often enough. If my web server were stolen, much of my public web presence would be gone forever. And that's not all: my home desktop is vulnerable too. And if my office desktop hard drive crashed, ... well, my boss would be pretty peeved. But backup is tedious. Yes, I can automate it, but it always seems to be breaking. Something overflows sooner or later. And copying stuff around, from home to work to webserver to laptop, and back again... its a mess. I can't find anything. I can't search what I have, much less search my backups.

There's a third aspect to my computational life I should tell you about; let me call it my "commerical presence", because it involves data that companies sell to me. I like to listen to music. And I like to watch movies. I don't understand why its so hard to do these legally, on-line. I can rip my CD's and store the resuting mp3's on my desktop. But its still wayyy too hard to play them on my stereo, much less listen to them when I'm on vacation. This part of my desktop is neither public, nor private. I don't have to, or shouldn't have to hide my mp3's; and if I wanted to listen to one while vacationing in Lithuania, I shouldn't have to physically move bytes across the Atlantic to do so. Any local copy, as long as its the same song, will do. My desktop-anywhere shouldn't have to lug an mp3 collection along with it. I should be able to just get the mp3's out of a big pie in the sky. Now, if I had to join a club with monthly dues to be able to do that, well, hey, that's OK. Things cost money, its called capitalism, we're used to it. But there aren't any clubs offering to host my foot-locker of mp3's that I could play anywhere. The music retailers all want me to buy a copy of a song, which I then have to manage myself (inlcuding making backups, copying it to work, to home, to my laptop, my car, my stereo, my iPod, ... ugh). I hate copies, and I hate copying. I don't want copies of songs. I just want to listen to them.

These aren't deep complaints, and my work life can be perfectly productive, and my home life perfectly happy, without solving these trite little drinking-coffee making-copies problems. The fact that these 'problems' don't seem very important is one of the reasons it's going to take a long time to solve them. The other reason it will take a long time is that solving these problems well, in an elegant fashion, is hard, and will require a sea-change of technology. Something that will take decades. Which is why we can see so far ahead :) And so this is what the rest of this article is about.

Computing Eternity

David Gelernter makes a number of marvelous predictions for the future of computing in his manifesto THE SECOND COMING. Here's how he explained the desktop-anywhere problem: A dekstop computer is a scooped-out hole in the beach where information from the Cybersphere wells up like seawater. My pale klunky prose wilts alongside. The ideas he presents are poetically pure and ideal, although some are equally maddening: hasn't he heard of this, or tried that? This guy is a Microsoft user, he's been trapped in that world, that mentality, and it shows. Thus, my gut reaction to the essay is to start cataloguing the actual technologies that will actually be used to implment some of the things he talks about. That's the problem with poetry: its too vague to serve as an engineering blueprint. How do you build such a thing? When he touches on certain points, I can't help but to scream to myself, "but hey, I already know a prototype of this". Lets start looking at some of the themes, and some of the prototypes.

What are the general themes that we need to touch on to build this vision of my-data-is-a-lifestream, my desktop-is-everywhere, I-just-want-my-MTV world? We need:

Permanent, network-distributed, safe-from-data-loss file storage, accessible from anywhere, at any time, never mind the firewalls.
File storage that protects my rights from search and seisure and fraud: it keeps the viruses and scamsters and hackers away, defends my constitutional rights to privacy from the prying eyes of the government, and yet lets me and my friends in. For that, I need to think about authentication, security and trust.
File storage that lets me actually find the things I've got stored away. I kinda really want to google my desktop. That's what I want.
Compute power. Some of the things I do require more compute power than what I have on my desktop. Where does that compute power come from?
Bandwidth. If my desktop is going to be anywhere, its going to have to be downloaded to get there. I want speed, responsiveness, agility. For that I need bandwidth, but bandwidth is scarce. What to do?
Community and Inclusiveness. This stuff is way too big for any one company to build or control. The road is already littered with failed startups and shut-down big company projects. Each of the recent computing revolutions have been community built, by an anarchic, decentralized, unfocused, disorganized community: the Internet was built that way. So was the Web. So was Linux and Apache. Music file sharing is a massive community effort, and so is much of the software written for file sharing. Blogging and blog software are community efforts. So is Wikipedia. The re-invention of computing as we know it will be too. Pay attention to your source code license: there's a reason the GPL is so popular.

Lets start rooting around.

Eternity Service

Ross Anderson described the Eternity Service as a distributed filesystem that could survive damage to its storage infrastructure in analogy to how the Internet can survive damage to its network. .... (prototype) and related concepts, such as FreeNet, eMule and GriPhiN all provide ways of publishing information on distributed networks. Each technology enables a user's home computer to participate in a broader network to supply distributed storage. If you think about it, this is very, very different than the defacto Internet today, where web pages are firmly rooted to the web servers that server them up.

If you are reading this web page near the turn of the century, chances are good that your browser fetched it off of the web server I run at home. Chances are also good that you got it off some caching proxy. I know my ISP runs one. The caching proxy stores a copy of the web page ... for a while. Not very long. If my server dies, chances are you won't see my page either. The caching proxy helps with bandwidth costs for my Internet provider, but doesn't help me much. ... But I know that my life would be a lot better if I didn't actually have to be sysadmin for the server I run at home (I hate replacing broken disk drives, etc). I would like it much better if I could just publish this page, period. Not worry about maintaining the server, about doing backups. Just publish it on FreeNet or Publius. If everyone's home computer was automatically a node/server on Publius, and if Publius required zero system administration, then I, as a writer/publisher, would be very happy. I could just write these thoughts, and not worry about the computing infrastructure to make sure that you can read this. We conclude that the eternity service is an important component of Gelernter's Manifesto, which he sadly fails to name as an important, contributing technology.

A crucial component of this idea is that of 'zero administration': the ultimate system must be so simple that any PC connected to the net could become a node, a part of the distributed storage infrastructure. The owner of a PC (e.g. my mom) should not have to give it much thought: if its hooked up to the Internet, its a part of the system.

Aspects:

What type of storage is it focused on, public, private, or commercial? Each has different characteristics: I want my private storage to be accessible from anywhere, to endure even if the network/servers are damaged. But I want it to remain private, to stay in my posession. I want my public writings to be robust against network damage as well, and I also want them to be hard-to-censor. I might want to be able to engage in anonymous speech, so that I could, for example, blast the sitting president (or the RIAA) without feeling I could get in trouble for it. The third, "commercial storage" would be a system that allowed me to access commercial content from anywhere, for a fee. This is the system that the RIAA is failing to build, failing to support: a way to get at the music that I paid for, where-ever I might be.
Does it provide eternity? Will a file get stored forever, or can it vanish? There are two types of eternity: protection against censorship, and protection against apathy.
- Censorship Protection: content cannot be (easily) removed by authorities objecting to the content, e.g. political speech, state secrets, bomb-making plans.
- Apathy Protection: no one cares about the content at this time, and thus, it will slowly get purged from various caches and stores until the last copy disappears forever.
Note that one can implement censorship protection, and still not get apathy protection: FreeNet works like this. One can also implement a system that is censorable (so that the sysadmins can explicitly purge spam), and still get apathy protection: as long as a file is not actively hunted down and terminated, it will stick around forever. These are orthogonal concepts.
Provides anonymity protections to poster, if desired. This would allow whistle-blowers and politcal rabble-rousers to remain anonymous without fear of intimidation/reprisal. This would also allow posters of spam, viruses and other criminal content to remain anonymous and beyond the reach of the law.
Allows censorship of content by editor or network operator. This would allow police authorities to remove child pornography or other objectionable content. This would also allow copyright holders or thier agents to remove content. This also allows the removal of old, out-of-date content and a general cleanup of e.g. spam or viruses that have clogged the system.
Identifies the downloader. This can potentially enable payment for downloads, or otherwise hook into a subscruiption service.
Provides file download popularity statistics. Of interest for a variety of reasonable and nefarious reasons.
Appears to the operating system as a filesystem. Thus, for example, I could put a binary into it, and then run that binary on my desktop. ZeroInstall tries to do this.
Versioning/Version Control (Gelernter's "Lifestreams"). Can I get earlier versions of my file? Is my file tagged with date meta-info? Can I get an earlier draft of this paper?
Support for extended File Attributes; storage/serving of file meta-data along with the file. Can I mark up the file with info that is important to me, such as where I was (geographically) when I last looked at it? Can I categorize in in many different ways, e.g. if its a hospital bill, can I put it in my "hospital" folder, as well as my "finances" folder? Note that folders do not need to be literally folders: they could in fact be fancy search queries: as long as the object responds to the query, its a part of that folder. This is how a given file might be in many folders at once.

See also:

OceanStore, described below.
Publius, described below.
Gnutella, a server-less peer-to-peer file sharing protocol.
Mojo Nation, an LGPL'ed pay-per-view file sharing service.
ZeroInstall, described below.
FreeHaven, defunct project.
Publius, defunct project.

Search and Query

Gelernter goes on at length about content addressable memory, about how it should be possible to retrieve information from the Eternity Service based on its content, and not based on its file name. Search is important not only for finding a needle-in-the-haystack on the Internet, but also for finding the mislaid file on ones own computer. In a different but still important sense, querying is used, for example, to report your bank balance, out of the sea of other transactions and investments and accounts one may have. The importance and centrality of search and data sharing for general application development is further discussed in the Why-QOF web page.

What are the pieces that are needed, and available?

Natural language query parsers. Gnome Storage is looking to provide natural language query for desktop applications.
Distributed databases and distributed query. DNS (the Domain Name System) is a distributed database for performing IP address lookup. unfortunately, there is no straightforward generalization to arbitrary data. LDAP (the lightweight directory access protocol) in theory can handle more generic data, but it remains difficult to set up and use.
My personal entry on this chart is QOF, the goal of which is to make it trivial for programmers to work with persistent, globally-unique, versionable, queryable OOP-type 'objects'.
Massively scalable search already has a proof-of-concept with Google. Curiously, though, the google page rank is the result of a carefully hand-tuned and highly proprietary algorithm. This indicates that search by content alone is not enough; search-by-content has to be ranked to provide results that are meaningful to users. And it seems that its the ranking, and not the search, that is the hard part.
Google focuses on free-text search. If you want prices, you need < a href="http://www.google.com/froogle">Froogle. Google is useless for binaries: if you want binary content, you go to specialized sites: rpmfind.net to locate RPM's, tucows to locate shareware, or mp3.com or scour.net to find audiovisual content. Each of these systems are appallingly poor at what they do: the RPM Spec file is used to build the rpmfind directories, but doesn't really contain adequate information. The mp3 and shareware sites are essentially built by hand: that part of the world doesn't even have the concept of an LSM to classify and describe content! (LSM is a machine-readable format used by metalab.unc.edu to classify the content of packages in its software repository.)
Searchable meta-data, and automatic time and (geographic) place tagging of a file when its created, viewed and edited. If I created a file while I was drinking coffee in a coffee-house, I want it tagged, so that I can find it later when I go searching for the words "coffeee house, 2 months ago". If I happened to create three versions of that file, I'd like to be able to call up each: tehre should have been (semi-)automatic file versioning, a "continuous backup" of sorts. A Wayback Machine for my personal data.

Here are some additional references:

gPulp provides a framework for distributed searching. Derived from Gnutella Next Generation. See the Wired Article. European consortium, standards body, costs real money to join. They seem to be working specs, not implementations. The main spec is a P2P 'data discovery protocol'.

LSM's, Name Spaces and Self-Describing Objects

There is another way to look at the problem of searching and finding an object based on its content, rather than its 'unique identifier'. Filenames/filepaths/URL's are essentially unique identifiers that locate an object. Unfortunately, they only reference it, and maybe provide only the slimest of additional data. For example, in Unix, the file system only provides the filename, owner, read/write privileges, modification/access times. By looking at the file suffix one can guess the mime-type, maybe: .txt .ps .doc .texi .html .exe and so on. File 'magic' can also help guess at the content. URL's don't even provide that much, although the HTTP/1.1 specification describes a number of optional header fields that provide similar information. See, for example, Towards the Anti-Mac or The Anti-Mac Interface for some discussion of this problem.

What is really needed is an infrastructure for more closely defining the content of a 'file' in both machine-readable and human-understandable terms. At the very least, there is the concept of mime-types. Web-page designers can use the <meta> tags to define some additional info about an object. With the growth of popularity of XML, there is some hope that the XML DTD's can be used to understand the type of object. There is the semi-forgotten, semi-ignored concept of 'object naming' and 'object trading brokers' as defined by CORBA, which attempt to match object requests to any object that might fill that request, rather than to an individually named object. Finally, there are sporadic attempts to classify content: LSM's used by metalab.unc.edu, RPM Spec files used by rufus.w3.org, deb's used by the Debian distribution. MP3's have an extremely poor content description mechanism: one can store the name of the artist, the title, the year and the genre. But these are isolated examples with no unifying structure.

Unfortunately, Gelernter is right: there is no all-encompassing object description framework or proposal in existence that can fill these needs. We need something more than a mime-type, and something less than a free-text search engine, to help describe and locate an object. The system must be simple enough to use everywhere: one might desire to build it into the filesystem, in the same way that 'owner' and 'modification date' are file attributes. It will have to become a part of the 'finder', such as the Apple Macintosh Finder or Nautilus, the Eazel finder. It must be general enough to describe non-ASCII files, so that search engines (such as Google) could perform intelligent searches for binary content. Today, Google cannot classify nor return content based on LSM's, RPM's, deb's, or the MP3 artist/title/genre fields.

distributed.net and SETI@home

distributed.net runs a distributed RC-64 cracking / Golumb Ruler effort. Seti@Home runs a distributed search of radio telescope data for interesting sources of extraterrestrial electromagnetic data. Both of these efforts are quite popular with the general public: they have built specialized clients/screen-savers that have chewed through a quadrillion trillion CPU cycles. Anyone who is happy running a distributed.net client, or a seti@home client might be happy running a generic client for performing massively parallel computations. Why limit ourselves to SETI and cypher cracking? Any problem that requires lots of CPU cycles to solve could, in theory, benefit from this kind of distributed computing. These high-cpu-usage problems need not be scientific in nature. A good example of a non-science high-cpu-cycle application is the animation/special effects rendering needed for Hollywood movies. The problem may not even be commercial or require that many cpu cycles: Distributed gaming servers, whether role-playing games, shoot-em-ups, or civilization/war games currently require dedicated servers with good bandwidth connections, administered by knowledgeable sysadmins.

The gotcha is that there is currently no distributed computing client that is is 'foolproof': providing generic services, easy to install and operate, hard for a cracker/hacker to subvert. There are no easy programming API's. XXX But this may be changing now, see BOINC, below.

Other clients:

BOINC, the software under SETI@Home, see listing below.
Xenoservers, reference below.
Climate Dynamics at RAL.
United Devices Purely commercial, totally proprietary.
PVM & MPI are older technologies, optimized for cluster and parallel computing. They are rather heavyweight, demanding of bandwidth, and unable to deal with clients come and go (unrealiable clients).
Folding@Home is attempting to solve protein folding problems with pure-custom software.
Popular Power attempted to pay for CPU cycles, as did Process Tree Network. Both Defunct.
Cosm attempted to define distributed computing API's. Defunct.

ERights and Sandbox Applets

Java still seems to be a technology waiting to fulfill its promise. However, it (and a number of other interpreters) do have one tantalizing concept built in: the sandbox, the chroot jail, the honeypot. Run an unsafe program in the chrooted jail, and we pretty much don't care what the program does, as long as we bothered to put some caps on its CPU and Disk usage. Let it go berserk. But unfortunately, the chroot jail is a sysadmin concept that takes brains and effort to install. Its not something that your average Red Hat or Debian install script sets up. Hell, we have to chroot named and httpd and dnetc and so on by hand. We are still a long ways off from being able to publish a storage and cpu-cycle playground on our personal computers that others could make use of as they wished. It is not until these sorts of trust and erights systems are set up that the kind of computing that Gelernter talks about is possible.

References:

Xenoservers, reference below.

Streaming Media & Broadcast: Bandwidth Matters

The naivest promise of 'digital convergence' is that soon, you'll watch TV on your computer. Or something like that. There are a multitude of blockers for the roll-out of these kinds of services, and one of them is bandwidth strain put on the broadcaster and the intervening Internet backbone. Given the way that people (or rather, operating systems and software applications) use the Internet today, if a thousand people want to listen to or view a streaming media broadcast, then the server must send out a thousand duplicate, identical streams. This puts a huge burden on the server as well as nearby routers.

The traditional proposed solution for this problem is MBONE, but MBONE has yet to see widespread deployment. (MBONE is the Internet 'multicast backbone' which allows a broadcast server to serve up one packet, and then have Internet routers make copies of the packet as it gets sent to receiving clients. Clients receive packets by 'subscribing' to 'channels'.)

There are two other approaches to distributing the bandwidth load: ephemeral file server and distributed streaming. Both leverage the idea that if someone else is receiving the same data that you want, then they can rebroadcast the data to you. The difference between these two is whether you get the data in order, and possibly whether you keep a permanent copy of it. In either case, you get your data in "chunks" or pieces, rather than as a whole. For streamed media, e.g. a radio broadcast, it is assumed that you are listening as it is broadcast, rather than waiting for a download to "finish", and then listening to it. For streamed media, the data must arrive in order, and must arrive in a timely manner. I don't know of any examples at this time.

An ephemeral file server, by contrast, can (and usually will) deliver data out-of-order (sometimes called "scatter-gather"). A good example might be BitTorrent, which only shares the file that you are currently downloading, instead of sharing all of your files. It is "ephemeral" in the sense that sharing usually stops shortly after download completes. BitTorrent explicitly delivers chunks of the data out of order: the goal is to make sure that everyone has something to share, rather than, e.g. everyone having the first half but not the second half of a file. "Ephemeral" does not mean short-term: torrents can (and do) exist for months: they exist as long as a file is popular, and as long as at least one client is up on the net. Equally interesting are the things that BitTorrent doesn't do, or gaurentee: for starters, there is no 'eternity': if there are no clients offering the file, it is effectively gone. BitTorrent does not keep either a master index of files offered, nor a even a searchable index of offered torrents. One must locate the torrent one wants in some other fashion: e.g. through web pages or traditional search engines. In the same vein, its not a file system: there is no heirarchy of files that are kept or can be browsed. The goal of BitTorrent really is to balance the network load in a distributed fashion.

To summarize the technical points:

The search problem: Can the user browse a list of available content? Can the user search for particular content? (BitTorrent relies on web pages and web search egines to solve these problems).
The peer discovery problem: Once a particular bit of content has been identified, how does a client discover the other clients that are ready to share?
- BitTorrent And PDTP solve this problem by having a tracker for each offered file. Clients register with the tracker and tell it what chunks of the file they already have; the tracker responds with a list of clients that might have the chunks we don't yet have. Clients keep the tracker up-to-date as the download proceeds. Conceptually, there is one tracker per offered file. Note, however, that the tracker is vulnerable: if it goes down, new clients are shut out.
- Swarmcast uses a Forward-Error Correction (FEC) algorithm to create packets that occupy a data space that is orders of magnitude larger than the offered file. Thus, the receiver can reconstruct the whole file after having recieved only a very small portion of the total packets in the space. The use of FEC encoding elminates the need for a chunk tracker: all packets in the data space are "gaurenteed" to contain data that the client does not yet have. This is by encoding in a very large data space: the probability that the client receives data that it already has is equal to the ratio of the file size to the data space size; this ratio can be made arbitrarily small. (Its kind of like a hologram; you need only some of it to reproduce the whole).
  
  Downside to this approach is that its CPU-intensive, and it can inflate the total amount of bytes that need to be delivered by a fair amount. Upside is that it can roll encryption and encoding into one.
The streaming problem. For streaming to work, data must be delivered in order. (BitTorrent doesn't do that)
Bandwidth allocation/balancing between peers. BitTorrent tries to load-balance by using a tit-for-tat strategy: a client will only offer chnks to those clients that are sending chunks to it. For streaming media, this strategy clearly can't work: sharing must be transitive, not reciprocal.
The 'dropped frames' problem: The viewer/receiver of a real-time stream must be able to get data in a timely manner, so that they can watch thier show/movie without interruption. The viewer is potentially willing to trade disproportionate amounts of upload bandwidth in exchange of a gaurenteed download bandwidth. The receiver is mostly interested in having multiple redundant streaming servers handy.

I am not yet aware of any generally available streaming-media reflectors, other than on those based on MBONE.

Swarmcast, now defunct, may be unique in having been the first to use a scatter-gather type algorithm for delivering a file by chopping it up into chunks. (Swarmcast predates BitTorrent). GPL license.
BitTorrent, described below, is an 'ephemeral fileserver', serving up files in a distributed fashion for the few mements that they are popular and being actively downloaded by others.
PDTP is a distributed file system, with heirarchical directories, but offers network load balancing through distributed file-piece delivery.

The Internet for the Rest of Us

To understand the future, it is sometimes useful to look at the past. Remember UUCP? It used to tie the Unix world together, as did BITNET for the VAx's and Cray's, or the VM network for mainframes. They were all obsoleted by the IP protocols of the Internet. But for a long time, they lived side-by-side, even attached to the Internet through gateways. The ideas that powered these networks were subsumed into, became a part of the Internet: The King is Dead, Long Live the King!. The spread of the types of technologies that Gelernter talks about will be evolutionary, not revolutionary.

Similarly, remember 'The Computer for the Rest of Us'? Well, before the web exploded, Marc Andressen used to talk about 'The Internet for the Rest of Us'. Clearly, some GUI slapped on the Internet would make it far more palatable, as opposed to the 'command-line' of telnet and ftp. But a web browser is not just a pretty GUI slapped on telnet or ftp, and if it had been, the WWW still wouldn't exist (what happened to 'gopher'? Simple: no pictures, no 'home pages'). The success of the WWW needed a new, simple, easy technology: HTTP and hyperlinks, to make it go. The original HTTP and HTML were dirt-simple, and that was half the power of the early Internet. Without this simplicity and ease of use, the net wouldn't have happened.

What about 'the rest of us'? It wasn't just technology that made the Internet explode, it was what the technology could do. It allowed (almost) anyone to publish anything at a tiny fraction of the cost of traditional print/radio/TV publishing. It gave power to the people. It was a fundamentally democratic movement that was inclusive, that allowed anyone to participate, not just the rich, the powerful, or the members of traditional media establishments. In a bizarrely different way, it is these same forces that power music file trading: even if the music publishing industry hadn't fallen asleep at the wheel, it is the democratization that drives file traders. Rather than listening to what the music industry wants me to listen to, I can finally listen to what I want to listen to. At long last, I am able to match artist to the artists work, rather than listening to the radio and scratching my head 'gee I liked that song, but what the hell was the name of the artist?' Before Napster, if I didn't know what music CD to buy, even when I wanted to. I wasn't hip enough to have friends who new the names of the cool bands, the CD's that were worth buying. Now, finally, I know the names of the bands that I like. Napster gave control back to the man in the street.

Similarly, the final distributed storage/computation infrastructure will have to address similar populist goals: it must be inclusive, not exclusive. Everyone must be able to participate. It must be for 'the rest of us'.

Commercialization

Like the early days of the net, the work of volunteers drove the phenomenon. Only later did it become commercialized. Unlike then, we currently have a Free Software community that is quite conscious of its own existence. Its a more powerful force. Once the basic infrastructure gets built, large companies will come to make use of and control that infrastructure. But meanwhile, we, as engineers, can build it.

I guess the upshot of this little diatribe is that Gelernter talks about his changes in a revolutionary manner, leading us to believe that the very concept of an operating system will have to be re-invented. He is wrong. The very concept of an operating system *will* be reinvented, someday. In the meanwhile, we have a perfectly evolutionary path from here to there, based not only on present technologies and concepts, but, furthermore, based on the principles of free software.

Critique

A discussion of some of the problems and Achilles' heels of the current software.

Revision Control

As an author, I would like to publish articles on Freenet, but, due to its very design, Freenet has trouble with revision control. That is, if I wrote an article yesterday, and I want to change it today, I can't. For if I could, then yesterday's article could be censored: the thought police could be standing behind my back, making sure that I undo yesterdays crime. None-the-less, there needs to be some sort of system for indicating that 'this document obsoletes the previous one'.

(N.B. These remarks are a bit off-base. Freenet now includes a date-based versioning scheme.)

Lack of True Eternity

Most of the file systems lack true eternity: if an unpopular document is posted, then there is a possibility that it might vanish/get erased because no one has accessed it for a while. This is a problem, especially when the published document was meant to be archival. This is particularly troublesome for library systems, where the archival really needs to be permanent, no matter how unpopular or uninteresting the content is to the current generation.

Lack of a Means to Verify Authenticity and Integrity

Many distributed file systems focus on anonymity and repudiability. But in fact, many publishing needs require the opposite: the ability to authenticate the document, to know that it's author is who she says she is, and know that the document hasn't been tampered with. If one is downloading software off the net, one wants to know that someone didn't attach a virus or Trojan horse to it. This sort of authentication is needed not only for software downloads, but also for private legal documents, the publication of documents by governments, for medical records and texts, insurance records, bank records, nuclear power station and aircraft design and maintenance records, and so on.

Lack of 'Managed Mirroring'

Although fully-automated, hands-off, censorship-proof distributed file systems are technically interesting, they fail to leverage the natural human urge to collect and archive and edit and prune. There are many, many collections and archives and mirrors on the net. Some of these are explicit mirrors of other sites, others are accidental mirrors which might hold some of the same files, maybe with different names. These are 'managed mirrors', but most/all distributed/peer-to-peer networks fail to identify and make use of such mirrors.

Private Eternity aka 'Data Backup'

If you've ever bought a tape drive to back up your data, my sympathies. One immediate and highly-practical, non-political application of eternity is to run it on a private LAN, as a means of avoiding backups while also providing distributed file service. Most small offices have a number of machines, with a fair amount of unused disk space. Instead of backing up to tape, it could be considerably more convenient to backup to the pool of storage on the local net. Think of it as LAN-based RAID: any one disk fails, no problem: the service can reconstruct files. CEO's hard drive fails? No problem: CEO's info was encrypted anyway, & sysadmin can reconstruct without need of the password.

Spam

Any free service that defends against censorship will also have a problem with spam. Malicious users could flood the system with garbage bytes. Freenet can somewhat defend against this: old, unwanted files get eventually deleted. Also, anyone who tries to make their spam 'popular' by accessing it with a robot will only succeed in getting it to migrate to servers near them (although this could be a kind of DoS.) DDoS might be harder to defend against. The price that Freenet pays for this is a lack of eternity. A true free eternity service may be far more vulnerable to spam.

Resources

Meta-Resources:

Cypherspace: Cypherspace links
OpenP2P.com: O'Reilly's P2P news site.
O'Reilly P2P Directory: O'Reilly P2P Directory lists a large number of P2P-type projects.

Specific Projects, in alphabetical order.

Allcast

BitTorrent

BitTorrent provides a distributed file streaming network, based on a principle of a "hot cache". If one user is downloading a popular file, they can act as a temporary server to other people downloading the same file. The goal is to ease network bandwidth that a single centralized server might have to offer up, while increasing download speeds for users by fetching from uncongested peers instead of the congested centralized server.

Interestingly, bittorrent clients are implemented with a tit-for-tat trading strategy, sending to peer clients only if they are also receiving from peers. To be able to trade, peers always ask for the 'rarest' missing pieces first, thus maximizing the likelyhood that they'll have something to offer to thier peers (and thus be able to effect a download).

BOINC

BOINC Berkeley Open Infrastructure for Network Computing. Provides a generic compting, distribution and statistics infrastructure for distributed computing projects such as SETI@Home.

DLM Distributed Lock Manager

DLM implements a set of VX-cluster style locks for a Linux cluster. Basically, this is a traditional cluster computing lock manager.

Entropy

Entropy -- Distributed global storage with web-like and p2p-like features. Emphasizes encryption.

Eternal Resource Locator

Eternal Resource Locator. A journal article describing research in a system that can authenticate the authenticity and integrity of documents, based on a trust network of publishers, editors and writers. Existing system is deployed to publish medical records and texts using standard Internet technologies.

The Eternity Service

Ross Anderson's 1997 paper on The Eternity Service is the original 'classic' that kick-started the conversations on an Internet storage mechanism that was survivable and robust, in analogy to how Internet routing and connections can still work even if part of the network is nuked. See also Anderson's notes on the topic.

Freenet

Freenet. Distributed file sharing system. Characteristics: Efficient: distributed storage; retrieval time scales as log(n) of network size. Robust: hard to attack/destroy, since due to global decentralization, and encrypted, anonymized contents on local nodes. Secure: Malicious tampering and counterfeiting protect against by cryptography. Uncensorable. Private: identity of readers and publishers are cryptographically anonymous and can't be forcibly discovered (e.g. via court order). Lossy: Files that are infrequently accessed become 'lost' from the system; therefore Freenet is a not a good way to archive rare/obscure materials. On the obverse, spam inserted into Freenet should wither away into oblivion. Lacks ability to get a global count of the number of file downloads. Java, Unix and Windows, also alpha C++ server.

Genny

GnutellaDev hosts the GPL'ed Python implementation of the Gnutella Genny protocol. Its a broadcast/reply based network built on UDP.

Globus

Globus is a grid computing framework.

InterMezzo

Distributed file system, more along traditional lines.

Mojo Nation

Mojo nation. Includes support for authentication and encryption. Level of access control uncertain. Has basic transaction support, not clear if the transaction support is general, and what sort of distributed locks are available. technical documentation

OceanStore

OceanStore Distributed File Sharing System. Aims at scalability, durability through promiscuous caching; aiming for billions of users. Uses cryptographic protections to limit damage (e.g. censorship) by local node operators. Includes fault tolerant commit protocol to provide strong consistency across replicas. Includes versioning system to allow archival (read-only) acceess to old versions. Provides an adaptive mechanism to protect against denial-of-service attacks, regional outages. A prototype is available on SourceForge, written in Java, under a BSD license.

Not clear: Does versioning/archival include file meta-data? How is file meta-data handled?

PDTP (Peer Distributed Transfer Protocol)

PDTP Provides a heirarchical distributed file service, focusing on distributing the bandwidth costs among many peers. Includes proxy, allowing operation through firewalls. Offers extended file attributes for storing/ publishing file meta-data along with the file. Code on sourceforge. Open source license.

SEDA

SEDA (Staged Event-Driven Architecture) provides a core of queue-based event services on which one can build extremely high concurrency systems, that is, systems that support a very large number of i/o (socket) connections. SEDA acheives this in part by performing admission control to each event queue, thus helping control congestion and distribute or balance the workload.

The main insight of SEDA is to avoid thread programming: the thread scheduler hides the true workload, and usually makes a poor scheduling choices, especially where there is lock contention/resource contention. The idea is to replace thread scheduling with a set of semaphore-like work queues which can be explicitly scheduled to avoid resource contention. This allows the scheduler to take the queue size into consideration when scheduling. The application is decomposed into a set of stages, each stage reads from input queues, performs work, and outputs to queues. Talk Slides.

Written in Java, BSD license. A number of other packages are built on this, including OceanStore and Apache Excalibur.

SFS

SAN File System, a file system for storage area networks.

SFS

Secure File System Cryptographic file system built on top of UFO, a VFS-like user-space file-system that support ftp, http, etc. Smart-card authentication.

SFS

Self-certifying File System is a secure, global file system with completely decentralized control. Darpa-funded research. GPL'ed. Works and is usable, but subject to incompatible changes in protocol.

Stateless Linux

Stateless Linux is a proposal to make Linux desktop computing at least locally 'location independent'. It is tackling the large variety of minute, mundane changes that must be made in order to be able to run Linux on diskless thin clients, live CD's, or have multiple identical desktops automatically managed centrally. thin clients and live CD's have been around for over a decade, but this is the first effort that is trying to be more than a 'custom hack', and rather make this be a standard, generic way of delivering Linux. Once this is acheived, one might then hope for a chance at having an internet-wide single-desktop login.

Tapestry

Tapestry provides a way of finding a copy of an object, without relying on a centralized server to do so. Basically a decentralized object index (???)

TAZ and the ReWebber Network

TAZ and ReWebber is a system meant to protect the anonymity of web page publishers. Described in a paper by Ian Goldberg and David Wagner, it uses an anonymizing chain of web servers to repeatedly encrypt the source URL, so that the only way to find the published web page is to ask the chain of rewebbers to decrypt the URL and return the resulting web page. Because the multiply-encrypted URL chain can become unmanegably long, a namespace server, the TAZ server, is used to associate a short name (and/or web page description) with the encrypted URL chain.

TAZ/Rewebber is focused entirely on maintaining the anonymity of the publisher; it provides no protections for the reader, nor does it provide an 'eternity service'. There was an implementation but its not clear if it ever was publically available.

Tor

Tor provides a distributed set of servers ("Onion Routers") that can bounce around generic TCP streams, thus (in principle) anonymizing them.

USENET Eternity

USENET Eternity Distributed File System. Makes use of USENET for document distribution. Includes mechanism for indexing. The system seems crude, primitive, a quick hack that is up and running, as opposed to a belabored, from-scratch implementation. Doesn't seem to be developing 'critical mass'. It is built on a set of interesting observations: that the USENET document distribution service is hard to censor, and that USENET itself is an old and widely deployed system run by most Internet service providers.

XDegrees

P2P file sharing. Attempts to find local servers, and use those preferentially over remote servers.

Xenoservers

Xenoservers are servers that can safely execute untrusted applications from guests, in exchange for money. Journal article. Implementation being created, not publicly available.

ZeroInstall

ZeroInstall is a way of running application binaries without "installing" them. Kind of like a distributed file system, but highly specialized to the task of being able to seemlessly run application binaries.

Defunct

Where are they now? The stars of yester-year, and some never-has-beens are listed below, with breif eulogies. More dryly, we could say: "The following sites appear to have shut down or are no longer in operation. In alphabetical order."

BlueSky

BlueSky (used to be at http://www.transarc.ibm.com/~ota/bluesky/, no known mirror) collection of references and mailing list for Global-Scale Distributed Storage Systems. This is where a lot of important people hang out. (As of June 2004, BlueSky appears to have ceased operations).

Cosm/Mithral

Cosm provides a programming API that aims to meet the requirements of distributed computing. It is currently ham-strung over licensing issues. The current license makes commercial and non-commercial use difficult and impossible by requiring the 'data' and 'results' to be published, as well as the 'source code' used in the project. Many users will find these impractical to live up to. My recommendation? GPL it!

Mithral appears to have ceased operations summer 2003. The website is still up. Started in 1999, this project was an early entrant into the grdd-computing arena.

FreeHaven

FreeHaven Distributed file sharing system. Characteristics: Anonymity for publishers and readers. Published documents are persistent, but the length persistence is determined by the publisher. To avoid abuse, the system hopes to add accountability and reputation features.

This project appears to be bizarrely organized. The mailing lists are dead, the website maintainer abandoned the site in August 2001. The web site does not link to any actual published code; and there appears to be no development community to be found. However, there does seem to be a very active project, TOR, which the website doesn't actually link to. It can be found only through the US Navy page on Onion Routing. Roger Dingledine appears to continue to publish papers on the main FreeHaven web site, however.

Intermemory

Proprietary storage startup. http://intermemory.net/ Appears to have ceased operations in 2002. Web site was pulled, no known mirror of original website.

Napster

Napster was the first well known file-trading systems. Users could trade files (typically song files) with each other; the Napster client served up files directly from one users machine to another's. Napster clients could search for files by filename. To implement search, a list of all the user's shared files was sent by the client up to the Napster servers, which kept a master index of all currently downloadable filenames.

The dependency on a master server was Napster's undoing: by shutting down the master servers, the whole network could be rolled up. In addition, Napster provided no anonymity: the identities of file traders could be quickly discovered. One could say that these were the two design flaws that caused Napster to be shut down, once it was discovered that the content traded on Napster was objectionable to someone. (The Napster trademark lives on today and is used for some other purpose; do not confuse the current thing called Napster with the famous original.)

OpenNap

OpenNap Open Source Napster protocol implementation. Server only. This web page lists a large number of Napster clients (32 open source and 17 proprietary), two other open source Napster servers, and 29 other file sharing protocols/systems. Clearly there was tremendous excitement, and a gret deal of energy expended on this architecture, and file trading in general.

OpenNap development appears to have ceased in late 2001, and both the development and the admin mailing lists went silent shortly thereafter. The sourceforge project page still reports a large number of downloads of the software, but its not clear if any of it is being used. OpenNap of course suffers from teh same architectural weaknesses of Napster which is why it may be defunct.

Popular Power

Popular Power (used to be at http://www.popularpower.com/, no known mirror of original website) provided a distributed computing framework. O'Reilly article states operations ceased 19 March 2001. This startup was partly funded by O'Reilly. Not clear what happened to the software.

Part of the novelty of its business plan is that it offerred to pay to get at the unused cycles on client machines. Unfortunately, the amount offered to the client is at best a small handful of dollars: The cost of purchasing a home PC, paying for electricity and for an internet connection far exceeds potential income from running distributed computing software. This is necessarily so: If one is to pay computing power, its always cheaper to buy in bulk, get a large number of machines and put them in a data center: the bandwidth is better, the cycles cheaper. Distributed computing on home-user PC's only makes sense if those cycles are free or very cheap to the cycle user.

Prague Eternity Service

Prague Eternity Service (mirror) Distributed File Sharing System. From the Charles University in Prague. Does not seem to be an active, ongoing project. Last activity dates to May 1999. Of historical interest. Source code available, C++. License unclear.

Process Tree Network

Process Tree Network (used to be at http://www.processtree.com/ no known mirror of original website) attempted to provide a commerical distributed computing API, and offere to pay cash for CPU cycles. Ceased operations sometime in 2001??

A criticism of Process Tree and its peer Popular Power was that, as commercial ventures, they are centered on large, paying projects: thus, there is no obvious way for smaller projects or individuals to participate. In particular, I might have some application that only needed hundreds of computers for a week, not tens of thousands for a year. Can I, as a small individual, get access to the system? This is important: the massive surge of the popularity of the Internet/www was precisely that it gave "power to the people": individual webmasters could publish whatever they wanted. The Internet had no centralized authority, there was rather a loose confederation. It seems to me that the success of distributed computing also depends on a means of not just delegating rights and authorities, but bringing them to the community for general use and misuse. Advogato discussion about paying for cycles.

Publius

Publius (mirror) Distributed File Sharing System. Meant to be censorship-resistant, and to provide anonymity for publishers. Limited to short documents (in an effort to limit mp3 file-trading abuse). The theory is interesting, the implementation leaves something to be desired. There are no clients at the moment, so all data access must happen through proxies. (Thus, there is no anonymity provided to readers.) Using a proxy requires a web browser to be reconfigured, and this interferes with normal web browsing. Servers & proxies are Unix only, Perl. GPL'ed.

The Publius project appears to have ceased operation some time in the summer of 2001. There was never a community built. Software was released; my notes state that it was originally under the GPL license, but now it appears to be only under a highly retrictive license prohibiting commercial use and prohibiting redistribution.

SFS

Symptomatic File System (used to be at http://www.venge.net/graydon/SFS.html, copy of original website) a proposal to build a robust distributed file system. Of historical interest; one of the early proposals (circa 1999) that summarized the problem and the possible solutions. Proposed by Graydon Hoare.

Swarmcast

Swarmcast (used to be at http://www.opencola.com/products/4_swarmcast/technology.shtml, no known mirror of original website) Broadcast Reflector. Meant only for improving file download performance by locally caching the downloaded document and redistributing it to other clients interested in downloading the same. It has an immediate economic advantage for operators of popular high-bandwidth websites that pay large monthly Internet bills. Not meant for generic distributed file storage. Does not currently have data streaming capabilities. Java client. GPL license. Cory Doctorow is founder/evangelist.

Swarmcast has an unusual data classification/indexing concept: "Folders". By looking at which files users store near each other, swarmcast extracts a google-like "pagerank" to indicate similar/dis-similar content. The idea is that this can be more reliable than manually-entered, error-prone meta-data, and doesn't require active participation by the user to classify the data.

O"Reilly article about Swarmcast.

This project is defunct; OpenCola web site, originally at http://www.opencola.com, is gone, the Swarmcase.com website has a placeholder webpage, the Swarmcast Sourceforge website has had all of its files pulled, and ry4an, one of its authors, has declared the project to be dead.

Characteristics Cross-Matrix

A crude attempt to cross-index system characteristics against implementations.

General-purpose Distributed File System.: Freenet: yes
OceanStore: yes
Provides anonymity protections to poster, if desired. This would allow whistle-blowers and politcal rabble-rousers to remain anonymous without fear of intimidation/reprisal. This would also allow posters of spam, viruses and other criminal content to remain anonymous and beyond the reach of the law.
Allows censorship of content by editor or network operator. This would allow police authorities to remove child pornography or other objectionable content. This would also allow copyright holders or thier agents to remove content. This also allows the removal of old, out-of-date content and a general cleanup of e.g. spam or viruses that have clogged the system.
Provides censorship protections; that is, content cannot be (easily) removed by authorities objecting to the content, e.g. political speech, state secrets, bomb-making plans.: Freenet: yes
OceanStore: yes
Publius: yes
Identifies the downloader. This can potentially enable payment for downloads, or otherwise hook into a subscruiption service.
Provides file download popularity statistics. Of interest for a variety of reasonable and nefarious reasons.: BitTorrent: No
Provides true permanent storage repository (stuff that is of little interest at this time is not dropped, but persists forever).: Freehaven: yes
Freenet: no
Allows execution of binaries: Freenet: no
Versioning/Version Control: Freenet: no (sort of? allows publication dates)
OceanStore: yes
Extended File Attributes, sotrage/serving of file meta-data along with the file.: PTDP: Yes.
BitTorrent: No

TODO


Open Q's/references:
-- what is the bit-torrent deliver a file distro? on slashdot...

-- earthlink p2p via sip ???

sun zfs slashdot

http://www.sun.com/2004-0914/feature/

stateless linux fedora havoc pennington
-- using bit-torrent or other p2p to deliver files from the root file
system


http://onionnetworks.com/products.php
WebRAID

tuple space

Mark Pesce, and his game:
TV serving, the tv remote control as an interface for google/searching