The next incarnation of the Internet will liberate both the content
and the CPU cycles from the actual hardware that performs storage
and computation. That is, both the data and the compute power
will be "virtualized" even further away from the hardware its been
traditionally bound to. The popular P2P file trading systems
already hint at what distributed storage might look like.
Efforts such as ZeroInstall show that one might be able to
run operating systems without actually having to 'install' them,
and 'Stateless Linux' show how one might be able to access one's
"desktop" from any handily available keyboard and monitor.
Distributed computing efforts such as SETI@Home hint at how
cpu cycles can be extracted from the vast repository of idle
computers attached to the net.
Keeping this articule up-to-date is difficult. A lot has happened
since the first draft of this article: one-fourth of "the next few
decades" has already gone by. Some links below may be dead,
and some statements may appear quaint. 2001
Draft of this paper.
- Eternity Service
- Ross Anderson described the Eternity Service
as a distributed filesystem that could survive damage to its storage
infrastructure in analogy to how the Internet can survive
damage to its network.
....
(prototype)
and related concepts, such as
FreeNet,
eMule
and
GriPhiN
all provide ways of publishing
information on distributed networks. Each technology enables a
user's home computer to participate in a broader network
to supply distributed storage. If you think about it, this is
very, very different than the defacto Internet today, where web
pages are firmly rooted to the web servers that server them up.
If you are reading this web page near the turn of the century,
chances are good that your browser fetched it off of the web
server I run at home. Chances are also good that you got it off
some caching proxy. I know my ISP runs one. The caching proxy
stores a copy of the web page ... for a while. Not very long.
If my server dies, chances are you won't see my page either.
The caching proxy helps with bandwidth costs for my Internet
provider, but doesn't help me much.
...
But I know that my life
would be a lot better if I didn't actually have to be sysadmin for the
server I run at home (I hate replacing broken disk drives, etc).
I would like it much better if I could just
publish this page, period. Not worry about maintaining the server,
about doing backups.
Just publish it on FreeNet or Publius.
If everyone's home computer was automatically a node/server on
Publius, and if Publius required zero system administration,
then I, as a writer/publisher, would be very happy. I could just
write these thoughts, and not worry about the computing
infrastructure to make sure that you can read this.
We conclude that the eternity service is an important component
of Gelernter's Manifesto, which he sadly fails to name as an
important, contributing technology.
A crucial component of this idea is that of 'zero administration':
the ultimate system must be so simple that any PC connected to the
net could become a node, a part of the distributed storage infrastructure.
The owner of a PC (e.g. my mom) should not have to give it much
thought: if its hooked up to the Internet, its a part of the system.
Aspects:
- What type of storage is it focused on, public, private, or
commercial? Each has different characteristics:
I want my private storage to be accessible from
anywhere, to endure even if the network/servers are damaged.
But I want it to remain private, to stay in my posession.
I want my public writings to be robust against network
damage as well, and I also want them to be hard-to-censor.
I might want to be able to engage in anonymous speech,
so that I could, for example, blast the sitting president
(or the RIAA) without feeling I could get in trouble for it.
The third, "commercial storage" would be a system that allowed
me to access commercial content from anywhere, for a fee.
This is the system that the RIAA is failing to build,
failing to support: a way to get at the music that I paid for,
where-ever I might be.
- Does it provide eternity? Will a file get stored forever,
or can it vanish? There are two types of eternity: protection
against censorship, and protection against apathy.
- Censorship Protection: content cannot be (easily)
removed by authorities objecting to the content,
e.g. political speech, state secrets, bomb-making plans.
- Apathy Protection: no one cares about the content at this
time, and thus, it will slowly get purged from various
caches and stores until the last copy disappears forever.
Note that one can implement censorship protection, and still
not get apathy protection: FreeNet works like this.
One can also implement a system that is censorable (so
that the sysadmins can explicitly purge spam), and still
get apathy protection: as long as a file is not actively
hunted down and terminated, it will stick around forever.
These are orthogonal concepts.
- Provides anonymity protections to poster, if desired.
This would allow whistle-blowers and politcal rabble-rousers
to remain anonymous without fear of intimidation/reprisal.
This would also allow posters of spam, viruses and other
criminal content to remain anonymous and beyond the reach
of the law.
- Allows censorship of content by editor or network operator.
This would allow police authorities to remove child pornography
or other objectionable content. This would also allow copyright
holders or thier agents to remove content. This also allows
the removal of old, out-of-date content and a general cleanup
of e.g. spam or viruses that have clogged the system.
- Identifies the downloader. This can potentially enable payment
for downloads, or otherwise hook into a subscruiption service.
- Provides file download popularity statistics. Of interest for
a variety of reasonable and nefarious reasons.
- Appears to the operating system as a filesystem. Thus, for
example, I could put a binary into it, and then run that binary
on my desktop. ZeroInstall tries to do
this.
- Versioning/Version Control (Gelernter's "Lifestreams").
Can I get earlier versions of my file? Is my file tagged
with date meta-info? Can I get an earlier draft of this
paper?
- Support for extended File Attributes; storage/serving of
file meta-data along with the file. Can I mark up the file
with info that is important to me, such as where I was
(geographically) when I last looked at it? Can I categorize
in in many different ways, e.g. if its a hospital bill,
can I put it in my "hospital" folder, as well as my "finances"
folder? Note that folders do not need to be literally folders:
they could in fact be fancy search queries: as long as the
object responds to the query, its a part of that folder.
This is how a given file might be in many folders at once.
See also:
- Search and Query
- Gelernter goes on at length about content addressable memory,
about how it should be possible to retrieve information from
the Eternity Service based on its content, and not based on its file
name. Search is important not only for finding a
needle-in-the-haystack on the Internet, but also for finding
the mislaid file on ones own computer. In a different but still
important sense, querying is used, for example, to report your
bank balance, out of the sea of other transactions and investments
and accounts one may have. The importance and centrality
of search and data sharing for general application development
is further discussed in the
Why-QOF
web page.
What are the pieces that are needed, and available?
- Natural language query parsers.
Gnome Storage
is looking to provide natural language query for desktop
applications.
- Distributed databases and distributed query. DNS (the Domain Name
System) is a distributed database for performing IP address lookup.
unfortunately, there is no straightforward generalization to
arbitrary data. LDAP (the lightweight directory access protocol)
in theory can handle more generic data, but it remains difficult
to set up and use.
- My personal entry on this chart is
QOF, the goal of which
is to make it trivial for programmers to work with persistent,
globally-unique, versionable, queryable OOP-type 'objects'.
- Massively scalable search already has a proof-of-concept with
Google.
Curiously, though, the google page rank is the result
of a carefully hand-tuned and highly proprietary algorithm.
This indicates that search by content alone is not enough;
search-by-content has to be ranked to provide results that
are meaningful to users. And it seems that its the ranking,
and not the search, that is the hard part.
- Google focuses on free-text search. If you want prices,
you need < a href="http://www.google.com/froogle">Froogle.
Google is useless for binaries: if you want binary content,
you go to specialized sites:
rpmfind.net to locate RPM's,
tucows to locate shareware, or mp3.com or scour.net to find audiovisual
content. Each of these systems are appallingly poor at what they do:
the RPM Spec file is used to build the rpmfind directories, but doesn't
really contain adequate information.
The mp3 and shareware sites are essentially built by hand: that
part of the world doesn't even have the concept of an LSM to classify
and describe content! (LSM is a machine-readable format used by
metalab.unc.edu to classify the content of packages in its software
repository.)
- Searchable meta-data, and automatic time and (geographic)
place tagging of a file when its created, viewed and edited.
If I created a file while I was drinking coffee in a
coffee-house, I want it tagged, so that I can find it later
when I go searching for the words "coffeee house, 2 months ago".
If I happened to create three versions of that file,
I'd like to be able to call up each: tehre should have been
(semi-)automatic file versioning, a "continuous backup"
of sorts. A
Wayback
Machine for my personal data.
Here are some additional references:
- gPulp provides a framework
for distributed searching. Derived from Gnutella Next Generation.
See the
Wired Article. European consortium, standards body, costs
real money to join. They seem to be working specs, not
implementations. The main spec is a P2P 'data discovery
protocol'.
- LSM's, Name Spaces and Self-Describing Objects
- There is another way to look at the problem of searching and finding
an object based on its content, rather than its 'unique identifier'.
Filenames/filepaths/URL's are essentially unique identifiers that
locate an object. Unfortunately, they only reference it, and maybe
provide only the slimest of additional data. For example, in Unix,
the file system only provides the filename, owner, read/write
privileges, modification/access times. By looking at the file
suffix one can guess the mime-type, maybe: .txt .ps .doc .texi .html
.exe and so on. File 'magic' can also help guess at the content.
URL's don't even provide that much, although the HTTP/1.1 specification
describes a number of optional header fields that provide similar
information. See, for example,
Towards the Anti-Mac
or The Anti-Mac
Interface for some discussion of this problem.
What is really needed is an infrastructure for more closely defining
the content of a 'file' in both machine-readable and human-understandable
terms. At the very least, there is the concept of mime-types. Web-page
designers can use the <meta> tags to define some additional
info about an object. With the growth of popularity of XML, there is some
hope that the XML DTD's can be used to understand the type of object.
There is the semi-forgotten, semi-ignored concept of 'object naming'
and 'object trading brokers' as defined by CORBA, which attempt to match
object requests to any object that might fill that request, rather
than to an individually named object. Finally, there are sporadic attempts
to classify content: LSM's used by metalab.unc.edu, RPM Spec files used by
rufus.w3.org, deb's used by the Debian distribution. MP3's have an
extremely poor content description mechanism: one can store the name of the
artist, the title, the year and the genre. But these are isolated examples
with no unifying structure.
Unfortunately, Gelernter is right: there is no all-encompassing object
description framework or proposal in existence that can fill these needs.
We need something more than a mime-type, and something less than a free-text
search engine, to help describe and locate an object. The system must be
simple enough to use everywhere: one might desire to build it into the
filesystem, in the same way that 'owner' and 'modification date' are file
attributes. It will have to become a part of the 'finder', such as the
Apple Macintosh Finder or
Nautilus, the
Eazel finder. It must be general enough
to describe non-ASCII files, so that search engines (such as Google) could
perform intelligent searches for binary content. Today, Google cannot
classify nor return content based on LSM's, RPM's, deb's, or the MP3
artist/title/genre fields.
- distributed.net and SETI@home
- distributed.net
runs a distributed RC-64 cracking / Golumb Ruler effort.
Seti@Home
runs a distributed search of radio telescope data for interesting
sources of extraterrestrial electromagnetic data. Both of these
efforts are quite popular with the general public: they have built
specialized clients/screen-savers that have chewed through a
quadrillion trillion CPU cycles. Anyone who is happy running
a distributed.net client, or a seti@home client might be happy
running a generic client for performing massively parallel
computations. Why limit ourselves to SETI and cypher cracking?
Any problem that requires lots of CPU cycles to solve could,
in theory, benefit from this kind of distributed computing.
These high-cpu-usage problems need not be scientific
in nature.
A good example of a non-science high-cpu-cycle application is
the animation/special effects rendering needed for Hollywood
movies.
The problem may not even be commercial or require that
many cpu cycles: Distributed gaming servers,
whether role-playing games, shoot-em-ups, or civilization/war
games currently require dedicated servers with good bandwidth
connections, administered by knowledgeable sysadmins.
The gotcha is that there is currently no distributed computing
client that is is 'foolproof': providing generic services,
easy to install and operate, hard for a
cracker/hacker to subvert. There are no easy programming API's.
XXX But this may be changing now, see BOINC, below.
Other clients:
- BOINC, the software under SETI@Home,
see listing below.
- Xenoservers, reference below.
- Climate
Dynamics at RAL.
- United Devices
Purely commercial, totally proprietary.
- PVM & MPI are older technologies, optimized for cluster
and parallel computing. They are rather heavyweight,
demanding of bandwidth, and unable to deal with clients
come and go (unrealiable clients).
- Folding@Home
is attempting to solve protein folding problems with pure-custom
software.
- Popular Power attempted to
pay for CPU cycles, as did
Process Tree Network. Both Defunct.
- Cosm attempted to define distributed
computing API's. Defunct.
- ERights and Sandbox Applets
- Java still seems to be a technology waiting to fulfill its promise.
However, it (and a number of other interpreters) do have one
tantalizing concept built in: the sandbox, the chroot jail,
the honeypot. Run an unsafe program in the chrooted jail,
and we pretty much don't care what the program does, as long
as we bothered to put some caps on its CPU and Disk usage.
Let it go berserk. But unfortunately, the chroot jail is
a sysadmin concept that takes brains and effort to install.
Its not something that your average Red Hat or Debian
install script sets up. Hell, we have to chroot named and
httpd and dnetc and so on by hand. We are still a long ways
off from being able to publish a storage and cpu-cycle playground
on our personal computers that others could make use of as they
wished. It is not until these sorts of trust and
erights systems are set up
that the kind of computing that Gelernter talks about is possible.
References:
- Streaming Media & Broadcast: Bandwidth Matters
- The naivest promise of 'digital convergence' is that soon, you'll
watch TV on your computer. Or something like that. There are
a multitude of blockers for the roll-out of these kinds of services,
and one of them is bandwidth strain put on the broadcaster and the
intervening Internet backbone. Given the way that people
(or rather, operating systems and software applications) use the
Internet today, if a thousand people want to listen to or view
a streaming media broadcast, then the server must send out a
thousand duplicate, identical streams. This puts a huge burden
on the server as well as nearby routers.
The traditional proposed solution
for this problem is MBONE, but MBONE has yet to see widespread
deployment. (MBONE is the Internet 'multicast backbone' which
allows a broadcast server to serve up one packet, and then have
Internet routers make copies of the packet as it gets sent to
receiving clients. Clients receive packets by 'subscribing'
to 'channels'.)
There are two other approaches to distributing the bandwidth
load: ephemeral file server and distributed
streaming. Both leverage the idea that if
someone else is receiving the same data that you want,
then they can rebroadcast the data to you. The difference
between these two is whether you get the data in order, and
possibly whether you keep a permanent copy of it.
In either case, you get your data in "chunks" or pieces,
rather than as a whole.
For streamed media, e.g. a radio broadcast, it is assumed that
you are listening as it is broadcast, rather than waiting for
a download to "finish", and then listening to it. For
streamed media, the data must arrive in order, and must arrive in a
timely manner. I don't know of any examples at this time.
An ephemeral file server, by contrast, can (and usually will)
deliver data out-of-order (sometimes called "scatter-gather").
A good example might be
BitTorrent, which only shares the
file that you are currently downloading, instead of sharing all
of your files. It is "ephemeral" in the sense that sharing usually
stops shortly after download completes. BitTorrent explicitly
delivers chunks of the data out of order: the goal is to
make sure that everyone has something to share, rather than,
e.g. everyone having the first half but not the second half of
a file. "Ephemeral" does not mean short-term: torrents can
(and do) exist for months: they exist as long as a file is
popular, and as long as at least one client is up on the net.
Equally interesting are the things that BitTorrent doesn't do,
or gaurentee: for starters, there is no 'eternity':
if there are no clients offering the file, it is effectively gone.
BitTorrent does not keep either a master index of files offered,
nor a even a searchable index of offered torrents. One must
locate the torrent one wants in some other fashion: e.g. through
web pages or traditional search engines. In the same vein,
its not a file system: there is no heirarchy of files that are
kept or can be browsed. The goal of BitTorrent really is to
balance the network load in a distributed fashion.
To summarize the technical points:
- The search problem: Can the user browse a list of available
content? Can the user search for particular content?
(BitTorrent relies on web pages and web search egines to
solve these problems).
- The peer discovery problem: Once a particular bit of
content has been identified, how does a client discover
the other clients that are ready to share?
- BitTorrent And PDTP solve this problem by having a
tracker
for each offered file. Clients register with the tracker
and tell it what chunks of the file they already have;
the tracker responds with a list of clients that might
have the chunks we don't yet have. Clients keep the
tracker up-to-date as the download proceeds. Conceptually,
there is one tracker per offered file. Note, however,
that the tracker is vulnerable: if it goes down, new
clients are shut out.
- Swarmcast uses a Forward-Error
Correction (FEC) algorithm to create packets that occupy
a data space that is orders of magnitude larger than the
offered file. Thus, the receiver can reconstruct the
whole file after having recieved only a very small
portion of the total packets in the space. The
use of FEC encoding elminates the need for a
chunk tracker: all packets in the data space are
"gaurenteed" to contain data that the client does not
yet have. This is by encoding in a very large data
space: the probability that the client receives data
that it already has is equal to the ratio of the
file size to the data space size; this ratio can be
made arbitrarily small. (Its kind of like a hologram;
you need only some of it to reproduce the whole).
Downside to this approach is that its CPU-intensive,
and it can inflate the total amount of bytes that need
to be delivered by a fair amount. Upside is that it
can roll encryption and encoding into one.
- The streaming problem. For streaming to work, data must
be delivered in order. (BitTorrent doesn't do that)
- Bandwidth allocation/balancing between peers. BitTorrent
tries to load-balance by using a tit-for-tat strategy:
a client will only offer chnks to those clients that
are sending chunks to it. For streaming media, this
strategy clearly can't work: sharing must be transitive,
not reciprocal.
- The 'dropped frames' problem: The viewer/receiver of a
real-time stream must be able to get data in a timely
manner, so that they can watch thier show/movie without
interruption. The viewer is potentially willing to trade
disproportionate amounts of upload bandwidth in exchange
of a gaurenteed download bandwidth. The receiver is
mostly interested in having multiple redundant streaming
servers handy.
I am not yet aware of any generally available streaming-media
reflectors, other than on those based on MBONE.
- Swarmcast, now defunct, may be
unique in having been the first to use a scatter-gather
type algorithm for delivering a file by chopping it up into
chunks. (Swarmcast predates BitTorrent). GPL license.
- BitTorrent, described below,
is an 'ephemeral fileserver', serving up files in a
distributed fashion for the few mements that they are
popular and being actively downloaded by others.
- PDTP is a distributed file system,
with heirarchical directories, but offers network
load balancing through distributed file-piece delivery.
- The Internet for the Rest of Us
- To understand the future, it is sometimes useful to look at the
past. Remember UUCP? It used to tie the Unix world together, as
did BITNET for the VAx's and Cray's, or the VM network for
mainframes. They were all obsoleted by the IP protocols of
the Internet. But for a long time, they lived side-by-side,
even attached to the Internet through gateways.
The ideas that powered these
networks were subsumed into, became a part of the Internet:
The King is Dead, Long Live the King!. The spread of the types
of technologies that Gelernter talks about will be evolutionary,
not revolutionary.
Similarly, remember 'The Computer for the Rest of Us'?
Well, before the web exploded, Marc Andressen used to talk about
'The Internet for the Rest of Us'. Clearly, some GUI slapped
on the Internet would make it far more palatable, as opposed to the
'command-line' of telnet and ftp. But a web browser is not just
a pretty GUI slapped on telnet or ftp, and if it had been, the
WWW still wouldn't exist (what happened to 'gopher'? Simple:
no pictures, no 'home pages'). The success of the WWW
needed a new, simple, easy technology: HTTP and hyperlinks, to
make it go. The original HTTP and HTML were dirt-simple, and that
was half the power of the early Internet. Without this simplicity
and ease of use, the net wouldn't have happened.
What about 'the rest of us'? It wasn't just technology that made
the Internet explode, it was what the technology could do. It
allowed (almost) anyone to publish anything at a tiny fraction
of the cost of traditional print/radio/TV publishing. It gave
power to the people. It was a fundamentally democratic movement
that was inclusive, that allowed anyone to participate, not just
the rich, the powerful, or the members of traditional media
establishments. In a bizarrely different
way, it is these same forces that power music file trading:
even if the
music publishing industry hadn't fallen asleep at the wheel, it
is the democratization that drives file traders. Rather than listening
to what the music industry wants me to listen to, I can finally listen
to what I want to listen to. At long last, I am able to match
artist to the artists work, rather than listening to the radio and
scratching my head 'gee I liked that song, but what the hell was
the name of the artist?' Before Napster, if I didn't know what
music CD to buy, even when I wanted to. I wasn't hip enough to
have friends who new the names of the cool bands, the CD's that
were worth buying. Now, finally, I know the names of the bands
that I like. Napster gave control back to the man in the street.
Similarly, the final distributed storage/computation infrastructure
will have to address similar populist goals: it must be inclusive,
not exclusive. Everyone must be able to participate. It must
be for 'the rest of us'.
- Commercialization
- Like the early days of the net, the work of volunteers drove the
phenomenon. Only later did it become commercialized. Unlike then,
we currently have a Free Software community that is quite conscious
of its own existence. Its a more powerful force. Once the
basic infrastructure gets built, large companies will come
to make use of and control that infrastructure. But meanwhile,
we, as engineers, can build it.
I guess the upshot of this little diatribe is that Gelernter talks
about his changes in a revolutionary manner, leading us to believe
that the very concept of an operating system will have to be
re-invented. He is wrong. The very concept of an operating system
*will* be reinvented, someday. In the meanwhile, we have a perfectly
evolutionary path from here to there, based not only on present
technologies and concepts, but, furthermore, based on the principles
of free software.