(logo)
(navigation image)
Home Wayback Machine | Archive-It | Blog | Heritrix

Search: Advanced Search

Anonymous User (login or join us)Upload
 Reference Links
Researcher access is currently not available pending redesign. This material has been retained for reference and was current information as of late 2002.

Data Available
Tools Available
Example Projects
Tool Documentation
Example Code
Data Available

The Internet Archive contains over 100 Terabytes of compressed data. This data is collected in collaboration with Alexa Internet. Alexa sends its crawlers out into the web roughly once every 2 months, retrieves copies of virtually everything it encounters, and donates a copy of this data to the Internet Archive. During periods of particular interest, such as a presidential election or extraordinary breaking news, relevent sites will be crawled more frequently, roughly every 2 to 8 hours.

The Internet Archive began archiving data in 1996. The archive grows at a rate of approximately 70 megabytes per second. A data pool of this magnitude offers a myriad of research ideas worth exploring and we encourage you to do so!

Archive Infrastructure

The Archive data is stored on approximately 150 desktop computers, each containing four 160 GB hard drives. These drives are mounted on /0, /1, /2, /3. In general, drives /1, /2, and /3 are filled to capacity with the archived files. Drive /0, however, is only half used. The other half (~77GB) is reserved for temporary space that can be used for data manipulation. It may not always be the case that the temporary space is located on drive /0. The alias /0/.final/tmp will always refer to the actual temporary space on the host.

Each computer host has a name in the general form ia00###, where ### can be in the range 100 - 177, 200 - 277, or 300 - 337. The digits ### refer to the physical location of machine within the Archive computer cluster. The computers are situated on rows of racks in the San Francisco Mission District Facility. The first number in the ### name refers to the rack; the second refers to shelf; and the third refers to the machine on the shelf. The entire listing of hosts is stored within the environment variable $ARCS. Subset listings of machines are stored in the environment variables $rack1, $rack2, and $rack3, which contain the listing of the machine from 100 - 177, 200 - 277, and 300 - 337, respectively.

Research.archive.org houses the personal files of the users on the system. Each user has access to the directory /home/<login> for file storage. Since research.archive.org is NFS mounted on all of the hosts, a user's home directory is always accessible from any remote host in the cluster as if the home directory were physically stored on each individual host. Altering files on homeserver mounted on one remote host will immediately affect the files on homeserver mounted another host because each host mounts the very same (and only) research.archive.org host.

Individual hosts can be accessed using the remote shell (rsh) UNIX command. The hosts in the cluster have an auto-authenticating script, so the secure shell (ssh) command is unnecessary. Access to the hosts is limited depending on the type of user account that is held. User accounts directly on research.archive.org have access to all of the machines located in $rack1.

How the Data is Stored

All of the archived web data is stored in ARC and DAT files. The ARC files contain the actual archived documents (html, gif, jpeg, ps, etc.) each preceded by some header information about the document. These archived files are individually compressed and individually accessible. There are a number of AV data mining tools provided for this purpose.

Each ARC file has a corresponding DAT file. The DAT files contain meta-information about each document; outward links that the document contains, the document file format, the document size, etc..

ARC and DAT files are indexed with CDX files. Each host provides an index, complete.cdx, located in /0/tmp/. This index may be joined against path_index.txt, located in the same directory, for the full path of the ARC file containing the archived document.

In addition to the indices located on each host, the archive also contains an archive-wide index split accross 6 remote hosts. These are aliased as index1 - index6. The CDX file on each of these hosts is located in /0/wayback.cdx.gz and is formatted slighty differently than the other CDX files located on each remote host. Refer to the legend on the first line of any CDX file for information on how to interpret the data.


Terms of Use (10 Mar 2001)