Citizendia
Your Ad Here

Web archiving is the process of collecting portions of the Web and ensuring the collection is preserved in an archive, such as an archive site, for future researchers, historians, and the public. The World Wide Web (commonly shortened to the Web) is a system of interlinked Hypertext documents accessed via the Internet. Digital preservation is the management of Digital information over time An archive refers to a collection of historical records and also refers to the location in which these records are kept In Web archiving, an archive site is a Website that stores information on or the actual webpages from the past for anyone to view Due to the massive size of the Web, web archivists typically employ web crawlers for automated collection. A web crawler (also known as a web spider, web robot, or—especially in the FOAF community— web scutter) is a program or automated The largest web archiving organization based on a crawling approach is the Internet Archive which strives to maintain an archive of the entire Web. The Internet Archive ( IA) is a Nonprofit organization dedicated to maintaining an on-line Library and archive of Web and National libraries, national archives and various consortia of organizations are also involved in archiving culturally important Web content. A national library is a Library specifically established by the Government of a country to serve as the preeminent repository of information for that country List of archives A national archive is a central Archive maintained by a Nation.

Contents

Collecting the Web

Web archivists generally archive all types of web content including HTML web pages, style sheets, JavaScript, images, and video. HTML, an initialism of HyperText Markup Language, is the predominant Markup language for Web pages It provides a means to describe the structure JavaScript is a Scripting language most often used for Client-side web development A digital image is a representation of a two-dimensional Image using ones and zeros (binary Digital video is a type of Video recording system that works by using a Digital rather than an analog video signal They also archive metadata about the collected resources such as access time, MIME type, and content length. Metadata ( meta data, or sometimes metainformation) is "data about data" of any sort in any media An Internet media type, originally called a MIME type after MIME and sometimes a Content-type after the name of a header in several protocols whose value This metadata is useful in establishing authenticity and provenance of the archived collection. Provenance, from the French provenir, "to come from" means the Origin, or the source, of something or the history of the ownership or location

Methods of collection

Remote harvesting

The most common web archiving technique uses web crawlers to automate the process of collecting web pages. A web crawler (also known as a web spider, web robot, or—especially in the FOAF community— web scutter) is a program or automated A web page or webpage is a resource of information that is suitable for the World Wide Web and can be accessed through a Web browser. Web crawlers typically view web pages in the same manner that users with a browser see the Web, and therefore provide a comparatively simple method of remotely harvesting web content. Examples of web crawlers frequently used for web archiving include:

On-demand

There are numerous services that may be used to archive web resources "on-demand", using web crawling techniques:

Database archiving

Database archiving refers to methods for archiving the underlying content of database-driven websites. It typically requires the extraction of the database content into a standard schema, often using XML. A Computer Database is a structured collection of records or data that is stored in a computer system A Logical Schema is a Data model of a specific problem domain expressed in terms of a particular data management technology Don't change "Extensible" Once stored in that standard format, the archived content of multiple databases can then be made available using a single access system. This approach is exemplified by the DeepArc and Xinq tools developed by the Bibliothèque nationale de France and the National Library of Australia respectively. The National Library of Australia is the country's largest reference library responsible under the terms of the National Library Act for "maintaining and developing a national DeepArc enables the structure of a relational database to be mapped to an XML schema, and the content exported into an XML document. A relational database is a Database that groups data using common attributes found in the data set An XML schema is a description of a type of XML document typically expressed in terms of constraints on the structure and content of documents of that type above and beyond Xinq then allows that content to be delivered online. Although the original layout and behavior of the website cannot be preserved exactly, Xinq does allow the basic querying and retrieval functionality to be replicated.

Transactional archiving

Transactional archiving is an event-driven approach, which collects the actual transactions which take place between a web server and a web browser. The term web server can mean one of two things A Computer program that is responsible for accepting HTTP requests from web clients which are A web browser is a software application which enables a user to display and interact with text images videos music games and other information typically located on a It is primarily used as a means of preserving evidence of the content which was actually viewed on a particular website, on a given date. A website (alternatively web site or Web site, a back-construction from the Proper noun World Wide Web) is a collection of Web pages This may be particularly important for organizations which need to comply with legal or regulatory requirements for disclosing and retaining information.

A transactional archiving system typically operates by intercepting every HTTP request to, and response from, the web server, filtering each response to eliminate duplicate content, and permanently storing the responses as bitstreams. Hypertext Transfer Protocol ( HTTP) is a Communications protocol for the transfer of information on the Internet. A transactional archiving system requires the installation of software on the web server, and cannot therefore be used to collect content from a remote website.

Examples of commercial transactional archiving software include:

Difficulties and limitations

Crawlers

Web archives which rely on web crawling as their primary means of collecting the Web are influenced by the difficulties of web crawling:

The Web is so large that crawling a significant portion of it takes a large amount of technical resources. The Web is changing so fast that portions of a website may change before a crawler has even finished crawling it.

General limitations

Not only must web archivists deal with the technical challenges of web archiving, they must also contend with intellectual property laws. Peter Lyman (2002) states that "although the Web is popularly regarded as a public domain resource, it is copyrighted; thus, archivists have no legal right to copy the Web. The public domain is a range of abstract materials &ndash commonly referred to as Intellectual property &ndash which are not owned or controlled by anyone Copyright is a legal concept enacted by Governments, giving the creator of an original work of authorship Exclusive rights to control its distribution usually for " Some web archives that are made publicly accessible like WebCite's or the Internet Archive’s allow content owners to hide or remove archived content that they do not want the public to have access to. WebCite is a service that archives webpages on demand Authors can subsequently cite the archived webpages through WebCite in addition to citing the original URL of the webpage The Internet Archive ( IA) is a Nonprofit organization dedicated to maintaining an on-line Library and archive of Web and Other web archives are only accessible from certain locations or have regulated usage. WebCite also cites on its FAQ a recent lawsuit against the caching mechanism, which Google won. Google Inc is an American public corporation, earning revenue from advertising related to its Internet search, e-mail, online

References

See also

External links

An archive refers to a collection of historical records and also refers to the location in which these records are kept In Web archiving, an archive site is a Website that stores information on or the actual webpages from the past for anyone to view Digital preservation is the management of Digital information over time Heritrix is the Internet Archive ’s Web crawler which was specially designed for Web archiving. The Internet Archive ( IA) is a Nonprofit organization dedicated to maintaining an on-line Library and archive of Web and The Library of Congress National Digital Library Program (NDLP is assembling a digital Library of reproductions of Primary source materials to support the study of The National Digital Information Infrastructure and Preservation Program is a national strategic program being led by the Library of Congress to preserve digital content The UK Web Archiving Consortium ( UKWAC) is a consortium of six leading UK institutions working collaboratively on a pilot operation archiving selected UK websites A web crawler (also known as a web spider, web robot, or—especially in the FOAF community— web scutter) is a program or automated WebCite is a service that archives webpages on demand Authors can subsequently cite the archived webpages through WebCite in addition to citing the original URL of the webpage A virtual artifact ( VA) is an immaterial object that exists in the human Mind or in a digital environment for example the Internet, Intranet
© 2009 citizendia.org; parts available under the terms of GNU Free Documentation License, from http://en.wikipedia.org
Dapyx Software network: MP3 Explorer | Ebook Manager | Zenithic