Citizendia

Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. This is referring to Index in the context of Information Technology In Computer science, data is anything in a form suitable for use with a Computer. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics and computer science. Informatics is the science of Information, the practice of Information processing, and the engineering of Information systems. An alternate name for the process in the context of search engines designed to find web pages on the Internet is Web indexing.

Popular engines focus on the full-text indexing of online, natural language documents[1] ; media types such as video and audio [2] and graphics[3][4] are also searchable. Multimedia is media and content that utilizes a combination of different content forms.

Meta search engines reuse the indices of other services and do not store a local index, whereas cache-based search engines permanently store the index along with the corpus. A meta-search engine is a Search engine that sends user requests to several other search engines and/or databases and aggregates the results into a single list or displays them In Linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size. Larger services typically perform indexing at a predetermined time interval due to the required time and processing costs, while agent-based search engines index in real time. In Artificial intelligence, an intelligent agent ( IA) is an entity which observes "reason" and acts upon an environment (i

Web 3.0 based on technologies of Semantic web, Website Parse Template, etc. Web 30 is one of the terms used to describe the evolutionary stage of the Web that follows Web 2 The Semantic Web is an evolving extension of the World Wide Web in which the Semantics of information and services on the web is defined making it possible for the Website Parse Template (WPT is an XML based open format which provides HTML structure description of Website pages hopes to provide the next generation search engines with more intelligent parsing and indexing technologies.

Contents

Indexing

The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. Without an index, the search engine would scan every document in the corpus, which would require considerable time and computing power. In Computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens For example, while an index of 10,000 documents can be queried within milliseconds, a sequential scan of every word in 10,000 large documents could take hours. The additional computer storage required to store the index, as well as the considerable increase in the time required for an update to take place, are traded off for the time saved during information retrieval.

Index Design Factors

Major factors in designing a search engine's architecture include:

Merge factors 
How data enters the index, or how words or subject features are added to the index during text corpus traversal, and whether multiple indexers can work asynchronously. The indexer must first check whether it is updating old content or adding new content. Traversal typically correlates to the data collection policy. A web crawler (also known as a web spider, web robot, or—especially in the FOAF community— web scutter) is a program or automated Search engine index merging is similar in concept to the SQL Merge command and other merge algorithms. A Relational database management system uses SQL MERGE (aka Upsert) statements to INSERT new records or [5]
Storage techniques 
How to store the index data, that is, whether information should be data compressed or filtered. Debt AIDS Trade in Africa (or DATA) is a Multinational non-government organization founded in January 2002 in London by U2 's
Index size 
How much computer storage is required to support the index.
Lookup speed 
How quickly a word can be found in the inverted index. The speed of finding an entry in a data structure, compared with how quickly it can be updated or removed, is a central focus of computer science.
Maintenance 
How the index is maintained over time[6].
Fault tolerance 
How important it is for the service to be reliable. Issues include dealing with index corruption, determining whether bad data can be treated in isolation, dealing with bad hardware, partitioning, and schemes such as hash-based or composite partitioning[7], as well as replication. A partition is a division of a logical Database or its constituting elements into distinct independent parts A hash function is any well-defined procedure or mathematical function for turning some kind of Data into a relatively small integer, that may Replication is the process of sharing information so as to ensure consistency between redundant resources such as Software or Hardware components to improve reliability

Index Data Structures

Search engine architectures vary in the way indexing is performed and in methods of index storage to meet the various design factors. Types of indices include:

Suffix tree 
Figuratively structured like a tree, supports linear time lookup. In Computer science, a suffix tree (also called suffix trie, PAT tree or in an earlier form position tree) is a data structure that presents Built by storing the suffixes of words. Used for searching for patterns in DNA sequences and clustering. Deoxyribonucleic acid ( DNA) is a Nucleic acid that contains the genetic instructions used in the development and functioning of all known A major drawback is that the storage of a word in the tree may require more storage than storing the word itself. [8] An alternate representation is a suffix array, which is considered to require less virtual memory and supports data compression such as the BWT algorithm. In Computer science, a suffix array is an Array giving the suffixes of a string in Lexicographical order. The Burrows-Wheeler transform ( BWT, also called block-sorting compression) is an Algorithm used in Data compression techniques such as
Tree 
An ordered tree data structure that is used to store an associative array where the keys are strings. In Computer science, a tree is a widely-used Data structure that emulates a Tree structure with a set of linked nodes In Computer programming and some branches of Mathematics, a string is an ordered Sequence of Symbols. Regarded as faster than a hash table but less space-efficient. Computer data storage, often called storage or memory, refers to Computer components devices and recording media that retain digital The suffix tree is a type of trie. In Computer science, a trie, or prefix tree, is an ordered tree Data structure that is used to store an Associative array where Tries support extendable hashing, which is important for search engine indexing. [9]
Inverted index 
Stores a list of occurrences of each atomic search criterion[10], typically in the form of a hash table or binary tree[11][12]. In Information technology, an inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping In Computer science, a hash table, or a hash map, is a Data structure that associates keys with values. In Computer science, a binary tree is a tree data structure in which each node has at most two children.
Citation index 
Stores citations or hyperlinks between documents to support citation analysis, a subject of Bibliometrics. A citation index is an index of Citations between publications allowing the user to easily establish which later documents cite which earlier documents Bibliometrics is a set of methods used to study or measure texts and information
Ngram index 
Stores sequences of length of data to support other types of retrieval or text mining. [13]
Term document matrix 
Used in latent semantic analysis, stores the occurrences of words in documents in a two-dimensional sparse matrix. In the mathematical subfield of Numerical analysis a sparse matrix is a matrix populated primarily with zeros

Challenges in Parallelism

A major challenge in the design of search engines is the management of parallel computing processes. There are many opportunities for race conditions and coherent faults. A race condition or race hazard is a flaw in a System or process whereby the output and/or result of the process is unexpectedly and critically dependent For example, a new document is added to the corpus and the index must be updated, but the index simultaneously needs to continue responding to search queries. This is a collision between two competing tasks. Consider that authors are producers of information, and a web crawler is the consumer of this information, grabbing the text and storing it in a cache (or corpus). In Linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed The forward index is the consumer of the information produced by the corpus, and the inverted index is the consumer of information produced by the forward index. This is commonly referred to as a producer-consumer model. The indexer is the producer of searchable information and users are the consumers that need to search. The challenge is magnified when working with distributed storage and distributed processing. In an effort to scale with larger amounts of indexed information, the search engine's architecture may involve distributed computing, where the search engine consists of several machines operating in unison. Distributed computing deals with Hardware and Software Systems containing more than one processing element or Storage element concurrent This increases the possibilities for incoherency and makes it more difficult to maintain a fully-synchronized, distributed, parallel architecture. [14]

Inverted indices

Many search engines incorporate an inverted index when evaluating a search query to quickly locate documents containing the words in a query and then rank these documents by relevance. In Information technology, an inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping A web search query is a query that a user enters into web Search engine to satisfy his or her Information needs. Because the inverted index stores a list of the documents containing each word, the search engine can use direct access to find the documents associated with each word in the query in order to retrieve the matching documents quickly. In Computer science, random access (sometimes called direct access) is the ability to access an arbitrary element of a sequence in equal time The following is a simplified illustration of an inverted index:

Inverted Index
WordDocuments
theDocument 1, Document 3, Document 4, Document 5
cowDocument 2, Document 3, Document 4
saysDocument 5
mooDocument 7

This index can only determine whether a word exists within a particular document, since it stores no information regarding the frequency and position of the word; it is therefore considered to be a boolean index. In Computer science, the Boolean datatype, sometimes called the logical datatype, is a Primitive datatype having one of two values Such an index determines which documents match a query but does not rank matched documents. In some designs the index includes additional information such as the frequency of each word in each document or the positions of a word in each document. [15] Position information enables the search algorithm to identify word proximity to support searching for phrases; frequency can be used to help in ranking the relevance of documents to the query. Such topics are the central research focus of information retrieval. Information retrieval ( IR) is the science of searching for documents for Information within documents and for metadata about documents as well as that

The inverted index is a sparse matrix, since not all words are present in each document. In the mathematical subfield of Numerical analysis a sparse matrix is a matrix populated primarily with zeros To reduce computer storage memory requirements, it is stored differently from a two dimensional array. In Computer science an array is a Data structure consisting of a group of elements that are accessed by indexing. The index is similar to the term document matrices employed by latent semantic analysis. Document-term matrices are used in Natural language processing programs Latent semantic analysis (LSA is a technique in Natural language processing, in particular in Vectorial semantics, of analyzing relationships between a set of documents The inverted index can be considered a form of a hash table. In some cases the index is a form of a binary tree, which requires additional storage but may reduce the lookup time. In Computer science, a binary tree is a tree data structure in which each node has at most two children. In larger indices the architecture is typically a distributed hash table. [16]

Inverted indices can be programmed in several computer programming languages. [17][18]

Index Merging

The inverted index is filled via a merge or rebuild. A rebuild is similar to a merge but first deletes the contents of the inverted index. The architecture may be designed to support incremental indexing[19], where a merge identifies the document or documents to be added or updated and then parses each document into words. For technical accuracy, a merge conflates newly indexed documents, typically residing in virtual memory, with the index cache residing on one or more computer hard drives.

After parsing, the indexer adds the referenced document to the document list for the appropriate words. In a larger search engine, the process of finding each word in the inverted index (in order to report that it occurred within a document) may be too time consuming, and so this process is commonly split up into two parts, the development of a forward index and a process which sorts the contents of the forward index into the inverted index. The inverted index is so named because it is an inversion of the forward index.

The Forward Index

The forward index stores a list of words for each document. The following is a simplified form of the forward index:

Forward Index
DocumentWords
Document 1the,cow,says,moo
Document 2the,cat,and,the,hat
Document 3the,dish,ran,away,with,the,spoon

The rationale behind developing a forward index is that as documents are parsing, it is better to immediately store the words per document. The delineation enables Asynchronous system processing, which partially circumvents the inverted index update bottleneck. [20] The forward index is sorted to transform it to an inverted index. In Computer science and Mathematics, a sorting algorithm is an Algorithm that puts elements of a list in a certain order. The forward index is essentially a list of pairs consisting of a document and a word, collated by the document. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words. In this regard, the inverted index is a word-sorted forward index.

Compression

Generating or maintaining a large-scale search engine index represents a significant storage and processing challenge. Many search engines utilize a form of compression to reduce the size of the indices on disk. Computer data storage, often called storage or memory, refers to Computer components devices and recording media that retain digital [21] Consider the following scenario for a full text, Internet search engine.

Given this scenario, an uncompressed index (assuming a non-conflated, simple, index) for 2 billion web pages would need to store 500 billion word entries. A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page A personal computer ( PC) is any Computer whose original sales price size and capabilities make it useful for individuals and which is intended to be operated A gigabyte (derived from the SI prefix Giga-) is a unit of Information or Computer Conflation occurs when the identities of two or more individuals concepts or places sharing some characteristics of one another become confused until there seems to be only a single At 1 byte per character, or 5 bytes per word, this would require 2500 gigabytes of storage space alone, more than the average free disk space of 25 personal computers. This space requirement may be even larger for a fault-tolerant distributed storage architecture. Depending on the compression technique chosen, the index can be reduced to a fraction of this size. The tradeoff is the time and processing power required to perform compression and decompression.

Notably, large scale search engine designs incorporate the cost of storage as well as the costs of electricity to power the storage. Thus compression is a measure of cost.

Document Parsing

Document parsing breaks apart the components (words) of a document or other form of media for insertion into the forward and inverted indices. The words found are called tokens, and so, in the context of search engine indexing and natural language processing, parsing is more commonly referred to as tokenization. Natural language processing ( NLP) is a subfield of Artificial intelligence and Computational linguistics. In Computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens It is also sometimes called word boundary disambiguation, tagging, text segmentation, content analysis, text analysis, text mining, concordance generation, speech segmentation, lexing, or lexical analysis. Part-of-speech tagging ( POS tagging or POST) also called grammatical tagging or word-category disambiguation, is the process of marking up the Text segmentation is the process of dividing written text into meaningful units such as Sentences or Topics The term applies to mental processes used by Content analysis (sometimes called textual analysis) is a Methodology in the Social sciences for studying the Content of Communication Text mining, sometimes alternately referred to as text Data mining, roughly equivalent to Text analytics, refers generally to the process In Languages agreement is a form of cross-reference between different parts of a sentence or phrase Speech segmentation is the process of identifying the boundaries between Words Syllables or Phonemes in spoken Natural languages. In Computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens In Computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens The terms 'indexing', 'parsing', and 'tokenization' are used interchangeably in corporate slang.

Natural language processing, as of 2006, is the subject of continuous research and technological improvement. Tokenization presents many challenges in extracting the necessary information from documents for indexing to support quality searching. Tokenization for indexing involves multiple technologies, the implementation of which are commonly kept as corporate secrets.

Challenges in Natural Language Processing

Word Boundary Ambiguity 
Native English speakers may at first consider tokenization to be a straightforward task, but this is not the case with designing a multilingual indexer. English is a West Germanic language originating in England and is the First language for most people in the United Kingdom, the United States In digital form, the texts of other languages such as Chinese, Japanese or Arabic represent a greater challenge, as words are not clearly delineated by whitespace. is a language spoken by over 130 million people in Japan and in Japanese emigrant communities Arabic (ar الْعَرَبيّة (informally ar عَرَبيْ) in terms of the number of speakers is the largest living member of the Semitic language In Computer science, whitespace is any single character or series of characters that represents horizontal or vertical space in Typography. The goal during tokenization is to identify words for which users will search. Language-specific logic is employed to properly identify the boundaries of words, which is often the rationale for designing a parser for each language supported (or for groups of languages with similar boundary markers and syntax).
Language Ambiguity 
To assist with properly ranking matching documents, many search engines collect additional information about each word, such as its language or lexical category (part of speech). A language is a dynamic set of visual auditory or tactile Symbols of Communication and the elements used to manipulate them In Grammar, a lexical category (also word class, lexical class, or in traditional grammar part of speech) is a linguistic category of words (or In Grammar, a lexical category (also word class, lexical class, or in traditional grammar part of speech) is a linguistic category of words (or These techniques are language-dependent, as the syntax varies among languages. Documents do not always clearly identify the language of the document or represent it accurately. In tokenizing the document, some search engines attempt to automatically identify the language of the document.
Diverse File Formats 
In order to correctly identify which bytes of a document represent characters, the file format must be correctly handled. Search engines which support multiple file formats must be able to correctly open and access the document and be able to tokenize the characters of the document.
Faulty Storage 
The quality of the natural language data may not always be perfect. An unspecified number of documents, particular on the Internet, do not closely obey proper file protocol. binary characters may be mistakenly encoded into various parts of a document. Without recognition of these characters and appropriate handling, the index quality or indexer performance could degrade.

Tokenization

Unlike literate human adults, computers do not understand the structure of a natural language document and cannot automatically recognize words and sentences. traditional definition of literacy is considered to be the ability to read and write or the ability to use Language to read, write, listen, To a computer, a document is only a sequence of bytes. Computers do not 'know' that a space character separates words in a document. Instead, humans must program the computer to identify what constitutes an individual or distinct word, referred to as a token. Such a program is commonly called a tokenizer or parser or lexer. In Computer science and Linguistics, parsing, or more formally syntactic analysis, is the process of analyzing a sequence of tokens to In Computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens Many search engines, as well as other natural language processing software, incorporate specialized programs for parsing, such as YACC OR Lex. This is a list of notable parsing systems Chart Key The following key describes the columns used in the comparison chart below The Computer program yacc is a Parser generator developed by Stephen C In Computer science, lex is a program that generates lexical analyzers ("scanners" or "lexers"

During tokenization, the parser identifies sequences of characters which represent words and other elements, such as punctuation, which are represented by numeric codes, some of which are non-printing control characters. The parser can also identify entities such as email addresses, phone numbers, and URLs. Named entity recognition (NER (also known as entity identification (EI and entity extraction) is a subtask of Information extraction that seeks to locate Electronic mail, often abbreviated to e-mail, email, or originally eMail, is a Store-and-forward method of writing sending receiving Uniform Resource Locator is an URI which also specifies where the identified resource is available and the protocol for retrieving it When identifying each token, several characteristics may be stored, such as the token's case (upper, lower, mixed, proper), language or encoding, lexical category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, and line number.

Language Recognition

If the search engine supports multiple languages, a common initial step during tokenization is to identify each document's language; many of the subsequent steps are language dependent (such as stemming and part of speech tagging). Stemming is the process for reducing inflected (or sometimes derived words to their stem, base or root form &ndash generally a written word form In Grammar, a lexical category (also word class, lexical class, or in traditional grammar part of speech) is a linguistic category of words (or Language recognition is the process by which a computer program attempts to automatically identify, or categorize, the language of a document. Language recognition is a field of Artificial intelligence which enables Computers to recognise Language of a text. A language is a dynamic set of visual auditory or tactile Symbols of Communication and the elements used to manipulate them Other names for language recognition include language classification, language analysis, language identification, and language tagging. Automated language recognition is the subject of ongoing research in natural language processing. Natural language processing ( NLP) is a subfield of Artificial intelligence and Computational linguistics. Finding which language the words belongs to may involve the use of a language recognition chart.

Format Analysis

If the search engine supports multiple document formats, documents must be prepared for tokenization. A file format is a particular way to encode information for storage in a Computer file. The challenge is that many document formats contain formatting information in addition to textual content. For example, HTML documents contain HTML tags, which specify formatting information such as new line starts, bold emphasis, and font size or style. HTML, an initialism of HyperText Markup Language, is the predominant Markup language for Web pages It provides a means to describe the structure In typography a font (also fount) is traditionally defined as a complete character set of a single size and style of a particular Typeface. In Typography, a typeface is a set of one or more Fonts designed with stylistic unity each comprising a coordinated set of Glyphs A typeface usually comprises If the search engine were to ignore the difference between content and 'markup', extraneous information would be included in the index, leading to poor search results. Format analysis is the identification and handling of the formatting content embedded within documents which controls the way the document is rendered on a computer screen or interpreted by a software program. Format analysis is also referred to as structure analysis, format parsing, tag stripping, format stripping, text normalization, text cleaning, and text preparation. The challenge of format analysis is further complicated by the intricacies of various file formats. Certain file formats are proprietary with very little information disclosed, while others are well documented. Common, well-documented file formats that many search engines support include:

Options for dealing with various formats include using a publicly available commercial parsing tool that is offered by the organization which developed, maintains, or owns the format, and writing a custom parser. Microsoft Word is Microsoft 's flagship word processing software. In Computing, Microsoft Excel (full name Microsoft Office Excel) consists of a proprietary Spreadsheet -application written and distributed Microsoft PowerPoint is a proprietary Presentation program developed by Microsoft. Lotus Notes is a Client-server, collaborative application developed and sold by IBM Software Group HTML, an initialism of HyperText Markup Language, is the predominant Markup language for Web pages It provides a means to describe the structure American Standard Code for Information Interchange ( ASCII) Adobe Systems Incorporated (pronounced a-DOE-bee əˈdoʊbiː ( is an American Computer software company headquartered in San Jose California PostScript ( PS) is a dynamically typed concatenative Programming language created by John Warnock and Charles Geschke in 1982 LaTeX (ˈleɪtɛ Usenet, a Portmanteau of "user" and "network" is a world-wide distributed Internet discussion system Don't change "Extensible" RSS is a family of Web feed formats used to publish frequently updated works – such as Blog entries news headlines audio and video – in a standardized The Standard Generalized Markup Language ( ISO 88791986 SGML) is an ISO Standard Metalanguage in which one can define Markup languages Multimedia is media and content that utilizes a combination of different content forms. Metadata ( meta data, or sometimes metainformation) is "data about data" of any sort in any media ID3 is a Metadata container most often used in conjunction with the MP3 Audio file format. In Computer science and Linguistics, parsing, or more formally syntactic analysis, is the process of analyzing a sequence of tokens to

Some search engines support inspection of files that are stored in a compressed or encrypted file format. Compressor is a video and audio media compression and encoding application for use with Final Cut Studio and Logic Studio on Mac OS X. When working with a compressed format, the indexer first decompresses the document; this step may result in one or more files, each of which must be indexed separately. Commonly supported compressed file formats include:

Format analysis can involve quality improvement methods to avoid including 'bad information' in the index. Unix (officially trademarked as UNIX, sometimes also written as Unix with Small caps) is a computer Content can manipulate the formatting information to include additional content. Examples of abusing document formatting for spamdexing:

Section Recognition

Some search engines incorporate section recognition, the identification of major parts of a document, prior to tokenization. Not all the documents in a corpus read like a well-written book, divided into organized chapters and pages. Many documents on the web, such as newsletters and corporate reports, contain erroneous content and side-sections which do not contain primary material (that which the document is about). The Internet is a global system of interconnected Computer networks For example, this article displays a side menu with links to other web pages. Some file formats, like HTML or PDF, allow for content to be displayed in columns. Even though the content is displayed, or rendered, in different areas of the view, the raw markup content may store this information sequentially. Words that appear sequentially in the raw source content are indexed sequentially, even though these sentences and paragraphs are rendered in different parts of the computer screen. If search engines index this content as if it were normal content, the quality of the index and search quality may be degraded due to the mixed content and improper word proximity. Two primary problems are noted:

Section analysis may require the search engine to implement the rendering logic of each document, essentially an abstract representation of the actual document, and then index the representation instead. For example, some content on the Internet is rendered via Javascript. If the search engine does not render the page and evaluate the Javascript within the page, it would not 'see' this content in the same way and would index the document incorrectly. Given that some search engines do not bother with rendering issues, many web page designers avoid displaying content via Javascript or use the Noscript tag to ensure that the web page is indexed properly. At the same time, this fact can also be exploited to cause the search engine indexer to 'see' different content than the viewer. Spamdexing involves a number of methods such as repeating unrelated phrases to manipulate the relevancy or prominence of resources indexed by a search engine, in a manner inconsistent

Meta Tag Indexing

Specific documents often contain embedded meta information such as author, keywords, description, and language. For HTML pages, the meta tag contains keywords which are also included in the index. Meta elements are HTML or XHTML elements used to provide structured Metadata about a Web page. Earlier Internet search engine technology would only index the keywords in the meta tags for the forward index; the full document would not be parsed. At that time full-text indexing was not as well established, nor was the hardware able to support such technology. Hardware is a general term that refers to the physical artifacts of a Technology. The design of the HTML markup language initially included support for meta tags for the very purpose of being properly and easily indexed, without requiring tokenization. [27]

As the Internet grew through the 1990s, many brick-and-mortar corporations went 'online' and established corporate websites. Brick and mortar (B&M refers to a company which possesses a building for operations The keywords used to describe webpages (many of which were corporate-oriented webpages similar to product brochures) changed from descriptive to marketing-oriented keywords designed to drive sales by placing the webpage high in the search results for specific search queries. The fact that these keywords were subjectively-specified was leading to spamdexing, which drove many search engines to adopt full-text indexing technologies in the 1990s. Spamdexing involves a number of methods such as repeating unrelated phrases to manipulate the relevancy or prominence of resources indexed by a search engine, in a manner inconsistent Search engine designers and companies could only place so many 'marketing keywords' into the content of a webpage before draining it of all interesting and useful information. Given that conflict of interest with the business goal of designing user-oriented websites which were 'sticky', the customer lifetime value equation was changed to incorporate more useful content into the website in hopes of retaining the visitor. In this sense, full-text indexing was more objective and increased the quality of search engine results, as it was one more step away from subjective control of search engine result placement, which in turn furthered research of full-text indexing technologies.

In Desktop search, many solutions incorporate meta tags to provide a way for authors to further customize how the search engine will index content from various files that is not evident from the file content. Desktop search is the name for the field of search tools which search the contents of a user's own Computer files, rather than searching the Internet Desktop search is more under the control of the user, while Internet search engines which must focus more on the full text index.

See also

Further reading

References

  1. ^ Clarke, C. , Cormack, G. : Dynamic Inverted Indexes for a Distributed Full-Text Retrieval System. TechRep MT-95-01, University of Waterloo, February 1995.
  2. ^ Stephen V. Rice, Stephen M. Bailey. Searching for Sounds. Comparisonics Corporation. May 2004. Verified Dec 2006
  3. ^ Charles E. Jacobs, Adam Finkelstein, David H. Salesin. Fast Multiresolution Image Querying. Department of Computer Science and Engineering, University of Washington. 1995. Verified Dec 2006
  4. ^ Lee, James. Software Learns to Tag Photos. MIT Technology Review. November 09, 2006. Pg 1-2. Verified Dec 2006. Commercial external link
  5. ^ Brown, E. W. : Execution Performance Issues in Full-Text Information Retrieval. Computer Science Department, University of Massachusetts at Amherst, Technical Report 95-81, October 1995.
  6. ^ Cutting, D. , Pedersen, J. : Optimizations for dynamic inverted index maintenance. Proceedings of SIGIR, 405-411, 1990.
  7. ^ Linear Hash Partitioning. MySQL 5. 1 Reference Manual. Verified Dec 2006
  8. ^ Gusfield, Dan [1997] (1999). Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. USA: Cambridge University Press. ISBN 0-521-58519-8.  .
  9. ^ trie, Dictionary of Algorithms and Data Structures, [http://www.nist.gov U.S. National Institute of Standards and Technology.
  10. ^ Black, Paul E. , inverted index, [http://www.nist.gov/dads Dictionary of Algorithms and Data Structures, [http://www.nist.gov U.S. National Institute of Standards and Technology Oct 2006. Verified Dec 2006.
  11. ^ C. C. Foster, Information retrieval: information storage and retrieval using AVL trees, Proceedings of the 1965 20th national conference, p. 192-205, August 24-26, 1965, Cleveland, Ohio, United States
  12. ^ Landauer, W. I. : The balanced tree and its utilization in information retrieval. IEEE Trans. on Electronic Computers, Vol. EC-12, No. 6, December 1963.
  13. ^ Google Ngram Datasets for sale at LDC Catalog
  14. ^ Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Google, Inc. OSDI. 2004.
  15. ^ Grossman, Frieder, Goharian. IR Basics of Inverted Index. 2002. Verified Dec 2006.
  16. ^ Tang, Hunqiang. Dwarkadas, Sandhya. "Hybrid Global Local Indexing for Efficient Peer to Peer Information Retrieval". University of Rochester. Pg 1. http://ftp.cs.rochester.edu/~sarrmor/publications/eSearch-NSDI04.pdf
  17. ^ [1] - inverted index written in Haskell
  18. ^ [2] - inverted index written in Lisp
  19. ^ Tomasic, A. Haskell is a standardized Purely functional Programming language with non-strict semantics, named after the Logician Haskell Curry A lisp ( OE wlisp, stammering is a Speech impediment, historically also known as sigmatism. , et al: Incremental Updates of Inverted Lists for Text Document Retrieval. Short Version of Stanford University Computer Science Technical Note STAN-CS-TN-93-1, December, 1993.
  20. ^ Sergey Brin and Lawrence Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Stanford University. Leland Stanford Junior University, commonly known as Stanford University or simply Stanford, is a private Research university located in 1998. Verified Dec 2006.
  21. ^ H. S. Heaps. Storage analysis of a compression coding for a document database. 1NFOR, I0(i):47-61, February 1972.
  22. ^ Murray, Brian H. Sizing the Internet. Cyveillance, Inc. Pg 2. July 2000. Verified Dec 2006.
  23. ^ Blair Bancroft. Word Count:A Highly Personal-and Probably Controversial-Essay on Counting Words. Personal Website. Verified Dec 2006.
  24. ^ The Unicode Standard - Frequently Asked Questions. Verified Dec 2006.
  25. ^ Storage estimates. Verified Dec 2006.
  26. ^ Average Total Hard Drive Size by Global Region, February 2008. Verified May 2008.
  27. ^ Berners-Lee, T. , "Hypertext Markup Language - 2. 0", RFC 1866, Network Working Group, November 1995.
  28. ^ Krishna Nareddy. Indexing with Microsoft Index Server. MSDN Library. Microsoft Corporation. January 30, 1998. Verified Dec 2006. Note that this is a commercial, external link.

[Anatomy of a search engine http://infolab.stanford.edu/~backrub/google.html] [Google n-gram information retriever http://n-gram-patterns.sourceforge.net/]


© 2009 citizendia.org; parts available under the terms of GNU Free Documentation License, from http://en.wikipedia.org
Dapyx Software network: MP3 Explorer | Ebook Manager | Zenithic