Citizendia
Your Ad Here

A hash function is any well-defined procedure or mathematical function for turning some kind of data into a relatively small integer, that may serve as an index into an array. In Mathematics, Computing, Linguistics and related subjects an algorithm is a sequence of finite instructions often used for Calculation The Mathematical concept of a function expresses dependence between two quantities one of which is given (the independent variable, argument of the function Debt AIDS Trade in Africa (or DATA) is a Multinational non-government organization founded in January 2002 in London by U2 's The integers (from the Latin integer, literally "untouched" hence "whole" the word entire comes from the same origin but via French This is referring to Index in the context of Information Technology In Computer science an array is a Data structure consisting of a group of elements that are accessed by indexing. The values returned by a hash function are called hash values, hash codes, hash sums, or simply hashes.

Hash functions are mostly used to speed up table lookup or data comparison tasks --- such as finding items in a database, detecting duplicated or similar records in a large file, finding similar stretches in DNA sequences, and so on. A Computer Database is a structured collection of records or data that is stored in a computer system In the context of a Relational database, a row —also called a record or tuple —represents a single implicitly structured Data item in a A computer file is a block of Arbitrary Information, or resource for storing information which is available to a Computer program and is usually A nucleic acid is a Macromolecule composed of chains of monomeric Nucleotides In Biochemistry these Molecules carry Genetic information

Hash functions are related to (and often confused with) checksums, check digits, fingerprints, randomizing functions, error correcting codes, and cryptographic hash functions. A checksum is a form of Redundancy check, a simple way to protect the integrity of data by detecting errors in data that are sent through space ( Telecommunications A check digit is a form of Redundancy check used for Error detection, the decimal equivalent of a binary Checksum. In Computer science, a fingerprinting algorithm is a procedure that maps an arbitrarily large Data item (such as a computer file) to a much shorter In Mathematics, Computer science, Telecommunication, and Information theory, error detection and correction has great practical importance in A cryptographic Hash function is a transformation that takes an input (or 'message' and returns a fixed-size string which is called the hash value (sometimes Although these concepts overlap to some extent, each has its own uses and requirements. The HashKeeper database maintained by the National Drug Intelligence Center, for instance, is more aptly described as a catalog of file fingerprints than of hash values. HashKeeper is a Database application of value primarily to those conducting Forensic examinations of Computers on a somewhat regular basis The US National Drug Intelligence Center (NDIC established in 1993, is a component of the U

A typical hash function at work
A typical hash function at work

Contents

Applications

Hash tables

Hash functions are mostly used in hash tables, to quickly locate a data record (for example, a dictionary definition) given its search key (the headword). In Computer science, a hash table, or a hash map, is a Data structure that associates keys with values. A dictionary is a book of alphabetically listed Words in a specific language with definitions etymologies pronunciations and other information or a book of alphabetically Specifically, the hash function is used to map the search key to the index of a slot in the table where the corresponding record is supposedly stored.

In general, a hashing function may map several different keys to the same hash value. Therefore, each slot of a hash table contains (implicitly or explicitly) a set of records, rather than a single record. For this reason, each slot of a hash table is often called a bucket, and hash values are also called bucket indices.

Thus, the hash function only hints at the record's location --- it only tells where one should start looking for it. Still, in a half-full table, a good hash function will typically narrow the search down to only one or two entries.

In the Java programming language, for example, hash functions are central to the way the language allows objects to be stored in hashing collections such as the standard HashMap and HashSet classes. To this end, Java's Object parent class provides a hashCode() method mandating the generation of a 32-bit integer hash value.

Finding duplicate records

To find duplicated records in a large unsorted file, one may use a hash function to map each file record to an index into a table T, and collect in each bucket T[i] a list of the numbers of all records with the same hash value i. In Computer science, a list is an ordered collection of entities / Items In the context of Object-oriented programming languages Once the table is complete, any two duplicate records will end up in the same bucket. The duplicates can then be found by scanning every list T[i] which contains two or more record numbers, fetching those records, and comparing them. With a table of appropriate size, this method is likely to be much faster than any alternative approach (such as sorting the file and comparing all consecutive pairs).

Finding similar records

Hash functions can also be used to locate table records whose key is similar, but not identical, to a given key; or pairs of records in a large file which have similar keys. For that purpose, one needs a hash function that maps similar keys to hash values that differ by at most m, where m is a small integer (say, 1 or 2). If one builds a table of T of all record numbers, using such a hash function, then similar records will end up in the same bucket, or in nearby buckets. Then one need only check the records in each bucket T[i] against those in buckets T[i+k] where k ranges between -m and m.

Finding similar substrings

The same techniques can be used to find equal or similar stretches in a large collection of strings, such as a document repository or a genomic database. In this case, the input strings are broken into many small pieces, and a hash function is used to detect potentially equal pieces, as above.

The Rabin-Karp algorithm is a relatively fast string searching algorithm that works in O(n) time on average. The Rabin-Karp algorithm is a String searching algorithm created by Michael O String searching algorithms, sometimes called string matching algorithms, are an important class of String algorithms that try to find a place where one or several In mathematics big O notation (so called because it uses the symbol O) describes the limiting behavior of a function for very small or very large arguments It is based on the use of hashing to compare strings.

Geometric hashing

This principle is widely used in computer graphics, computational geometry and many other disciplines, to locate close pairs of points in the plane or in three-dimensional space, similar shapes in a list of shapes, similar images in an image database, and so on. Computer graphics are Graphics created by Computers and more generally the Representation and Manipulation of Pictorial Data Computational geometry is a branch of Computer science devoted to the study of algorithms which can be stated in terms of Geometry. Image processing is any form of Signal processing for which the input is an image such as photographs or frames of video the output of image processing can be either an image In these applications, the set of all inputs is some sort of metric space, and the hashing function can be interpreted as a partition of that space into a grid of cells. In Mathematics, a metric space is a set where a notion of Distance (called a metric) between elements of the set is defined In Mathematics, a partition may be a Partition of a set or an Ordered partition of a set or a Partition of a graph The table is often an array with two or more indices (called a bucket grids), and the hash function returns an index tuple. This special case of hashing is known as geometric hashing or the grid method. In Computer science, geometric hashing is a method for efficiently finding two-dimensional objects represented by discrete points that have undergone an Affine transformation

Properties

Determinism

To serve its purpose, a hash function must be fast and deterministic --- meaning that two identical or equivalent inputs must generate the same hash value. In Computer science, a deterministic algorithm is an Algorithm which in informal terms behaves predictably

Uniformity

A good hash function should map the expected inputs as evenly as possible over its output range. That is, every hash value in the output range should be generated with roughly the same probability. Probability is the likelihood or chance that something is the case or will happen The reason for this last requirement is that the cost of hashing-based methods goes up sharply as the number of collisions --- pairs of inputs that are mapped to the same hash value --- increases. Basically, if some hash values are more likely to occur than others, a larger fraction of the lookup operations will have to search through a larger set of colliding table entries.

Note that the hash values need only be uniformly distributed, not random in any sense. Thus, a good randomizing function is usually good for hashing, but the converse need not be true.

Hash tables often contain only a small subset of the valid inputs. For instance, a club membership list may contain only a hundred or so member names, out of the very large set of all possible names. In these cases, the uniformity criterion should hold for almost all typical subsets of entries that may be found in the table, not just for the global set of all possible entries.

In other words, if a typical set of m records is hashed to n table slots, the probability of a bucket receiving many more than m/n records should be vanishingly small. In particular, if m is less than n, very few buckets should have more than one or two records. (Ideally, no bucket should have more than one record; but a small number of collisions is virtually inevitable, even if n is much larger than m (see the birthday paradox). In Probability theory, the birthday problem, pertains to the Probability that in a set of Randomly chosen people some pair of them will have the same

Variable range

In many applications, the range of hash values may be different for each run of the program, or may change along the same run (for instance, when a hash table needs to be expanded). In those situations, one needs a hash function which takes two parameters --- the input data z, and the number n of allowed hash values.

Data normalization

In some applications, the input data may contain features that are irrelevant for comparison purposes. When looking up a personal name, for instance, it may be desirable to ignore the distinction between upper and lower case letters. For such data, one must use a hash function that is compatible with the data equivalence criterion being used: that is, any two inputs that are considered equivalent must yield the same hash value. In Mathematics, an equivalence relation is a Binary relation between two elements of a set which groups them together as being "equivalent"

Continuity

A hash function that is used to search for similar (as opposed to equivalent) data must be as continuous as possible; two inputs that differ by a little should be mapped to equal or nearly equal hash values. In Mathematics, a continuous function is a function for which intuitively small changes in the input result in small changes in the output

Note that continuity is usually considered a fatal flaw for checksums, cryptographic hash functions, and other related concepts. Continuity is undesirable for hash functions only in some applications, such as hash tables that use linear search. In Computer science, linear search is a Search algorithm, also known as sequential search, that is suitable for searching a list of data for a particular

Hash function algorithms

The choice of a hashing function depends strongly on the nature of the input data, and their probability distribution in the intended application. In Probability theory and Statistics, a probability distribution identifies either the probability of each value of an unidentified Random variable

Injective and perfect hashing

The ideal hashing function should be injective --- that is, it should map each valid input to a different hash value. Such a function would directly locate the desired entry in a hash table, without any additional search.

An injective hash function whose range is all integers between 0 and n−1, where n is the number of valid inputs, is said to be perfect. A Perfect hash function of a set S is a Hash function which maps different keys (elements in S to different numbers Besides providing single-step lookup, a perfect hash function also results in a compact hash table, without any vacant slots.

Unfortunately, injective and perfect hash functions exist only in very few special situations (such as mapping month names to the integers 0. . 11); and even then they are often too complicated or expensive to be of practical use. Indeed, hash functions are typically required to map a large set of valid potential inputs to a much smaller range of hash values --- and therefore cannot be injective.

Hashing uniformly distributed data

If the inputs are bounded-length strings (such as telephone numbers, car license plates, invoice numbers, etc. In Computer programming and some branches of Mathematics, a string is an ordered Sequence of Symbols. Basic principle A traditional landline telephone system also known as "plain old telephone service" (POTS, commonly handles both signaling and audio information An invoice or bill is a commercial document issued by a seller to the Buyer, indicating the products quantities and agreed Prices ), and each input may independently occur with uniform probability, then a hash function needs only map roughly the same number of inputs to each hash value. In Probability theory, to say that two events are independent intuitively means that the occurrence of one event makes it neither more nor less probable that the other For instance, suppose that each input is an integer z in the range 0 to N−1, and the output must be an integer h in the range 0 to n−1, where N is much larger than n. Then the hash function could be h = z mod n (the remainder of z divided by n), or h = (z × n) ÷ N (the value z scaled down by n/N and truncated to an integer) --- or many other formulas.

Hashing data with other distributions

These simple formulas will not do if the input values are not equally likely, or are not independent. For instance, most patrons of a supermarket will live in the same geographic area, so their telephone numbers are likely to begin with the same 3 to 4 digits. Customer divider barjpg|thumb|In supermarkets sellers periodically change prices for classes of goods in response to market conditions rather than negotiating the price of each good In that case, if n is 10000 or so, the division formula (z × n) ÷ N, which depends mainly on the leading digits, will generate a lot of collisions; whereas the remainder formula z mod n, which is quite sensitive to the trailing digits, may still yield a fairly even distribution of hash values.

When the data values are long (or variable-length) character strings --- such as personal names, web page addresses, or mail messages --- their distribution is usually very uneven, with complicated dependencies. For other uses see Character. In Computer and machine-based Telecommunications terminology a character is a unit of Uniform Resource Locator is an URI which also specifies where the identified resource is available and the protocol for retrieving it For example, text in any natural language has highly non-uniform distributions of characters, and character pairs, very characteristic of the language. In the Philosophy of language, a natural language (or ordinary language) is a Language that is spoken or written in phonemic-alphabetic or phonemically-related For such data, is prudent to use a hash function that depends on all characters of the string --- and depends on each character in a different way.

Special-purpose hash functions

In many such cases, one can design a special-purpose (heuristic) hash function that yield many fewer collisions than a good general-purpose hash function. In Computer science, a heuristic algorithm or simply a Heuristic is an Algorithm that ignores whether the solution to the problem can be proven For example, suppose that the input data are file names such as FILE0000. CHK, FILE0001. CHK, FILE0002. CHK, etc. , with mostly sequential numbers. For such data, a function that extracts the numeric part k of the file name and returns k mod n would be nearly optimal. Needless to say, a function that is exceptionally good for a specific kind of data may have dismal performance on data with different distribution.

Hashing with checksum functions

One can obtain good general-purpose hash functions for string data by adapting certain checksum or fingerprinting algorithms. Some of those algorithms will map arbitrary long string data z, with any typical real-world distribution --- no matter how non-uniform and dependent --- to a fixed length bit string, with a fairly uniform distribution. This string can be interpreted as a binary integer k, and turned into a hash value by the formula h = k mod n.

This method will produce a fairly even distribution of hash values, as long as the hash range size n is small compared to the range of the checksum function. Bob Jenkins' LOOKUP3 algorithm[1] uses a 32-bit checksum. A 64-bit checksum should provide adequate hashing for tables of any feasible size.

Hashing with cryptographic hash functions

Some cryptographic hash functions, such as MD5, have even stronger uniformity guarantees than checksums or fingerprints, and thus can provide very good general-purpose hashing functions. In Cryptography, MD5 ( Message-Digest algorithm 5) is a widely used partially insecure Cryptographic hash function with a 128- Bit hash value However, the uniformity advantage may be too small to offset their much higher cost.

Audio identification

Main article: Acoustic fingerprint

For audio identification [2] such as finding out whether an MP3 file matches one of a list of known items, one could use a conventional hash function such as MD5, but this would be very sensitive to highly likely perturbations such as time-shifting, CD read errors, different compression algorithms or implementations or changes in volume. An acoustic fingerprint is a digital measure of certain acoustic properties that is deterministically generated from an Audio signal, that can be used to identify MPEG-1 Audio Layer 3, more commonly referred to as MP3, is a Digital audio encoding format using a form of Lossy data compression In Cryptography, MD5 ( Message-Digest algorithm 5) is a widely used partially insecure Cryptographic hash function with a 128- Bit hash value Using something like MD5 is useful as a first pass to find exactly identical files, but another more advanced algorithm is required to find all items that would nonetheless be interpreted as identical to a human listener. In Cryptography, MD5 ( Message-Digest algorithm 5) is a widely used partially insecure Cryptographic hash function with a 128- Bit hash value Though they are not common, hashing algorithms do exist that are robust to these minor differences. There is a service called MusicBrainz which creates a fingerprint for an audio file and matches it to its online community driven database. MusicBrainz is a project that aims to create an Open content Music database

Origins of the term

The term "hash" comes by way of analogy with its standard meaning in the physical world, to "chop and mix". Donald Knuth notes that Hans Peter Luhn of IBM appears to have been the first to use the concept, in a memo dated January 1953, and that Robert Morris used the term in a survey paper in CACM which elevated the term from technical jargon to formal terminology. Donald Ervin Knuth (kəˈnuːθ (born 10 January 1938) is a renowned computer scientist and Professor Emeritus of the Art of Computer Hans Peter Luhn ( July 1, 1896 &ndash August 19, 1964) was a computer scientist for IBM, and creator of the Luhn algorithm International Business Machines Corporation abbreviated IBM and nicknamed "Big Blue", is a multinational Computer Technology Year 1953 ( MCMLIII) was a Common year starting on Thursday (link will display full calendar of the Gregorian calendar. Robert "Bob" H Morris is an American Cryptographer. He received a Bachelor's degree in Mathematics from Harvard University Communications of the ACM ( CACM) is the flagship monthly Journal of the Association for Computing Machinery (ACM [3]

In the SHA-1 algorithm, for example, the domain is "flattened" and "chopped" into "words" which are then "mixed" with one another using carefully chosen mathematical functions. The range ("hash value") is made to be a definite size, 160 bits (which may be either smaller or larger than the domain), through the use of modular division. In Mathematics, modular arithmetic (sometimes called modulo arithmetic, or clock arithmetic) is a system of Arithmetic for Integers

See also

Notes

  1. ^  In the remainder of this article, the term function is used to refer to algorithms as well as the functions they compute. Universal hashing is a Randomized algorithm for selecting a Hash function F with the following property for any two distinct inputs x and Cryptography (or cryptology; from Greek grc κρυπτός kryptos, "hidden secret" and grc γράφω gráphō, "I write" A cryptographic Hash function is a transformation that takes an input (or 'message' and returns a fixed-size string which is called the hash value (sometimes In Cryptography, a keyed-Hash Message Authentication Code ( HMAC or KHMAC) is a type of Message authentication code (MAC calculated using a In Computer science, geometric hashing is a method for efficiently finding two-dimensional objects represented by discrete points that have undergone an Affine transformation Distributed hash tables ( DHTs) are a class of decentralized distributed systems that provide a lookup service similar to a Hash table: ( name, A Perfect hash function of a set S is a Hash function which maps different keys (elements in S to different numbers Linear Hashing is a dynamic Hash table algorithm invented by Witold Litwin (1980, and later popularized by Paul Larson. A rolling hash is a Hash function where the input is hashed in a window that moves through the input The Rabin-Karp algorithm is a String searching algorithm created by Michael O Zobrist hashing is a technique for creating Hash codes usually from something like a Chess position The Bloom filter, conceived by Burton H Bloom in 1970 is a space-efficient Probabilistic Data structure that is used to test whether an element is a member In Computer science, a hash table, or a hash map, is a Data structure that associates keys with values. In Computer science, a hash list is typically a list of hashes of the data blocks in a file or set of files In Cryptography and Computer science Hash trees or Merkle trees are a type of Data structure which contains a tree of summary Coalesced hashing, also called coalesced chaining, is a strategy of collision resolution in a Hash table that forms a hybrid of Separate chaining and In Computer chess and other computer games transposition tables are used to speed up the search of the Game tree. This is a list of Hash functions including Cyclic redundancy checks Checksum functions and Cryptographic hash functions Cyclic redundancy checks In Mathematics, Computing, Linguistics and related subjects an algorithm is a sequence of finite instructions often used for Calculation The Mathematical concept of a function expresses dependence between two quantities one of which is given (the independent variable, argument of the function

References

  1. ^ Jenkins, Bob (September, 1997), Hash Functions, “Algorithm Alley”, Dr. Dobb's Journal, <http://www.ddj.com/184410284> 
  2. ^ "Robust Audio Hashing for Content Identification by Jaap Haitsma, Ton Kalker and Job Oostveen"
  3. ^ Knuth, Donald (1973). Donald Ervin Knuth (kəˈnuːθ (born 10 January 1938) is a renowned computer scientist and Professor Emeritus of the Art of Computer The Art of Computer Programming, volume 3, Sorting and Searching, 506-542. The Art of Computer Programming is a comprehensive Monograph written by Donald Knuth that covers many kinds of Programming Algorithms  

External links

Dictionary

hash function

-noun

  1. (computing) an algorithm that generates a numeric, or fixed-size character output from a variable-sized piece of text or other data; used in database table queries, cryptography and in error-checking
© 2009 citizendia.org; parts available under the terms of GNU Free Documentation License, from http://en.wikipedia.org
Dapyx Software network: MP3 Explorer | Ebook Manager | Zenithic