Part-of-speech tagging (POS tagging or POST), also called grammatical tagging, is the process of marking up the words in a text as corresponding to a particular part of speech, based on both its definition, as well as its context—i. In Grammar, a lexical category (also word class, lexical class, or in traditional grammar part of speech) is a linguistic category of words (or e. , relationship with adjacent and related words in a phrase, sentence, or paragraph. In Grammar, a phrase is a group of Words that functions as a single unit in the Syntax of a sentence. A paragraph (from the Greek paragraphos, " to write beside " or " written beside " is a self-contained unit of a discourse in A simplified form of this is commonly taught school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc. For English usage of verbs see the wiki article English verbs. In Grammar, an adjective is a word whose main syntactic role is to modify a Noun or Pronoun, giving more information about the Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags. Computational linguistics is an Interdisciplinary field dealing with the statistical and/or rule-based modeling of Natural language from a computational In Mathematics, Computing, Linguistics and related subjects an algorithm is a sequence of finite instructions often used for Calculation
Contents |
Research on part-of-speech tagging has been closely tied to corpus linguistics. Corpus linguistics is the Study of language as expressed in Samples ( corpora) or "real world" text The first major corpus of English for computer analysis was the Brown Corpus developed at Brown University by Henry Kucera and Nelson Francis, in the mid-1960s. The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) was compiled by Henry Kucera and W Brown University is a highly esteemed private University located in Providence, Rhode Island and is a member of the Ivy League. Henry Kucera (originally Jindřich Kučera; born 1925 is a Czech linguist who was a pioneer in Corpus linguistics and linguistic software. It consists of about 1,000,000 words of running English prose text, made up of 500 samples from randomly chosen publications. Each sample is 2,000 or more words (ending at the first sentence-end after 2,000 words, so that the corpus contains only complete sentences).
The Brown Corpus was painstakingly "tagged" with part-of-speech markers over many years. The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) was compiled by Henry Kucera and W A first approximation was done with a program by Greene and Rubin, which consisted of a huge handmade list of what categories could co-occur at all. For example, article then noun can occur, but article verb (arguably) cannot. The program got about 70% correct. Its results were repeatedly reviewed and corrected by hand, and later users sent in errata, so that by the late 70s the tagging was nearly perfect (allowing for some cases even human speakers might not agree on).
This corpus has been used for innumerable studies of word-frequency and of part-of-speech, and inspired the development of similar "tagged" corpora in many other languages. Statistics derived by analyzing it formed the basis for most later part-of-speech tagging systems, such as CLAWS and VOLSUNGA. However, by this time (2005) it has been superseded by larger corpora such as the 100 million word British National Corpus. The British National Corpus (or just BNC) is a 100-million-word Text corpus of samples of written and spoken English from a wide range of sources
For some time, part-of-speech tagging was considered an inseparable part of natural language processing, because there are certain cases where the correct part of speech cannot be decided without understanding the semantics or even the pragmatics of the context. Natural language processing ( NLP) is a subfield of Artificial intelligence and Computational linguistics. Semantics is the study of meaning in communication The word derives from Greek σημαντικός ( semantikos) "significant" from Pragmatics is the study of the ability of Natural language speakers to communicate more than that which is explicitly stated This is extremely expensive, especially because analyzing the higher levels is much harder when multiple part-of-speech possibilities must be considered for each word.
In the mid 1980s, researchers in Europe began to use hidden Markov models (HMMs) to disambiguate parts of speech, when working to tag the Lancaster-Oslo-Bergen Corpus of British English. A hidden Markov model ( HMM) is a Statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters and the HMMs involve counting cases (such as from the Brown Corpus), and making a table of the probabilities of certain sequences. For example, once you've seen an article such as 'the', perhaps the next word is a noun 40% of the time, an adjective 40%, and a number 20%. Knowing this, a program can decide that "can" in "the can" is far more likely to be a noun than a verb or a modal. The same method can of course be used to benefit from knowledge about following words.
More advanced ("higher order") HMMs learn the probabilities not only of pairs, but triples or even larger sequences. So, for example, if you've just seen an article and a verb, the next item may be very likely a preposition, article, or noun, but even less likely another verb.
When several ambiguous words occur together, the possibilities multiply. However, it is easy to enumerate every combination and to assign a relative probability to each one, by multiplying together the probabilities of each choice in turn. The combination with highest probability is then chosen. The European group developed CLAWS, a tagging program that did exactly this, and achieved accuracy in the 93-95% range.
It is worth remembering, as Eugene Charniak points out in Statistical techniques for natural language parsing [1], that merely assigning the most common tag to each known word and the tag "proper noun" to all unknowns, will approach 90% accuracy because many words are unambiguous. Eugene Charniak is a Computer Science and Cognitive Science professor at Brown University.
CLAWS pioneered the field of HMM-based part of speech tagging, but was quite expensive since it enumerated all possibilities. It sometimes had to resort to backup methods when there were simply too many (the Brown Corpus contains a case with 17 ambiguous words in a row, and there are words such as "still" that can represent as many as 7 distinct parts of speech). The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) was compiled by Henry Kucera and W
In 1987, Steve DeRose and Ken Church independently developed dynamic programming algorithms to solve the same problem in vastly less time. In Mathematics and Computer science, dynamic programming is a method of solving problems exhibiting the properties of Overlapping subproblems and Their methods were similar to the Viterbi algorithm known for some time in other fields. The Viterbi algorithm is a Dynamic programming Algorithm for finding the most likely sequence of hidden states &ndash called the Viterbi path DeRose used a table of pairs, while Church used a table of triples and an ingenious method of estimating the values for triples that were rare or nonexistent in the Brown Corpus (actual measurement of triple probabilities would require a much larger corpus). Both methods achieved accuracy over 95%. DeRose's 1990 dissertation at Brown University included analyses of the specific error types, probabilities, and other related data, and replicated his work for Greek, where it proved similarly effective. Brown University is a highly esteemed private University located in Providence, Rhode Island and is a member of the Ivy League.
These findings were surprisingly disruptive to the field of Natural Language Processing. Natural language processing ( NLP) is a subfield of Artificial intelligence and Computational linguistics. The accuracy reported was higher than the typical accuracy of very sophisticated algorithms that integrated part of speech choice with many higher levels of linguistic analysis: syntax, morphology, semantics, and so on. CLAWS, DeRose's and Church's methods did fail for some of the known cases where semantics is required, but those proved negligibly rare. This convinced many in the field that part-of-speech tagging could usefully be separated out from the other levels of processing; this in turn simplified the theory and practice of computerized language analysis, and encouraged researchers to find ways to separate out other pieces as well. Markov Models are now the standard method for part-of-speech assignment.
The methods already discussed involve working from a pre-existing corpus to learn tag probabilities. It is, however, also possible to bootstrap using "unsupervised" tagging. Language acquisition in children Syntactic bootstrapping is the idea that children use syntactic knowledge they have developed to help learn what words mean -- semantics builds on Unsupervised tagging techniques use an untagged corpus for their training data and produce the tagset by induction. That is, they observe patterns in word use, and derive part-of-speech categories themselves. For example, statistics readily reveal that "the", "a", and "an" occur in similar contexts, while "eat" occurs in very different ones. With sufficient iteration, similarity classes of words emerge that are remarkably similar to those human linguists would expect; and the differences themselves sometimes suggest valuable new insights.
These two categories can be further subdivided into rule-based, stochastic, and neural approaches. Some current major algorithms for part-of-speech tagging include the Viterbi algorithm, Brill Tagger, and the Baum-Welch algorithm (also known as the forward-backward algorithm). The Viterbi algorithm is a Dynamic programming Algorithm for finding the most likely sequence of hidden states &ndash called the Viterbi path The Brill tagger is a method for doing Part-of-speech tagging. In Computer science, Statistical computing and Bioinformatics, the Baum-Welch algorithm is used to find the unknown parameters of a Hidden Markov model Hidden Markov model and visible Markov model taggers can both be implemented using the Viterbi algorithm. A hidden Markov model ( HMM) is a Statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters and the The Viterbi algorithm is a Dynamic programming Algorithm for finding the most likely sequence of hidden states &ndash called the Viterbi path