In computer science and linguistics, parsing (more formally: syntactic analysis) is the process of analyzing a sequence of tokens to determine grammatical structure with respect to a given (more or less) formal grammar. Computer science (or computing science) is the study and the Science of the theoretical foundations of Information and Computation and their Linguistics is the scientific study of Language, encompassing a number of sub-fields In Computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens In Formal semantics, Computer science and Linguistics, a formal grammar (also called formation rules) is a precise description of a Formal A parser is thus one of the components in an interpreter or compiler, where it captures the implied hierarchy of the input text and transforms it into a form suitable for further processing (often some kind of parse tree, abstract syntax tree or other hierarchical structure) and normally checks for syntax errors at the same time. A compiler is a Computer program (or set of programs that translates text written in a computer language (the source language) into another A parse tree or concrete syntax tree is an (ordered rooted tree that represents the syntactic structure of a string according to some In Computer science, an abstract syntax tree (AST or just syntax tree, is a tree representation of the Syntax of some Source code The parser often uses a separate lexical analyser to create tokens from the sequence of input characters. In Computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens Parsers may be programmed by hand or may be semi-automatically generated (in some programming language) by a tool (such as Yacc) from a grammar written in Backus-Naur form. The Computer program yacc is a Parser generator developed by Stephen C In Computer science, Backus–Naur Form ( BNF) is a Metasyntax used to express Context-free grammars that is a formal way to describe Formal
Parsing is also an earlier term for the diagramming of sentences of natural languages, and is still used for the diagramming of inflected languages, such as the Romance languages or Latin. In Grammar, inflection or inflexion is the way language handles grammatical relations and relational categories such as tense, mood, voice The Romance languages (sometimes referred to as Romanic languages, or Neolatin languages) are a branch of the Indo-European language family comprising all Latin ( lingua Latīna, laˈtiːna is an Italic language, historically spoken in Latium and Ancient Rome.
Parsers can also be constructed as executable specifications of grammars in functional programming languages. Frost, Hafiz and Callaghan [1] have built on the work of others to construct a set of higher-order functions (called parser combinators) which allow polynomial time and space complexity top-down parser to be constructed as executable specifications of ambiguous grammars containing left-recursive productions. In Mathematics and Computer science, higher-order functions or '''functionals''' are functions which do at least one of the following In mathematics and functional programming Higher Order functions (HOF are defined as the functions that can take functions as their input and can also produce functions as their output The X-SAIGA site has more about the algorithms and implementation details.
Contents |
In some machine translation and natural language processing systems, human languages are parsed by computer programs. Machine translation, sometimes referred to by the abbreviation Natural language processing ( NLP) is a subfield of Artificial intelligence and Computational linguistics. Human sentences are not easily parsed by programs, as there is substantial ambiguity in the structure of human language. Syntactic ambiguity is a property of sentences which may be reasonably interpreted in more than one way or reasonably interpreted to mean more than one thing In order to parse natural language data, researchers must first agree on the grammar to be used. Grammar is the field of Linguistics that covers the Rules governing the use of any given natural language. The choice of syntax is affected by both linguistic and computational concerns; for instance some parsing systems use lexical functional grammar, but in general, parsing for grammars of this type is known to be NP-complete. Lexical functional grammar (LFG is a Grammar framework in Theoretical linguistics, a variety of Generative grammar. In Computational complexity theory, the Complexity class NP-complete (abbreviated NP-C or NPC) is a class of problems having two properties Head-driven phrase structure grammar is another linguistic formalism which has been popular in the parsing community, but other research efforts have focused on less complex formalisms such as the one used in the Penn Treebank. Head-driven phrase structure grammar (HPSG is a highly lexicalized non-derivational Generative grammar theory developed by Carl Pollard and Ivan Sag A treebank or parsed corpus is a Text corpus in which each sentence has been parsed i Shallow parsing aims to find only the boundaries of major constituents such as noun phrases. Shallow Parsing (also chunking, "light parsing" is an analysis of a sentence which identifies the constituents ( Noun groups Another popular strategy for avoiding linguistic controversy is dependency grammar parsing. Dependency grammar (DG is a class of syntactic theories developed by Lucien Tesnière.
Most modern parsers are at least partly statistical; that is, they rely on a corpus of training data which has already been annotated (parsed by hand). Statistics is a mathematical science pertaining to the collection analysis interpretation or explanation and presentation of Data. This approach allows the system to gather information about the frequency with which various constructions occur in specific contexts. (See machine learning. Machine learning is a subfield of Artificial intelligence that is concerned with the design and development of Algorithms and techniques that allow computers to "learn" ) Approaches which have been used include straightforward PCFGs (probabilistic context free grammars), maximum entropy, and neural nets. A stochastic context-free grammar ( SCFG; also probabilistic context-free grammar, PCFG) is a Context-free grammar in which each production is The principle of maximum entropy is a postulate about a universal feature of any Probability assignment on a given set of Propositions ( Events hypotheses Traditionally the term neural network had been used to refer to a network or circuit of biological neurons. Most of the more successful systems use lexical statistics (that is, they consider the identities of the words involved, as well as their part of speech). In Grammar, a lexical category (also word class, lexical class, or in traditional grammar part of speech) is a linguistic category of words (or However such systems are vulnerable to overfitting and require some kind of smoothing to be effective. For the machine learning concept see Overfitting (machine learning In Statistics, overfitting is fitting a Statistical model
Parsing algorithms for natural language cannot rely on the grammar having 'nice' properties as with manually-designed grammars for programming languages. As mentioned earlier some grammar formalisms are very computationally difficult to parse; in general, even if the desired structure is not context-free, some kind of context-free approximation to the grammar is used to perform a first pass. In Formal language theory, a context-free language is a language generated by some Context-free grammar. Algorithms which use context-free grammars often rely on some variant of the CKY algorithm, usually with some heuristic to prune away unlikely analyses to save time. The Cocke-Younger-Kasami (CYK algorithm (alternatively called CKY determines whether a string can be generated by a given Context-free grammar and if so how it In Computer science, a heuristic algorithm or simply a Heuristic is an Algorithm that ignores whether the solution to the problem can be proven (See chart parsing. A chart parser is a type of parser suitable for ambiguous grammars (including grammars of Natural languages. ) However some systems trade speed for accuracy using, eg, linear-time versions of the shift-reduce algorithm. Bottom-up parsing (also known as shift-reduce parsing) is a strategy for analyzing unknown data relationships that attempts to identify the most fundamental units first and A somewhat recent development has been parse reranking in which the parser proposes some large number of analyses, and a more complex system selects the best option. It is normally branching of one part and its subparts
The most common use of a parser is as a component of a compiler or interpreter. A compiler is a Computer program (or set of programs that translates text written in a computer language (the source language) into another This parses the source code of a computer programming language to create some form of internal representation. In Computer science, source code (commonly just source or code) is any sequence of statements or declarations written in some Human-readable A programming language is an Artificial language that can be used to write programs which control the behavior of a machine particularly a Computer. Programming languages tend to be specified in terms of a context-free grammar because fast and efficient parsers can be written for them. In Formal language theory, a context-free grammar ( CFG) is a grammar in which every production rule is of the form V &rarr Parsers are written by hand or generated by parser generators. A compiler-compiler or compiler generator is a tool that creates a parser, interpreter, or Compiler from some form of formal description
Context-free grammars are limited in the extent to which they can express all of the requirements of a language. Informally, the reason is that the memory of such a language is limited. The grammar cannot remember the presence of a construct over an arbitrarily long input; this is necessary for a language in which, for example, a name must be declared before it may be referenced. More powerful grammars that can express this constraint, however, cannot be parsed efficiently. Thus, it is a common strategy to create a relaxed parser for a context-free grammar which accepts a superset of the desired language constructs (that is, it accepts some invalid constructs); later, the unwanted constructs can be filtered out.

The following example demonstrates the common case of parsing a computer language with two levels of grammar: lexical and syntactic.
The first stage is the token generation, or lexical analysis, by which the input character stream is split into meaningful symbols defined by a grammar of regular expressions. In Computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens In Computing, regular expressions provide a concise and flexible means for identifying strings of text of interest such as particular characters words or patterns of characters For example, a calculator program would look at an input such as "12*(3+4)^2" and split it into the tokens 1,2, *, (, 3, +, 4, ), ^, and 2, each of which is a meaningful symbol in the context of an arithmetic expression. The parser would contain rules to tell it that the characters *, +, ^, ( and ) mark the start of a new token, so meaningless tokens like "12*" or "(3" will not be generated.
The next stage is parsing or syntactic analysis, which is checking that the tokens form an allowable expression. This is usually done with reference to a context-free grammar which recursively defines components that can make up an expression and the order in which they must appear. In Formal language theory, a context-free grammar ( CFG) is a grammar in which every production rule is of the form V &rarr However, not all rules defining programming languages can be expressed by context-free grammars alone, for example type validity and proper declaration of identifiers. These rules can be formally expressed with attribute grammars. An Attribute grammar is a formal way to define attributes for the productions of a Formal grammar, associating these attributes to values
The final phase is semantic parsing or analysis, which is working out the implications of the expression just validated and taking the appropriate action. A compiler is a Computer program (or set of programs that translates text written in a computer language (the source language) into another In the case of a calculator or interpreter, the action is to evaluate the expression or program; a compiler, on the other hand, would generate some kind of code. Attribute grammars can also be used to define these actions.
The task of the parser is essentially to determine if and how the input can be derived from the start symbol of the grammar. This can be done in essentially two ways:
Another important distinction is whether the parser generates a leftmost derivation or a rightmost derivation (see context-free grammar). In Formal language theory, a context-free grammar ( CFG) is a grammar in which every production rule is of the form V &rarr LL parsers will generate a leftmost derivation and LR parsers will generate a rightmost derivation (although usually in reverse).
Some of the parsers that use top-down parsing include:
Some of the parsers that use bottom-up parsing include:
See also the comparison of parser generators. In Computer science generating strings is one of the names given to the process of creating a set of strings from some collection of rules A chart parser is a type of parser suitable for ambiguous grammars (including grammars of Natural languages. A compiler-compiler or compiler generator is a tool that creates a parser, interpreter, or Compiler from some form of formal description In Natural language processing, deterministic parsing refers to Parsing Algorithms that are deterministic in the sense that given the input the Parser In Computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens Shallow Parsing (also chunking, "light parsing" is an analysis of a sentence which identifies the constituents ( Noun groups This is a list of notable parsing systems Chart Key The following key describes the columns used in the comparison chart below