Citizendia

A diagram of the operation of a typical multi-language, multi-target compiler.
A diagram of the operation of a typical multi-language, multi-target compiler.

A compiler is a computer program (or set of programs) that translates text written in a computer language (the source language) into another computer language (the target language). Computer programs (also software programs, or just programs) are instructions for a Computer. A programming language is an Artificial language that can be used to write programs which control the behavior of a machine particularly a Computer. The original sequence is usually called the source code and the output called object code. In Computer science, source code (commonly just source or code) is any sequence of statements or declarations written in some Human-readable In Computer science, object code, or an object file, is the representation of code that a Compiler or Assembler generates by processing Commonly the output has a form suitable for processing by other programs (e. g. , a linker), but it may be a human-readable text file. A text file (sometimes spelled "textfile" is a kind of Computer file that is structured as a sequence of lines.

The most common reason for wanting to translate source code is to create an executable program. In Computing, an executable (file causes a computer "to perform indicated tasks according to encoded instructions," as opposed to a file that only contains The name "compiler" is primarily used for programs that translate source code from a high-level programming language to a lower level language (e. In computing a high-level programming language is a Programming language with strong abstraction from the details of the computer g. , assembly language or machine language). See the terminology section below for information regarding inconsistent use of the terms assembly and assembler Machine code or machine language is a system of instructions and data executed directly by a Computer 's Central processing unit. A program that translates from a low level language to a higher level one is a decompiler. A decompiler is the name given to a Computer program that performs the reverse operation to that of a Compiler. A program that translates between high-level languages is usually called a language translator, source to source translator, or language converter. A language rewriter is usually a program that translates the form of expressions without a change of language. In Mathematics, Computer science and Logic, rewriting covers a wide range of potentially non-deterministic methods of replacing subterms of a

A compiler is likely to perform many or all of the following operations: lexical analysis, preprocessing, parsing, semantic analysis, code generation, and code optimization. In Computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens In Computer science, a preprocessor is a program that processes its input data to produce output that is used as input to another program In Computer science and Linguistics, parsing, or more formally syntactic analysis, is the process of analyzing a sequence of tokens to In Computer science, code generation is the process by which a Compiler 's code generator converts some internal representation of Source code In Computing, optimization is the process of modifying a system to make some aspect of it work more efficiently or use fewer resources

Contents

History

Software for early computers was exclusively written in assembly language for many years. Higher level programming languages were not invented until the benefits of being able to reuse software on different kinds of CPUs started to become significantly greater than the cost of writing a compiler. The very limited memory capacity of early computers also created many technical problems when implementing a compiler. Computer data storage, often called storage or memory, refers to Computer components devices and recording media that retain digital

Towards the end of the 1950s, machine-independent programming languages were first proposed. Subsequently, several experimental compilers were developed. The first compiler was written by Grace Hopper, in 1952, for the A-0 programming language. Rear Admiral Grace Murray Hopper ( December 9 1906 – January 1 1992) was an American Computer scientist and United The A-0 system, written by Grace Hopper in 1951 and 1952 for the UNIVAC I, was the first Compiler ever developed for an electronic computer The FORTRAN team led by John Backus at IBM is generally credited as having introduced the first complete compiler, in 1957. Fortran (previously FORTRAN) is a general-purpose, procedural, imperative Programming language that is especially suited to John Warner Backus ( December 3, 1924 – March 17, 2007) was an American Computer scientist. International Business Machines Corporation abbreviated IBM and nicknamed "Big Blue", is a multinational Computer Technology COBOL was an early language to be compiled on multiple architectures, in 1960. COBOL (ˈkoʊbɒl is one of the oldest programming languages still in active use [1]

In many application domains the idea of using a higher level language quickly caught on. Because of the expanding functionality supported by newer programming languages and the increasing complexity of computer architectures, compilers have become more and more complex. A programming language is an Artificial language that can be used to write programs which control the behavior of a machine particularly a Computer.

Early compilers were written in assembly language. The first self-hosting compiler — capable of compiling its own source code in a high-level language — was created for Lisp by Hart and Levin at MIT in 1962. The term Self-hosting was coined to refer to the use of a Computer program as part of the Toolchain or Operating system that produces new versions of that Lisp (or LISP) is a family of Computer Programming languages with a long history and a distinctive fully parenthesized syntax [2] Since the 1970s it has become common practice to implement a compiler in the language it compiles, although both Pascal and C have been popular choices for implementation language. Pascal is an influential imperative and procedural Programming language, designed in 1968/9 and published in 1970 by Niklaus Wirth as a small tags please moot on the talk page first! --> In Computing, C is a general-purpose cross-platform block structured Building a self-hosting compiler is a bootstrapping problem -- the first such compiler for a language must be compiled either by a compiler written in a different language, or (as in Hart and Levin's Lisp compiler) compiled by running the compiler in an interpreter. Bootstrapping is a term used in Computer science to describe the techniques involved in writing a Compiler (or assembler) in the target Programming In Computer science, an interpreter normally means a Computer program that executes, i

Compilers in education

Compiler construction and compiler optimization are taught at universities as part of the computer science curriculum. Compiler optimization is the process of tuning the output of a Compiler to minimize or maximize some attribute of an Executable computer program Computer science (or computing science) is the study and the Science of the theoretical foundations of Information and Computation and their Such courses are usually supplemented with the implementation of a compiler for an educational programming language. An educational programming language is a Programming language that is designed primarily as a Learning instrument and not so much as a tool for writing real-world A well-documented example is Niklaus Wirth's PL/0 compiler, which Wirth used to teach compiler construction in the 1970s. Niklaus Emil Wirth (b February 15, 1934) is a Swiss computer scientist, best known for designing several Programming languages including There are at least two programming languages known as PL/0. One is a subset of IBM's general purpose Programming language PL/I. [3] In spite of its simplicity, the PL/0 compiler introduced several influential concepts to the field:

  1. Program development by stepwise refinement (also the title of a 1971 paper by Wirth[4])
  2. The use of a recursive descent parser
  3. The use of EBNF to specify the syntax of a language
  4. A code generator producing portable P-code
  5. The use of T-diagrams in the formal description of the bootstrapping problem

Compiler output

One method used to classify compilers is by the platform on which the generated code they produce executes. A recursive descent parser is a top-down parser built from a set of mutually-recursive procedures (or a non-recursive equivalent where each such Procedure In Computer science, Extended Backus–Naur Form (EBNF is a Metasyntax notation used to express Context-free grammars that is a formal way to describe In Computer science, code generation is the process by which a Compiler 's code generator converts some internal representation of Source code Bootstrapping is a term used in Computer science to describe the techniques involved in writing a Compiler (or assembler) in the target Programming In Computing, a platform describes some sort of Hardware architecture or Software framework (including Application frameworks, that allows This is known as the target platform.

A native or hosted compiler is one whose output is intended to directly run on the same type of computer and operating system as the compiler itself runs on. The output of a cross compiler is designed to run on a different platform. A cross compiler is a Compiler capable of creating Executable code for a platform other than the one on which the compiler is run Cross compilers are often used when developing software for embedded systems that are not intended to support a software development environment. An embedded system is a special-purpose Computer system designed to perform one or a few dedicated functions often with Real-time computing constraints

The output of a compiler that produces code for a virtual machine (VM) may or may not be executed on the same platform as the compiler that produced it. In Computer science, a virtual machine (VM is a Software implementation of a machine (computer that executes programs like a real machine For this reason such compilers are not usually classified as native or cross compilers.

Compiled versus interpreted languages

Higher-level programming languages are generally divided for convenience into compiled languages and interpreted languages. A compiled language is a Programming language whose implementations are typically Compilers (translators which generate Machine code from In Computer programming an interpreted language is a Programming language whose implementation often takes the form of an interpreter. However, there is rarely anything about a language that requires it to be exclusively compiled, or exclusively interpreted. The categorization usually reflects the most popular or widespread implementations of a language — for instance, BASIC is thought of as an interpreted language, and C a compiled one, despite the existence of BASIC compilers and C interpreters.

In a sense, all languages are interpreted, with "execution" being merely a special case of interpretation performed by transistors switching on a CPU. In Electronics, a transistor is a Semiconductor device commonly used to amplify or switch electronic signals Modern trends toward just-in-time compilation and bytecode interpretation also blur the traditional categorizations. In Computing, just-in-time compilation ( JIT) also known as dynamic translation, is a technique for improving the runtime performance of a Computer Bytecode is a term which has been used to denote various forms of Instruction sets designed for efficient execution by a software interpreter as well as being suitable

There are exceptions. Some language specifications spell out that implementations must include a compilation facility; for example, Common Lisp. Common Lisp, commonly abbreviated CL, is a dialect of the Lisp Programming language, published in ANSI standard document Information Other languages have features that are very easy to implement in an interpreter, but make writing a compiler much harder; for example, APL, SNOBOL4, and many scripting languages allow programs to construct arbitrary source code at runtime with regular string operations, and then execute that code by passing it to a special evaluation function. SNOBOL ( String Oriented Symbolic Language) is a Computer Programming language developed between 1962 and 1967 at AT&T Bell Laboratories To implement these features in a compiled language, programs must usually be shipped with a runtime library that includes a version of the compiler itself. In Computer programming, a runtime library is a special Program library used by a Compiler, to implement functions built into a Programming language

Hardware compilation

The output of some compilers may target hardware at a very low level. Hardware is a general term that refers to the physical artifacts of a Technology. For example a Field Programmable Gate Array (FPGA) or structured Application-specific integrated circuit (ASIC). FPGAs should not be confused with the Flip-chip pin grid array, a form of integrated circuit packaging Such compilers are said to be hardware compilers or synthesis tools because the programs they compile effectively control the final configuration of the hardware and how it operates; the output of the compilation are not instructions that are executed in sequence - only an interconnection of transistors or lookup tables. A silicon compiler is a Software system that takes a user's specifications and automatically generates an Integrated circuit (IC For example, XST is the Xilinx Synthesis Tool used for configuring FPGAs. Similar tools are available from Altera, Synplicity, Synopsys and other vendors.

Compiler design

The approach taken to compiler design is affected by the complexity of the processing that needs to be done, the experience of the person(s) designing it, and the resources (eg, people and tools) available.

A compiler for a relatively simple language written by one person might be a single, monolithic piece of software. When the source language is large and complex, and high quality output is required the design may be split into a number of relatively independent phases, or passes. Having separate phases means development can be parceled up into small parts and given to different people. It also becomes much easier to replace a single phase by an improved one, or to insert new phases later (eg, additional optimizations).

The division of the compilation processes in phases (or passes) was championed by the Production Quality Compiler-Compiler Project (PQCC) at Carnegie Mellon University. The Production Quality Compiler-Compiler Project (or PQCC was a long-term project led by William Wulf at Carnegie Mellon University to produce an industrial-strength Carnegie Mellon University (also known as CMU) is a private Research University in Pittsburgh, Pennsylvania, United This project introduced the terms front end, middle end (rarely heard today), and back end.

All but the smallest of compilers have more than two phases. However, these phases are usually regarded as being part of the front end or the back end. The point at where these two ends meet is always open to debate. The front end is generally considered to be where syntactic and semantic processing takes place, along with translation to a lower level of representation (than source code).

The middle end is usually designed to perform optimizations on a form other than the source code or machine code. This source code/machine code independence is intended to enable generic optimizations to be shared between versions of the compiler supporting different languages and target processors.

The back end takes the output from the middle. It may perform more analysis, transformations and optimizations that are for a particular computer. Then, it generates code for a particular processor and OS.

This front-end/middle/back-end approach makes it possible to combine front ends for different languages with back ends for different CPUs. A programming language is an Artificial language that can be used to write programs which control the behavior of a machine particularly a Computer. Practical examples of this approach are the GNU Compiler Collection, LLVM, and the Amsterdam Compiler Kit, which have multiple front-ends, shared analysis and multiple back-ends. The GNU Compiler Collection (usually shortened to GCC) is a set of Compilers produced for various Programming languages by the GNU Project The Low Level Virtual Machine, generally known as LLVM, is a Compiler infrastructure written in C++, which is designed for Compile-time The Amsterdam Compiler Kit (ACK is a fast lightweight and Retargetable compiler suite and toolchain written by Andrew Tanenbaum and Ceriel Jacobs

One-pass versus multi-pass compilers

Classifying compilers by number of passes has its background in the hardware resource limitations of computers. Compiling involves performing lots of work and early computers did not have enough memory to contain one program that did all of this work. So compilers were split up into smaller programs which each made a pass over the source (or some representation of it) performing some of the required analysis and translations.

The ability to compile in a single pass is often seen as a benefit because it simplifies the job of writing a compiler and one pass compilers are generally faster than multi-pass compilers. In Computer programming, a one-pass compiler is a Compiler that passes through the Source code of each Compilation unit only once A multi-pass compiler is a type of Compiler that processes the Source code or Abstract syntax tree of a program several times Many languages were designed so that they could be compiled in a single pass (e. g. , Pascal). Pascal is an influential imperative and procedural Programming language, designed in 1968/9 and published in 1970 by Niklaus Wirth as a small

In some cases the design of a language feature may require a compiler to perform more than one pass over the source. For instance, consider a declaration appearing on line 20 of the source which affects the translation of a statement appearing on line 10. In this case, the first pass needs to gather information about declarations appearing after statements that they affect, with the actual translation happening during a subsequent pass.

The disadvantage of compiling in a single pass is that it is not possible to perform many of the sophisticated optimizations needed to generate high quality code. Compiler optimization is the process of tuning the output of a Compiler to minimize or maximize some attribute of an Executable computer program It can be difficult to count exactly how many passes an optimizing compiler makes. For instance, different phases of optimization may analyse one expression many times but only analyse another expression once.

Splitting a compiler up into small programs is a technique used by researchers interested in producing provably correct compilers. Proving the correctness of a set of small programs often requires less effort than proving the correctness of a larger, single, equivalent program.

While the typical multi-pass compiler outputs machine code from its final pass, there are several other types:

Front end

The front end analyzes the source code to build an internal representation of the program, called the intermediate representation or IR. In Computing, an intermediate representation (IR is a Data structure that is constructed from input data to a program, and from which part or all of the It also manages the symbol table, a data structure mapping each symbol in the source code to associated information such as location, type and scope. In Computer science, a symbol table is a Data structure used by a language translator such as a Compiler or interpreter, where each Identifier This is done over several phases, which includes some of the following:

  1. Line reconstruction. Languages which strop their keywords or allow arbitrary spaces within identifiers require a phase before parsing, which converts the input character sequence to a canonical form ready for the parser. When applied to Computer languages stropping refers to the method used to mark letter sequences as having a special property (most often being a keyword or certain The top-down, recursive-descent, table-driven parsers used in the 1960s typically read the source one character at a time and did not require a separate tokenizing phase. Top-down parsing is a strategy of analyzing unknown data relationships by hypothesizing general Parse tree structures and then considering whether the known fundamental structures A recursive descent parser is a top-down parser built from a set of mutually-recursive procedures (or a non-recursive equivalent where each such Procedure Atlas Autocode, and Imp (and some implementations of Algol and Coral66) are examples of stropped languages whose compilers would have a Line Reconstruction phase. Atlas Autocode (AA was a Programming language developed around 1965 at Manchester University for the Atlas Computer. Edinburgh IMP is a development of ATLAS Autocode, initially developed around 1966-1969 at Edinburgh University, Scotland. Algol (β Per / Beta Persei known colloquially as the Demon Star, is a bright Star in the Constellation Perseus. CORAL ( C omputer O n-line R eal-time A pplications L anguage is a Programming language originally developed in 1964 at
  2. Lexical analysis breaks the source code text into small pieces called tokens. In Computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens Each token is a single atomic unit of the language, for instance a keyword, identifier or symbol name. In Computer programming, a keyword is a Word or Identifier that has a particular meaning to the Programming language. In Computer science, Identifiers ( IDs) are lexical tokens that name entities. The musical instrument is spelled Cymbal. A symbol is something --- such as an object, Picture, written word a sound a piece The token syntax is typically a regular language, so a finite state automaton constructed from a regular expression can be used to recognize it. In Computing, regular expressions provide a concise and flexible means for identifying strings of text of interest such as particular characters words or patterns of characters This phase is also called lexing or scanning, and the software doing lexical analysis is called a lexical analyzer or scanner. In Computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens
  3. Preprocessing. In Computer science, a preprocessor is a program that processes its input data to produce output that is used as input to another program Some languages, e. g. , C, require a preprocessing phase which supports macro substitution and conditional compilation. tags please moot on the talk page first! --> In Computing, C is a general-purpose cross-platform block structured A macro (from the Greek 'μάκρο' for long or far in Computer science is a rule or Pattern that specifies how a certain input sequence (often a sequence Typically the preprocessing phase occurs before syntactic or semantic analysis; e. g. in the case of C, the preprocessor manipulates lexical tokens rather than syntactic forms. However, some languages such as Scheme support macro substitutions based on syntactic forms.
  4. Syntax analysis involves parsing the token sequence to identify the syntactic structure of the program. In Computer science and Linguistics, parsing, or more formally syntactic analysis, is the process of analyzing a sequence of tokens to In Computer science and Linguistics, parsing, or more formally syntactic analysis, is the process of analyzing a sequence of tokens to This phase typically builds a parse tree, which replaces the linear sequence of tokens with a tree structure built according to the rules of a formal grammar which define the language's syntax. A parse tree or concrete syntax tree is an (ordered rooted tree that represents the syntactic structure of a string according to some In Formal semantics, Computer science and Linguistics, a formal grammar (also called formation rules) is a precise description of a Formal The parse tree is often analyzed, augmented, and transformed by later phases in the compiler.
  5. Semantic analysis is the phase in which the compiler adds semantic information to the parse tree and builds the symbol table. A parse tree or concrete syntax tree is an (ordered rooted tree that represents the syntactic structure of a string according to some This phase performs semantic checks such as type checking (checking for type errors), or object binding (associating variable and function references with their definitions), or definite assignment (requiring all local variables to be initialized before use), rejecting incorrect programs or issuing warnings. In Computer science, a type system defines how a Programming language classifies values and expressions into '''types''', how it can Several object binding times exist in Object oriented systems Semantic analysis usually requires a complete parse tree, meaning that this phase logically follows the parsing phase, and logically precedes the code generation phase, though it is often possible to fold multiple phases into one pass over the code in a compiler implementation. In Computer science and Linguistics, parsing, or more formally syntactic analysis, is the process of analyzing a sequence of tokens to In Computer science, code generation is the process by which a Compiler 's code generator converts some internal representation of Source code

Back end

The term back end is sometimes confused with code generator because of the overlapped functionality of generating assembly code. In Computer science, code generation is the process by which a Compiler 's code generator converts some internal representation of Source code Some literature uses middle end to distinguish the generic analysis and optimization phases in the back end from the machine-dependent code generators.

The main phases of the back end include the following:

  1. Analysis: This is the gathering of program information from the intermediate representation derived from the input. Compiler optimization is the process of tuning the output of a Compiler to minimize or maximize some attribute of an Executable computer program Typical analyses are data flow analysis to build use-define chains, dependence analysis, alias analysis, pointer analysis, escape analysis etc. Data-flow analysis is a technique for gathering information about the possible set of values calculated at various points in a Computer program. A Use-Definition Chain ( UD Chain) is a Data structure that consists of a use U a Variable, and all the definitions D of that variable that can reach In Compiler theory, dependence analysis produces execution-order constraints between statements/instructions Alias analysis is a technique in Compiler theory, used to determine if a storage location may be accessed in more than one way In Computer science pointer analysis, or points-to analysis, is a Static code analysis technique that establishes which Pointers or heap references In programming language Compiler optimization theory escape analysis is a method for determining the dynamic scope of Pointers It is related to Pointer analysis Accurate analysis is the basis for any compiler optimization. The call graph and control flow graph are usually also built during the analysis phase. A call graph (also known as a call Multigraph) is a directed graph that represents calling relationships between Subroutines in a Computer program In computer science a control flow graph (CFG is a representation, using graph notation of all paths that might be traversed through a program during
  2. Optimization: the intermediate language representation is transformed into functionally equivalent but faster (or smaller) forms. Compiler optimization is the process of tuning the output of a Compiler to minimize or maximize some attribute of an Executable computer program Popular optimizations are inline expansion, dead code elimination, constant propagation, loop transformation, register allocation or even automatic parallelization. In Computing, inline expansion, or inlining, is a Compiler optimization that replaces a function Call site with the body of the Callee In Compiler theory, dead code elimination is a Compiler optimization that removes code that does not affect the program In Compiler theory, constant folding and constant propagation are related optimization techniques used by many modern compilers In Compiler theory, loop optimization plays an important role in improving cache performance making effective use of parallel processing capabilities and reducing overheads In Compiler optimization, register allocation is the process of Multiplexing a large number of target program Variables onto a small number of Automatic parallelization, also auto parallelization, autoparallelization, parallelization, or //ization (shorthand the last two of which imply
  3. Code generation: the transformed intermediate language is translated into the output language, usually the native machine language of the system. In Computer science, code generation is the process by which a Compiler 's code generator converts some internal representation of Source code Machine code or machine language is a system of instructions and data executed directly by a Computer 's Central processing unit. This involves resource and storage decisions, such as deciding which variables to fit into registers and memory and the selection and scheduling of appropriate machine instructions along with their associated addressing modes (see also Sethi-Ullman algorithm). When generating code for arithmetic expressions the Compiler has to decide which is the best way to translate the expression in terms of number of instructions used as well

Compiler analysis is the prerequisite for any compiler optimization, and they tightly work together. For example, dependence analysis is crucial for loop transformation. In Compiler theory, dependence analysis produces execution-order constraints between statements/instructions In Compiler theory, loop optimization plays an important role in improving cache performance making effective use of parallel processing capabilities and reducing overheads

In addition, the scope of compiler analysis and optimizations vary greatly, from as small as a basic block to the procedure/function level, or even over the whole program (interprocedural optimization). In Computing, a basic block is code that has one entry point (i Interprocedural optimization (IPO is a Compiler technique used in Computer programming to improve performance in programs containing many frequently used functions Obviously, a compiler can potentially do a better job using a broader view. But that broad view is not free: large scope analysis and optimizations are very costly in terms of compilation time and memory space; this is especially true for interprocedural analysis and optimizations.

The existence of interprocedural analysis and optimizations is common in modern commercial compilers from HP, IBM, SGI, Intel, Microsoft, and Sun Microsystems. International Business Machines Corporation abbreviated IBM and nicknamed "Big Blue", is a multinational Computer Technology Silicon Graphics Inc (commonly initialised to SGI, historically sometimes referred to as Silicon Graphics Computer Systems or SGCS) is a company Microsoft Corporation is an American multinational Computer technology Corporation, which rose to dominate the Home computer Sun Microsystems Inc ( is a multinational vendor of Computers computer components Computer software, and Information technology services The open source GCC was criticized for a long time for lacking powerful interprocedural optimizations, but it is changing in this respect. The GNU Compiler Collection (usually shortened to GCC) is a set of Compilers produced for various Programming languages by the GNU Project Another good open source compiler with full analysis and optimization infrastructure is Open64, which is used by many organizations for research and commercial purposes. Open64 is an Open source, optimizing Compiler for the Intel IA-64 (Itanium AMD Opteron and Intel IA-32e architecture

Due to the extra time and space needed for compiler analysis and optimizations, some compilers skip them by default. Users have to use compilation options to explicitly tell the compiler which optimizations should be enabled.

Related techniques

Assembly language is not a high-level language and a program that compiles it is more commonly known as an assembler, with the inverse program known as a disassembler. See the terminology section below for information regarding inconsistent use of the terms assembly and assembler A disassembler is a Computer program that translates Machine language into Assembly language —the inverse operation to that of an assembler.

A program that translates from a low level language to a higher level one is a decompiler. A decompiler is the name given to a Computer program that performs the reverse operation to that of a Compiler.

A program that translates between high-level languages is usually called a language translator, source to source translator, language converter, or language rewriter. In Mathematics, Computer science and Logic, rewriting covers a wide range of potentially non-deterministic methods of replacing subterms of a The last term is usually applied to translations that do not involve a change of language.

See also

Notes

  1. ^ IP: "The World's First COBOL Compilers" -- 12 June 1997
  2. ^ T. Hart and M. Levin "The New Compiler", AIM-39 CSAIL Digital Archive - Artificial Intelligence Laboratory Series
  3. ^ "The PL/0 compiler/interpreter"
  4. ^ Book description at the ACM Digital Library

References

External links

Dictionary

compiler

-noun

  1. (computing) A computer program which reads source code and outputs assembly code or executable code.
  2. A person who compiles.
© 2009 citizendia.org; parts available under the terms of GNU Free Documentation License, from http://en.wikipedia.org