Citizendia
Your Ad Here

A decompiler is the name given to a computer program that performs the reverse operation to that of a compiler. Computer programs (also software programs, or just programs) are instructions for a Computer. A compiler is a Computer program (or set of programs that translates text written in a computer language (the source language) into another That is, it translates a file containing information at a relatively low level of abstraction (usually designed to be computer readable rather than human readable) into a form having a higher level of abstraction (usually designed to be human readable).

Contents

Introduction

The term "decompiler" is most commonly applied to a program which translates executable programs (the output from a compiler) into source code in a (relatively) high level language which, when compiled, will produce an executable whose behavior is the same as the original executable program. In Computing, an executable (file causes a computer "to perform indicated tasks according to encoded instructions," as opposed to a file that only contains A compiler is a Computer program (or set of programs that translates text written in a computer language (the source language) into another In Computer science, source code (commonly just source or code) is any sequence of statements or declarations written in some Human-readable In computing a high-level programming language is a Programming language with strong abstraction from the details of the computer By comparison, a disassembler translates an executable program into assembly language (and an assembler could be used to assemble it back into an executable program). A disassembler is a Computer program that translates Machine language into Assembly language —the inverse operation to that of an assembler. See the terminology section below for information regarding inconsistent use of the terms assembly and assembler

Decompilation is the act of using a decompiler, although the term, when used as a noun, can also refer to the output of a decompiler. It can be used for the recovery of lost source code, and is also useful in some cases for computer security, interoperability and error correction. This article describes how security can be achieved through design and engineering Interoperability is a property referring to the ability of diverse systems and organizations to work together (inter-operate In Mathematics, Computer science, Telecommunication, and Information theory, error detection and correction has great practical importance in [1] The success of decompilation depends on the amount of information present in the code being decompiled and the sophistication of the analysis performed on it. The bytecode formats used by many virtual machines (such as the Java Virtual Machine or the .NET Framework Common Language Runtime) often include extensive metadata and high-level features that make decompilation quite feasible. A Java Virtual Machine ( JVM) is a set of computer software programs and data structures which use a Virtual machine The Common Language Runtime (CLR is the Virtual machine component of Microsoft's. Metadata ( meta data, or sometimes metainformation) is "data about data" of any sort in any media Machine language has typically much less metadata, and is therefore much harder to decompile. Machine code or machine language is a system of instructions and data executed directly by a Computer 's Central processing unit.

Some compilers and post-compilation tools produce obfuscated code (that is, they attempt to produce output that is very difficult to decompile). Obfuscated code is Source code or Intermediate language that is very hard to read and understand often intentionally This is done to make it more difficult to reverse engineer the executable. Reverse engineering (RE is the process of discovering the technological principles of a device object or system through analysis of its structure function and operation

Design

Decompilers can be thought of as composed of a series of phases each of which contributes specific aspects of the overall decompilation process.

Loader

The first decompilation phase is the loader, which parses the input machine code or intermediate language program's binary file format. The loader should be able to discover basic facts about the input program, such as the architecture (Pentium, PowerPC, etc), and the entry point. In many cases, it should be able to find the equivalent of the main function of a C program, which is the start of the user written code. This excludes the runtime initialization code, which should not be decompiled if possible.

Disassembly

The next logical phase is the disassembly of machine code instructions into a machine independent intermediate representation (IR). For example, the Pentium machine instruction

   mov    eax, [ebx+0x04]

might be translated to the IR

   eax := m[ebx+4];

Idioms

Idiomatic machine code sequences are sequences of code whose combined semantics is not immediately apparent from the instructions' individual semantics. Either as part of the disassembly phase, or as part of later analyses, these idiomatic sequences need to be translated into known equivalent IR. For example, the x86 assembly code:

   cdq    eax             ; edx is set to the sign-extension of eax
   xor    eax, edx
   sub    eax, edx

could be translated to

   eax := abs(eax);

Some idiomatic sequences are machine independent; some involve only one instruction. x86 assembly language is the Assembly language for the X86 class of processors which includes Intel 's Pentium series and AMD For example, xor eax, eax clears the eax register (sets it to zero). This can be implemented with a machine independent simplification rule, such as a xor a = 0.

In general, it is best to delay detection of idiomatic sequences if possible, to later stages that are less affected by instruction ordering. For example, the instruction scheduling phase of a compiler may insert other instructions into an idiomatic sequence, or change the ordering of instructions in the sequence. A pattern matching process in the disassembly phase would probably not recognize the altered pattern. Later phases group instruction expressions into more complex epressions, and modify them into a canonical (standardized) form, making it more likely that even the altered idiom will match a higher level pattern later in the decompilation.

Program analysis

Various program analyses can be applied to the IR. In particular, expression propagation combines the semantics of several instructions into more complex expressions. For example,

   mov   eax,[ebx+0x04]
   add   eax,[ebx+0x08]
   sub   [ebx+0x0C],eax

could result in the following IR after expression propagation:

   m[ebx+12] := m[ebx+12] - (m[ebx+4] + m[ebx+8]);

The resulting expression is more like high level language, and has also eliminated the use of the machine register eax . Later analyses may eliminate the ebx register.

Type analysis

A good machine code decompiler will perform type analysis. Here, the way registers or memory locations are used result in constraints on the possible type of the location. For example, an and instruction implies that the operand is an integer; programs do not use such an operation on floating point values (except in special library code) or on pointers. An add instruction results in three constraints, since the operands may be both integer, or one integer and one pointer (with integer and pointer results respectively; the third constraint comes from the ordering of the two operands when the types are different).

Various high level expressions can be recognized which trigger recognition of structures or arrays. However, it is difficult to distinguish many of the possibilities, because of the freedom that machine code or even some high level languages such as C allow with casts and pointer arithmetic.

The example from the previous section could result in the following high level code:

struct T1* ebx;
   struct T1 {
       int v0004; 
       int v0008;
       int v000C;
   };
 ebx->v000C -= ebx->v0004 + ebx->v0008;

Structuring

The penultimate decompilation phase involves structuring of the IR into higher level constructs such as while loops and if/then/else conditional statements. For example, the machine code

   xor eax, eax
l0002:
   or  ebx, ebx
   jge l0003
   add eax,[ebx]
   mov ebx,[ebx+0x4]
   jmp l0002    
l0003:
   mov [0x10040000],eax

could be translated into:

   eax = 0;
   while (ebx < 0) {
       eax += ebx->v0000;
       ebx = ebx->v0004;
   }
   v10040000 = eax;

Unstructured code is more difficult to translate into structured code than already structured code. Solutions include replicating some code, or adding boolean variables. See chapter 6 of [2].

Code generation

The final phase is the generation of the high level code in the back end of the decompiler. Just as a compiler may have several back ends for generating machine code for different architectures, a decompiler may have several back ends for generating high level code in different high level languages.

Just before code generation, it may be desirable to allow an interactive editing of the IR, perhaps using some form of graphical user interface. This would allow the user to enter comments, and non-generic variable and function names. However, these are almost as easily entered in a post decompilation edit. The user may want to change structural aspects, such as converting a while loop to a for loop. These are less readily modified with a simple text editor, although source code refactoring tools may assist with this process. The user may need to enter information that failed to be identified during the type analysis phase, e. g. modifying a memory expression to an array or structure expression. Finally, incorrect IR may need to be corrected, or changes made to cause the output code to be more readable.

Legality

The majority of computer programs are covered by copyright laws. Copyright is a legal concept enacted by Governments, giving the creator of an original work of authorship Exclusive rights to control its distribution usually for Although the precise scope of what is covered by copyright differs from region to region, copyright law generally provides the author (the programmer(s) or employer) with a collection of exclusive rights to the program. These rights include the right to make copies, including copies made into the computer's RAM. Since the decompilation process involves making multiple such copies, it is generally prohibited without the authorization of the copyright holder. However, because decompilation is often a necessary step in achieving software interoperability, copyright laws in both the United States and Europe permit decompilation to a limited extent. Interoperability is a property referring to the ability of diverse systems and organizations to work together (inter-operate

In the United States, the copyright fair use defense has been successfully invoked in decompilation cases. Fair use is a doctrine in United States copyright law that allows limited use of copyrighted material without requiring permission from the rights holders such as use for For example, in Sega v. Accolade, the court held that Accolade could lawfully engage in decompilation in order to circumvent the software locking mechanism used by Sega's game consoles [3]

In Europe, the 1991 Software Directive explicitly provides for a right to decompile in order to achieve interoperability. The result of a heated debate between, on the one side, software protectionists, and, on the other, academics as well as independent software developers, Article 6 permits decompilation only if a number of conditions are met:

In addition, Article 6 prescribes that the information obtained through decompilation may not be used for other purposes and that it may not be given to others.

Overall, the decompilation right provided by Article 6 is interesting, as it codifies what is claimed to be common practice in the software industry. In Law, codification is the process of collecting and restating the law of a Jurisdiction in certain areas usually by subject forming a Legal code. Few European lawsuits are known to have emerged from the decompilation right. This could be interpreted as meaning either one of two things: 1) the decompilation right is not used frequently and the decompilation right may therefore have been unnecessary, or 2) the decompilation right functions well and provides sufficient legal certainty not to give rise to legal disputes. In a recent report regarding implementation of the Software Directive by the European member states, the European Commission seems to support the second interpretation. The European Commission (formally the Commission of the European Communities) is the executive branch of the European Union.

In popular culture

In Star Trek:Voyager, The Doctor equates "decompilation" of his program to his death, although in reality this is not a destructive procedure. The Emergency Medical Hologram, better known as The Doctor, is a fictional character on the television series Star Trek Voyager.

References

  1. ^ "Why Decompilation"
  2. ^ C. Cifuentes. Reverse Compilation Techniques. PhD thesis, Queensland University of Technology, 1994. (available as compressed postscript)
  3. ^ The Legality of Decompilation
  4. ^ B. Czarnota and R. J. Hart, Legal protection of computer programs in Europe: a guide to the EC directive. 1991, London: Butterworths.

See also

External links

A disassembler is a Computer program that translates Machine language into Assembly language —the inverse operation to that of an assembler. A compiler is a Computer program (or set of programs that translates text written in a computer language (the source language) into another In Computer science, an interpreter normally means a Computer program that executes, i In Computer science, abstract interpretation is a theory of sound approximation of the Semantics of computer programs based on Monotonic functions Obfuscated code is Source code or Intermediate language that is very hard to read and understand often intentionally Reverse engineering (RE is the process of discovering the technological principles of a device object or system through analysis of its structure function and operation The Open Directory Project ( ODP) also known as dmoz (from directory

Dictionary

decompiler

-noun

  1. (computer science) A computer program performing the reverse operation to that of a compiler.
© 2009 citizendia.org; parts available under the terms of GNU Free Documentation License, from http://en.wikipedia.org
Dapyx Software network: MP3 Explorer | Ebook Manager | Zenithic