throbber
SOFTWARE—PRACTICE AND EXPERIENCE, VOL. 25(7), 811–829 (JULY 1995)
`
`Decompilation of Binary Programs
`
`AND K. JOHN GOUGH
`CRISTINA CIFUENTES
`(email: cifuente@fit.qut.edu.au gough@fit.qut.edu.au)
`School of Computing Science, Queensland University of Technology, GPO Box 2434, Brisbane QLD 4001,
`Australia
`
`∗
`
`SUMMARY
`
`The structure of a decompiler is presented, along with a thorough description of the different modules that
`form part of a decompiler, and the type of analyses that are performed on the machine code to regenerate
`high-level language code. The phases of the decompiler have been grouped into three main modules:
`front-end, universal decompiling machine, and back-end. The front-end is a machine-dependent module
`that performs the loading, parsing and semantic analysis of the input program, as well as generating
`an intermediate representation of the program. The universal decompiling machine is a machine- and
`language-independent module that performs data and control flow analysis of the program based on the
`intermediate representation, and the program’s control flow graph. The back-end is a language-dependent
`module that deals with the details of the target high-level language.
`In order to increase the readability of the generated programs, a decompiling system has been imple-
`mented which integrates a decompiler, dcc, and an automatic signature generator, dccSign. Signatures
`for libraries and compilers are stored in a database that is read by the decompiler; thus, the generated
`programs can make use of known library names, such as (cid:87)(cid:114)(cid:105)(cid:116)(cid:101)(cid:76)(cid:110)(cid:40)(cid:41) and (cid:112)(cid:114)(cid:105)(cid:110)(cid:116)(cid:102)(cid:40)(cid:41).
`dcc is a decompiler for the Intel 80286 architecture and the DOS operating system. dcc takes as input
`binary programs from a DOS environment and generates C programs as output. Sample code produced
`by this decompiler is given.
`
`KEY WORDS: decompiler; reverse compiler; compiler signature; library signature; i80286; C language
`
`INTRODUCTION
`A decompiler, or reverse compiler, is a program that attempts to perform the inverse process
`of the compiler: given an executable program compiled in any high-level language, the aim is
`to produce a high-level language program that performs the same function as the executable
`program. Thus, the input is machine dependent, and the output is language dependent.
`Several practical problems are faced when writing a decompiler. The main problem
`derives from the representation of data and instructions in the Von Neumann architecture:
`they are indistinguishable. Thus, data can be located in between instructions, such as many
`implementations of indexed jump ((cid:99)(cid:97)(cid:115)(cid:101)) tables. This representation and self-modifying
`code practices makes it hard to decompile a binary program.
`Another problem is the great number of subroutines introduced by the compiler and the
`linker. The compiler will always include start-up subroutines that set up its environment,
`∗
`
`Present address: Department of Computer Science, University of Tasmania, GPO Box 252C, Hobart TAS 7001, Australia.
`(Email: C.N.Cifuentes@cs.utas.edu.au)
`
`CCC 0038–0644/95/070811–19
`1995 by John Wiley & Sons, Ltd.
`
`Received April 1994
`Revised January 1995
`
`Blue Coat Systems - Exhibit 1022
`
`

`
`812
`
`C. CIFUENTES AND K. J. GOUGH
`
`and runtime support routines whenever required. These routines are normally written in
`assembler and in most cases are untranslatable into a higher-level representation. Also,
`most operating systems do not provide a mechanism to share libraries; consequently, binary
`programs are self-contained and library routines are bound into each binary image. Library
`routines are either written in the language the compiler was written in, or in assembler. This
`means that a binary program contains not only the routines written by the programmer, but
`a great number of other routines linked in by the linker. As an example of the amount of
`extra subroutines available in a binary program, a ‘hello world’ program compiled in C
`generates 23 different procedures. The same program compiled in Pascal generates more
`than 40 procedures. All we are really interested in is the one procedure, (cid:109)(cid:97)(cid:105)(cid:110).
`Despite the above-mentioned limitations, there are several uses for a decompiler, including
`two major software areas: maintenance of code and software security. From a maintenance
`point of view, a decompiler is an aid in the recovery of lost source code, the migration
`of applications to a new hardware platform, the translation of code written in an obsolete
`language into a newer language, the structuring of old code written in an unstructured way
`(e.g. ‘spaghetti’ code), and a debugger tool that helps in finding and correcting bugs in an
`existing binary program. From a security point of view, a binary program can be checked
`for the existence of malicious code (e.g. viruses) before it is run for the first time on a
`computer, in safety-critical systems where the compiler is not trusted, the binary program
`is validated to do exactly what the original high-level language program intended to do,
`and thus, the output of the compiler can be verified in this way.
`Different attempts at writing decompilers have been made in the last 20 years. Due to the
`amount of information lost in the compilation process, to be able to regenerate high-level
`language (HLL) code, all of these experimental decompilers have limitations in one way
`or another, including decompilation of assembly files1,2,3,4,5 or object files with or without
`symbolic debugging information,6,7 simplified high-level language,1 and the requirement
`of the compiler’s specification.8,9 Assembly programs have helpful data information in the
`form of symbolic text, such as data segments, data and type declarations, subroutine names,
`subroutine entry point, and subroutine exit statement. All this information can be collected in
`a symbol table and the decompiler would not need to address the problem of separating data
`from instructions, or the naming of variables and subroutines. Object files with debugging
`information contain the program’s symbol table as constructed by the compiler. Given the
`symbol table, it is easy to determine which memory locations are instructions, as there is a
`certainty on which memory locations represent data. In general, object files contain more
`information than binary files. Finally, the requirement to have access to the compiler’s
`specifications is impractical, as these specifications are not normally disclosed by compiler
`manufacturers, or do not even exist.
`Our decompiler, dcc, differs from previous decompilation projects in several ways; it
`(cid:3)
`analysis to
`analyses binary programs rather than assembler or object files, performs idiom
`capture the essence of a sequence of instructions with a special meaning, performs data flow
`analysis on registers and condition codes to eliminate them, and structures the program’s
`control flow graph into a generic set of high-level structures that can be accommodated
`into different high-level languages, eliminating as much as possible the use of the (cid:103)(cid:111)(cid:116)(cid:111)
`statement.
`The rest of this paper is structured in the following way: a thorough description of
`the structure of a decompiler, followed by the description of our implementation of an
`∗
`
`An idiom is a sequence of instruction that forms a logical entity and has a meaning that cannot be derived by considering
`the primary meanings of the individual instructions
`
`Blue Coat Systems - Exhibit 1022
`
`

`
`DECOMPILATION OF BINARY PROGRAMS
`
`813
`
`binary program
`
`?
`
`Front-end
`
`(machine dependent)
`
`?
`
`UDM
`
`(analysis)
`
`?
`
`Back-end
`
`(language dependent)
`
`?
`
`HLL program
`
`Figure 1. Decompiler modules
`
`automatic decompiling system, and conclusions. The paper is followed by the definitions
`of graph theoretical concepts used throughout the paper (Appendix I), and sample output
`from different phases of the decompilation of a program (Appendix II).
`
`THE DECOMPILER STRUCTURE
`A decompiler can be structured in a similar way to a compiler, that is, a series of modules
`that deal with machine- or language-dependent features. Three main modules are required: a
`machine-dependent module that reads in the program, loads it into virtual memory and parses
`it (the front-end), a machine- and language-independent module that analyses the program
`in memory (the universal decompiling machine), and a language-dependent module that
`writes formatted output for the target language (the back-end) (see Figure 1). This modular
`representation makes it easier to write decompilers for different machine/target language
`pairs, by writing different front-ends for different machines, and different back-ends for
`different target languages. This result is true in theory, but in practical applications is
`always limited by the generality of the intermediate language used.
`
`The front-end
`The front-end module deals with machine-dependent features and produces a machine-
`independent representation. It takes as input a binary program for a specific machine, loads
`it into virtual memory, parses it, and produces an intermediate representation of the program,
`as well as the program’s control flow graph (see Figure 2).
`The parser disassembles code starting at the entry point given by the loader, and follows
`
`Blue Coat Systems - Exhibit 1022
`
`

`
`814
`
`C. CIFUENTES AND K. J. GOUGH
`
`binary program
`
`?
`
`Loader
`
`?
`
`Parser
`
`?
`
`Semantic analysis
`
`?
`
`low-level intermediate code
`
`control flow graph
`
`Figure 2. Front-end phases
`
`the instructions sequentially until a change in flow of control is met. Flow of control is
`changed due to a conditional, unconditional or indexed branch, or a procedure call; in
`which case the target branch label(s) start new instruction paths to be disassembled. A
`path is finished by a return instruction or program termination. All instruction paths are
`followed in a recursive manner. Problems are introduced by machine instructions that use
`indexed or indirect addressing modes. To handle these, heuristic methods are implemented.
`For example, while disassembling code, the parser must check for sequences of instructions
`that represent a multiway branch (e.g. a (cid:115)(cid:119)(cid:105)(cid:116)(cid:99)(cid:104) statement), normally implemented as an
`index jump in a jump table.10,11 Finally, the intermediate code is generated and the control
`flow graph is built.
`Two levels of intermediate code are required; a low-level representation that resembles
`the assembler from the machine, and a higher-level representation that resembles statements
`from a high-level language. The initial level, or low-level intermediate code, is a simple m :
`1 mapping of machine instructions to assembler mnemonics. Compound instructions (such
`as (cid:114)(cid:101)(cid:112) (cid:109)(cid:111)(cid:118)(cid:115)) are represented by a unique low-level intermediate intruction (e.g. (cid:114)(cid:101)(cid:112)(cid:95)(cid:109)(cid:111)(cid:118)(cid:115)).
`The second level, or high-level intermediate code, is generated by the interprocedural data
`flow analysis, explained later, and maps n : 1 low-level to high-level instructions. The
`front-end generates a low-level representation.
`The semantic analysis phase performs idiom analysis and type propagation. Idioms are
`replaced by an appropriate functionally equivalent intermediate instruction. For example,
`Figure 3 illustrates two different idioms: the one on the left-hand side is a negation of a
`long variable, represented in this case by registers (cid:100)(cid:120)(cid:58)(cid:97)(cid:120). The idiom on the right-hand side
`is the prologue code of a high-level language procedure. In this case, space for 6 bytes
`is being reserved on the stack for local variables. There are a number of different idioms
`
`Blue Coat Systems - Exhibit 1022
`
`

`
`DECOMPILATION OF BINARY PROGRAMS
`
`815
`
`(cid:110)(cid:101)(cid:103) (cid:100)(cid:120)
`(cid:110)(cid:101)(cid:103) (cid:97)(cid:120)
`(cid:115)(cid:98)(cid:98) (cid:100)(cid:120)(cid:44) (cid:48)
`
`(cid:112)(cid:117)(cid:115)(cid:104) (cid:98)(cid:112)
`(cid:109)(cid:111)(cid:118) (cid:98)(cid:112)(cid:44) (cid:115)(cid:112)
`(cid:115)(cid:117)(cid:98) (cid:98)(cid:112)(cid:44) (cid:54)
`
`⇓
`
`(cid:110)(cid:101)(cid:103) (cid:100)(cid:120)(cid:58)(cid:97)(cid:120)
`
`(cid:101)(cid:110)(cid:116)(cid:101)(cid:114) (cid:54)(cid:44)(cid:48)
`
`Figure 3. Sample idioms and their transformation
`
`widely known in the compiler community, and the decompiler must code them in order to
`generate clearer high-level code.
`Type information is propagated after the idioms have been recognized. For example, if a
`long local variable is found at stack offsets -1 to -4, all references to (cid:91)(cid:98)(cid:112)(cid:45)(cid:50)(cid:93) and (cid:91)(cid:98)(cid:112)(cid:45)(cid:52)(cid:93)
`must be merged into references to such a long variable, e.g. (cid:91)(cid:98)(cid:112)(cid:45)(cid:50)(cid:93)(cid:58)(cid:91)(cid:98)(cid:112)(cid:45)(cid:52)(cid:93). Other type
`information can be propagated in the same way, such as fields (offsets) of a record.
`An optimization phase is performed on the control flow graph as well. Due to the nature
`of machine code instructions, the compiler might need to introduce intermediate branches
`in an executable program, because there is no machine instruction capable of branching
`more than a certain maximum distance in bytes (architecture dependent). An optimization
`pass over the control flow graph removes this redundancy, by replacing the target branch
`location of all conditional or unconditional jumps that branch to an unconditional jump (and
`any recursive branches in this format) with the final target basic block. While performing
`this process, some basic blocks are not going to be referenced any more, as they were used
`only for intermediate branches. These nodes must be eliminated from the graph as well.
`
`The universal decompiling machine
`The universal decompiling machine (UDM) is an intermediate module that is totally ma-
`chine and language independent. It deals with flow graphs and the intermediate representa-
`tion of the program and performs all the flow analysis the input program needs (see Figure 4).
`
`Data flow analysis
`The aim of the data flow analysis phase is to transform the low-level intermediate repre-
`sentation into a higher-level representation that resembles a HLL statement. It is therefore
`necessary to eliminate the concept of condition codes (or flags) and registers, as these con-
`cepts do not exist in high-level languages, and to introduce the concept of expressions,
`as these can be used in any HLL program. For this purpose, the technology of compiler
`optimization has been appropriated.
`The first analysis is concerned with condition codes. Some condition codes are used
`only by hand-crafted assembly code instructions, and thus are not translatable to a high-
`level representation. Therefore, condition codes are classified in two groups: HLCC which
`is the set of condition codes that are likely to have been generated by a compiler (e.g.
`overflow, carry), and NHLCC which is the set of condition codes that are likely to have
`been generated by assembly code (e.g. trap, interrupt). The HLCC set is the one that can
`
`Blue Coat Systems - Exhibit 1022
`
`

`
`816
`
`C. CIFUENTES AND K. J. GOUGH
`
`control flow graph
`
`low-level intermediate code
`
`?
`
`Data flow Analysis
`
`?
`
`Control flow Analysis
`
`?
`
`high-level intermediate code
`
`structured control flow graph
`
`Figure 4. UDM phases
`
`be analysed further. Instructions that use condition codes in the NHLCC set mean that the
`subroutine is most likely non high-level, and is therefore flagged as being so; assembler is
`all that can be generated for these subroutines.
`A use/definition, or reaching definition, analysis is performed on condition codes. In
`this way, for a given use of a condition code c at instruction j, the use/definition chain
`((cid:117)(cid:100)(cid:45)(cid:99)(cid:99)(cid:40)(cid:41)) determines which instruction(s) defined this condition. In practical cases, there
`is always only one instruction that defined the condition(s) used in the instruction at j; (i.e.
`(cid:117)(cid:100)(cid:45)(cid:99)(cid:99)(x) = {i}). This set can be computed by a forward-flow any-path algorithm.12,13 Once
`the set of defined/used instructions is known (i.e. (i, j)), these low-level instructions can
`be replaced by a unique conditional high-level instruction that is semantically equivalent to
`the given instructions. Let us consider the example in Figure 5, which illustrates this point.
`Instruction 2 uses the sign ((cid:83)(cid:70)) and zero ((cid:90)(cid:70)(cid:41) condition codes. These flags were previously
`defined by instruction 1, which also defines the carry flag ((cid:67)(cid:70)). Given that this instruction
`defines all flags used by the conditional jump, they can be merged into one high-level
`instruction that compares the registers for the greater condition; i.e. (cid:74)(cid:67)(cid:79)(cid:78)(cid:68) (cid:40)(cid:97)(cid:120) (cid:62) (cid:98)(cid:120)(cid:41).
`The second analysis is concerned with registers. The elimination of temporary registers
`leads to the elimination of intermediate instructions by replacing several low-level instruc-
`tions by one high-level instruction that does not make use of the intermediate register. As
`with condition codes, some machine instructions are hand-crafted by assembler program-
`mers, and are untranslatable to a higher representation. We therefore classify the low-level
`instructions into two sets: HLI which is the set of instructions that are representable in a
`high-level language (e.g. (cid:109)(cid:111)(cid:118), (cid:97)(cid:100)(cid:100)), and NHLI which is the set of instructions that are likely
`to be generated only by assembly code (e.g. (cid:99)(cid:108)(cid:105), (cid:105)(cid:110)(cid:115)). This analysis is concerned only
`with instructions from the HLI set. Instructions from the NHLI set belong to subroutines
`
`Blue Coat Systems - Exhibit 1022
`
`

`
`DECOMPILATION OF BINARY PROGRAMS
`
`817
`
`(cid:49) (cid:99)(cid:109)(cid:112) (cid:97)(cid:120)(cid:44) (cid:98)(cid:120) (cid:59) (cid:100)(cid:101)(cid:102)(cid:58) (cid:83)(cid:70)(cid:44)(cid:90)(cid:70)(cid:44)(cid:67)(cid:70)
`(cid:50) (cid:106)(cid:103) (cid:108)(cid:97)(cid:98)(cid:90)
`(cid:59) (cid:117)(cid:115)(cid:101)(cid:58) (cid:83)(cid:70)(cid:44)(cid:90)(cid:70) (cid:59) (cid:117)(cid:100)(cid:45)(cid:99)(cid:99)(cid:40)(cid:83)(cid:70)(cid:44)(cid:90)(cid:70)(cid:41) (cid:61) (cid:123)(cid:49)(cid:125)
`⇓
`
`(cid:74)(cid:67)(cid:79)(cid:78)(cid:68) (cid:40)(cid:97)(cid:120) (cid:62) (cid:98)(cid:120)(cid:41)
`
`Figure 5. Condition code example
`
`that are likely to have been written in assembler, or are untranslatable, and therefore are
`flagged as being so, and assembler is generated for them.
`Two preliminary analyses are required for the elimination of registers. A definition/use
`analysis on registers is needed to determine how many uses there are for a definition of a
`register. Note that register variables are not eliminated by this analysis, as they represent
`local variables and thus are required in the final output program. This analysis can be solved
`in a backward-flow any-path problem.12,13 Also, an interprocedural live register analysis is
`required, to ascertain which registers are live on entrance and on exit from a basic block.
`This analysis is also solved by known algorithms.12,13
`The method to eliminate registers has been named forward substitution as, by performing
`a forward substitution of the symbolic contents of a defined register that is only used once
`on the instruction that uses it, the temporary register is eliminated, the temporary instruction
`that defined the register is also eliminated, and the final instruction that used the register
`is now defined in terms of an expression. Let us consider a modulo 10 example, described
`in Figure 6. The registers (cid:115)(cid:105) and (cid:100)(cid:105) are register variables in this example. Register (cid:116)(cid:109)(cid:112) is
`a virtual register introduced by the parser whenever a (cid:68)(cid:73)(cid:86) instruction is found, as (cid:68)(cid:73)(cid:86) is
`equivalent to two low-level instructions; a division and a modulus. Because they use the
`same registers as operands, and they redefine these registers, a (cid:116)(cid:109)(cid:112) register is introduced
`
`(cid:59) (cid:111)(cid:116)(cid:104)(cid:101)(cid:114) (cid:99)(cid:111)(cid:100)(cid:101) (cid:104)(cid:101)(cid:114)(cid:101)
`(cid:46)(cid:46) (cid:46)(cid:46)(cid:46)
`(cid:59) (cid:100)(cid:117)(cid:40)(cid:97)(cid:120)(cid:41) (cid:61) (cid:123)(cid:51)(cid:48)(cid:125)
`(cid:59) (cid:65)(cid:83)(cid:71)(cid:78) (cid:97)(cid:120)(cid:44) (cid:100)(cid:105)
`(cid:50)(cid:56) (cid:77)(cid:79)(cid:86) (cid:97)(cid:120)(cid:44) (cid:100)(cid:105)
`(cid:59) (cid:100)(cid:117)(cid:40)(cid:98)(cid:120)(cid:41) (cid:61) (cid:123)(cid:51)(cid:50)(cid:44)(cid:51)(cid:51)(cid:125)
`(cid:59) (cid:65)(cid:83)(cid:71)(cid:78) (cid:98)(cid:120)(cid:44) (cid:48)(cid:65)(cid:104)
`(cid:50)(cid:57) (cid:77)(cid:79)(cid:86) (cid:98)(cid:120)(cid:44) (cid:48)(cid:65)(cid:104)
`(cid:59) (cid:100)(cid:117)(cid:40)(cid:97)(cid:120)(cid:41) (cid:61) (cid:123)(cid:51)(cid:49)(cid:125)(cid:44) (cid:100)(cid:117)(cid:40)(cid:100)(cid:120)(cid:41) (cid:61) (cid:123)(cid:51)(cid:49)(cid:125)
`(cid:59) (cid:65)(cid:83)(cid:71)(cid:78) (cid:100)(cid:120)(cid:58)(cid:97)(cid:120)(cid:44) (cid:97)(cid:120)
`(cid:51)(cid:48) (cid:67)(cid:87)(cid:68)
`(cid:59) (cid:100)(cid:117)(cid:40)(cid:116)(cid:109)(cid:112)(cid:41) (cid:61) (cid:123)(cid:51)(cid:50)(cid:44)(cid:51)(cid:51)(cid:125)
`(cid:51)(cid:49) (cid:77)(cid:79)(cid:86) (cid:116)(cid:109)(cid:112)(cid:44) (cid:100)(cid:120)(cid:58)(cid:97)(cid:120) (cid:59) (cid:65)(cid:83)(cid:71)(cid:78) (cid:116)(cid:109)(cid:112)(cid:44) (cid:100)(cid:120)(cid:58)(cid:97)(cid:120)
`(cid:51)(cid:50) (cid:68)(cid:73)(cid:86) (cid:98)(cid:120)
`(cid:59) (cid:65)(cid:83)(cid:71)(cid:78) (cid:97)(cid:120)(cid:44) (cid:116)(cid:109)(cid:112) (cid:47) (cid:98)(cid:120) (cid:59) (cid:100)(cid:117)(cid:40)(cid:97)(cid:120)(cid:41) (cid:61) (cid:123)(cid:125)
`(cid:51)(cid:51) (cid:77)(cid:79)(cid:68) (cid:98)(cid:120)
`(cid:59) (cid:65)(cid:83)(cid:71)(cid:78) (cid:100)(cid:120)(cid:44) (cid:116)(cid:109)(cid:112) (cid:37) (cid:98)(cid:120) (cid:59) (cid:100)(cid:117)(cid:40)(cid:100)(cid:120)(cid:41) (cid:61) (cid:123)(cid:51)(cid:52)(cid:125)
`(cid:51)(cid:52) (cid:77)(cid:79)(cid:86) (cid:115)(cid:105)(cid:44) (cid:100)(cid:120)
`(cid:59) (cid:65)(cid:83)(cid:71)(cid:78) (cid:115)(cid:105)(cid:44) (cid:100)(cid:120)
`(cid:46)(cid:46) (cid:46)(cid:46)(cid:46)
`(cid:59) (cid:111)(cid:116)(cid:104)(cid:101)(cid:114) (cid:99)(cid:111)(cid:100)(cid:101) (cid:104)(cid:101)(cid:114)(cid:101)(cid:44) (cid:110)(cid:111) (cid:117)(cid:115)(cid:101) (cid:111)(cid:102) (cid:97)(cid:120)
`⇓
`
`(cid:65)(cid:83)(cid:71)(cid:78) (cid:115)(cid:105)(cid:44) (cid:100)(cid:105) (cid:37) (cid:48)(cid:65)(cid:104)
`
`Figure 6. Simple expression example
`
`Blue Coat Systems - Exhibit 1022
`
`

`
`818
`
`C. CIFUENTES AND K. J. GOUGH
`
`to hold the initial value of (cid:100)(cid:120)(cid:58)(cid:97)(cid:120). The substitution up to instruction 32 is illustrated in the
`following code:
`
`(cid:59) (cid:65)(cid:83)(cid:71)(cid:78) (cid:97)(cid:120)(cid:44) (cid:100)(cid:105)
`(cid:50)(cid:56) (cid:77)(cid:79)(cid:86) (cid:97)(cid:120)(cid:44) (cid:100)(cid:105)
`(cid:59) (cid:65)(cid:83)(cid:71)(cid:78) (cid:98)(cid:120)(cid:44) (cid:48)(cid:65)(cid:104)
`(cid:50)(cid:57) (cid:77)(cid:79)(cid:86) (cid:98)(cid:120)(cid:44) (cid:48)(cid:65)(cid:104)
`(cid:40)(cid:115)(cid:117)(cid:98)(cid:115)(cid:116)(cid:105)(cid:116)(cid:117)(cid:116)(cid:101) (cid:50)(cid:56)(cid:45)(cid:62)(cid:51)(cid:48)(cid:41)
`(cid:59) (cid:65)(cid:83)(cid:71)(cid:78) (cid:100)(cid:120)(cid:58)(cid:97)(cid:120)(cid:44) (cid:100)(cid:105)
`(cid:51)(cid:48) (cid:67)(cid:87)(cid:68)
`(cid:40)(cid:115)(cid:117)(cid:98)(cid:115)(cid:116)(cid:105)(cid:116)(cid:117)(cid:116)(cid:101) (cid:51)(cid:48)(cid:45)(cid:62)(cid:51)(cid:49)(cid:41)
`(cid:51)(cid:49) (cid:77)(cid:79)(cid:86) (cid:116)(cid:109)(cid:112)(cid:44) (cid:100)(cid:120)(cid:58)(cid:97)(cid:120) (cid:59) (cid:65)(cid:83)(cid:71)(cid:78) (cid:116)(cid:109)(cid:112)(cid:44) (cid:100)(cid:105)
`(cid:51)(cid:50) (cid:68)(cid:73)(cid:86) (cid:98)(cid:120)
`(cid:59) (cid:65)(cid:83)(cid:71)(cid:78) (cid:97)(cid:120)(cid:44)(cid:116)(cid:109)(cid:112) (cid:47) (cid:98)(cid:120) (cid:40)(cid:101)(cid:108)(cid:105)(cid:109)(cid:105)(cid:110)(cid:97)(cid:116)(cid:101) (cid:105)(cid:110)(cid:115)(cid:116)(cid:114)(cid:117)(cid:99)(cid:116)(cid:105)(cid:111)(cid:110)(cid:41)
`
`Instruction 32 defines a register that is not going to be used; therefore this instruction
`is redundant and must be eliminated. Any uses of the registers in the right-hand side of
`the instruction that is redundant need to be backpatched to reflect the non-existence of the
`instruction. After this step, the code would look like this:
`
`(cid:59) (cid:65)(cid:83)(cid:71)(cid:78) (cid:98)(cid:120)(cid:44) (cid:48)(cid:65)(cid:104)
`(cid:50)(cid:57) (cid:77)(cid:79)(cid:86) (cid:98)(cid:120)(cid:44) (cid:48)(cid:65)(cid:104)
`(cid:51)(cid:49) (cid:77)(cid:79)(cid:86) (cid:116)(cid:109)(cid:112)(cid:44) (cid:100)(cid:120)(cid:58)(cid:97)(cid:120) (cid:59) (cid:65)(cid:83)(cid:71)(cid:78) (cid:116)(cid:109)(cid:112)(cid:44) (cid:100)(cid:105)
`(cid:51)(cid:51) (cid:77)(cid:79)(cid:68) (cid:98)(cid:120)
`(cid:59) (cid:65)(cid:83)(cid:71)(cid:78) (cid:100)(cid:120)(cid:44) (cid:116)(cid:109)(cid:112) (cid:37) (cid:98)(cid:120)
`(cid:51)(cid:52) (cid:77)(cid:79)(cid:86) (cid:115)(cid:105)(cid:44) (cid:100)(cid:120)
`(cid:59) (cid:65)(cid:83)(cid:71)(cid:78) (cid:115)(cid:105)(cid:44) (cid:100)(cid:120)
`
`(cid:59) (cid:100)(cid:117)(cid:40)(cid:98)(cid:120)(cid:41) (cid:61) (cid:123)(cid:51)(cid:51)(cid:125)
`
`and the final substitutions would give the final result in instruction 34:
`
`(cid:59) (cid:65)(cid:83)(cid:71)(cid:78) (cid:98)(cid:120)(cid:44) (cid:48)(cid:65)(cid:104)
`(cid:50)(cid:57) (cid:77)(cid:79)(cid:86) (cid:98)(cid:120)(cid:44)(cid:48)(cid:65)(cid:104)
`(cid:51)(cid:49) (cid:77)(cid:79)(cid:86) (cid:116)(cid:109)(cid:112)(cid:44)(cid:100)(cid:120)(cid:58)(cid:97)(cid:120) (cid:59) (cid:65)(cid:83)(cid:71)(cid:78) (cid:116)(cid:109)(cid:112)(cid:44) (cid:100)(cid:105)
`(cid:51)(cid:51) (cid:77)(cid:79)(cid:68) (cid:98)(cid:120)
`(cid:59) (cid:65)(cid:83)(cid:71)(cid:78) (cid:100)(cid:120)(cid:44) (cid:100)(cid:105) (cid:37) (cid:48)(cid:65)(cid:104) (cid:40)(cid:115)(cid:117)(cid:98)(cid:115)(cid:116)(cid:105)(cid:116)(cid:117)(cid:116)(cid:101) (cid:51)(cid:49)(cid:45)(cid:62)(cid:51)(cid:51)(cid:41)
`(cid:40)(cid:115)(cid:117)(cid:98)(cid:115)(cid:116)(cid:105)(cid:116)(cid:117)(cid:116)(cid:101) (cid:50)(cid:57)(cid:45)(cid:62)(cid:51)(cid:51)(cid:41)
`(cid:59) (cid:65)(cid:83)(cid:71)(cid:78) (cid:115)(cid:105)(cid:44) (cid:100)(cid:105) (cid:37) (cid:48)(cid:65)(cid:104) (cid:40)(cid:115)(cid:117)(cid:98)(cid:115)(cid:116)(cid:105)(cid:116)(cid:117)(cid:116)(cid:101) (cid:51)(cid:51)(cid:45)(cid:62)(cid:51)(cid:52)(cid:41)
`
`(cid:51)(cid:52) (cid:77)(cid:79)(cid:86) (cid:115)(cid:105)(cid:44) (cid:100)(cid:120)
`
`A complete algorithm that describes this data flow analysis can be found in References 14
`and 15.
`
`Control flow analysis
`The control flow analyser structures the control flow graph into generic high-level control
`structures that are available in most languages. These are conditional ((cid:105)(cid:102)(cid:46)(cid:46)(cid:116)(cid:104)(cid:101)(cid:110)(cid:91)(cid:46)(cid:46)(cid:101)(cid:108)(cid:115)(cid:101)(cid:93)),
`multiway branch ((cid:99)(cid:97)(cid:115)(cid:101)), and different types of loops ((cid:119)(cid:104)(cid:105)(cid:108)(cid:101)(cid:40)(cid:41), (cid:114)(cid:101)(cid:112)(cid:101)(cid:97)(cid:116)(cid:46)(cid:46)(cid:117)(cid:110)(cid:116)(cid:105)(cid:108)(cid:40)(cid:41), and
`endless (cid:108)(cid:111)(cid:111)(cid:112)). Different methods have been specified in the literature to structure graphs,
`most of them dealing with the elimination of (cid:103)(cid:111)(cid:116)(cid:111) statements from the graph, by the
`introduction of new variables16,17, code replication18,19,20 or the use of multilevel (cid:101)(cid:120)(cid:105)(cid:116)21,22.
`Both the introduction of new variables and code replication modify the apparent semantics
`of the program, and is therefore not desirable when decompiling binary programs, given
`that we want to decompile the code ‘as is’. The use of multilevel (cid:101)(cid:120)(cid:105)(cid:116) statements is
`not supported by commonly used languages (e.g. Pascal, C), and thus cannot be part of
`the generic set of high-level control constructs that can be generated. We developed an
`algorithm that structures the graph into the set of generic high-level control structures, and,
`whenever it determines that a particular subgraph is not one of the generic constructs, it
`uses a (cid:103)(cid:111)(cid:116)(cid:111). Note that the minimum number of (cid:103)(cid:111)(cid:116)(cid:111)s is always used. This algorithm is
`described in Reference 23.
`
`Blue Coat Systems - Exhibit 1022
`
`

`
`DECOMPILATION OF BINARY PROGRAMS
`
`819
`
`.....
`jcond ((si * 5) == 50)
`
`
`
`
`
`
`
`
`.....
`jcond (((si * 5) != 50) or (di (cid:60) si))
`
`(cid:1)
`
`(cid:1)
`
`A
`
`A
`
`A
`
`A
`
`A
`
`A
`
`AAU
`
`other
`
`(cid:1)
`
`(cid:1)
`
`(cid:1)
`
`(cid:1)
`
`(cid:1)(cid:1)(cid:11)
`
`printf()
`?
`
`jcond (di (cid:60) si)
`Q
`
`Q
`
`(cid:27) %
`
`Q
`
`Q
`
`Q
`
`QQs
`other
`
`?
`
`printf()
`?
`
`Figure 7. Short-circuit evaluation graph
`
`A second structuring phase can be implemented to check for short circuit evaluation
`graphs. These graphs can be transformed into simpler graphs that hold two or more com-
`pound conditions on the one basic block, rather than requiring to generate high-level code
`that uses two or more nested (cid:105)(cid:102)(cid:46)(cid:46)(cid:116)(cid:104)(cid:101)(cid:110) statements. Figure 7 illustrates an example of a
`compound (cid:111)(cid:114) condition. The top basic block checks for the equality of (cid:40)(cid:115)(cid:105) (cid:42) (cid:53)(cid:41) and (cid:53)(cid:48).
`If this is false, a (cid:112)(cid:114)(cid:105)(cid:110)(cid:116)(cid:102)(cid:40)(cid:41) node is reached. On the other hand, if the equality is true, a
`second condition is checked; (cid:100)(cid:105) (cid:60) (cid:115)(cid:105). If this condition is true, the same (cid:112)(cid:114)(cid:105)(cid:110)(cid:116)(cid:102)(cid:40)(cid:41) basic
`block is reached, otherwise some other code is reached. Rather than generating C code for
`these conditions that require a (cid:103)(cid:111)(cid:116)(cid:111) (as these conditions are not properly nested), they can
`be merged into a compound (cid:111)(cid:114) that negates the first condition, as the (cid:112)(cid:114)(cid:105)(cid:110)(cid:116)(cid:102)(cid:40)(cid:41) node is
`reached whenever the first condition is false, and leaves the second condition as it is, given
`that this condition also reaches (cid:112)(cid:114)(cid:105)(cid:110)(cid:116)(cid:102)(cid:40)(cid:41) when it is true. The intermediate basic block
`and edges are removed from the graph. The complete structuring algorithm is described in
`References 15 and 24.
`
`The back-end
`The back-end module is language dependent, as it deals with the target high-level language.
`This module, optionally restructures the graph into control constructs available only in the
`particular target language, and then generates code for this language (see Figure 8).
`The restructuring phase is optional and aims at structuring the graph even further, so
`that control structures available in the target language but not present in the generic set
`of control structures of the structuring algorithm, previously described, are utilized. For
`instance, if the target language is Ada, multilevel exits are allowed. After the graph has
`been structured, multilevel exits look like a loop with abnormal (cid:103)(cid:111)(cid:116)(cid:111) exits. The restructuring
`phase can check the target destination of each (cid:103)(cid:111)(cid:116)(cid:111), and determine if an (cid:101)(cid:120)(cid:105)(cid:116)(cid:40)(cid:105)(cid:41) statement
`is suitable instead. Another example is the (cid:102)(cid:111)(cid:114) loop; such a loop is equivalent to a (cid:119)(cid:104)(cid:105)(cid:108)(cid:101)(cid:40)(cid:41)
`
`Blue Coat Systems - Exhibit 1022
`
`

`
`820
`
`C. CIFUENTES AND K. J. GOUGH
`
`structured control flow graph
`
`high-level intermediate code
`
`?
`
`Restructuring
`
`?
`
`HLL Code Generation
`
`?
`
`HLL program
`
`Figure 8. Back-end phases
`
`loop that makes use of an induction variable. In this case, the induction variable needs to
`be found.
`The final phase is the HLL code generation which generates code for the target HLL
`based on the control flow graph an

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket