I built something like this long ago. The easy part - I thought - would be to separate data and code by evaluating all the branches twice (once for the branches taken and once for those not taken) and build up a backlog of the branches not yet taken. Every step along the way you'd mark the instruction pointer as the opcode, the other bytes for that instruction as the operands and the remainder after a complete run would be the static data present in the binary.
If only.
The easiest way to throw such a scheme off is to have an instruction jump into the middle of an operand and have the code be dual meaning, one reading gives you one disassembly, another a completely different one.
If you then try to convert the result to a pseudo-C construct (a function with a stackframe and the corresponding unwinding of that stackframe at the end of the fuction) then it no longer works...
The short version of the above is there is a 1:1 correspondence between a C language source code and the un-optimized output of a compiler. That coupling is a lot less hard when you start optimizing and when you use hand-crafted assembly as the input to a reverse-compiler the output may simply no longer be functionally equivalent because the mapping might be non-existent and this can be a hard situation to detect.
Eleven or twelve years ago, I was presenting a paper at the European Joint Conferences on Theory and Practice of Software; I went to a talk while I was there by Gogul Balakrishnan, and spent some time talking to him about decompilation: I remember him telling me that the problem you describe is simply not possible to solve for x86.
FWIW, jumping into the middle of an instuction is not actually a problem (even though I do realize it trips up a number of disassemblers: I actually publishef a proof-of-concept obfuscating assembler that was designed to specifically break all the best disassemblers at the time, and leaned heavily on that), as you just need to build your own graph of the execution of the program: don't try to mark regions of memory, as that will screw you.
What really gets you quickly are computed jumps: even super popular disassemblers often hiccup on simple switch tables, as there is no way to really know for sure all the cases that were proven to be impossible by the compiler and left out of the checks. And one could imagine a case where a jump is computer based on the output of a potentially undesirable function... you essentially need to be able to prove things about programs to even ask the question, and those proofs almost always fail due to incompleteness :(.
Ah yes, switches, I totally forgot about that bit but you're absolutely right, those were impossible to get right, especially with the optimizer on. The end result that best matched the code would be a jumble of ifs and gotos rather than a switch and because you don't know the bounds you're going to have to make a bunch of assumptions about the contents of the addresses of the jump table. Bad luck to you if one of the jump tables (you can get more of them for disjoint sets of case statements that cluster together) happens to be shorter than you think and is followed by other address pointers.
Decompilation for C is a very hard problem. Conceptually super easy but extremly hard if not impossible to implement. The java folks have it so much easier in this respect.
The only one that had me stumped for a bit was 'undesirable' but even there it left only one possibility. The rest I didn't even notice, I read what you meant and now that you point them out I see them. Funny thing, the brain.
I have a rule when I write reports: don't send it on the same day. Let it sit for a day and then re-read it, from paper if possible. That's when I spot the typos and the weird sentences. If I proofread right away I just read what I think it should say rather than what it really says. Frustrating!
I'd definitely just prefer using the dissassembled output as it's infinitely more readable once you've grokked it. Nice project though, in a field as monopolised as reversing, you can never have too many tools!
Well, this is open source, IDA is not. And while the features are far less, if the author keeps working on it, the gap will shrink with time. (Given the vast feature set of IDA, it's unlikely the gap will ever be closed completely, but it might become good enough for some people's purposes.)
If only.
The easiest way to throw such a scheme off is to have an instruction jump into the middle of an operand and have the code be dual meaning, one reading gives you one disassembly, another a completely different one.
If you then try to convert the result to a pseudo-C construct (a function with a stackframe and the corresponding unwinding of that stackframe at the end of the fuction) then it no longer works...
The short version of the above is there is a 1:1 correspondence between a C language source code and the un-optimized output of a compiler. That coupling is a lot less hard when you start optimizing and when you use hand-crafted assembly as the input to a reverse-compiler the output may simply no longer be functionally equivalent because the mapping might be non-existent and this can be a hard situation to detect.