Measuring the quality and effectiveness of binary analysis tools comes with numerous challenges, particularly regarding the accuracy of the initial disassembly used for further analysis. The task of generating a correct disassembly is recognized as a difficult problem, as various common code structures like jump tables, object hierarchies, and pointer manipulation can pose difficulties for these tools. Moreover, malware creators employ obfuscation techniques, packing, and code-rewriting to thwart analysis tool chains. Several tool chains, such as IDA Pro, Ghidra, angr, Radare, Ddisasm, and others, attempt to address these challenges with varying degrees of success.
Comparing the effectiveness of these tools presents real difficulties. Historically, such comparisons have been done by examining the instructions produced by each tool. However, this approach has limitations due to the legitimate choices a tool may make in determining which instructions to include in the disassembly. We propose a solution in the form of truth boundaries, termed "min-truth" and "max-truth," to establish a baseline for measuring disassembly accuracy and further discuss the strengths and limitations of this approach.