The Future of Reverse Engineering: Seeing Beyond the Bytes
- Stephanie Domas
- Jul 15
- 3 min read
Updated: Jul 16
For years, I've been wrestling with a fundamental problem in reverse engineering: how do we truly understand a massive binary blob? Our brains, incredible as they are, simply aren't wired to process gigabytes of ones and zeros. We've been stuck between two extremes: the overwhelming flexibility of hex views and the rigid, often brittle, nature of disassemblies. There's a huge gap in how we conceptualize and analyze this information, often treating reverse engineering more as an art than a science.
This is why I've been pouring my efforts into Dynamic Binary Visualization. My core belief is that we can translate computationally difficult tasks into visual problems – problems our brains are uniquely equipped to solve. Think about it: our visual cortex is a powerhouse for processing 3D and spatial information. Why aren't we leveraging that for binary analysis?
I'm not the first to explore this space. Greg Conti's work on diagraphs, treating sequential bytes as XY coordinates, revealed remarkable visual patterns in data, even without knowing its structure. Alois Cozzi took it further with Hilbert curves, preserving locality and using entropy visualization to highlight critical areas like packed malware or encryption keys.
Building on these foundational ideas, I developed CantorDust (also known as Cancer Dust, depending on the day!). This software is my attempt to push the boundaries of interactive binary analysis, making it truly independent of format.
Here's how CantorDust helps us see what's really going on:
Visualizing in Higher Dimensions: We're extending Conti's 2D diagraphs into 3D. This gives us even more depth and insight into the underlying structure of the data.
Entropy Mapping: By visualizing entropy, we can immediately spot regions of high randomness – tell-tale signs of encrypted data, packed executables, or obfuscated code. Conversely, low entropy can point us to things like entry points or unencrypted strings. I've used this to find hidden images and even key checker algorithms within applications like Notepad.exe.
Binary Classification with Bayesian Statistics: This is where it gets powerful. We can train CantorDust with samples of known data types – say, x86 code, a custom instruction set, or a proprietary file format. Using a naive Bayes classifier and n-gram models, the software can then statistically identify similar regions in unknown binaries. No more guessing; we're classifying based on learned patterns.
Probabilistic Parsing: Forget rigid grammar definitions. We can generate call graphs for arbitrary binary files purely based on statistical patterns. This means we can analyze unknown instruction sets and understand function calls and connectivity without needing a complete, predefined understanding of the binary's grammar.
I put these techniques to the test with a real-world case study: dissecting an Intel BIOS firmware image. My goal was to find an unsigned splash screen module and identify a CRC checksum routine. Using CantorDust, I was able to:
Extract firmware images from a BIOS update executable.
Identify custom compressed modules within the firmware using Bayesian classification, a task that conventional tools failed at because of the lack of standard headers.
Visually pinpoint the splash screen image inside those modules, even though the bitmap header had been stripped.
Locate the CRC checksum routines by leveraging probabilistic parsing to find debug printf functions and then tracing their callers. This led me directly to the validate_rbu_crc function, giving me the CRC offset, magic number, and algorithm needed to repackage the firmware.
The result? This entire process, which would have taken me around 37 hours using traditional methods (even with prior knowledge!), was completed in about 9 hours with CantorDust. That's a massive reduction in time and effort.
The implications are significant. If we can reduce complex programs to fundamental, visually distinct patterns, it changes how we approach malware detection, making traditional signature-based methods less effective. It also simplifies exploitation, especially on architectures with limited gadgets, by reducing programs to simple, repetitive operations.
Ultimately, this work challenges our very definition of "code." If malicious and benign code can look identical at the instruction stream level, and if the program's behavior is determined by a data table rather than unique instructions, then perhaps programming itself is evolving. We need new ways to understand and analyze binary information to keep pace with the ever-changing landscape of data.
I encourage you to explore these ideas further. If your wondering 'where is the repo'? you sadly wont find it. (you may find this one Github Repo but its not the correct one). While this worked was performed as a passion/hobby project in my free time, terms of my employment contract did not allow me to keep the IP, and my employer at the time chose not to open source it. The repo you do find is them taking the original idea and created a IDA plugin, while my original development was a standalone application.
Christopher Domas (@xoreaxeaxeax)
