How did you get to reverse the files? by protoman3000 at 7:17 AM EDT on July 11, 2020
Hello everybody,
vgmstream is awesome, but I'm also intrigued by whats happening to make this work.
I would like to know how the contributors to vgmstream got to understand the various file formats that vgmstream is able to decode.
I'm not talking about reading a table of the file memory-map in some wiki and then writing the vgmstream plugin for that. I'm asking about how to understand the file by yourself, effectively reversing the file format without any given specification.
E.g. if a new game comes out and uses a completely unknown file format for their music, how do you approach this and reverse it?
There is no definitive approach to that. There can't be. I think it's a combination of experience with already existing formats and experimentation or even brute-force. Often times you can change some unknown values in files and observe the results in-game. Some values even make sense just by looking at them. For example if you're dealing with audio streams and see a common frequency (like 22050 or 32000 etc.) somewhere in the header you already know what those bytes mean. You then keep looking in the neighboring bytes to find other commonly used stuff.
It involves toying around in HEX editors or writing small tools to help understand the format better or doing tests more quickly than jumping around in a HEX editor manually. Some formats are just deviations of others where only the header changes. Others might be already known files in disguise or in non-standard container formats like is very often the case with Vorbis compressed WAV files.
It gets more complicated when compression and obfuscation comes into play. Not every compression can be immediately identified by some byte sequence in which case it can take ages to reverse engineer the format until someone discovers the compression method by luck or by extensive research/testing. In those cases it's usually better to decompile the application (when possible) to observe what's going on. Alternatively you can look around in the game EXE to find strings and stuff that can at least point you in the right direction. Or even better, if the game folder contains a commonly used DLL for compression handling and whatnot.
First you need to know some basic audio file concepts, like what sample rate, channels, endianness, or codecs are. Also how to read and use a hex editor.
Then you start doing simpler before harder stuff, everything is just stepping stones, so to speak. So when tackling harder things you go "oh, this is just like --- and ---".
It helps if you know some programming, though it's not 100% mandatory. Those formats are made by programmers, so you can often guess what and why are they doing this.
You also need to be at least a bit crafty and perseverant, as a person. I may not be able to tackle one format today (=I don't know enough yet), but in a year I could (seriously, that happens).
As for my actual methods to handle somewhat-complex new things:
- take a bunch of files (NOT ONE SINGLE FILE, THAT'S MUCH HARDER, people don't seem to get this), preferably from multiple platforms and multiple games. This decreases researching time by *a lot*.
- compare files via hex editor and find possible header (base) values. Like a 0xAC44 (44100), that's sample rate. Or a number that it's often 2, sometimes 1, that's surely channels. While a block of data that always changes is probably codec data. But some value that is always 0x100 may be the position where data starts, or interleave, or something else.
- figure out codec data. Some data has a particular "look" but identifying that comes from experience, but also from common sense. A new game would use current codecs, and an old game old codecs, console games use console codecs, etc. Sometimes you can try a bunch of possible codecs until one works.
- as I get to understand (most of) the structure I'll usually make a simple .txt detailing: at position 0x00 is the sample rate, at position 0x100 data, etc. Then you take this info and program vgmstream to read and play file (there is .txth for quick tests too).
- test a bunch of files (AGAIN, NOT A SINGLE ONE) and see if they sound ok. If not, go back to prev steps and figure out missing parts again.
- Some games just use common engines or platform's default formats, so it helps to have some knowledge of related things.
- for new, unknown things (like a new codec, or encrypted data) some advanced tricks are needed. If nothing resembling sample rate/etc is found, the header may be elsewhere, or data encrypted. If the files are mp3-small, then it could be a new codec similar to mp3. If the game has a "vorbis.dll", surely it's some Ogg/Vorbis obfuscation or variation. You can overcome those alone with experience, or reading other people's open source code and matching, but you may end up needing to decompile the executable and painstakingly try to guess what's going on in the CPU.
That's it pretty much. One format may take 10-20min to add (literally) or 10-20h (also literally), for complex cases possibly much more.
And again, people that go "here is one 10kb file for this extremely complex format that took me 1 second to upload, please spend 10+ hours to add it, I didn't bother to try anything first, kthanksbye" are just WTF. Sometimes not even uploading anything!
Here's my memory of dealing with simple schemes a while ago, I strayed into advice by the end, too.
If I have a general idea of how games are put together, and how audio engines work, then I'll know where to look for the audio data. A lot of common audio encodings are visually recognizable when looking at a hex dump, headers may appear as interruptions in this pattern. A tool that shows word, byte or nybble distribution can help choose among different similar-looking PCM or ADPCM codes.
So then I'd try to decode a chunk of audio based on what it looks like, and work from what it sounds like: Is it just noise? Can I hear something under the noise? Is it garbled? Are there just occasional glitches? Wrong frequency? Does stereo sound off? Does it seem to cut off, repeat, or skip?
The cycle might then go back to visual; looking at the waveform in an audio editor might reveal a pattern in the timing or amplitude of glitches and repeats, or show how the signal wanders off DC, etc. Looking at the input data corresponding to errors could show new headers, the distance between errors could inform on the structure of interleave, etc. So adjust the decoder and repeat.
From more of a reverse engineering angle, dumping strings from the binary can show codec names, or debug symbols can give hints and maybe even match open source code. There are code patterns to look for a decoder in disassembly, like 16-bit clamping which may have 0x7fff and 0x8000 nearby.
For bank file structure, look in the hex for obvious offsets near the start of end of the file, see where they point (relative to file start, or the current block, etc): Is it systematically near what looks like a sample start (especially after padding or silence), or another header? Figure out where samples and other headers are and search for those offsets in this or other files. Take note of apparent tables of structures (similar or increasing numbers seeming to repeat), if part of that structure doesn't seem to be obvious metadata try to guess what it could be for: If it varies it could be an aspect of each sample, if constant it might describe an aspect of the file structure that wasn't obvious.
For all of this you need to know common patterns that might show up, so just reading tables on wikis is a good place to start! Importantly, try to think about why those patterns would have been used, to develop intuition for finding new patterns yourself.
I'm interested to hear from others, too!
[edit] I see there were a few posts before mine, thanks! I want to re-emphasize bnnm's point: Multiple files are key! More data is always better for understanding and testing. Errors will often reveal themselves when they manifest differently in different files.
Like bnnm, I'd also stress the importance of comparing as many files as possible. One thing I tend to do is to print out a single-row hex dump of some region of all the files I've got on hand. That way you get a good bird's-eye view of how and where data usually differs as well as possible values that may appear.
Aside from looking at the data in a hex dump, opening files as raw audio in Audacity can also be be quite helpful to determine whether a file contains audio data or something else entirely. Although that could also be misleading with compressed or encrypted files obviously.
Was in the process of doing some disk cleaning and found found an old bit of text I had written for this thread originally. Figure I might as well post it instead of just trashing it... Maybe someone will find it to be... something.
I guess, for me, it all started around 2010, when I was dabbling in a lot of DAE (Digital Audio Extraction, which is a specific term that basically refers to CDDA extraction with an optical drive usually).
One day someone brought it to my attention that one of the discs I had ripped from the Final Fantasy VIII soundtrack cut off early by some samples. She had the same issue with her own rip and the same drive model as me and had already rectified it by adjusting the read-offset for the drive, so I re-ripped my own copy of the CD in question with her suggested drive read-offset adjustment.
But from my side it still didn't look good - there was supposed to be null (digitally silent) samples at the end of the disc image - yet, I was still seeing what appeared to be real, non-zero data at the end. At least, Adobe Audition didn't show any null samples. I sent over the new rip to her, and was told it DID look good.. in Sony Sound Forge. I was stumped, because that basically meant one of the programs/waveforms had to be lying.
Eventually, I got myself a decent hex editor and read up on the WAVE format to figure it out once and for all. Sure enough, Sony Sound Forge was in fact showing the "digital" truth, and as it turns out, Adobe Audition interpolates the waveform, and there's no way to disable it or even configure it.
Finally discovering a decent hex editor (HxD) got me quite intrigued about format structures and just being able to see pure data like that. I had done some dynamic web pages (a la PHP) in the past, so I had some very basic programming knowledge already, and had also done my first Tool-Assisted Speedrun in 2010, from which I learned a lot about data types. It was enough for me to write some scripts to manipulate and analyze WAV files. Nothing that hadn't already been done much better by others before, but it was fun writing, and definitely a learning experience.