Bio 0 "sound" folder discoveries (and a sample extractor) by Nisto at 11:55 PM EDT on April 27, 2014
Hi all. I made some discoveries about the "sound" folder from Bio0 (Resident Evil 0). I don't know if any of this was already known, but I hope it will be helpful to some anyway. The "demo" folder, which seems to have some more samples, is still a big unknown to me, so the script I wrote will not work on any other files from the game. Although, the files in that folder appears to have essentially the same layout of audio data as in the "sound" directory - only they seem to be compressed/encrypted. So I hope we can figure out how to decompress/decrypt those one day.
.arc = these can contain just about anything, but for the "sound" directory, all of them contain only "pool", "proj" and "sdir" files, respectively .sam = samples .son = sequencer data.. I think
If you're curious, you can extract the internal files from the .arc files with MarkGrass' excellent tool, BioFAT (SVN6 anyway).
The sdir files contain DSP header data (such as sample rate and coefficients) and some other data. They contain two tables, which are terminated by 0xFFFFFFFF. As for the pool and proj files, I can't figure those out. If someone knows though, enlighten me/us!
sdir structure:
Table 1 (each entry is 32 bytes) 16 No idea. 16 Reserved? This is always 0x0000. u32 Sample's offset in the .sam file. 32 Reserved? This is always 0x00000000. 16 No idea. This is always 0x3C00. u16 Sample rate. u32 Total number of RAW samples. u32 Loop start offset? u32 Loop end offset? u32 Pointer to entry for this sample in the next table.
Table 2 (each entry is 40 bytes) u16 Size of the first chunk of the entry. This is always 0x0008. 48 Not sure what's stored here. u16[16] Coefficients.
The sample extraction script can be downloaded here.
It's written in PHP (I know, it's a dumb language to use for a task like this, but it's so easy to use), so you'll need PHP. I'm sorry if it's any trouble. Getting the script running is not a lengthy process, even if you're unfamiliar with PHP. Simply get the binary package (windows.php.net for Windows users), unpack it anywhere, then, in a command prompt/terminal use the CD command to change the current working directory to the directory in which you extracted the binaries.
To actually run the script: php [path...]script_name.php [path...]bio0_sound_dir Example: php C:\SamExtract.php C:\bio0\sound
If you come across a bug, please do notify me! I honestly wasn't sure if I should share the script in the first place, due to the harsh criticism it may or may not receive. This is the first time I share the full source code of a project, so please go easy on me.
Finally, some important things to note...
1: The first six samples of voice_1.sam contain only null bytes, and very few of them. As a result, your player (foobar2000 in my case) may crash if you try to open them, so please avoid these samples if possible!
2: The script won't add any loop values to the header [yet] as I'm not sure if those are standard loop start/end offsets in the sdir table (or if they're even loop values at all). Mainly because some of them don't make sense to me, like here for example -- the start offset (assuming that's what it is) is higher than the end offset(?) in a lot of cases. I also don't think they're offset+length pairs. So, if anyone figures it out, certainly do let me know, or post an update of the script even.
I don't know what those files looks like. Although, if they're only .samp files without any complementary files (like .arc in the case of Bio0), then I don't think I can help you. Can you upload one of them?
Easy-peasy. This will extract the Mario samples as well: SampExtract.php
Use it just as described in the OP, only, supply the path to the directory with the Mario sdir/samp files instead, obviously.
The other files seems to contain names for each FX or something.. You'll have to figure that out yourself. But I hope simple numbered filenames will do.
Also, I noticed that some of these files, vgmstream simply will not play. You can likely use DSPADPCM (from the Wii/NGC SDK) to properly convert those to WAV though.. if you don't mind not preserving the format.
Anyway, yeah, those will do until loops are discovered. But with Super Paper Mario already rippable with BrawlBox that MIGHT help, because it ports some sounds from PM:TTYD.
EDIT: I think I found where the loop offsets would be. 2 offsets after the number of samples, I think.
Well, a while ago, I ripped some sound effects from SPM with BrawlBox. I discovered the loop because the sample for the sound where Mario is blasting off (or when an item is being thrown by the audience) has the same loop as a similar sound in SPM, and found the loop there. That loop was 7588, I converted it to hexadecimal, and discovered the loop offset.
EDIT: The offset right after the sample offset, I don't know, but it could be for if the loop points are longer than FF FF in hexadecimal (?), but I don't know of any samples whose loops would last longer than that.
When you say "that loop was 7588", do you mean it had 7588 samples? Or bytes? Or something else?
Either way I think it's just a coincidence. The two variables marked in the picture in the OP are most definitely "connected". Notice how, if the right variable is 0, the left is always 0 as well. We need to figure this out! :)
7588 samples, I meant. I don't think you're getting what I mean.
If the number of samples in the song is "Total number of RAW samples", then I'm guessing the loop point is both "Loop start offset?" and "Loop end offset?", but it's after the number of RAW samples. Get that into decimal, and you got yourself your loop samples. The loop point is offset (h) 16 and 17 (dunno about 14 and 15) in HxD.
It doesn't work that way. A loop start offset cannot be the same as the loop end offset (unless what you actually meant to say is that "loop end offset" could be the last sample, which is possible, but then again what about the other 4-byte value?). The values will be different if there's a loop context. I know you meant that you think the loop values are 2 bytes, and at 0x16, but again, it doesn't make sense, because when the right value is 0, the left will be 0 too. And the right variable IS larger than 0xFFFF for some samples (just look through the sdir file, and you'll find some), so I doubt the values are 2 bytes.
These values are with 99% likeliness 4 bytes. But they don't seem to be standard, start/end offset, nibble-expressed loop values (because, as mentioned, the left value will sometimes be larger than the right, and obviously vice versa..).
Since I'm having a hard time explaining, the highlighted part is the loop offset in hexadecimal. Keep your eye on each row. If the sample has 00 00 there, it probably either loops from the beginning, or doesn't loop at all. I don't know if there is a value to allow it to loop, and I think the "number of RAW samples" may be the same value as "Loop end", because for most samples I can't find identical values. Although the sample I highlighted, I believe, was one of the channels for "The Final Hall" ambience at the room before you fight Grodus.
Once again: I know you meant that. But I simply do not think the loop values would be 2 bytes.
Here's something interesting for you... https://dl.dropboxusercontent.com/u/48454461/img/bgm_0a.PNG
This is bgm_0a.sdir from Biohazard 0. The structure of the sdir format between Bio0 and PMTTYD is identical.
EDIT: I think I figured it out. They're indeed 4-byte values. Both are expressed in raw samples (unlike the std, which is expressed in nibbles). The left is the start offset, and the right is the length of the loop.
Take the example above the entry that's marked: it has 0x133A5 raw samples. Adding the two values to the right (0x108EF and 0x2AB5) becomes.. 0x133A4!
Now I just need to figure out a proper way to convert raw samples to nibbles.. (samples / 1.75) * 2 kinda works, but it'll be off by one or two nibbles in most cases :/
And anyway, I actually think vgmstream doesn't support looping when the start offset is larger than however many null nibbles there are at the start for some reason (I could be wrong though - I don't know what a lot of the DSP values really are to be honest, like what's coefficients? predictor? scale?), so I'm not sure what to do here..
Bio Rebirth uses the same format, as well, only the data is stored in the SND archive files.
For reference, the SON, SDIR, etc data is created by Factor 5's MusyX SDK. If you want to reverse-engineer it, I support you 100%... more information could probably be found in the Dolphin/Revolution SDKs.
Oh, and that version of biofat is highly outdated (SVN 6). biofat is now hosted at RHDN.NET
http://www.romhacking.net/utilities/1019/
...update coming soon, with support for both Bio0 and Rebirth reimplemented.
Oh, okay. I guess I'm gonna have a look at those SND files on it. Maybe I can modify the script to support those as well. Assuming they're like the sdir files?
Yup, I know SVN6 is old, but it's the last version that supports Bio0 and 4 files, so... Anyway, good to hear you'll reimplement them!
Do you have the MusyX SDK? Where can I find it? (If you wanna go PM, I'm on VGMdb, The Horror Is Alive, FFShrine and some other places)
The loop sample number is two bytes right of the loop start point. Add those two and you get your loop end. You won't always have the same loop end point as the number of samples.
The script has been revamped. It can now extract samples from at least: Biohazard (Resident Evil) Biohazard 0 (Resident Evil 0) Paper Mario: The Thousand-Year Door Star Fox Adventures
A format must now be specified though - just pass the filepath of the script as an argument to PHP (without additional parameters) for usage instructions.
Loop values are still not added, and the issue with some DSPs not playing remains (not an issue with Bio1 and Bio0 though, all samples play), so feel free to contribute any fixes or ideas if you can!
Also, I noticed something new about the sdir format. The bytes between offsets 0x04 and 0x08 (probably two 16-bit values) in the second table seems to be non-zero only when there's a loop context. So perhaps they have to do with one of the possible loop context header values.. ? The 2 bytes before these two values does not appear to be related to looping context. I don't know, maybe someone more experienced with audio formats can give a hint as to what the values may be?
Updated the script. Just a small bugfix (nothing that affected the extraction process, just some undefined variables that were being passed to the custom error function), as well as an addition to the tested games. The link is the same.
If anyone has used it with any other games besides the ones listed in the script/usage instructions I'd appreciate it if you could document it here.
Started reverse engineering the game a few days ago.. Happy to say I am making progress in locating the various routines used to read the ALZ files (e.g. those in the "demo" folder). The value at 0x1 is a 32-bit little-endian value, which is the decompressed size. And I'm pretty sure at this point that the byte at 0x0 is a version byte or a decompression type/level, as each type (0, 1 and 2 that I can see so far) goes through separate branches.
However, now I've learned that the data in at least option_e.lz (option.lz on the Japanese release) might also be encrypted/encoded somehow. Here's a sample of the decompressed data (I just snagged this from RAM):
EDIT2: Oh, I'm dumb... It's not encoded! It's just custom code points for each character after all... But it didn't seem to make sense to me at first. Blah. Anyway, I will post a list of the codepoints later on. It's not related to any sound data, but hey, it might help in the future maybe.. I thought it'd be easier to first look for data that I know it'll read, and when it'll read it. Now it seems I just need to understand how the whole decompression routine works and perhaps write a script or something.
EDIT3: Here is a list of character codepoints: https://dl.dropboxusercontent.com/u/48454461/misc/bio0char.txt. Also, I just learned the hard way, that, even though there are a whole bunch of texture files containing glyphs, all characters (except in menus and whatnot) are actually hardcoded into the main executable! You can find the data at 0x258FCC in the executable of the US release. It seems they use a format similar to what is demonstrated in the Dolphin SDK.
I have managed write a decoder for the format in Python, but... there's something wrong. Some codepoints (in the case of option_e.lz) are not in their right place and I have tried to locate the error in my code for hours without luck.
I am wondering if someone here is willing to have a look at the assembly code (PowerPC) and try to locate the error? I have an IDA Pro database file with lots of comments that I am willing to share, and I can share the Python script too.
To give a perspective, here is a comparison of the in-game decoded data (RAM) and the result of the Python script both converted from the original codepoints to ASCII text:
So I finally figured out why it didn't come out right. I hope I am not the only one excited for this, as it took me over a week of practically non-stop reverse engineering and coding. Oh yeah, and I should mention that right now it only supports type 2 ALZ files. I plan on implementing support for type 1 soon, don't worry (it doesn't seem like any ALZ file in Biohazard 0 is type 1 anyway).
And for anyone interested, here is a Python converter for the custom codepoints (mainly seen in message.arc). It currently only supports ASCII characters, but anyone can feel free to add some of the remaining characters to it and post an update (although tedious, it should be easy to do even for non-programmers). Keep in mind that it does not expect any offsets at the beginning of the file (like option_e.lz has) though, so to get it to properly convert the codepoints you must trim any data that's not relevant from the beginning/end of the input file.
HI, Nisto, thank you very much and finally ALZ format Resident Evil 0 solved, I am waiting for more than four years time, but I want to change the enemy's data, there is a problem is that even though you can unlock tool ALZ file it can not be compressed back, very much hope that you can come up with a compression tool, once again thank you for your hard work.
@shenghua8848: I managed to get back in contact with Mark Grass (developer of "Biofat"). No promises, but I think if I shared my source code with him, that he might be able to write a compressor. Modding is not really my area of interest, so I am not motivated enough to figure out how to compress it back (it can't be as easy as doing everything in the reverse order? :P), but I think it is Mark's. As I mentioned on XeNTaX, I've never worked with compression algorithms before, so it's difficult for me. But fingers crossed. :)
@Nisto£ºI know the compression process is difficult, but now no one except you and markgrass willing to do these, and I hope you can come up and markgrass compression tool, I have been waiting for a long time, your efforts we will not forget
Don't think there will be any use of it, and I haven't even been able to test it, but here is an update with support for type 1, and some code optimizations:
EDIT: removed -- 0.2 contains an integer reading bug, please see page 3 for a link to version 0.3!
Still no support for compression. Trying to get a response from Mark to see if he's willing to try.
@Nisto£ºalzdec03 is perfect, not found any obvious problems, if the compression function can add it perfectly
LZ compression by RangerRus at 12:03 AM EST on January 11, 2015
Hi, Nisto. Thanks for your work in this subject. Don't you mind to explain in two words how work sliding window in this 'a'lz? bits meaning. For example: 02 79 05 00 00 01 02 04 18 a6 1d 10 d1 5f d1 9f {0x00} = 02 - type {0x01-0x04} = 0x579 - decompressed size {0x05-...} = {01,02,04,18...} - compressed data How properly read bits in compressed data part?
It compiles with at least GCC 4.8.1 (MinGW) on Windows. Haven't tried with other compilers as I only have GCC.
For those wondering, this version only adds some additional safety checks and messages.
Mark Grass hasn't responded in over a week and it seems like he's read my PM at this point, so I don't know, I guess he isn't up for trying to implement a compressor. I hope someone else will give it a try.
Also, I would appreciate any feedback on the source code as this is one of the first C programs I've written. If I am doing something wrong/redundant/whatever, I would love to know how I can improve it. Thanks!
I have updated the codepoint converter to support all characters in all versions. All versions changes the character mappings, so I've had to implement a table for each release (trial, JP, US), hence the now enormous script file size by the way. This has taken me a great amount of effort to compile as each character has required me to manually extract it "by eye" (naturally some automation and OCR was involved, but it has still added up to a lot of time due to proof reading among other things), since each codepoint just maps to a glyph and that's it (there's no consistency with standard encodings or anything). So here's to hoping it'll be put to good use.
Maybe in the future I'll add the ability to convert characters back to codepoints. If there's any requests for that...
Terrible news (and good for some I guess)... The codepoints (at least in the JP release) changes for some reason at various places. For example, if you save at the first typewriter (cabin), you'll notice two characters are off in the converted text (列車2程個写) when comparing to what the actual game shows (列車2等個室). This is not an error in the character mappings in the converter (I have triple checked the two codepoints in question). So I have two theories:
1: the game loads the font textures needed (apparently it's only in the US release the codepoint textures aren't loaded) based on the characters in either each string or each file 2: the game loads the needed font textures for the option menu (which is where I've been iterating each codepoint) when the menu is loaded
So I figured, one way I might be able to figure it out is by replacing the codepoints in the actual game image... This made me remember that there is actually a way to replace compressed files with uncompressed data, for those who wants to mod this game. Just set the first byte to 0 and the integer (4 bytes) at 0x1 to the (uncompressed) size (remember that it's little-endian). Obviously you might be limited to what you can do with this method, but it's a start. Also, this is actually supported by the game, it actually does check if the data is uncompressed.
So anyway, I've manually replaced the codepoints within in the actual image and.. the game still shows the unexpected characters, so I guess it must must be somehow related to which "area" is currently loaded. So guess there's no accurate way to convert the codepoints without knowing which set of characters are to be used, or by guessing :shrug:
The last few days, I've been working on the original ALZ decompression tool I wrote, to also support compressing, since it's been requested once or twice. Now I truly understand the algorithm (and compression concepts in general) and have finally finished writing a successor with compression support. It can (de)compress any known ALZ type (0-2), and the word compression algorithm even seems to be more efficient.
Blargh. Now it's the first thing I see.. Having been awake for nearly 24 hours, I was quite tired last night though (<-[open] that's a legitimate excuse and not something I'm making up !! [close]->), so I hope it's forgiven.. :P
I'm not sure what you're asking. Are you saying that you never tested alzdec, or the new tool? alzdec could decompress any ALZ type as well, so alz-tool shouldn't really bring up anything new from the original game data. Basically, all this tool brings is compression support.
If you're wondering about modding in general, I can confirm that the game itself reads any output from alz-tool without problems.