rtcvb32: Well taking a raw look, there's an awful lot of zeros. About what i expected.
clarry: That doesn't answer the question. What are the zeros there for, what do they represent? In all likelihood they are a part of something and not just "empty unused space." Zero is valid data too! And it happens to be very common data.
Zeros near other values certainly, but by itself generally not.
I am assuming an object has something like 100 properties, when those properties are blank it's 0. Putting all of them there whether they are used or not might or might not work.
clarry: Anyway, turns out I own the game and decided to take a look inside. While reverse engineering that binary format in 15 minutes is beyond my capabilities, I can make this observation: the file contains lots of recurring references to things like Chandelier, WallTorch, LightEffect, Teleporter, Destination, SetOfTiles*, SpikesSet*. Thousands of these. Which suggests to me that the file contains the entire scene with all its objects losslessly serialized.
Most likely named references to objects which is going to be in the sharedassets probably. Levels from what i see are every page and scene in the game, from the game menu and options, to each level in the game. As such each object is literally just a tile or object placed in space.
At least if my experience with Morrowind ESP files are anything to go by.
These details would have to include 3d points, scale, rotation, zoom, bounds checking model, and any other details it thinks it needs, including scripts (
AI and otherwise). 99% of mapped landscape/level layout are going to be static tiles/blocks which requires no interactions.
clarry: Level15 appears to contains approximately 10k of such objects. I took a quick peek at a playthrough of the game on YouTube, and while it is not very easy to get the sense of the levels' scale, I can tell that they are quite large and can easily consist of that many tiles (and objects & effects contained within said tiles).
If all objects with all their properties and relations and references to assets are serialized in binary, the size looks quite sensible. A few kilobytes per object times 10k is how you get to a few tens of megabytes.
Yes, but you are still getting 200-300 0's in a row, which is why you'd go from 62Mb to 1.5Mb when compressing.
clarry: While I personally would not implement a game this way, I don't see anything wrong with it per se. The zeroes would be part of these serializations -- possibly including any padding bytes, if the goal is to make these files fast and easy to read & write without any marshalling (i.e. just copy the blob -- possibly even mmap the file and start using the memory as-is!).
Disk space is cheap and losslessly dumping objects in and out is fast and simple. Developer time is *not* cheap and I can see why someone would not spend it optimizing the data files, especially given how well entropy coding compresses it for distribution anyway.
Space may be cheaper now; but extracting a 50Mb archive to 530Mb is still disconcerting, especially if it's say an SSD with limited writes.
Now if Unity had done what a number of engines do which is allow you to access a zip file, then the 5-10% space lost due to inefficiency vs say 7z is insignificant.
Sorry, Space and efficiency is one of my big OCD's.
Ever written your own huffman code, store and load it and compress/decompress data on the fly?
clarry: Nope. But I have reverse engineered data files in games, including written a decompressor for the RLE compression in
Killing Time and used that to uncompress & extract all the data files out of the archive.
Don't see the need to write my own huffman implementation given that there are plenty of free & open source implementations out there (including easy to use libraries). Also, if I wanted entropy coding, I would not waste time on old huffman.
Arithmetic coding is much better.
It can be fun to implement for fun (
though the sliding window is much simpler), but storing any kind of data means you're creating some impromptu format that is unique and not recognized by any other software; at least that is what i was aiming towards. So the '
Unity internal format' that eric5h5 was scoffing me about... well you get the idea.
Regardless, Arithmetic coding is a terrible name for it, as it doesn't do math or arithmetic at all; Instead it just creates additional symbols that contain a hidden length element attached to it. So if you encoded Hello and Helol the compression on Helol wouldn't be useful at all while it still would try to store the data for 1 L and 2 L's (
Unless it's smart and knows it never finds 2 l's in a row).
I'd think arithmetic coding could just multiply values together (
and you modulus to restore them), with enough symbols together the sum is smaller than the individual symbols. (
32^5 vs 26^5)