Filed under: Editorial

An Introduction to Video Game Data Mining

Hi, I’m Ehm. One of my main hobbies is searching through video game data, looking for unused and hidden assets. While trying to decide what to write about first on Source Gaming, I eventually landed on a tutorial/introduction/opinion piece of sorts. What better to write about than something you’re familiar with?

I personally classify hacking/reverse engineering and data mining as two different things. I jokingly describe myself on Twitter as a “video game archaeologist”, but I kind of mean it. Archaeologists use tools to find things, and what I do isn’t too different from that. Data mining in a very literal sense is like picking away at game data. Hacking and reverse engineering is a much more involved process, however. It often requires a high technical level of knowledge and logic. That’s not to say that data mining is a super easy process, but I believe that what I do is not special. I took some programming classes throughout high school and college, and it was never something that really clicked for me. That’s why I firmly believe that anyone can data mine with some time and practice. Some people are not cut out to do reverse engineering though, and I’m one of them.

I don’t remember exactly how I found The Cutting Room Floor over half a decade ago, but I do remember two distinct events that defined my interest in discovering secrets in games. The first was a level editor for New Super Mario Bros., originally written by Ninji. I stumbled upon this editor back in 2009, when Nintendo DS emulators and flashcarts were quite abundant at the time. The notion of creating and playing custom levels in a (nearly) brand new Mario game was incredibly appealing. After all, Super Mario Maker wouldn’t exist until many years later. The second key event for me was sifting through text dumps for Harry Potter and the Sorcerer’s Philosopher’s Stone, and Chamber of Secrets for Game Boy Color. A couple of beloved turn-based RPGs from my childhood made for a good start on TCRF. A friend had done some hacking for both games, and gave me complete text dumps. I played through the games again while noting what lines appeared and which didn’t. However, there was a slight issue I didn’t realize at the time. The UK English and US English text was intertwined, and being Canadian, I naturally played through only using the UK English setting. Some lines were only used in the US language setting, resulting in my documenting them as unused. A shaky start for sure, but a decent one I think!

There are a few things that I think are important to data mining successfully. Equip yourself with the right tools. A hex editor is absolutely essential. I personally use HxD, but there are plenty of great editors available online. I’ve seen many people determine a file type based on its extension (.png, .txt, .bin). Developers can put any extension they want on a file. Majority of the time, it’s meaningless. A hex editor allows you to see the raw data of a file, and get a proper look at what’s actually under the hood. Know what you’re looking at. A PNG file will always start off with “‰PNG” and “IHDR” near the beginning of the file (or in hex, 89 50 4E 47 0D 0A 1A 0A 00 00 00 0D 49 48 44 52). If you find a .png file without this data at the beginning, it’s likely either not a PNG file, or it has compression or encryption applied. Over time, you’ll become familiar with different file headers and data patterns, and it will be much easier to ID things quickly in a hex editor. Be resourceful. If you’re looking into a fairly popular game, chances are someone out there has done it already. Look online for useful tools or information about the game’s data. Be persistent and patient. If you don’t find anything interesting, move on and check back later. I’ve revisited numerous games and discovered new stuff. You might’ve overlooked something in the past, or maybe you learned something new that led to a fresh discovery.

Sometimes developers will talk about their time making games, and what kind of experiences and challenges they faced. The full picture isn’t always revealed though, and that’s why I think data mining is such an important aspect of telling those stories. It’s fascinating to get some insight into what could’ve been included in a game. Similar to behind-the-scenes features for films, you get sort of a “backstage access” vibe from peering into a game’s data. Unused assets will often tell a story about the game’s development in some way. Sometimes, you might even find a hidden developer message that will literally tell you about the game’s development. Even something as simple as a project name used throughout a game’s data can reveal part of how a developer works, once many are compiled together.

Don’t be afraid to give data mining a try on your own. It’s easy to find suitable tools and help online, and you might just make a big discovery one day in one of your favourite games. What are you waiting for? Get digging!