The code behind localizing Japanese games
Sara Leen wrote about her work as a Japanese to English localization programmer for XSEED Games. I was interested in learning more and Sara was kind enough to share her experience.
She modifies games and builds toolchains so translators can work in their tool of choice… Excel.
It’s a unique form of software development that involves debugging, low level programming, and reverse engineering.
So how do games store text?
Source code
To celebrate the 10th anniversary of the gravity flipping platformer VVVVVV Terry Cavanagh open sourced it. If we dig a little we learn dialog and menu text is stored directly in source code.
After searching for “I wonder why” we find all the dialog in Scripts.cpp:
|
|
We could prepare it for a translator by:
- Identifying code that looks like dialogue or menu text
- Ignoring code like
cutscene()
,squeak(player)
andendtext
- Extract text with line numbers into a spreadsheet
Line Number | Text |
---|---|
559 | I wonder why the ship |
560 | teleported me here alone? |
566 | I hope everyone else |
567 | got out ok… |
After a translator edits the spreadsheet the translated text needs to be reinserted into the source code. Sara recommends building tools to do this.
If the project you’re working is at any sort of scale that makes that an hours long job? Do not do it manually. Please.
In some games the text is spread across tons of files. Finding the dialog and menu text becomes very difficult and important context can be missing1.
Sometimes you won’t even know which character is speaking unless you actually play through the game while you’re working on the script. And obviously this poses some interesting challenges because in general Japanese is not as context heavy and context apparent as English is. You can’t always tell who’s talking just by the tone of their voice.
Embedding dialog in the code also makes it more difficult to support multiple languages. You might need a separate build of the game for every language you support.
There’s an open pull request that creates a script file system for VVVVVV if you want to check it out. Let’s take a look at that approach in another game.
Script files
Corpse Party is a horror game that describes cutscenes, dialog, and gameplay behavior in script files. These are separate from source code and stored in compressed archives2 separate from the executable.
Here’s an example of a plain text script3 with Japanese dialogue.
SCRIPT4,0
fade:in,5,red
showface:01_kenshiro_e.bmp
showtext:"ケンシロウ","お前はもう死んでいる ","","",
showface:none
wait:20,OPEN
end
END_SCRIPT
Scripts can have commands like:
- Fading screen in/out
- Playing music
- Moving characters
- Displaying dialogue
They provide context for translators and let game developers describe cutscenes and behavior without modifying source code.
If you’re lucky you might be able to modify the script and have the game load it in.
SCRIPT4,0
fade:in,5,red
showface:01_kenshiro_e.bmp
showtext:"Kenshiro","You are already dead.","","",
showface:none
wait:20,OPEN
end
END_SCRIPT
Text encoding4
So far we’ve assumed text can be replaced and just work. If the game encodes text in UTF-8 that might be true.
But some games use less common formats like Shift JIS or even custom made encodings. Sara explains how to handle these:
As long as you know the format of the script file, you’re usually going to be fine. But some games hard code their text and others have been in binary formats that have been [assembled] in some way.
And when that happens, you often have to take a look at the code and figure out exactly what this format is which may or may not be documented and basically do the process in reverse. Unassemble it.
The way that encodings work . . . is that the text is interpreted from its binary or hexadecimal format into characters we can understand. And since there’s a lot of different encodings out there, sometimes you don’t know exactly what you’re going to get. . . If a game was made for an older system, it may have a completely custom encoding where a value like “2” means “a” and in this situation just replacing the text you have no idea what the game’s going to output.
If the encoding is very specific for the game or the font simply doesn’t have other symbols, you might get complete jibberish in what font was already there. In Japanese, it’s called mojibake.
If the text encoding of a document is different than what your text editor thinks it is then it will look like the screenshot below. For example you could have a Shift JIS encoded file that is read as UTF-8 by your text editor.
As a part of their work localization programmers can replace characters in a game’s font with characters in the target language.
For example the Japanese character「か」might be replaced with the English letter R. If this is done without translating the text then gibberish is displayed. There isn’t a one to one mapping between individual Japanese characters and English letters so the result looks like a random mashing of keys.
The fan translation community calls this Cavespeak. Can you imagine an ancient civilization understanding the nonsense below?
Variable vs fixed-width fonts
Even encodings with all the right characters can have issues with text rendering:
Shift JIS largely supports English letters but because Japanese text tends to include characters that are all the same width, it doesn’t often account for the fact that in English you have very thin letters like the letter I. And so when you simply insert the text into those games you might get a situation where all of the letters are spaced really far from each other. And that does not look good.
You can see an example of character width issues in this screenshot of an early version of the Mother 3 fan translation. The game assumed characters would be the same width and height.
This occurs in commercial productions too. There’s another issue with this translation but I’ll leave that as an exercise for the reader.
Graphical fonts
Some games use graphical fonts instead of TrueType or OpenType fonts.
A graphical font is an image file that includes every character that will be required by the game. This is combined with a table that contains the geometry of each font so the game can draw the characters properly.
Graphical fonts help give games a unique style.
If you’re translating a game into another language you’ll need a graphical font that includes images for every single character. In the example above, we only have letters from the English alphabet. To support a language like French we need to add letters with accent marks to the font.
So why use graphical fonts at all?
They give artists full control over how fonts look. Chrono Trigger used graphical fonts in its original release which were replaced with TrueType fonts on the PC. You can see how they look out of place compared to the original fonts.
The PC version was later patched to improve the font and graphics.
Translating graphical assets
Most games embed text in assets like textures and logos.
In a best case scenario, the original layers of the texture are preserved. For example the text on a sign would be on a separate layer from the physical board the text is written on. This allows localizers to change the text in Photoshop without redrawing the asset. However, they’ll still need to pick a font or draw text that matches the aesthetic of the original version.
That might sound simple but take a look at the logo below. Nearly everything had to be created from scratch except for the symbol in the background. Translating graphical assets is not always a simple text replacement job.
Asset re-insertion
Assets like script files and graphics are often encrypted and stored in archives. Professional localizers can browse game source code to learn how to extract and reinsert assets.
For fans this work is much harder. Usually they only have access to compiled binaries and must read assembly code to reverse engineer games. Sara got her start with fan translations and explains the process:
Initially you would have a debugger that works with the game and you would have to step through step by step until you actually see the texts being drawn. And then you basically start stepping back from there to see how the text was loaded. And so it’s all a process of finding first something that uses the function you need to edit and then working your way back.
Usually it’s going to be in some sort of container you don’t have any specifications for. . . some encoding you don’t know. It might even be encrypted and obviously all of the functions for you to fix that are right there, but how do you put it back in? It’s easier to extract something than it is to create an archive that works exactly as the one before.
It definitely is much easier to get a file that’s already encrypted or encoded or compressed open than it is to make a new one because you have to worry about the size of the file. You have to worry about getting the optimization of the compression right. You mess up one part of the encryption key and everything’s wrong.
The difficulty of this process means that text extraction and insertion can be very involved for fan translations. If you want to learn more you might like this article about dissassembling a visual novel engine.
There are also tools like QuickBMS that can extract and reinsert data using many different formats.
An ideal case
Before we wrap up, I want to give you an example of how games can store text with localization in mind.
A Short Hike loads dialog from CSV files that can be edited right inside Excel. This allows fans to easily insert their own translations.
Here’s an example of an English to Korean translation.
LineCode | LineText | OriginalText | Comment | Speaker | StoryNode |
---|---|---|---|---|---|
line:aeab6a | 난 인터넷이 이해가 안 돼. | i just don’t get the internet | GoatExtra | Original | i just don’t get the internet |
line:74bfc8 | 믿지도 마. | don’t trust it | GoatExtra | Original | don’t trust it |
line:5e700f | 절대로. | never will | GoatExtra |
Consider doing something similar if you’re making a game! Note that even this game only supports a limited set of fonts. To write a Russian translation the developer would need to add a Cyrillic font.
The end…?
I hope you enjoyed this little tour of localization programming! Listen to the full conversation with Sara if you’d like to hear more details.
To learn more you can also check out Legends of Localization. Their article on translating personal pronouns paints a picture of just how complicated localization can be.
Thanks again to Sara for sharing her knowledge and reviewing this post.
See you next time!
–Jeremy
-
You can read more about the challenges of translating Japanese games without full context in this article about translating JRPGs at XSEED Games. Translators can miss a lot if they don’t see their work running inside the game! ↩︎
-
Corpse Party uses the Heart of Crown encoding format which is supported by QuickBMS. ↩︎
-
In the original script files for Corpse Party everything was in Japanese text including the commands. This is very unusual. In most cases, the commands are either in English or written in Japanese using English characters. This is called Romaji. For example, the Japanese word for music (音楽) is written as “ongaku” in Romaji. ↩︎
-
Here’s a great article to learn more about encoding. ↩︎