Internationalization
Contents
Introduction
PCGen for 6.0+ releases will be attempting Internationalization. (probably not in 6.0) To deal with this we have created a Yahoo group called pcgen_international, that is dedicated to the development of non-English language content for PCGen.
Requests for translations, problems with translations, and drafts of translated data sets or other translated content, as well as anything coming up with the Internationalization of PCGen are to be discussed in that group.
Anybody interested to help out with translations to other languages, but also those who are just interested in the discussion of the matter, please join the group. Those interested in joining our efforts on the Internationalization of PCGen, please make yourself known on that list.
This page is an attempt to regroup the information discussed on the list.
Goal
The goal is to allow a user that doesn’t speak English to use the program. There is many things to consider to attain this goal.
First the installation process should be detailed in many languages. The program UI should also be translated, and so is the data. The skills should be sorted in the UI and in the output sheet according to the natural sorting of the language. The output sheets should be in the user’s language too. The documentation need to be also in the user’s language.
(bug report language and translation)
Saved characters are usually reusable with a new version of the program, and the il8n should not change that. (technical: the saved elements should stay the same, should probably be keys of some sort — I have no idea what the pcg files are)
imperial and meter thing (already handled in 12/2011 version, at least there is an item in the options)
Tools for il8n
To facilitate the internationalization and localization, some tools would be welcome. This tools are not necessary for the user, but are a great addition to maintainers.
- a tool test to see if some entries are not translated (maybe a page that would provide statistics) it should be ui and data separated I think
- keeping glossary for each languages could help (some terms are found at many places)
- a tool to eliminate (or point at) line that are not used any more in the main bundle file (it is quite big and there seems to bit not displayed text)
Choices yet to be made
Application Packaging
As of December 2011, the application and translation of some part of the UI are bundled together.
The question is should the application be bundled this way, or should there be language specific bundles.
Pro of a all-in-one
- user can change language on the fly
- no need for explanations on how to install a new language set
Cons
- file could get quite big
Genre class names
In some languages, the class names can have a genre, like the feminine of priest is priestess in English (even if Cleric is the name of the class, it’s just to illustrate the point). Do we want to use those names in the class names or do we stick to whatever name the class is?
The goals in more details for PCGen
Default language after clean install
The language should be the system one. To the user, if doesn’t understand the default language, it will be a pain for him to find the option to change the language. It may be even better if the user is prompted with a choice at the start of the program (or at install) in order to simplify this process. It might be displayed only if the system language doesn’t exist in PCGen.
After clean install (no preference files):
- PCGen determine the system language
- if translation exist (and are believed good enough) switch to that language, if not, display an element to ask the user for the language to use.
Published language priority
- The data that is not translated in a language, for example books not yet translated, should stay in the language it first get out.
Books are almost never instantly translated by companies and some aren’t even translated. PCGen is also short on staff. All that means that the translations will probably be coming far after the original one. To an user that only use translated material, it will not be a problem. To an user that mix translated material, and not translated one, it shouldn’t become a problem.
Using different data languages
- The data should be the same that the English one to avoid having to duplicate data correction.
Having data in several languages could be a way to do il8n
Cons
- might need language specific bundles
- a character used with one language cannot be seen by someone with a different language set
- multiple corrections needed when correcting a bug (and if automatic generation of the translation is done, why not do it on the fly)
Separating data and UI translations
- The interface language and the data language should be separate to allow people with book in a different language than their own to use the program more simply.
To the user that speak only one language, it is no advantage. For a user which use book that are not in the language he is most comfortable with, he would put the UI and the data to different languages. Same for someone who doesn’t use translation of the data but prefer to have a program in a language he is more comfortable with.
Custom content and localization
- It should be possible to use another language in a data to allow non English speaker to develop custom content in another language. That seem to be a problem when combined with separate language from data. A way to avoid it is to provide easy reference to create custom content.
This might be a hard to make real combined with other points.
Technical Aspects
Language choice UI
There is already options to change the language and the distance metric used.
It might be best to have a drop down list rather than radio choices.
Formatting Issue
In Japanese, the classic number system is almost always used especially in RPG. In fact, translating numbers is not needed, what is needed is formatting numbers.
That means that 10000 gets formatted 10,000 in English, 10 000 in French. In Japanese, it would either be 10,000 or 1万.
In Java, this is usually done by using the NumberFormat class.
Sorting Issue
Sorting is usually language dependent, meaning that the list of skills for example is not sorted the same way in different languages, even if the name are the same. The differences are usually minor between roman letter using languages.
Actually, the international guys brought up the sorting thing again - another project would be to look into the GUI/core and replacing all the sorting of items with internationalized methods of sorting (Collators)
- Use of collator to sort lists.
Unique Identifier
Apparently there is already unique identifiers in use for items, so the idea would be to have a file with that id as the key and the translation as the value.
Tom’s technical proposal
So looking back at the earlier thread, I never actually came back with my suggestion, so here is my perspective on localization and how it should be done. This has been made easier by some changes we have in the 6.x line, and having new UI (with what I believe to be a lot more isolation of code that does display of items) is a huge boost here to making this practical.
First a few base facts as background:
- For [almost] every item (Class, Skill, etc.) there is a unique identifier (generally referred to as a Key - note that if the "KEY" token is not used, then the Key is the name (first entry on the line in the data) (There is an exception to this we'll call problem #1)
- There are basically 4 types of things that need translation:
- (2a) item names
- (2b) constants (like spell schools)
- (2c) variables (like meters/feet/etc.)
- (2d) Strings (like descriptions)
- [if anyone can think of more, let me know]
- Most (but not all) tokens are "unique" or otherwise "addressable" in that they can only occur once per object. (Anyone notice that the tokens have started to be specific in the test code & docs about whether they overwrite, add, etc. - this is one reason why - and yes, I've been slowly trying to make progress on this even in the 2007-2009 work) [The exception to this is problem #2)
A few principles about l10n:
- We must not make producing a data set materially more complicated than it is today (no requirement to put %L10NNAME% type gunk into data)
- We should target an ability to tell us if l10n is complete for any given data set
With that:
- Following from #1, almost everything we have in the data today is "addressable". By that, I mean that the OUTPUTNAME for a Skill called "FooBar" can be uniquely called out. The name hierarchy is something like: SKILL//Foo//OUTPUTNAME (exceptions are problems #1,2 to be addressed later)
Given that, we can actually set principle #1 to "The data remains unchanged" (again, except for problem #2). The entire data set is produced (assume US English for a moment). We then have a unique file for l10n that has things like:
- SKILL:FooBar|Oobarf-ay
- SKILL:FooBar:OUTPUTNAME|ooBarOutF-ay
- etc.
This (to the first order) covers 2a, 2d.
For 2b items, we simply have to expand the list of items (SKILL, SPELL) that we are familiar with, so we get things like:
- SPELLSCHOOL:Divination|Ivinationd-ay
2c is a bit more complicated, but I can't believe it's all that bad given it's a thing many applications already do (and it's a known thing)
Each file could be named in such a way that it identifies it's l10n, e.g.:
- srd_de_ch.l10n
- ...and probably placed in a L10N subfolder of the initial dataset.
- (Note I'd recommend we be clear on where these go, AND ALSO require that NO PCC FILES (I'd recommend we say no PCC or LST, but the formal limit would be no PCC) are recognized if they are in the L10N folder... so that the initial directory parse [which is probably one of the slower parts of our boot process] can immediately ignore the l10n directory and not have to look through the file list looking for .PCC files... alternately, we have a l10n folder that is parallel to the data folder, but that then adds complication in needing new preferences to point at multiple l10n folders and requires more complicated structure within that l10n structure to identify WHICH files in that folder map to which datasets... because we can't simply say a given English word will always translate the same way - English is way too overloaded for that.)
I would recommend we keep things in a small subset of files, and not try to do 1:1 for each data file (that would produce a lot of file sprawl) - but that's not my call.
Addressing principle #2:
- Since we have a set of items we know would need to be covered (names, certain tokens), we should be able to load those into memory, and load the l10n file into memory and compare. This should be able to produce warnings of 2 kinds:
- (W1) Errors where the base data set contains things that are not translated
- (W2) Errors where the translation file attempts to translate things not in the base data set
- I can't imagine that utility is all that hard to write (just would need to make the list of todos)
This brings us to problems #1 and #2:
Problem #1: Non-unique names
- We glossed over this in 6.x, but the truth is that Spell names are not unique. Some of the *RDs have duplicate names (not all, but I forget which). Same is true for languages.
- (1a) Spells can theoretically be differentiated by evaluating the "TYPE" token for Divine, Arcane, or Psionic (those are "magical" items in our code.
- (1b) Languages can theoretically be differentiated between "Spoken" and "Written" as those are "magical" types.
I believe both of those forms of magic are things on our backlog of FREQs to clean up... and the reason they are on the cleanup list really for L10N (as much as it is to just cleanup the overuse of TYPE)
Problem #2: Non-unique tokens
- There are only a few tokens that are not unique. DESC is one of them, if I recall correctly. The probably solution here is to simply give an identifier to each token. This might require a change to LST Something like:
- DESC*Overall:x
- DESC*Second:x
- (Note: I'm not sure this syntax works or is by any means "good". DESC:OVERALL|x might be better as it avoids potential issues with using * as a reserved character - take this is a principle of what would have to happen to DESC token, not as a full-blown proposal)
Note that this naming of each DESC item (And other reusable items), while it breaks the "can't change the data" principle actually helps as much as it hurts... it would give us the ability to do things like:
- DESC:.CLEARID.Overall
- or things similar to that, which is actually a neat benefit for the small overhead of pain it puts into datasets. (Which, by the way, could be converted to whatever we decide on anyway with our nifty converter, so this doesn't seem all that bad)
So in my mind, the question really is: Is going slightly more than half way good enough? There are areas where we could do translation, and some areas where we need some pretty material core code changes to support it.
Translation files
Hi Tom,
Welcome BACK!!!
I don't mind the solutions put forth. Unique Identifiers for DESC is no more absurd then the Unique Identifiers we'll be needing to use more than one CHOOSER on a single line. Plus, the ability to clear off a Section of DESC would be awesome.
I'd vote for as complete a job as possible.
Here is where I'm not understanding - where are we going to put the translation stuff?
(choices I view One translation file (can get really big) One translation file per file (apparently Tom advise against this ) regroup several translation in a file )