Internationalization

From PCGen Wiki
Jump to: navigation, search

Introduction

PCGen for 6.0+ releases will be attempting Internationalization (probably not in 6.0). Internalization is the process of changing software to allow it to be usable in any language and any locale. It is often abbreviated i18n. It is needed in PCGen because the software was not thought from the ground up to be usable from a language other than English. Localization is the process of creating the translations for a particular language, to do so the software need to able to handle internationalization. To deal with those, a Yahoo group called pcgen_international has been created.

Requests for translations, problems with translations, and drafts of translated data sets or other translated content, as well as anything coming up with the Internationalization of PCGen are to be discussed in that group.

Anybody interested to help out with translations to other languages, but also those who are just interested in the discussion of the matter, please join the group. Those interested in joining our efforts on the Internationalization of PCGen, please make yourself known on that list.

This page is an attempt to regroup the information discussed on the list.

There is a JIRA issue to regroup work on i18n: http://jira.pcgen.org/browse/CODE-733

Goal

The goal is to allow a user that doesn’t speak English to use the program in his native language with the specifics of his locale. This is an ideal goal, as it is not possible to provide all possible localizations with the low number of volunteers. What is needed is to provide the means for people to create localizations, in order to be able to do that, PCGen must be fully internationalized. Once this is done, the number of localization will slowly grow.

PCGen interface, especially the part that are data set independent, are already internationalized and partially translated. Updating those translations is one of the needed step but not the only one. Another element is already internationalized, it’s the use of imperial or metric distance. There is also an option to change it.

Other steps are need to attain internationalization.

  • First the installation process should be detailed in many languages, even if it’s only expanding a compressed file then using a script to launch the program.
  • All program UI strings should also be translatable, and so are the data strings. It seems that the non data strings are already translatable.
  • The skills should be sorted in the UI and in the output sheet according to the natural sorting of the language.
  • The output sheets should be in the user’s language too.
  • In case the user create it’s own translation, it should be possible to add the language in the option pane without having to compile the program (or worse have to change some code!). From a technical point of view, it also means that the translations should be external to the jar, or that a simple way to add a translation provided.
  • Saved characters are usually reusable with a new version of the program, and the il8n should not change that. (technical: the saved elements should stay the same, should probably be keys of some sort — I have no idea what the pcg files are)
  • It would be better for the measure and weights system to be initially set to a logical value rather than a set one, it would depend on the first chosen locale. It is also needed to verify that the imperial and metric system option are actually working.

There is also other elements that would improve usability for the user, mainly that help/documentation would be available in his language. This is usually big work to be achieve but doesn’t need any change in PCGen. Another linked part is to be able to handle bug report in another language. To do that, the project need some people that would translate the bug reports to English before the development team is able to handle the bug reports.

Tools for internationalization and localization

To facilitate the internationalization and localization, some tools would be welcome. This tools are not necessary for the user, but are a great addition to maintainers.

For internationalization:

  • a way to determine which strings are translatable or not. Some IDE and some code quality library provide this kind of tool.
  • a tool to eliminate (or point at) line that are not used any more in the main bundle file (it is quite big and there seems to bit not displayed text)

For localization,

  • If a specific way of storing the data translation is used, a tool to produce and update a more standard file would be needed. In free project, gettext files (POT, PO) are often used. In Java, ResourceBundle are often used. Another standard format is XLIFF. Such a tool would be able to add new needed translation to a file of the standard format, and flag old translations. From the standard format, it would update the translation of the specific format file. This is needed because there is tool to ease and speed up translation, like Virtaal, ResourceBundle editors, or poedit, and they need a standard file format to be usable.
  • a tool test to see if some entries are not translated (maybe a page that would provide statistics) it might be more useful if UI and data separated
  • keeping glossary for each languages could help (some terms are found at many places)
  • A way to detect incorrect token translation (like using Dager instead of Dagger) is welcome. One way to do this is by generating error or warning when loading the translation file. It would be welcome to generate this list of error without having to launch PCGen, it would also be a nice addition to automatic testing.

Choices yet to be made

Application Packaging

As of December 2011, the application and translation of some part of the UI are bundled together.

The question is should the application be bundled this way, or should there be language specific bundles.

Pro of a all-in-one

  • user can change language on the fly
  • no need for explanations on how to install a new language set

Cons

  • file could get quite big. But if it is text, it compress usually really well.

Genre class names

In some languages, the class names can have a genre, like the feminine of priest is priestess in English (even if Cleric is the name of the class, it’s just to illustrate the point). Do we want to use those names in the class names or do we stick to whatever name the class is?

The goals in more details for PCGen

Default language after clean install

The language should be the system one. To the user, if doesn’t understand the default language, it will be a pain for him to find the option to change the language. It may be even better if the user is prompted with a choice at the start of the program (or at install) in order to simplify this process. It might be displayed only if the system language doesn’t exist in PCGen. As noted before, the measure and weight system used need to be chosen by the user at the first install with the default choice depending on the language/locale selected. After first launch or install, the option is changeable in the option preference panel.

After clean install (no preference files):

  1. PCGen determine the system language
  2. if translation exist (and are believed good enough) switch to that language, if not, display an element to ask the user for the language to use.

Published language priority

  • The data that is not translated in a language, for example books not yet translated, should stay in the language it first get out.

Books are almost never instantly translated by companies and some aren’t even translated. PCGen is also short on staff. All that means that the translations will probably be coming far after the original one. To an user that only use translated material, it will not be a problem. To an user that mix translated material, and not translated one, it shouldn’t become a problem.

Using different data languages

  • The data should be the same that the English one to avoid having to duplicate data correction.

Having data in several languages could be a way to do il8n

Cons

  • might need language specific bundles
  • a character used with one language cannot be seen by someone with a different language set
  • multiple corrections needed when correcting a bug (and if automatic generation of the translation is done, why not do it on the fly)

It probably should be avoided to do it that way.

Separating data and UI translations

  • The interface language and the data language should be separate to allow people with book in a different language than their own to use the program more simply.

To the user that speak only one language, it is no advantage nor a problem. For a user which use book that are not in the language he is most comfortable with, he would put the UI and the data to different languages. Same for someone who doesn’t use translation of the data but prefer to have a program in a language he is more comfortable with.

Custom content and localization

  • It should be possible to use another language in a data to allow non English speaker to develop custom content in another language. That seem to be a problem when combined with separate language from data. A way to avoid it is to provide easy reference to create custom content.

This might be a hard to make real combined with other points.

(might be wrong on this point Masaru20100) At the moment, when creating custom content, the English token are used. For a non English-speaker, it is a problem, that’s why, ideally, he should be able to create custom by using his own language.

The problem this cause is that if the user change the language of the program, the custom content either doesn’t work anymore, or all possible translations are checked which would cause massive slowness in the software.

As it seams undoable, the user should be provided with way to easily find the needed English token for his custom content.

Indicating what is translated

It would be good to provide the user with visual information of how much a dataset is translated in his language. It is not really a needed feature but could help him. If he knew that all is not translated, he would be less surprised.

Technical Aspects

Language choice UI

There is already options to change the language and the weights and measure system used.

It might be best to have a drop down list rather than radio choices. New translation should appear in the list.

Formatting Issue

Martijn pointed that number should be translatable. In Japanese, the classic number system is almost always used especially in RPG. In fact, translating numbers is not needed, what is needed is formatting numbers.

That means that 10000 gets formatted 10,000 in English, 10 000 in French. In Japanese, it would either be 10,000 or 1万. There is even a locale to obtain 一万 if needed.

In Java, this is usually done by using the NumberFormat class with the appropriate locale.

Sorting Issue

Sorting is usually language dependent, meaning that the list of skills for example is not sorted the same way in different languages, even if the name are the same. The differences are usually minor between roman letter using languages.

Actually, the international guys brought up the sorting thing again - another project would be to look into the GUI/core and replacing all the sorting of items with internationalized methods of sorting (Collators).

Technically, Java already provide ways of sorting based on locale, it’s the Collator classes.

Unique Identifier

Apparently there is already unique identifiers in use for items, so the idea would be to have a file with that id as the key and the translation as the value.

Output Sheet and Data Language

The question is: should output sheets be locale dependent? If it is, when a sheet is corrected for one locale, it will need to be edited in all languages. Having languages taken into account in the same sheet make it more complicated.

Gender, race and level example

There is also other problem as illustrated by the gender, race and level.

In English, the gender propose Male, Female, Neuter, Unknown. In a English statblock, it is usually a line like: Gender Race Level N, as “Female drow cleric 3” for a drow noble [1]. As of this writing, the software output “Female Drow Noble Cleric3”, or “Female Drow Noble Cleric 3” instead in most statblock sheets. The Unknown gender, and maybe the neuter one too, should probably not output as is in statblock sheets, while not On my system, with the preferences to use the system language, the female translation is used instead.

In French, it is Mâle, Femelle, Neuter, Inconnu. The same noble drow is “Drow, prêtre 3” (prêtre is the translation of cleric) [2]. I remember reading “Drow (f), prêtre 3”, where the sex is included by using a single letter in parenthesis. It could also have been “Drow, guerrière 3”, where guerrière is the feminine of fighter. As you can remark, the order is different.

In Japanese, it looks like “ドラウ(女性)の貴族の3レベル・クレリック” [3]. ドラウ is drow, 女性 means woman, の is the possessive, 貴族 is noble, レベル is level, クレリック is cleric. That would look like drow(female)’s noble’s 3rd level cleric. When the gender is unknown, it is not mentioned, goblin is such an example [4].

In Italian, it seems to be “Drow nobile femmina Chierico 3”, ie Drow noble female Cleric 3. [5]

I think that in German, Female used as the choice and female as in female something is not written the same.

Note that my examples uses Pathfinder because I do not know of translations of the SRD/RSRD online.

This is only the first element of the stat block, and it seems already complicated. And there was no example of right-to-left languages, like Arabic or Hebrew. One thing seems common is that the gender value used on display, on the one used in stat blocks are different, whatever the language, and it is the same for race. For a standard character sheet, it seems that the UI value can be reused as is. In the case of gender, there is already two (three?) token, GENDER.SHORT and GENDER.LONG (and GENDER?). GENDER.SHORT, which has the same value as GENDER.LONG, could be changed to be the output value. A new value could be introduced, maybe something like GENDER.STATBLOCK. That still doesn’t handle the problem of the order of gender/race/class levels.

Data translation and output sheet values locale

The issue is of what language to use when outputting the gender or other localized fields. At the moment, the data is in English, but there is a mix of English and whatever language the UI is defined to use. My primary example is Gender. Once a data language is introduced, it might make sense to use this language in the output sheets rather than the UI one (if they differ).

Tom’s technical proposal

So looking back at the earlier thread, I never actually came back with my suggestion, so here is my perspective on localization and how it should be done. This has been made easier by some changes we have in the 6.x line, and having new UI (with what I believe to be a lot more isolation of code that does display of items) is a huge boost here to making this practical.

First a few base facts as background:

  1. For [almost] every item (Class, Skill, etc.) there is a unique identifier (generally referred to as a Key - note that if the "KEY" token is not used, then the Key is the name (first entry on the line in the data) (There is an exception to this we'll call problem #1)
  2. There are basically 4 types of things that need translation:
    1. (2a) item names
    2. (2b) constants (like spell schools)
    3. (2c) variables (like meters/feet/etc.)
    4. (2d) Strings (like descriptions)
    5. [if anyone can think of more, let me know]
  3. Most (but not all) tokens are "unique" or otherwise "addressable" in that they can only occur once per object. (Anyone notice that the tokens have started to be specific in the test code & docs about whether they overwrite, add, etc. - this is one reason why - and yes, I've been slowly trying to make progress on this even in the 2007-2009 work) [The exception to this is problem #2)

A few principles about l10n:

  1. We must not make producing a data set materially more complicated than it is today (no requirement to put %L10NNAME% type gunk into data)
  2. We should target an ability to tell us if l10n is complete for any given data set

With that:

  • Following from #1, almost everything we have in the data today is "addressable". By that, I mean that the OUTPUTNAME for a Skill called "FooBar" can be uniquely called out. The name hierarchy is something like: SKILL//Foo//OUTPUTNAME (exceptions are problems #1,2 to be addressed later)

Given that, we can actually set principle #1 to "The data remains unchanged" (again, except for problem #2). The entire data set is produced (assume US English for a moment). We then have a unique file for l10n that has things like:

  • SKILL:FooBar|Oobarf-ay
  • SKILL:FooBar:OUTPUTNAME|ooBarOutF-ay
  • etc.

This (to the first order) covers 2a, 2d.

For 2b items, we simply have to expand the list of items (SKILL, SPELL) that we are familiar with, so we get things like:

  • SPELLSCHOOL:Divination|Ivinationd-ay

2c is a bit more complicated, but I can't believe it's all that bad given it's a thing many applications already do (and it's a known thing)

Each file could be named in such a way that it identifies it's l10n, e.g.:

  • srd_de_ch.l10n
  • ...and probably placed in a L10N subfolder of the initial dataset.
  • (Note I'd recommend we be clear on where these go, AND ALSO require that NO PCC FILES (I'd recommend we say no PCC or LST, but the formal limit would be no PCC) are recognized if they are in the L10N folder... so that the initial directory parse [which is probably one of the slower parts of our boot process] can immediately ignore the l10n directory and not have to look through the file list looking for .PCC files... alternately, we have a l10n folder that is parallel to the data folder, but that then adds complication in needing new preferences to point at multiple l10n folders and requires more complicated structure within that l10n structure to identify WHICH files in that folder map to which datasets... because we can't simply say a given English word will always translate the same way - English is way too overloaded for that.)

I would recommend we keep things in a small subset of files, and not try to do 1:1 for each data file (that would produce a lot of file sprawl) - but that's not my call.

Addressing principle #2:

  • Since we have a set of items we know would need to be covered (names, certain tokens), we should be able to load those into memory, and load the l10n file into memory and compare. This should be able to produce warnings of 2 kinds:
  • (W1) Errors where the base data set contains things that are not translated
  • (W2) Errors where the translation file attempts to translate things not in the base data set
  • I can't imagine that utility is all that hard to write (just would need to make the list of todos)

This brings us to problems #1 and #2:

<a id="problem1">Problem #1</a>: Non-unique names

  • We glossed over this in 6.x, but the truth is that Spell names are not unique. Some of the *RDs have duplicate names (not all, but I forget which). Same is true for languages.
  • (1a) Spells can theoretically be differentiated by evaluating the "TYPE" token for Divine, Arcane, or Psionic (those are "magical" items in our code.
  • (1b) Languages can theoretically be differentiated between "Spoken" and "Written" as those are "magical" types.

I believe both of those forms of magic are things on our backlog of FREQs to clean up... and the reason they are on the cleanup list really for L10N (as much as it is to just cleanup the overuse of TYPE)

<a id="problem2">Problem #2</a>: Non-unique tokens

  • There are only a few tokens that are not unique. DESC is one of them, if I recall correctly. The probably solution here is to simply give an identifier to each token. This might require a change to LST Something like:
  • DESC*Overall:x
  • DESC*Second:x
  • (Note: I'm not sure this syntax works or is by any means "good". DESC:OVERALL|x might be better as it avoids potential issues with using * as a reserved character - take this is a principle of what would have to happen to DESC token, not as a full-blown proposal)

Note that this naming of each DESC item (And other reusable items), while it breaks the "can't change the data" principle actually helps as much as it hurts... it would give us the ability to do things like:

  • DESC:.CLEARID.Overall
  • or things similar to that, which is actually a neat benefit for the small overhead of pain it puts into datasets. (Which, by the way, could be converted to whatever we decide on anyway with our nifty converter, so this doesn't seem all that bad)

So in my mind, the question really is: Is going slightly more than half way good enough? There are areas where we could do translation, and some areas where we need some pretty material core code changes to support it.

Translation files

Hi Tom,

Welcome BACK!!!

I don't mind the solutions put forth. Unique Identifiers for DESC is no more absurd then the Unique Identifiers we'll be needing to use more than one CHOOSER on a single line. Plus, the ability to clear off a Section of DESC would be awesome.

I'd vote for as complete a job as possible.

Here is where I'm not understanding - where are we going to put the translation stuff?

thpr wrote:

> Here is where I'm not understanding - where are we going to put the translation stuff?

Idea #1 (bad idea IMHO) is to put it in a separate directory: data/srd/... l10n/srd/...

The problem with that is that you get into synchronization issues... so any time files/directory names change in one place they have to change in another. That's a contract on the data developer. (It would also require multiple l10n directories, since we support multiple data directories, so it's a bunch of code)

Idea #2: Implicit Subdirectories: data/srd/*.lst data/srd/l10n/*.l10n

That ensures that the l10n directory is associated with the dataset.

Idea #3: Explicit designation: data/srd/srd.pcc contains LOCALIZATION:l10n/srd.l10n then: data/l10n/srd.l10n

(would support more than one .l10n file)


--- In pcgen_international@yahoogroups.com, Martijn Verburg <martijnverburg@...> wrote: > > (2) We should target an ability to tell us if l10n is complete for any > > given data set > > > Not sure what you mean by this?

If there is an object called "Dagger" we need to ensure the file contains: EQUIPMENT:Dagger|Aggerd-ay

If it contains: EQUIPMENT:Dgager|Aggerd-ay

That is also an error (just like the "unconstructed reference" items are errors. By capturing both type 1 and type 2 error (things that aren't translated as well as things that shouldn't have been translated) we are capturing the vast majority of the simple problems.

Also need translation

Page number: either the original book is referenced, but it doesn’t help people that have the translation where the page is not the same, or the translated book is referenced and that number need to be localized. Usually data sets represents individual books, but there is sometimes compilation done of those. That means that several dataset would represent a single one. I know the French editor of Pathfinder does this. No idea what the content is and how it is organized, but it might work by changing the description to include something like chapter X of book Z in the description.

The data sets points to the English books. Should the translation points to the localized one, if it exists?

See also