Difference between revisions of "Internationalization"

From PCGen Wiki
Jump to: navigation, search
(New page: =Introduction= PCGen for 6.0+ releases will be attempting Internationalization. To deal with this we have created a Yahoo group called [http://groups.yahoo.com/group/pcgen_international p...)
 
(Update to add Tom and Masura20100 thoughts)
Line 10: Line 10:
  
 
Actually, the international guys brought up the sorting thing again - another project would be to look into the GUI/core and replacing all the sorting of items with internationalized methods of sorting (Collators)
 
Actually, the international guys brought up the sorting thing again - another project would be to look into the GUI/core and replacing all the sorting of items with internationalized methods of sorting (Collators)
 +
 +
 +
===New Ideas meriting thought===
 +
 +
'''PER Masura20100:'''
 +
 +
I thought a bit about it, and I felt that if it was done it would need to be:
 +
* The data should be the same that the English one to avoid having to duplicate data correction.
 +
* The data that is not translated in a language, for example books not yet translated, should stay in the language it first get out.
 +
* The interface language and the data language should be separate to allow people with book in a different language than their own to use the program more simply.
 +
* It should be possible to use another language in a data to allow non English speaker to develop custom content in another language. That seem to be a problem when combined with separate language from data. A way to avoid it is to provide easy reference to create custom content.
 +
* Use of collator to sort lists.
 +
 +
'''PER Tom:'''
 +
 +
So looking back at the earlier thread, I never actually came back with my suggestion, so here is my perspective on localization and how it should be done. This has been made easier by some changes we have in the 6.x line, and having new UI (with what I believe to be a lot more isolation of code that does display of items) is a huge boost here to making this practical.
 +
 +
First a few base facts as background:
 +
* (1) For [slmost] every item (Class, Skill, etc.) there is a unique identifier (generally referred to as a Key - note that if the "KEY" token is not used, then the Key is the name (first entry on the line in the data) (There is an exception to this we'll call problem #1)
 +
* (2) There are basically 4 types of things that need translation:
 +
* (2a) item names
 +
* (2b) constants (like spell schools)
 +
* (2c) variables (like meters/feet/etc.)
 +
* (2d) Strings (like descriptions)
 +
* [if anyone can think of more, let me know]
 +
* (3) Most (but not all) tokens are "unique" or otherwise "addressable" in that they can only occur once per object. (Anyone notice that the tokens have started to be specific in the test code & docs about whether they overwrite, add, etc. - this is one reason why - and yes, I've been slowly trying to make progress on this even in the 2007-2009 work) [The exception to this is problem #2)
 +
 +
A few principles about l10n:
 +
* (1) We must not make producing a data set materially more complicated than it is today (no requirement to put %L10NNAME% type gunk into data)
 +
* (2) We should target an ability to tell us if l10n is complete for any given data set
 +
 +
With that:
 +
* Following from #1, almost everything we have in the data today is "addressable". By that, I mean that the OUTPUTNAME for a Skill called "FooBar" can be uniquely called out. The name hierarchy is something like: SKILL//Foo//OUTPUTNAME (exceptions are problems #1,2 to be addressed later)
 +
 +
Given that, we can actually set principle #1 to "The data remains unchanged" (again, except for problem #2). The entire data set is produced (assume US English for a moment). We then have a unique file for l10n that has things like:
 +
* SKILL:FooBar|Oobarf-ay
 +
* SKILL:FooBar:OUTPUTNAME|ooBarOutF-ay
 +
* etc.
 +
 +
This (to the first order) covers 2a, 2d.
 +
 +
For 2b items, we simply have to expand the list of items (SKILL, SPELL) that we are familiar with, so we get things like:
 +
* SPELLSCHOOL:Divination|Ivinationd-ay
 +
 +
2c is a bit more complicated, but I can't believe it's all that bad given it's a thing many applications already do (and it's a known thing)
 +
 +
Each file could be named in such a way that it identifies it's l10n, e.g.:
 +
* srd_de_ch.l10n
 +
* ...and probably placed in a L10N subfolder of the initial dataset.
 +
* (Note I'd recommend we be clear on where these go, AND ALSO require that NO PCC FILES (I'd recommend we say no PCC or LST, but the formal limit would be no PCC) are recognized if they are in the L10N folder... so that the initial directory parse [which is probably one of the slower parts of our boot process] can immediately ignore the l10n directory and not have to look through the file list looking for .PCC files... alternately, we have a l10n folder that is parallel to the data folder, but that then adds complication in needing new preferences to point at multiple l10n folders and requires more complicated structure within that l10n structure to identify WHICH files in that folder map to which datasets... because we can't simply say a given English word will always translate the same way - English is way too overloaded for that.)
 +
 +
I would recommend we keep things in a small subset of files, and not try to do 1:1 for each data file (that would produce a lot of file sprawl) - but that's not my call.
 +
 +
Addressing principle #2:
 +
* Since we have a set of items we know would need to be covered (names, certain tokens), we should be able to load those into memory, and load the l10n file into memory and compare. This should be able to produce warnings of 2 kinds:
 +
* (W1) Errors where the base data set contains things that are not translated
 +
* (W2) Errors where the translation file attempts to translate things not in the base data set
 +
* I can't imagine that utility is all that hard to write (just would need to make the list of todos)
 +
 +
This brings us to problems #1 and #2:
 +
 +
Problem #1: Non-unique names
 +
* We glossed over this in 6.x, but the truth is that Spell names are not unique. Some of the *RDs have duplicate names (not all, but I forget which). Same is true for languages.
 +
* (1a) Spells can theoretically be differentiated by evaluating the "TYPE" token for Divine, Arcane, or Psionic (those are "magical" items in our code.
 +
* (1b) Languages can theoretically be differentiated between "Spoken" and "Written" as those are "magical" types.
 +
I believe both of those forms of magic are things on our backlog of FREQs to clean up... and the reason they are on the cleanup list really for L10N (as much as it is to just cleanup the overuse of TYPE)
 +
 +
Problem #2: Non-unique tokens
 +
* There are only a few tokens that are not unique. DESC is one of them, if I recall correctly. The probably solution here is to simply give an identifier to each token. This might require a change to LST Something like:
 +
* DESC*Overall:x
 +
* DESC*Second:x
 +
* (Note: I'm not sure this syntax works or is by any means "good". DESC:OVERALL|x might be better as it avoids potential issues with using * as a reserved character - take this is a principle of what would have to happen to DESC token, not as a full-blown proposal)
 +
 +
Note that this naming of each DESC item (And other reusable items), while it breaks the "can't change the data" principle actually helps as much as it hurts... it would give us the ability to do things like:
 +
* DESC:.CLEARID.Overall
 +
* or things similar to that, which is actually a neat benefit for the small overhead of pain it puts into datasets. (Which, by the way, could be converted to whatever we decide on anyway with our nifty converter, so this doesn't seem all that bad)
 +
 +
So in my mind, the question really is: Is going slightly more than half way good enough? There are areas where we could do translation, and some areas where we need some pretty material core code changes to support it.
 +
* END TOM Thoughts

Revision as of 16:07, 11 December 2011

Introduction

PCGen for 6.0+ releases will be attempting Internationalization. To deal with this we have created a Yahoo group called pcgen_international, that is dedicated to the development of non-English language content for PCGen.

Requests for translations, problems with translations, and drafts of translated data sets or other translated content, as well as anything coming up with the Internationalization of PCGen are to be discussed in that group.

Anybody interested to help out with translations to other languages, but also those who are just interested in the discussion of the matter, please join the group. Those interested in joining our efforts on the Internationalization of PCGen, please make yourself known on that list.

Sorting Issue

Actually, the international guys brought up the sorting thing again - another project would be to look into the GUI/core and replacing all the sorting of items with internationalized methods of sorting (Collators)


New Ideas meriting thought

PER Masura20100:

I thought a bit about it, and I felt that if it was done it would need to be:

  • The data should be the same that the English one to avoid having to duplicate data correction.
  • The data that is not translated in a language, for example books not yet translated, should stay in the language it first get out.
  • The interface language and the data language should be separate to allow people with book in a different language than their own to use the program more simply.
  • It should be possible to use another language in a data to allow non English speaker to develop custom content in another language. That seem to be a problem when combined with separate language from data. A way to avoid it is to provide easy reference to create custom content.
  • Use of collator to sort lists.

PER Tom:

So looking back at the earlier thread, I never actually came back with my suggestion, so here is my perspective on localization and how it should be done. This has been made easier by some changes we have in the 6.x line, and having new UI (with what I believe to be a lot more isolation of code that does display of items) is a huge boost here to making this practical.

First a few base facts as background:

  • (1) For [slmost] every item (Class, Skill, etc.) there is a unique identifier (generally referred to as a Key - note that if the "KEY" token is not used, then the Key is the name (first entry on the line in the data) (There is an exception to this we'll call problem #1)
  • (2) There are basically 4 types of things that need translation:
  • (2a) item names
  • (2b) constants (like spell schools)
  • (2c) variables (like meters/feet/etc.)
  • (2d) Strings (like descriptions)
  • [if anyone can think of more, let me know]
  • (3) Most (but not all) tokens are "unique" or otherwise "addressable" in that they can only occur once per object. (Anyone notice that the tokens have started to be specific in the test code & docs about whether they overwrite, add, etc. - this is one reason why - and yes, I've been slowly trying to make progress on this even in the 2007-2009 work) [The exception to this is problem #2)

A few principles about l10n:

  • (1) We must not make producing a data set materially more complicated than it is today (no requirement to put %L10NNAME% type gunk into data)
  • (2) We should target an ability to tell us if l10n is complete for any given data set

With that:

  • Following from #1, almost everything we have in the data today is "addressable". By that, I mean that the OUTPUTNAME for a Skill called "FooBar" can be uniquely called out. The name hierarchy is something like: SKILL//Foo//OUTPUTNAME (exceptions are problems #1,2 to be addressed later)

Given that, we can actually set principle #1 to "The data remains unchanged" (again, except for problem #2). The entire data set is produced (assume US English for a moment). We then have a unique file for l10n that has things like:

  • SKILL:FooBar|Oobarf-ay
  • SKILL:FooBar:OUTPUTNAME|ooBarOutF-ay
  • etc.

This (to the first order) covers 2a, 2d.

For 2b items, we simply have to expand the list of items (SKILL, SPELL) that we are familiar with, so we get things like:

  • SPELLSCHOOL:Divination|Ivinationd-ay

2c is a bit more complicated, but I can't believe it's all that bad given it's a thing many applications already do (and it's a known thing)

Each file could be named in such a way that it identifies it's l10n, e.g.:

  • srd_de_ch.l10n
  • ...and probably placed in a L10N subfolder of the initial dataset.
  • (Note I'd recommend we be clear on where these go, AND ALSO require that NO PCC FILES (I'd recommend we say no PCC or LST, but the formal limit would be no PCC) are recognized if they are in the L10N folder... so that the initial directory parse [which is probably one of the slower parts of our boot process] can immediately ignore the l10n directory and not have to look through the file list looking for .PCC files... alternately, we have a l10n folder that is parallel to the data folder, but that then adds complication in needing new preferences to point at multiple l10n folders and requires more complicated structure within that l10n structure to identify WHICH files in that folder map to which datasets... because we can't simply say a given English word will always translate the same way - English is way too overloaded for that.)

I would recommend we keep things in a small subset of files, and not try to do 1:1 for each data file (that would produce a lot of file sprawl) - but that's not my call.

Addressing principle #2:

  • Since we have a set of items we know would need to be covered (names, certain tokens), we should be able to load those into memory, and load the l10n file into memory and compare. This should be able to produce warnings of 2 kinds:
  • (W1) Errors where the base data set contains things that are not translated
  • (W2) Errors where the translation file attempts to translate things not in the base data set
  • I can't imagine that utility is all that hard to write (just would need to make the list of todos)

This brings us to problems #1 and #2:

Problem #1: Non-unique names

  • We glossed over this in 6.x, but the truth is that Spell names are not unique. Some of the *RDs have duplicate names (not all, but I forget which). Same is true for languages.
  • (1a) Spells can theoretically be differentiated by evaluating the "TYPE" token for Divine, Arcane, or Psionic (those are "magical" items in our code.
  • (1b) Languages can theoretically be differentiated between "Spoken" and "Written" as those are "magical" types.

I believe both of those forms of magic are things on our backlog of FREQs to clean up... and the reason they are on the cleanup list really for L10N (as much as it is to just cleanup the overuse of TYPE)

Problem #2: Non-unique tokens

  • There are only a few tokens that are not unique. DESC is one of them, if I recall correctly. The probably solution here is to simply give an identifier to each token. This might require a change to LST Something like:
  • DESC*Overall:x
  • DESC*Second:x
  • (Note: I'm not sure this syntax works or is by any means "good". DESC:OVERALL|x might be better as it avoids potential issues with using * as a reserved character - take this is a principle of what would have to happen to DESC token, not as a full-blown proposal)

Note that this naming of each DESC item (And other reusable items), while it breaks the "can't change the data" principle actually helps as much as it hurts... it would give us the ability to do things like:

  • DESC:.CLEARID.Overall
  • or things similar to that, which is actually a neat benefit for the small overhead of pain it puts into datasets. (Which, by the way, could be converted to whatever we decide on anyway with our nifty converter, so this doesn't seem all that bad)

So in my mind, the question really is: Is going slightly more than half way good enough? There are areas where we could do translation, and some areas where we need some pretty material core code changes to support it.

  • END TOM Thoughts