Rules Persistence System

From PCGen Wiki
Jump to: navigation, search

Background

This document is primarily intended to communicate the design of PCGen Rules Persistence System.

This document provides a detailed overview of the architecture of a specific portion of PCGen. The overall architecture and further details of other subsystems and processes are provided in separate documents available on the Architecture page.

Key Terms

Loader
A Loader is a file used to load a specific file type within the persistent form of either the PCGen Game Mode or a specific book ("Campaign").
Token
A Token is a piece of code that parses small parts of of a file so the appropriate information can be loaded into the Rules Data Store.
Reference
A Reference is a holding object produced when a reference to an object is encountered in the data, and subsequently loaded with the underlying object when data load is complete.


Overview

This document describes the Rules Persistence System, and provides guidance on how to interact with the interface/API of the Rules Persistence System.

Architectural Design

It is probably significant at this point to point out that our LST language is - from a Computer Science perspective - a Domain Specific Language.

PCGen is (for the most part) strongly typed and highly structured, so the guidance we take should be from languages like Java and C++, not from languages like Perl or javascript. Also, we have a very high incentive to get our data loading "correct" - so we should be able to catch errors up front at LST load. So we really *want* the benefits of a "compile" step, not "parse on the fly".

That overall observation gives us the ability to look at a few different aspects of how compilers work. Specifically:

  • Compilers must be able to parse the source files into a format that can be processed internally.
  • Compilers must consider "reference before construction" (we call it building symbol tables during compilation - i.e. "How do you know that variable reference was a declared variable?").

PCGen will have to address both of those items to successfully process files.


Parsing source files

Architectural Discussion

Most compilers do multiple passes at the structure of information in their source files. There may be pre-processors, etc. This is (often) facilitated by a parsing system that produces an object tree (via lex/yacc or JavaCC or equivalent), and it is processed with multiple different visitors, each of which can then depend on information gleaned by the previous one. (This is actually how the new formula system parses formulas - It's a specific JavaCC syntax)

It is very difficult for PCGen to do a similar form of analysis. Our files are not conducive to being parsed by a tree-building system, due to the inconsistent nature of many of the LST tokens. An early version of such a parser from 2008 or so - never placed into a public repository - struggled with the exceptions and lack of consistent "reserved characters" and "separator characters" that are usually major highlights of a structured programming language.

However, we really *want* the benefits of a "compile" step, even though we can't build a tree. Therefore, we have currently designed the system to do a more linear parse of the files, while (for the most part) doing a strong validation of input.

Determining file format

We start with the concept of a File Loader. Knowing the file format is critical to understand which of the few dozen loaders should be used. We therefore have a set of rules so we "know" what file loader to apply to a given file.

For the first pass of a load, which is loading the game mode files from disk, we know the precise format of the file, because the file names are highly rigid. "miscinfo.lst" is a strictly required file name in a game mode. (There is one and only one of that file and it must have that name and not be in a sub-directory of the game mode for the game mode to be valid). Therefore, the code can hard-code this into a sequence of lookup processes in the game mode directly that ties a specific loader to a specific file name.

In the second pass of a load, we are looking for PCC files. While this is no longer strict on the exact file name, we ARE strict on the file suffix (must be PCC). This again allows us to infer the nature of the file we are processing, allowing us to build a strict association between the file name and the file format.

In the third pass of a load, we are now data driven. We are loading contents as defined by a PCC file. Here, there is no longer a file name format. Rather, the contents of the PCC file had a specific key:value syntax that defined the format of the file. The PCC file might have contained "TEMPLATE:rsrd_templates.lst" for example, which indicates that the file "rsrd_templates.lst" is to be processed as a "TEMPLATE" file. There is no strict requirement that these items end in ".lst" although that is certainly a convention and well enough understood that exceptions would probably be a bit mind-bending to everyone.

Parsing an LST/PCC file

Each LST file type has an associated *Loader class within the pcgen.persistence.lst or pcgen.rules.persistence package. Spells, for example, are loaded using the SpellLoader class. In general, the pcgen.persistence.lst items are older and pcgen.rules.persistence.* is the newer system for doing load from LST files. Within a file loader, we parse the file line-by-line. In most files, lines are independent (each line represents a separate object).

There are three major file formats we are dealing with:

Command-based
The first set are individual commands that will occur on a single object. This occurs, for example, in the "miscinfo.lst" file. Each line is processed and loaded into the GameMode object. Most of the Game Mode files are of this form, as are the PCC files. The GLOBALMODIFIER file in the data directory also operates this way. Since this can be seen as a slighly degenerate form of an object-based load (see below), this is not discussed in any detail in this document.
Object-Based
This set of files creates one object for each line (or the line represents a modification of an existing object). The majority of our LST files in the data directory are processed this way, as does the stats and checks file in a Game Mode. This is discussed in more detail below.
Batch-based
The CLASS and KIT files are a major exception to the object-based description above, since they are blocks of information with a new "CLASS:x" or "STARTKIT" line representing the split to a new item. Investigation of the loading of those files is currently left as an exercise for the reader. ***This should actually be included as it is relevant to future direction


Object-based file loading

For the majority of our files, the first entry on a line represents the ownership and behavior for that line. This can take a few formats, but in general takes one of these two forms:

PREFIX:DisplayName
PREFIX:Key.MODIFICATION

The PREFIX may be empty/missing depending on the file type. The PREFIX may be something like ALIGNMENT: to indicate an alignment. This is done in files that can define more than one format. (e.g. Stats and checks used to be shared when they were stored in the game mode)

The DisplayName is the starting name of the object.

For a modification (or any reference to the object), the KEY MUST be used. If no KEY: token is provided, then the DisplayName serves as the KEY.

The MODIFICATION is COPY=x, MOD or FORGET.

.COPY
Allows a data file to copy an existing object. This .COPY entry need not worry about file load order (see below). The value preceding the .COPY string identifies the object to be copied. This identifier is the KEY (or KEY and CATEGORY) of the object to be copied. The identifier for the copied object is placed after an equals sign that follows the .COPY String, e.g.: Dodge.COPY=MyDodge
.MOD
Allows a data file to modify an existing object. This .MOD entry need not worry about file load order (see below). All .MOD entries will be processed after all .COPY entries, regardless of the source file. The value preceding the .MOD string identifies the object to be modified. This identifier is the KEY (or KEY and CATEGORY) of the object to be modified. If more than one .COPY token produces an object with the same identifier, then a duplicate object error will be generated.
FORGET
Allows a data file to remove an existing object from the Rules Data Store. This .FORGET entry need not worry about file load order (see below). All .FORGET entries will be processed after all .COPY and .MOD entries, regardless of the source file. The value preceding the .FORGET string identifies the object to be removed from the Rules Data Store.

Data Persistence File Load Order Independence

This provides specific clarity on the the Order of Operations during file loading.

When files are loaded, they are processed in order as the lines appear in the file, unless the line is a MODIFICATION. If it is a modification, it is processed after normal loading is complete. Note this means ALL FILES of a given format (e.g. TEMPLATE) are loaded with their DisplayName lines processed before ANY .COPY is processed. All .COPY items are processed before any .MOD items are processed. All .MOD items are processed before any .FORGET items are processed. (Note that strictly this is Base/Copy/Mod/Forget by object type, it doesn't strictly inhibit parallelism between file types during file load). This order of operations is necessary so that a second file can perform a .COPY or .MOD on the contents of another file. It is also important to recognize that .COPY occurs before .MOD, which gives strict consideration to what items may want to appear on the original line vs in a .MOD line as they are not always equivalent.

Source Information

There is one additional exception to the file processing as described above. If a line starts with a SOURCE*: token, then that line is processed as "persistent information" for that file. All items on that line will be applied to ALL items in the file. This should be limited to just source information that needs to be universally applied to included objects.


Tokens

Subsequent entries on a line represent tags/tokens on that object to give it information and behavior within PCGen.

In general, the format of a token is:

NAME:VALUE

The list of available tokens is specific to a given data persistence file type. This allows features to be limited to certain objects to avoid non-sensical situations (e.g. you can't assign material components to a Race). A collection of Global tags that can be used in nearly all data persistence files is also available.

The exact processing occurs within the plugins that are loaded to process each token. Each Token Class is stored in a separate file/class, independent of the core of PCGen, to allow each token to be independently updated, removed, or otherwise manipulated without altering or impacting other Tokens.

This also forces the Token Classes to be fairly simple, which makes them easy to test, modify, and understand (as they are effectively atomic to the processing of a specific token). One goal of the PCGen Rules Persistence System is to ensure that all of the parsing of LST files is done within the Tokens and not in the core of PCGen. This makes adding new tags to the LST files to be reasonably painless (though changes to the core or export system may also be required to add required functionality).

Individual Token files are in the pcgen.plugin.lsttokens package. Many may rely on abstract classes provided in pcgen.rules.persistence.token. When PCGen is launched, JARs that are within the Plugin directory are parsed for their contents. This actually happens in the gmgen.pluginmgr.JARClassLoader Class. As one of many operations that takes place during the import, each Class is analyzed to determine if it is a persistence Token (a persistence Token is defined as a non-abstract Class that implements the LstToken interface). When a persistence Token is found, it is imported into the TokenLibrary or TokenStore.

Discussion

As with any architecture, there are tradeoffs in having a plugin system. The first of these is in code association within the PCGen system. Due to the plugin nature (and the use of reflection) there are certain use-associations which cannot be made within an Integrated Development Environment (IDE) such as Eclipse. For example, it is impossible to find where a TemplateToken is constructed by automated search, as it is constructed by a Class.newInstance() call.

One quirk with the plugin system is also that it occasionally requires full rebuilds of the code in order to ensure the core code and the plugins are "in sync" on their functionality. This is reasonably rare, but is a result of the lack of a hard dependency tree in the code (really, the same problem IDEs have in determining usage)

There are also some great advantages to a plugin system.

By using reflection to perform the import of the classes and using reflection to inspect those classes, some associations can be made automatically, and do not require translation tables. By having all of the information directly within the Token Classes, a 'contract' to update multiple locations in the code (or parameter files) is avoided. There is also a minimal amount of indirection (the indirection introduced by TokenStore's Token map is very easy to understand).

The addition of a Token Class to the Plugin JAR will allow the new Token to be parsed. This makes adding new tags to the LST files to be reasonably painless (actually having it perform functions in the PCGen core is another matter :) )

Also, By keeping each Token in an individual class, this keeps the Token Classes very simple, which makes them easy to test, modify, and understand (as they are effectively atomic to the processing of a specific token).

In the future, we may also be able to defer some loading of plugins until after the game mode has loaded, allowing us to only activate and load those tokens relevant for a specific game mode. Specifically, it would be nice to not have to process any ALIGNMENT based tokens in MSRD, for example (and to have them all automatically be errors as well). This need may be mitigated by the more data driven design we are working to develop.

Future Work

It would be nice if there were a method of forcing the isolation without having a slew of JAR files... sunsetting the need to update pluginbuild.xml when a new tokens is created would be nice as well. So there is probably an architectural choice here that involves the tradeoff between separate tokens, token discovery, contract to have to update pluginbuild.xml, and modularity.

Identifying the Token

In determining which token is used, two items are relevant. First, the name of the token, second, the Class of object processed by the token. If two tokens are found during plugin load that share the same name and class processed, an error is thrown during PCGen startup.

How are token conflicts resolved? If two tokens have the same key (String before the : in the LST file), AND implement the same persistence Token Interface (e.g. PCClassLSTToken), then an error will be reported by the TokenStore class when the plugin JAR files are loaded.

The TokenStore is the older method of storing the tokens. In these cases, the tokens must be an exact match to both the name (case insensitive - but by convention they are capitalized in the LST files) and the class of object being processed. The TokenStore effectively has a Map<Class, Map<String, Token>>.

For the TokenLibary more flexibility is allowed. If an object like a Language is being processed, then first, the system will look for tokens that match Language.class exactly. If that fails, then the system will use reflection on Language.class to determine the parent class and see if a token of the appropriate NAME exists at that level. This is repeated until a relevant token plugin is found or the token is determined to be invalid.

This lookup starts within the TokenLibrary. Within that TokenLibrary, exists multiple TokenFamily objects. Each version of PCGen can have its own TokenFamily. This allows tokens that support backwards compatibility to be contained separately from the primary tokens.

In some cases there are both Global tags and "local" tags that have the same key (e.g. "TEMPLATE"). As described above, the the "local" key (one that is specific to a certain type of LST file) would take priority over the Global Token. This is the case with TEMPLATE, as the Global tag processing takes place in a call to PObjectLoader.parseTagLevel(), far below the PCClass-specific processing that takes place early in PCClassLoader.parseClassLine()

Future need: Interface Tokens

The current system does suffer from a number of issues. Some of our current "global" tokens really aren't global. They may be global in as much as an item is "granted" to a PC, but would fail on other object types. For other situations, we have begun to move away from the heavyweight and complicated CDOMObject/PObject into a more lightweight object, but we want to share behavior (and load tokens) there as well.

The existing TokenLibrary system has a few weaknesses with that new desire. As we rely more on interfaces than direct inheritance, TokenLibrary will begin to fail.

We therefore need some infrastructure to load tokens based on the available interfaces on an object as well. Note that this will produce an ambiguity we will need to resolve. For example, if there is a REACH token that is appropriate for both CDOMObject.class and SomeInterface.class, then we need a bright-line rule as to which token will apply (or if sharing a name between hard-class based tokens and interface tokens produces an error).

Token processing order

In general, all tokens are processed in the order they are encountered.

One exception is CATEGORY: in Ability, which must be on the original line (illegal on COPY/MOD lines), and which is processed by the Loader.

Being processed in the order they are encountered does not mean that they are applied to the PC in the order in which they appear on the given line. That order of operations is defined within the core.

Subtokens

Some tags have complex behavior that significantly differs based on the first argument in the value of the tag. In order to simplify tag parsing and Token code, these Tokens implement a Sub-token structure, which delegates parsing of the tag value to a Token specialized to the first argument in the value of the tag.

This design is primarily intended to separate out code for different subtokens. This provides increased ability to add new subtokens without altering existing code. This provides increased flexibility for developers, and ensures that unexpected side effects from code changes don't impact other features of PCGen.

Note that it is legal for a subtoken to only be valid in a single object type (such as a Race), even if the "primary" token is accepted universally. This greatly simplifies the restriction of subtokens to individual file types without producing burden on the primary token to establish legal values. Resolution of those restrictions is handled entirely within the LoadContext and its supporting classes.

Re-entrant tokens

There are a few tokens that allow you to drill into a separate object and then apply another token. In Equipment for example:

PART:1|...

In this case the ... above is another token. This means that the token will have a second ':' used as a separator. In general (but not universally the case), an embedded ':' as a separator indicates a re-entrant token.

Prerequisite Tags

Currently the Prerequisite tags are an exception to the parsing system. The Prerequisite tags have a prefix of "PRE" and are followed by the Prerequisite name, e.g. PREFEAT. This means that the Prerequisite tags do not follow the traditional method of having a unique name before the colon. Also, Prerequisite tags can have a leading ! to negate the Prerequisite.

In order to address this situation of a different token definition system, the PreComatibilityToken provides a wrapper into the new PCGen 5.16+ token syntax.

Class Wrapped Token

A ClassWrappedToken provides compatibility for previously allowed bad behavior in data files.

Many Class tokens in PCGen versions up to 5.14 ignored the class level, so they are technically Class tags and not CLASSLEVEL tags. Yet, PCGen 5.14 allows those tags to appear on class level lines. This is a bit deceptive to users in that the effect will always be on the class, and not appear on the specified level.

Unfortunately, one cannot simply remove support for using CLASS tokens on CLASSLEVEL lines, because if they are used at level 1, then they are equivalent to appearing on a CLASS line. Certainly, the data monkeys use it that way. For example, Blackguard in RSRD advanced uses EXCHANGELEVEL on the first level line.

Therefore the entire ClassWrappedToken system is a workaround for data monkeys using CLASS tokens on CLASSLEVEL lines, and therefore it should only work on level one, otherwise expectations for when the token will take effect are not set.

Future Work

This should eventually be removed, so that it is discretely clear from reading the data where a token is legal and where not.

Format of the value of a token

In most cases, we use a vertical pipe to separate different components of a VALUE. Each Token can process the exact contents and load the appropriate information into the Rules Data Store.

The format of each token is within the documentation of PCGen.

Unparsing

Adding output to the persistence system provides the ability to reuse the Rules Persistence System in a data file editor, as well as the runtime system. This sharing of code helps to guarantee the integrity of the data file editor. Such a structure also facilitates unit testing, as the Rules Persistence System can be tested independently of the core code.

All tokens loaded into the TokenLibrary (but not those in the TokenStore) has the ability to both "parse" and "unparse" information for the Rules Persistence System. Parsing is the act of reading a token value from a data persistence file and placing it into the internal rules data structure. Unparsing is the act of reading the internal data structure and writing out the appropriate syntax into a data persistence file.

In addition to other benefits, this parse/unparse structure allows Tokens to be tested without major dependence on other components of PCGen. These tests are found in plugin.lsttokens package of the code/src/utest source directory.

Shared Persistence System with (future) Editor

The data persistence system should be usable for both a data file editor and the runtime character generation program.

A significant investment made in ensuring that persistent data is read without errors should be reused across both a data file editor and the runtime system. Consolidation reduces the risk of error and ensures that the editor will always be up to date (a problem that caused its removal). In addition, additional editing capabilities (e.g. edit data in place) that are not available today can be added once a full-capability editor is available.

Tokens may overwrite previous values or add to the set of values for that tag. In the case of an editor, it is critically important not to lose information that would later be overwritten in a runtime environment. A simple example would be the use of a .MOD to alter the number of HANDS on a Race. This alteration should be maintained in the file that contained the .MOD and the value (or unspecificied default) in the original Race should not be lost. This is done by tracking the exact changes that occur during data load. This ability to handle changes is fully explained in the Load Commit Subsystem.

Unparsing in practice

The File Loaders separate out the tags in an input file and call the parse method on the appropriate Tokens. In order to unparse a loaded object back to the data persistence syntax, the all Tokens that could be used in the given object type must be called (this makes unparse a bit more CPU intensive than parse).

Unparsing a particular object requires delegation of the unparse to all tokens subtokens to see if they were used. Because all tokens are called when unparsing an object, it is important that tokens properly represent when they are not used. This is done by returning null from the unparse method of the Token.

Some tokens can be used more than once in a given object (e.g. BONUS), and thus must be capable of indicating each of the values for the multiple tag instances. Since Tokens do not maintain state, the unparse method must only be called a single time to get all of the values; thus, the unparse method returns an array of String objects to indicate the list of values for each instance of the tag being unparsed.

Token should not include the name of the tag in the unparsed result. Just as the token is not responsible for removing/ignoring the name of the tag in the value passed into the parse method, it does not prepend the name of the tag to the value(s) returned from the unparse method. (This also happens to simplify the conversion and compatibility systems.)

Further Reading

To understand more about how PCGen handles the reference of an object before it is constructed, see CDOM References Concept Document

To understand more about how PCGen handles communicating information from the tokens to the Rules Data Store, see Load Commit Subsystem