Difference between revisions of "Rules Persistence System"

From PCGen Wiki
Jump to: navigation, search
Line 11: Line 11:
 
==Key Terms==
 
==Key Terms==
  
* Loader
+
; Loader
* Token
+
: A Loader is a file used to load a specific file type within the persistent form of either the PCGen Game Mode or a specific book ("Campaign").
* Reference
+
; Token
 +
: A Token is a piece of code that parses small parts of of a file so the appropriate information can be loaded into the Rules Data Store.
 +
; Reference
 +
: A Reference is a holding object produced when a reference to an object is encountered in the data, and subsequently loaded with the underlying object when data load is complete.
  
  
Line 167: Line 170:
  
 
We therefore need some infrastructure to load tokens based on the available interfaces on an object as well.  Note that this will produce an ambiguity we will need to resolve.  For example, if there is a REACH token that is appropriate for both CDOMObject.class and SomeInterface.class, then we need a bright-line rule as to which token will apply (or if sharing a name between hard-class based tokens and interface tokens produces an error).
 
We therefore need some infrastructure to load tokens based on the available interfaces on an object as well.  Note that this will produce an ambiguity we will need to resolve.  For example, if there is a REACH token that is appropriate for both CDOMObject.class and SomeInterface.class, then we need a bright-line rule as to which token will apply (or if sharing a name between hard-class based tokens and interface tokens produces an error).
 +
 +
==Token processing order==
 +
 +
In general, all tokens are processed in the order they are encountered. 
 +
 +
One exception is CATEGORY: in Ability, which must be on the original line (illegal on COPY/MOD lines), and which is processed by the Loader.
 +
 +
Being processed in the order they are encountered does not mean that they are applied to the PC in the order in which they appear on the given line.  That order of operations is defined within the core.
  
 
==Subtokens==
 
==Subtokens==
Line 174: Line 185:
 
This design is primarily intended to separate out code for different subtokens.  This provides increased ability to add new subtokens without altering existing code.  This provides increased flexibility for developers, and ensures that unexpected side effects from code changes don't impact other features of PCGen.
 
This design is primarily intended to separate out code for different subtokens.  This provides increased ability to add new subtokens without altering existing code.  This provides increased flexibility for developers, and ensures that unexpected side effects from code changes don't impact other features of PCGen.
  
The LoadContext is capable of processing subtokens for a given Token.  Any token which delegates to subtokens can call <i>processSubToken(T, String, String, String)</i> from LoadContext in order to delegate to subtokens.  This delegation will return a boolean value to indicate success (<i>true</i>) or failure (<i>false</i>) of the delegationThe exact cause of the failure is reported to the <i>Logging</i> utility.
+
Note that it is legal for a subtoken to only be valid in a single object type (such as a Race), even if the "primary" token is accepted universally.  This greatly simplifies the restriction of subtokens to individual file types without producing burden on the primary token to establish legal valuesResolution of those restrictions is handled entirely within the LoadContext and its supporting classes.
  
Note that it is legal for a subtoken to only be valid in a single object type (such as a Race), even if the "primary" token is accepted universally. This greatly simplifies the restriction of subtokens to individual file types without producing burden on the primary token to establish legal valuesResolution of those restrictions is handled entirely within the LoadContext and its supporting classes.
+
==Re-entrant tokens==
 +
 
 +
There are a few tokens that allow you to drill into a separate object and then apply another token.  In Equipment for example:
 +
<pre>
 +
PART:1|...
 +
</pre>
 +
 
 +
In this case the ... above is another token.  This means that the token will have a second ':' used as a separator.  In general (but not universally the case), an embedded ':' as a separator indicates a re-entrant token.
 +
 
 +
==Format of the value of a token==
 +
 
 +
In most cases, we use a vertical pipe to separate different components of a VALUE.  Each Token can process the exact contents and load the appropriate information into the Rules Data Store.   
 +
 
 +
The format of each token is within the documentation of PCGen.

Revision as of 21:05, 25 February 2018

Background

This document is primarily intended to communicate the design of PCGen Rules Persistence System.

This document provides a detailed overview of the architecture of a specific portion of PCGen. The overall architecture and further details of other subsystems and processes are provided in separate documents available on the Architecture page.

Key Terms

Loader
A Loader is a file used to load a specific file type within the persistent form of either the PCGen Game Mode or a specific book ("Campaign").
Token
A Token is a piece of code that parses small parts of of a file so the appropriate information can be loaded into the Rules Data Store.
Reference
A Reference is a holding object produced when a reference to an object is encountered in the data, and subsequently loaded with the underlying object when data load is complete.


Overview

This document describes the Rules Persistence System, and provides guidance on how to interact with the interface/API of the Rules Persistence System.


Architectural Design

It is probably significant at this point to point out that our LST language is - from a Computer Science perspective - a Domain Specific Language.

PCGen is (for the most part) strongly typed and highly structured, so the guidance we take should be from languages like Java and C++, not from languages like Perl or javascript. Also, we have a very high incentive to get our data loading "correct" - so we should be able to catch errors up front at LST load. So we really *want* the benefits of a "compile" step, not "parse on the fly".

That overall observation gives us the ability to look at a few different aspects of how compilers work. Specifically:

  • Compilers must be able to parse the source files into a format that can be processed internally.
  • Compilers must consider "reference before construction" (we call it building symbol tables during compilation - i.e. "How do you know that variable reference was a declared variable?").

PCGen will have to address both of those items to successfully process files.


Parsing source files

Architectural Discussion

Most compilers do multiple passes at the structure of information in their source files. There may be pre-processors, etc. This is (often) facilitated by a parsing system that produces an object tree (via lex/yacc or JavaCC or equivalent), and it is processed with multiple different visitors, each of which can then depend on information gleaned by the previous one. (This is actually how the new formula system parses formulas - It's a specific JavaCC syntax)

It is very difficult for PCGen to do a similar form of analysis. Our files are not conducive to being parsed by a tree-building system, due to the inconsistent nature of many of the LST tokens. An early version of such a parser from 2008 or so - never placed into a public repository - struggled with the exceptions and lack of consistent "reserved characters" and "separator characters" that are usually major highlights of a structured programming language.

However, we really *want* the benefits of a "compile" step, even though we can't build a tree. Therefore, we have currently designed the system to do a more linear parse of the files, while (for the most part) doing a strong validation of input.

Determining file format

We start with the concept of a File Loader. Knowing the file format is critical to understand which of the few dozen loaders should be used. We therefore have a set of rules so we "know" what file loader to apply to a given file.

For the first pass of a load, which is loading the game mode files from disk, we know the precise format of the file, because the file names are highly rigid. "miscinfo.lst" is a strictly required file name in a game mode. (There is one and only one of that file and it must have that name and not be in a sub-directory of the game mode for the game mode to be valid). Therefore, the code can hard-code this into a sequence of lookup processes in the game mode directly that ties a specific loader to a specific file name.

In the second pass of a load, we are looking for PCC files. While this is no longer strict on the exact file name, we ARE strict on the file suffix (must be PCC). This again allows us to infer the nature of the file we are processing, allowing us to build a strict association between the file name and the file format.

In the third pass of a load, we are now data driven. We are loading contents as defined by a PCC file. Here, there is no longer a file name format. Rather, the contents of the PCC file had a specific key:value syntax that defined the format of the file. The PCC file might have contained "TEMPLATE:rsrd_templates.lst" for example, which indicates that the file "rsrd_templates.lst" is to be processed as a "TEMPLATE" file. There is no strict requirement that these items end in ".lst" although that is certainly a convention and well enough understood that exceptions would probably be a bit mind-bending to everyone.


Parsing an LST/PCC file

Each LST file type has an associated *Loader class within the pcgen.persistence.lst or pcgen.rules.persistence package. Spells, for example, are loaded using the SpellLoader class. In general, the pcgen.persistence.lst items are older and pcgen.rules.persistence.* is the newer system for doing load from LST files. Within a file loader, we parse the file line-by-line. In most files, lines are independent (each line represents a separate object).

There are three major file formats we are dealing with:

Command-based
The first set are individual commands that will occur on a single object. This occurs, for example, in the "miscinfo.lst" file. Each line is processed and loaded into the GameMode object. Most of the Game Mode files are of this form, as are the PCC files. The GLOBALMODIFIER file in the data directory also operates this way. Since this can be seen as a slighly degenerate form of an object-based load (see below), this is not discussed in any detail in this document.
Object-Based
This set of files creates one object for each line (or the line represents a modification of an existing object). The majority of our LST files in the data directory are processed this way, as does the stats and checks file in a Game Mode. This is discussed in more detail below.
Batch-based
The CLASS and KIT files are a major exception to the object-based description above, since they are blocks of information with a new "CLASS:x" or "STARTKIT" line representing the split to a new item. Investigation of the loading of those files is currently left as an exercise for the reader. ***This should actually be included as it is relevant to future direction


Object-based file loading

For the majority of our files, the first entry on a line represents the ownership and behavior for that line. This can take a few formats, but in general takes one of these two forms:

PREFIX:DisplayName
PREFIX:Key.MODIFICATION

The PREFIX may be empty/missing depending on the file type. The PREFIX may be something like ALIGNMENT: to indicate an alignment. This is done in files that can define more than one format. (e.g. Stats and checks used to be shared when they were stored in the game mode)

The DisplayName is the starting name of the object.

For a modification (or any reference to the object), the KEY MUST be used. If no KEY: token is provided, then the DisplayName serves as the KEY.

The MODIFICATION is COPY=x, MOD or FORGET.

.COPY
Allows a data file to copy an existing object. This .COPY entry need not worry about file load order (see below). The value preceding the .COPY string identifies the object to be copied. This identifier is the KEY (or KEY and CATEGORY) of the object to be copied. The identifier for the copied object is placed after an equals sign that follows the .COPY String, e.g.: Dodge.COPY=MyDodge
.MOD
Allows a data file to modify an existing object. This .MOD entry need not worry about file load order (see below). All .MOD entries will be processed after all .COPY entries, regardless of the source file. The value preceding the .MOD string identifies the object to be modified. This identifier is the KEY (or KEY and CATEGORY) of the object to be modified. If more than one .COPY token produces an object with the same identifier, then a duplicate object error will be generated.
FORGET
Allows a data file to remove an existing object from the Rules Data Store. This .FORGET entry need not worry about file load order (see below). All .FORGET entries will be processed after all .COPY and .MOD entries, regardless of the source file. The value preceding the .FORGET string identifies the object to be removed from the Rules Data Store.


Data Persistence File Load Order Independence

This provides specific clarity on the the Order of Operations during file loading.

When files are loaded, they are processed in order as the lines appear in the file, unless the line is a MODIFICATION. If it is a modification, it is processed after normal loading is complete. Note this means ALL FILES of a given format (e.g. TEMPLATE) are loaded with their DisplayName lines processed before ANY .COPY is processed. All .COPY items are processed before any .MOD items are processed. All .MOD items are processed before any .FORGET items are processed. (Note that strictly this is Base/Copy/Mod/Forget by object type, it doesn't strictly inhibit parallelism between file types during file load). This order of operations is necessary so that a second file can perform a .COPY or .MOD on the contents of another file. It is also important to recognize that .COPY occurs before .MOD, which gives strict consideration to what items may want to appear on the original line vs in a .MOD line as they are not always equivalent.


Source Information

There is one additional exception to the file processing as described above. If a line starts with a SOURCE*: token, then that line is processed as "persistent information" for that file. All items on that line will be applied to ALL items in the file. This should be limited to just source information that needs to be universally applied to included objects.

Tokens

Subsequent entries on a line represent tags/tokens on that object to give it information and behavior within PCGen.

In general, the format of a token is:

NAME:VALUE

The list of available tokens is specific to a given data persistence file type. This allows features to be limited to certain objects to avoid non-sensical situations (e.g. you can't assign material components to a Race). A collection of Global tags that can be used in nearly all data persistence files is also available.

The exact processing occurs within the plugins that are loaded to process each token. Each Token Class is stored in a separate file/class, independent of the core of PCGen, to allow each token to be independently updated, removed, or otherwise manipulated without altering or impacting other Tokens.

This also forces the Token Classes to be fairly simple, which makes them easy to test, modify, and understand (as they are effectively atomic to the processing of a specific token). One goal of the PCGen Rules Persistence System is to ensure that all of the parsing of LST files is done within the Tokens and not in the core of PCGen. This makes adding new tags to the LST files to be reasonably painless (though changes to the core or export system may also be required to add required functionality).

Individual Token files are in the pcgen.plugin.lsttokens package. Many may rely on abstract classes provided in pcgen.rules.persistence.token. When PCGen is launched, JARs that are within the Plugin directory are parsed for their contents. This actually happens in the gmgen.pluginmgr.JARClassLoader Class. As one of many operations that takes place during the import, each Class is analyzed to determine if it is a persistence Token (a persistence Token is defined as a non-abstract Class that implements the LstToken interface). When a persistence Token is found, it is imported into the TokenLibrary or TokenStore.

Discussion

As with any architecture, there are tradeoffs in having a plugin system. The first of these is in code association within the PCGen system. Due to the plugin nature (and the use of reflection) there are certain use-associations which cannot be made within an Integrated Development Environment (IDE) such as Eclipse. For example, it is impossible to find where a TemplateToken is constructed by automated search, as it is constructed by a Class.newInstance() call.

One quirk with the plugin system is also that it occasionally requires full rebuilds of the code in order to ensure the core code and the plugins are "in sync" on their functionality. This is reasonably rare, but is a result of the lack of a hard dependency tree in the code (really, the same problem IDEs have in determining usage)

There are also some great advantages to a plugin system.

By using reflection to perform the import of the classes and using reflection to inspect those classes, some associations can be made automatically, and do not require translation tables. By having all of the information directly within the Token Classes, a 'contract' to update multiple locations in the code (or parameter files) is avoided. There is also a minimal amount of indirection (the indirection introduced by TokenStore's Token map is very easy to understand).

The addition of a Token Class to the Plugin JAR will allow the new Token to be parsed. This makes adding new tags to the LST files to be reasonably painless (actually having it perform functions in the PCGen core is another matter :) )

Also, By keeping each Token in an individual class, this keeps the Token Classes very simple, which makes them easy to test, modify, and understand (as they are effectively atomic to the processing of a specific token).

In the future, we may also be able to defer some loading of plugins until after the game mode has loaded, allowing us to only activate and load those tokens relevant for a specific game mode. Specifically, it would be nice to not have to process any ALIGNMENT based tokens in MSRD, for example (and to have them all automatically be errors as well). This need may be mitigated by the more data driven design we are working to develop.

Future Work

It would be nice if there were a method of forcing the isolation without having a slew of JAR files... sunsetting the need to update pluginbuild.xml when a new tokens is created would be nice as well. So there is probably an architectural choice here that involves the tradeoff between separate tokens, token discovery, contract to have to update pluginbuild.xml, and modularity.

Identifying the Token

In determining which token is used, two items are relevant. First, the name of the token, second, the Class of object processed by the token. If two tokens are found during plugin load that share the same name and class processed, an error is thrown during PCGen startup.

How are token conflicts resolved? If two tokens have the same key (String before the : in the LST file), AND implement the same persistence Token Interface (e.g. PCClassLSTToken), then an error will be reported by the TokenStore class when the plugin JAR files are loaded.

The TokenStore is the older method of storing the tokens. In these cases, the tokens must be an exact match to both the name (case insensitive - but by convention they are capitalized in the LST files) and the class of object being processed. The TokenStore effectively has a Map<Class, Map<String, Token>>.

For the TokenLibary more flexibility is allowed. If an object like a Language is being processed, then first, the system will look for tokens that match Language.class exactly. If that fails, then the system will use reflection on Language.class to determine the parent class and see if a token of the appropriate NAME exists at that level. This is repeated until a relevant token plugin is found or the token is determined to be invalid.

This lookup starts within the TokenLibrary. Within that TokenLibrary, exists multiple TokenFamily objects. Each version of PCGen can have its own TokenFamily. This allows tokens that support backwards compatibility to be contained separately from the primary tokens.

In some cases there are both Global tags and "local" tags that have the same key (e.g. "TEMPLATE"). As described above, the the "local" key (one that is specific to a certain type of LST file) would take priority over the Global Token. This is the case with TEMPLATE, as the Global tag processing takes place in a call to PObjectLoader.parseTagLevel(), far below the PCClass-specific processing that takes place early in PCClassLoader.parseClassLine()

Future need: Interface Tokens

The current system does suffer from a number of issues. Some of our current "global" tokens really aren't global. They may be global in as much as an item is "granted" to a PC, but would fail on other object types. For other situations, we have begun to move away from the heavyweight and complicated CDOMObject/PObject into a more lightweight object, but we want to share behavior (and load tokens) there as well.

The existing TokenLibrary system has a few weaknesses with that new desire. As we rely more on interfaces than direct inheritance, TokenLibrary will begin to fail.

We therefore need some infrastructure to load tokens based on the available interfaces on an object as well. Note that this will produce an ambiguity we will need to resolve. For example, if there is a REACH token that is appropriate for both CDOMObject.class and SomeInterface.class, then we need a bright-line rule as to which token will apply (or if sharing a name between hard-class based tokens and interface tokens produces an error).

Token processing order

In general, all tokens are processed in the order they are encountered.

One exception is CATEGORY: in Ability, which must be on the original line (illegal on COPY/MOD lines), and which is processed by the Loader.

Being processed in the order they are encountered does not mean that they are applied to the PC in the order in which they appear on the given line. That order of operations is defined within the core.

Subtokens

Some tags have complex behavior that significantly differs based on the first argument in the value of the tag. In order to simplify tag parsing and Token code, these Tokens implement a Sub-token structure, which delegates parsing of the tag value to a Token specialized to the first argument in the value of the tag.

This design is primarily intended to separate out code for different subtokens. This provides increased ability to add new subtokens without altering existing code. This provides increased flexibility for developers, and ensures that unexpected side effects from code changes don't impact other features of PCGen.

Note that it is legal for a subtoken to only be valid in a single object type (such as a Race), even if the "primary" token is accepted universally. This greatly simplifies the restriction of subtokens to individual file types without producing burden on the primary token to establish legal values. Resolution of those restrictions is handled entirely within the LoadContext and its supporting classes.

Re-entrant tokens

There are a few tokens that allow you to drill into a separate object and then apply another token. In Equipment for example:

PART:1|...

In this case the ... above is another token. This means that the token will have a second ':' used as a separator. In general (but not universally the case), an embedded ':' as a separator indicates a re-entrant token.

Format of the value of a token

In most cases, we use a vertical pipe to separate different components of a VALUE. Each Token can process the exact contents and load the appropriate information into the Rules Data Store.

The format of each token is within the documentation of PCGen.