![]() |
Docs: ERPSS Portability Conventions
The Event-Related Potential Software System (ERPSS) is a set of programs and data formats which allow versatile analysis of neurophysiological data. Acquisition of data, signal averaging, plotting, signal processing, and statistical analysis are among the capabilities of the system. The software is written primarily in the C programming language, and performs best in a UNIX oprating system environment. The ERPSS design represents an evolution of many previous programs and data formats, and incorporates suggestions and analysis techniques proposed by many users over a period of more than ten years.
ERPSS data are stored in a standardized format that allows flexible analysis after collection and consistent manipulation by various programs. The user interface is relatively simple and consistent; the programs are written so that they can be easily ported to various machines, and the data are automatically adjusted to accomodate differences in the underlying hardware.
This document discusses the issues involved in selecting the data formats and establishing the programming conventions that allow ERPSS data to be easily ported from one machine to another without any consideration by the user of the state of the data, and includes a guide to writing machine-independent software for the ERPSS system.
The Event-Related Potential Software System (ERPSS) programs and data formats were designed to operate in a UNIX/Linux environment, and are written in C to ease accomodation of different operating system and machine hardware environments. Since various versions of UNIX/Linux differ in a number of respects, great attention has been given to potential differences in the environments. Methods to deal with the underlying hardware differences were also considered in detail.
The result of this design and experimentation is a set of data formats, coding conventions, and library routines that enable the software to function on various machines running different versions of UNIX/Linux. In addition, the data collected using ERPSS programs on one machine can be transported to another machine by way of tape, disk, serial line, or network and still be processed and examined without the user having to consider the particular machine on which the system is operating. This document describes the rationale, considerations, and conventions established to attain the machine independence of the ERPSS.
The ERPSS data formats were selected to optimize three main objectives, these being:
Portability of certain standardized
data files between machines having different processors and operating systems.
This implies that different ordering of bytes in a short word as well as different
ordering of words in a long word must be accomodated. In addition, different
compilers can allocate members of structures differently, and the systems calls
can also differ significantly between machines.
Fast execution of native code on the particular hardware being used when
processing standardized ERPSS data files.
User transparency in processing ERPSS data which has been transported from
another machine with possible "swab corruption" of the data file, wherein
the bytes in a short word (16 bits) may or may not have been swapped.
This diagram shows the ordering of bytes in short words, and short words in long words. The critical features are that storing or retrieving data in a sequential byte by byte fashion accesses different parts of a short word on different machines, and that accessing sequential short words in a long word also refers to different parts of the long word. In particular, the 80386 short word format has the least significant byte of a short at the same address as the short itself (" little endian "), while the SPARC format has the most significant byte of a short at the address of the short and the least significant byte at the address of the short plus one (" big endian "). For the long word formats, the SPARC has the most significant short word first in memory storage order, whereas the 80386 stores the least significant short word and then the most significant short in ascending memory order.
As can be seen, native data formats differ between machines. When binary data files consisting of bytes, short words, and long words are interchanged between machines, it is possible that both long and short integers may be interpreted incorrectly. To further complicate the situation serial transfer of these types of data can cause the bytes in each short word to become reversed due to the way native character pointers access data in memory.
To allow fast native code processing of the various ERPSS data files and portability of these files between machines a number of conventions and procedures were established for writing code which creates, reads, writes, or alters these machine-independent data files. A "canonical" data order was selected as a reference for the alterations required in the code for machines which store data in different native formats.
The canonical format chosen was that employed by the Intel family of processors ("little endian"), since that format has a certain intuitive appeal; the ordering of bytes in a short word and short words in a long word is consistent with the storage order of the data in memory.
When software is to be transported to a new machine, it is recompiled with various changes being made in the code as conditioned by certain parameters. Standardized systems calls are employed to equalize the capabilities of the different operating systems; these are contained in the libesys.a library. The conventions, procedures, and definitions established to maintain data portability as well as code portability between different machines and operating systems are contained in the following:
Included in every routine in the ERPSS system (linked in at compile time via libesys.a). This file characterizes the environment in which the code will be running. It contains C preprocessor definitions for the operating system which is in use and for various arrangements of data in the native machine formats, as well as other ERPSS parameters and constants. The following tokens are defined based on the type of machine the code is for:
First, the token CBYTEREV is defined if the C compiler
on the machine allocates char data types in a structure definition in
"hi byte then lo byte" of a short word order, opposite the "canonical"
order described above. This is often the case when the BYTEREV token
is defined, as described next.
The BYTEREV token is defined when the native machine format stores bytes
in a short word opposite to the canonical order.
The token WORDREV
is defined if the most significant short word in a long word
precedes the least significant short word, also opposite to the
canonical form.
The token SMALL
is used to reduce the size of data arrays on machines
with severe address space limitations (e.g. PDP11).
These environment tokens are conditionally defined for the machines SUNSPARC, PDP11, IS68K, and VAX.
The token __emachdep_h, which is defined to flag prior
inclusion of emachdep.h.
Tokens which define the number of bytes of 8 bits constituting
char, short, and long interger data types, which must be 1, 2, and 4
respectively. These parameters are checked in the swabopen routine
described below.
Type definitions, dependent upon the operating system and C compiler,
to enable shorthand notation for data types (e.g. u_short for unsigned short)
as well as glossing over various compiler deficiencies and imcompatibilities
(e.g. void typ unsupported).
Also included in every ERPSS file, it contains tokens for:
Byte-reversal indicators BOCANON and BOSWABBED.
These indicators are defined as short
integers, and constitute the first element of any
ERPSS machine-independent data file.
The short integer BOCANON is written when a data file is created. If
the data become swab corrupted in some manner (such as a tar between
two different machines) this indicator will appear as BOSWABBED when
read as a short. This is used by the swabopen
routine and other programs to determine
whether a data file needs to be "swabbed" to put it back into canonical
form (see below).
Magic numbers for different types of ERPSS machine-independent data files.
Each type of file has a different number (a short word) which is used to
identify the type of data file automatically as well as to verify that
appropriate data are being supplied to a program or routine in the ERPSS
system.
Also included in every ERPSS file, it contains:
Standardized directory pathnames for certain parameter files in the ERPSS
system (e.g. the /usr/erpss/lib/graphicdevs database).
To enable ERPSS routines to employ standardized system calls when operating in various environments, the library "libesys.a" was developed. It contains ERPSS versions of routines to open files, seek to particular places, and the like. The code for each of these calls is conditioned upon the operating system defined in the emachdep.h header file and attempts to emulate a particular function using the specific calls available with each operating system.
This subroutine is contained in the libu.a ERPSS library and constitutes one of the routines which maintain the canonical format for ERPSS machine-independent data files. It is used to open ERPSS machine-independent files and checks the first short word in the file to determine whether it is BOCANON or BOSWABBED. In combination with the file type argument, either EF_SHORT or EF_CHAR, the data file is corrected if necessary to maintain canonical form on the disk before being read or written by the user program. More details on the operation of swabopen (E3u) can be obtained from its ERPSS manual page as well as from the descriptions of the different types of ERPSS machine-independent data files below.
To attain portability, reference data formats were identified and are maintained for data files in the ERPSS system. Three types of data files with different uses are currently employed in the ERPSS which are designed to be portable between machines. These include:
Data files consisting predominantly of short words with headers which can
contain any of char, short, and long data types.
These files are normally stored on disk and are never supplied as input to
a program through a pipe unless the data are generated on the same machine.
This is the format used for most data files in the system.
Data files consisting of sequences of short words which can either arise
from a process on the same machine which supplies the sequence via a pipe,
or as disk data files which can potentially transferred between different
machine with possible "swab corruption" and then supplied to a program
via a pipe (redirection of input). The "plotfilter" image files are an
example of this type of file.
Data files consisting of sequences of characters which always are created
as disk files. Compressed raster images are an example of this type of
file.
These three types of files are treated slightly different to overcome the possible swab corruption and differring native data formats. The following sections describe in detail how the files are treated to attain machine independence while allowing reasonably fast native code on various machines.
This first type of ERPSS data file consists of ERP data stored as arrays of short words, along with headers containing descriptive information as well as other binary parameters associated with the data. These headers can contain char, short, or long data types. These ERP data files are always kept in canonical form on disk; the first element is a byte-swab short word. This type of ERPSS file is always opened by the swabopen (E3u) routine with the EF_SHORT argument, which swabs the file to put it in canonical form, if necessary. Even if the ordering of bytes in a short word is correct, the different hardware formats for character and long word variables on different machines imply that further correction may be needed.
Special library routines are always used to read, write, or create these data files and can thus perform the appropriate corrections on the char and long word data types in the header(s) as the data are read from or written to the disk file. The objective is to always maintain the data in canonical form in the disk file, and adjust the header information as the record is read into main memory. This allows native machine code generated by the specific compiler to reference the header data without regard for the type of host machine under which the ERPSS software is running. The subroutines used to read and write the header data are compiled anew on each machine and use the emachdep.h machine type tokens to generate code appropriate to the specific hardware and operating system environment.
Correction of character arrays containing string data in the header(s) is accomplished by swapping bytes for those arrays if the BYTEREV token is defined in the emachdep.h file when the data are read into memory. The same character array swabbing is performed if the data are subsequently written out to disk, thus maintaining the canonical form on disk. This maneuver allows native character pointer incrementation to access successive characters within the arrays. This procedure requires that character array data begin on even memory boundaries and contain even numbers of elements. In addition, the adjusting subroutines are simplified and run faster if all such character arrays are contiguous in the header structure.
Character type data elements in the header structure which are not part of character arrays need special consideration if the CBYTEREV token is defined. In this case the header structure definition is conditionally dependent upon the CBYTEREV token and reverses the order of definition of these variables, thus maintaining their absolute position in the actual binary header. Again, this requires that there be an even number of such allocations and that they begin on an even boundary.
Long word variables in the headers are treated in a manner similar to character array data, except that the conditional adjustment is controlled by the WORDREV token in the emachdep.h file, and words are swapped in each longword variable. Again, these variables are most easily corrected if they appear contiguously in the headers.
Hence, the conventions for constructing header structures for these types of data files include:
The first element should be a short word which contains the BOCANON
value when the data are first written and created.
The second element should be a short word containing a unique "magic number"
identifying the type of the ERPSS data file.
All long word variables in the header should be segregated and appear
contiguously in the structure.
All character array variables should be segregated and appear contiguously
in the structure, have even number of elements, and begin on short word
boundaries.
All isolated character variables should be conditioned on the CBYTEREV
token and the order of definition reversed if it is defined. There should
be an even number of these variables, preferably appearing contiguously
in the structure.
If one adheres to these conventions when designing data headers, the subroutines which maintian canonical form on the disk and are always used to read or write the data can be simplified. There action for this type of data file can be summarized thusly:
Action of Adjusting Subroutines When
Data type Read Into Memory Written to Disk
--------------------------------------------------------------------------
char (controlled by CBYTEREV) (Controlled by CBYTEREV)
short (corrected by swabopen) (no adjustment required)
long Swap words if WORDREV Swap words if WORDREV
char array Swap bytes if BYTEREV Swap bytes if BYTEREV
Graphic image files stored in virtual form constitute another type of machine-independent ERPSS data file. These can be displayed on different types of graphic devices by supplying them as input to various plot filters, which translate the device-independent commands and virtual (normalized) coordinates to commands and coordinates appropriate for the specific graphic device. They are basically a serial stream of short words, but differ from the ERP data files in that no real header precedes the data, and the same sequence can be piped to a filter process as well as exist as disk-resident files. Since these files can be piped into a filter from another process or via redirection of the standard input of a filter program, it is not possible to guarantee that the data are not swab-corrupted when received by the filter process. Hence, the filter process must be prepared to swap bytes for each short word retrieved from the stream or file.
This situation is handled by starting each such stream with the BOCANON byte swap indicator, a plot-filter image file ERPSS file ID, and a third "process id" when the stream is first written. If the data are captured in a saved disk file and become swab-corrupted, they could be restored to canonical form by opening the file using swabopen (E3u), supplying the data type argument EF_SHORT, and then passing them off to the filter process. However, since these image files can also be redirected into an instance of a filter program, the filters themselves must check the first short in the file and swab all subsequent data if it indicates the data have been swab corrupted (first short is BOSWABBED). Thus, these data files need not exist on disk in canonical form at all times, and programs which create a pipe to a filter process and open multiple saved plot-filter image files should do so using swabopen (E3u) so that all data supplied to the filter process arrives in the same format.
Hence, the filter process needs only to check the initial byte swab indicator and the ERPSS file ID and subsequently swab all data coming in if so indicated. This can be done efficiently by swabbing a whole buffer at a time as the input is read from the pipe. This approach allows the filter programs to use simple macros for retrieving data from the buffer without having to check a flag and possibly swab each short word as it is taken from the buffer.
There is one final detail regarding image files that explains the third element, the "process id". Filters are usually invoked by application programs that send data to the filter via the interconnecting pipe. The filters are responsible for handling the hardware device, possibly needing to prevent simultaneous use by two users and/or using a shared resource. To enable the application program to determine whether the filter has been successfully invoked, the application program pauses after invoking the filter, waiting for one of two signals indicating success or failure. In this case, the "process id" that is passed through the pipe is that of the parent, application program, and is non-zero, indicating that these signals should be sent. However, since filters can also receive data from a saved image file via redirection of their standard input streams (in which case no signal can be sent), this "process id" can be set to zero, indicating to the filter program that no signals need be sent.
Data files consisting of primarily of sequences of characters (bytes) are most efficiently handled by starting each such file with the BOCANON byte swab indicator, which is written to the file in canonical order, defined to be low byte then high byte. When such files are opened after possible swab-corruption, it is sufficient to check the order of bytes of the byte-swap indicator as it is read a character at a time. Since a swab-corrupted character file may be in a form appropriate for a machine on which BYTEREV is defined, these files are not always in canonical form on the disk. However, they can always be opened for reading using swabopen (E3u) with the file type argument EF_CHAR. Swabopen will determine whether or not swabbing the file is necessary to achieve the same order of characters out of the file as when they were written into it.
It is possible to intersperse other data types in the character file stream; if they are always written into the file in canonical order (from lowest byte to most significant byte order) and retrieved in the same manner, the other data type will be properly extracted from the stream. Currently, only the compressed raster image files are of this type.