Docs: ERPSS Portability Conventions


ABSTRACT

The Event-Related Potential Software System (ERPSS) is a set of programs and data formats which allow versatile analysis of neurophysiological data. Acquisition of data, signal averaging, plotting, signal processing, and statistical analysis are among the capabilities of the system. The software is written primarily in the C programming language, and performs best in a UNIX oprating system environment. The ERPSS design represents an evolution of many previous programs and data formats, and incorporates suggestions and analysis techniques proposed by many users over a period of more than ten years.

ERPSS data are stored in a standardized format that allows flexible analysis after collection and consistent manipulation by various programs. The user interface is relatively simple and consistent; the programs are written so that they can be easily ported to various machines, and the data are automatically adjusted to accomodate differences in the underlying hardware.

This document discusses the issues involved in selecting the data formats and establishing the programming conventions that allow ERPSS data to be easily ported from one machine to another without any consideration by the user of the state of the data, and includes a guide to writing machine-independent software for the ERPSS system.

INTRODUCTION

The Event-Related Potential Software System (ERPSS) programs and data formats were designed to operate in a UNIX/Linux environment, and are written in C to ease accomodation of different operating system and machine hardware environments. Since various versions of UNIX/Linux differ in a number of respects, great attention has been given to potential differences in the environments. Methods to deal with the underlying hardware differences were also considered in detail.

The result of this design and experimentation is a set of data formats, coding conventions, and library routines that enable the software to function on various machines running different versions of UNIX/Linux. In addition, the data collected using ERPSS programs on one machine can be transported to another machine by way of tape, disk, serial line, or network and still be processed and examined without the user having to consider the particular machine on which the system is operating. This document describes the rationale, considerations, and conventions established to attain the machine independence of the ERPSS.

PORTABILITY CONSIDERATIONS

Goals

The ERPSS data formats were selected to optimize three main objectives, these being:

As an example of the problems involved in overcoming the differences in machine representations of data, consider the following byte orderings for short (16 bit) and long (32 bit) integer data on two machines: the Sun SPARC, and the Intel 80386 family of machines.

This diagram shows the ordering of bytes in short words, and short words in long words. The critical features are that storing or retrieving data in a sequential byte by byte fashion accesses different parts of a short word on different machines, and that accessing sequential short words in a long word also refers to different parts of the long word. In particular, the 80386 short word format has the least significant byte of a short at the same address as the short itself (" little endian "), while the SPARC format has the most significant byte of a short at the address of the short and the least significant byte at the address of the short plus one (" big endian "). For the long word formats, the SPARC has the most significant short word first in memory storage order, whereas the 80386 stores the least significant short word and then the most significant short in ascending memory order.

As can be seen, native data formats differ between machines. When binary data files consisting of bytes, short words, and long words are interchanged between machines, it is possible that both long and short integers may be interpreted incorrectly. To further complicate the situation serial transfer of these types of data can cause the bytes in each short word to become reversed due to the way native character pointers access data in memory.

Canonical Order and Portability Conventions

To allow fast native code processing of the various ERPSS data files and portability of these files between machines a number of conventions and procedures were established for writing code which creates, reads, writes, or alters these machine-independent data files. A "canonical" data order was selected as a reference for the alterations required in the code for machines which store data in different native formats.

The canonical format chosen was that employed by the Intel family of processors ("little endian"), since that format has a certain intuitive appeal; the ordering of bytes in a short word and short words in a long word is consistent with the storage order of the data in memory.

When software is to be transported to a new machine, it is recompiled with various changes being made in the code as conditioned by certain parameters. Standardized systems calls are employed to equalize the capabilities of the different operating systems; these are contained in the libesys.a library. The conventions, procedures, and definitions established to maintain data portability as well as code portability between different machines and operating systems are contained in the following:

The emachdep.h File

Included in every routine in the ERPSS system (linked in at compile time via libesys.a). This file characterizes the environment in which the code will be running. It contains C preprocessor definitions for the operating system which is in use and for various arrangements of data in the native machine formats, as well as other ERPSS parameters and constants. The following tokens are defined based on the type of machine the code is for:

These environment tokens are conditionally defined for the machines SUNSPARC, PDP11, IS68K, and VAX.

The emagic.h File

Also included in every ERPSS file, it contains tokens for:

The edirs.h File

Also included in every ERPSS file, it contains:

The libesys.a Library

To enable ERPSS routines to employ standardized system calls when operating in various environments, the library "libesys.a" was developed. It contains ERPSS versions of routines to open files, seek to particular places, and the like. The code for each of these calls is conditioned upon the operating system defined in the emachdep.h header file and attempts to emulate a particular function using the specific calls available with each operating system.

The Swabopen Routine

This subroutine is contained in the libu.a ERPSS library and constitutes one of the routines which maintain the canonical format for ERPSS machine-independent data files. It is used to open ERPSS machine-independent files and checks the first short word in the file to determine whether it is BOCANON or BOSWABBED. In combination with the file type argument, either EF_SHORT or EF_CHAR, the data file is corrected if necessary to maintain canonical form on the disk before being read or written by the user program. More details on the operation of swabopen (E3u) can be obtained from its ERPSS manual page as well as from the descriptions of the different types of ERPSS machine-independent data files below.

ERPSS Machine-Independent Data Files

To attain portability, reference data formats were identified and are maintained for data files in the ERPSS system. Three types of data files with different uses are currently employed in the ERPSS which are designed to be portable between machines. These include:

These three types of files are treated slightly different to overcome the possible swab corruption and differring native data formats. The following sections describe in detail how the files are treated to attain machine independence while allowing reasonably fast native code on various machines.

ERP Data Files

This first type of ERPSS data file consists of ERP data stored as arrays of short words, along with headers containing descriptive information as well as other binary parameters associated with the data. These headers can contain char, short, or long data types. These ERP data files are always kept in canonical form on disk; the first element is a byte-swab short word. This type of ERPSS file is always opened by the swabopen (E3u) routine with the EF_SHORT argument, which swabs the file to put it in canonical form, if necessary. Even if the ordering of bytes in a short word is correct, the different hardware formats for character and long word variables on different machines imply that further correction may be needed.

Special library routines are always used to read, write, or create these data files and can thus perform the appropriate corrections on the char and long word data types in the header(s) as the data are read from or written to the disk file. The objective is to always maintain the data in canonical form in the disk file, and adjust the header information as the record is read into main memory. This allows native machine code generated by the specific compiler to reference the header data without regard for the type of host machine under which the ERPSS software is running. The subroutines used to read and write the header data are compiled anew on each machine and use the emachdep.h machine type tokens to generate code appropriate to the specific hardware and operating system environment.

Correction of character arrays containing string data in the header(s) is accomplished by swapping bytes for those arrays if the BYTEREV token is defined in the emachdep.h file when the data are read into memory. The same character array swabbing is performed if the data are subsequently written out to disk, thus maintaining the canonical form on disk. This maneuver allows native character pointer incrementation to access successive characters within the arrays. This procedure requires that character array data begin on even memory boundaries and contain even numbers of elements. In addition, the adjusting subroutines are simplified and run faster if all such character arrays are contiguous in the header structure.

Character type data elements in the header structure which are not part of character arrays need special consideration if the CBYTEREV token is defined. In this case the header structure definition is conditionally dependent upon the CBYTEREV token and reverses the order of definition of these variables, thus maintaining their absolute position in the actual binary header. Again, this requires that there be an even number of such allocations and that they begin on an even boundary.

Long word variables in the headers are treated in a manner similar to character array data, except that the conditional adjustment is controlled by the WORDREV token in the emachdep.h file, and words are swapped in each longword variable. Again, these variables are most easily corrected if they appear contiguously in the headers.

Hence, the conventions for constructing header structures for these types of data files include:

If one adheres to these conventions when designing data headers, the subroutines which maintian canonical form on the disk and are always used to read or write the data can be simplified. There action for this type of data file can be summarized thusly:


                                 Action of Adjusting Subroutines When
     Data type              Read Into Memory     	    Written to Disk

     --------------------------------------------------------------------------

     char      	     	(controlled by CBYTEREV) 	(Controlled by CBYTEREV)

     short      	(corrected by swabopen) 	(no adjustment required)

     long         	Swap words if WORDREV  		Swap words if WORDREV

     char array     	Swap bytes if BYTEREV  		Swap bytes if BYTEREV


Graphic Image Files

Graphic image files stored in virtual form constitute another type of machine-independent ERPSS data file. These can be displayed on different types of graphic devices by supplying them as input to various plot filters, which translate the device-independent commands and virtual (normalized) coordinates to commands and coordinates appropriate for the specific graphic device. They are basically a serial stream of short words, but differ from the ERP data files in that no real header precedes the data, and the same sequence can be piped to a filter process as well as exist as disk-resident files. Since these files can be piped into a filter from another process or via redirection of the standard input of a filter program, it is not possible to guarantee that the data are not swab-corrupted when received by the filter process. Hence, the filter process must be prepared to swap bytes for each short word retrieved from the stream or file.

This situation is handled by starting each such stream with the BOCANON byte swap indicator, a plot-filter image file ERPSS file ID, and a third "process id" when the stream is first written. If the data are captured in a saved disk file and become swab-corrupted, they could be restored to canonical form by opening the file using swabopen (E3u), supplying the data type argument EF_SHORT, and then passing them off to the filter process. However, since these image files can also be redirected into an instance of a filter program, the filters themselves must check the first short in the file and swab all subsequent data if it indicates the data have been swab corrupted (first short is BOSWABBED). Thus, these data files need not exist on disk in canonical form at all times, and programs which create a pipe to a filter process and open multiple saved plot-filter image files should do so using swabopen (E3u) so that all data supplied to the filter process arrives in the same format.

Hence, the filter process needs only to check the initial byte swab indicator and the ERPSS file ID and subsequently swab all data coming in if so indicated. This can be done efficiently by swabbing a whole buffer at a time as the input is read from the pipe. This approach allows the filter programs to use simple macros for retrieving data from the buffer without having to check a flag and possibly swab each short word as it is taken from the buffer.

There is one final detail regarding image files that explains the third element, the "process id". Filters are usually invoked by application programs that send data to the filter via the interconnecting pipe. The filters are responsible for handling the hardware device, possibly needing to prevent simultaneous use by two users and/or using a shared resource. To enable the application program to determine whether the filter has been successfully invoked, the application program pauses after invoking the filter, waiting for one of two signals indicating success or failure. In this case, the "process id" that is passed through the pipe is that of the parent, application program, and is non-zero, indicating that these signals should be sent. However, since filters can also receive data from a saved image file via redirection of their standard input streams (in which case no signal can be sent), this "process id" can be set to zero, indicating to the filter program that no signals need be sent.

Files Consisting of Sequences of Characters

Data files consisting of primarily of sequences of characters (bytes) are most efficiently handled by starting each such file with the BOCANON byte swab indicator, which is written to the file in canonical order, defined to be low byte then high byte. When such files are opened after possible swab-corruption, it is sufficient to check the order of bytes of the byte-swap indicator as it is read a character at a time. Since a swab-corrupted character file may be in a form appropriate for a machine on which BYTEREV is defined, these files are not always in canonical form on the disk. However, they can always be opened for reading using swabopen (E3u) with the file type argument EF_CHAR. Swabopen will determine whether or not swabbing the file is necessary to achieve the same order of characters out of the file as when they were written into it.

It is possible to intersperse other data types in the character file stream; if they are always written into the file in canonical order (from lowest byte to most significant byte order) and retrieved in the same manner, the other data type will be properly extracted from the stream. Currently, only the compressed raster image files are of this type.


© 2005 UCSD ERP Lab
Please send comments and suggestions to the ERPSS Webmaster