Docs: ERPSS System Tour - Page 2

Data Files and Formats

While there are many different types and variations of data files in the ERPSS system, there are three with which one will almost certainly have to deal in the course of acquiring and analyzing any ERP data using the ERPSS programs - raw files, log files, and "averaged" or erpss data files. These three types of data files are mentioned extensively in the following expositions; each stores data in "binary" form by default, and each has a special program that allows one to examine and possibly edit that type of data file.

One might, at this point, wonder why the ERPSS design employs binary data files instead of ASCII files, given that binary data are difficult to transport between different machine environments, difficult to edit, and impossible to view with standard utilities. The two main answers are storage space and speed; storing data in ASCII would certainly more than double the storage requirements for data files, and converting to and from ASCII between each step of processing could be unbearably slow.

In any case, the portability problem for these binary files has been solved in the ERPSS environment, and the special programs necessary to edit and view the various files are available. In addition, keeping the data in binary form assists in maintaining the integrity of the data files, as only certain programs and routines can directly manipulate the data.

The three essential types of ERPSS data files arise from the process of collecting and analyzing ERP data. When one digitizes data during an experiment, two files are produced - a raw file containing all the EEG, event, and response information, and a log file, which contains a summary of the event occurrences and their times in the associated raw file. Although the raw file contains all the information that is recorded when an experiment is conducted, it can be quite large, making it slow to read and impractical to store on disk for long time periods. The log file is thus useful for assigning various events to different categories in preparation of averaging the EEG data from the raw file, as well as for certain behavioral analyses that are independent of the EEG data. The result of averaging the raw file is an averaged data file, or simply erpss data file. These files can also arise from various manipulations of other data files. For example, one may summarize data by deleting certain channels and/or averages from the original data file, form difference waveforms, lump across conditions, or form "grand" averages across subjects. Each of these types of operations on erpss data files results in a new erpss data file.

Raw Files

The ERPSS approach to recording data is to continuously sample one or more channels of EEG data at a predetermined sampling rate and store these data, along with information about the occurrence of various experimental events, for subsequent analysis. This approach has a number of advantages, including the ability to re-average data as many times as desired with various epoch lengths, (digitally altered) sampling rates, artifact detection, correction, or rejection criteria, and event classifications. It does, however, entail more data storage than might be required for other approaches when relevant experimental events are few and far between, and additional processing steps are required. Nonetheless, most users agree that the advantages far outweigh the disadvantages in a research environment.

Raw files are typically generated by the operation of the program Acquisition on PCs running a multi-threaded OS. After the data are collected, they are transferred to UNIX machines for analysis. Raw files consist of a "header", an initial record containing various descriptions and recording parameters, followed by multiple data records. Although the data records following the header are separate, they collectively form the consecutive samples of the data. Each data record, in addition to the actual EEG data, has a short prefix containing information about the various events that occurred during the time period covered by the record. To achieve maximum performance, only those tasks that are essential during the digitization of data are performed during that phase. As a result, data are stored in a multiplexed form in the manner that they are retrieved from the analog to digital converter, and the various channels must be separated during subsequent processing of the raw data. All programs that employ raw data files invoke standardized routines to demultiplex the data channels and extract discrete segments (or epochs) from multiple records of the continuous representation in the raw file.

The program rawfile (E1) is used to examine and graphically display raw data files. One can peruse the data quite flexibly, extracting data segments of arbitrary length for display. One can specify that the data be digitally re-sampled at a lower effective rate than that at which the data were recorded and/or be filtered with a wide range of digital filters. Data around arbitrary sample points can be selected for display, or one can search for specific events. One main use of rawfile is to determine criteria for artifact detection and rejection that will be used when the data are subsequently averaged. Rawfile can also be used just to print various parameters contained in the header of a raw file, to help identify an unknown raw file. Below is a sample of the graphical display produced by rawfile.

Log Files

When data are recorded by Acquisition, a log file associated with the raw file may also be produced. This file summarizes the events in the corresponding raw file and their times of occurrence, and also contains a condition code and set of eight flags for each event. These four elements, the event, its time, the corresponding condition code, and the flags comprise a log entry or item in a log file. Log files also include an initial header containing some of the descriptions and parameters specified during digitization and creation of the file. Event identities are represented by numerical codes called "event codes" that denote stimuli, responses by the subject, or "meta" events such as pauses in the collection of data (pause marks) or requests to delete stretches of data (delete marks), etc. The times are simply the sample at which the event occurred. The condition code is really just an extension of the event code. It is useful when one presents the same stimuli with the same event codes to a subject, but the state or condition of the subject has changed (e.g. direction of attention). Of the eight flags associated with each log item, four are pre-assigned and denote that the entry is "deleted", was rejected because the data were artifactual, could not be recovered from the raw file due to data errors, or were not properly recorded due to technical malfunction. The four remaining flags can be used as desired by the experimenter to record other arbitrary information.

The program logfile (E1) is used to examine and (optionally) edit log files. It prints lists of log items, summarizes event occurrences, and finds sections of the log file which are delimited by changes in condition codes or by pause and/or delete marks. It is possible to search for specific event codes or for event codes occurring in quite arbitrary contexts. For this purpose, the log event selector system can be employed; see les (E4). This program can be useful for correcting mistakes (such as an incorrect condition code) that were made during data acquisition. Below is a sample of the output of logfile:


  *******************  Header Information for Log File: log  *******************

  Sampling rate (Hz): 250.00      Creation time: Thu Oct 24 14:15:12 1996

  Flags (hex):        0x0000      Dig. Prog. vers.:    1  Time comp. factor:   0
  # Dig. Chans:           32      Raw rec. size pts: 128  Raw rec. log base 2: 7
  
  Sub. Desc: PW 10-24-96
  Exp. Desc: SWITCH2 -EXPERIMENT 
  Lab Name:  STM-VER Lab
  Host Name: ISLAND
  File Desc: Raw Data
  
  *****************************  66731 total items  *****************************
  
  Type h(elp) for list of implemented commands
  
  logfile: l 0 10
    Item   Event   Cond.  Flags  Sample Time  Prev. IEI
   Number   Code   Code   (Hex)   (seconds)     (msec)
	0    57       1    0X00        0.088     88.000
       	1    65       1    0X00        0.092      4.000
       	2     1       1    0X00        0.168     76.000
       	3     4       1    0X00        1.168   1000.000
       	4    31       1    0X00        1.676    508.000
       	5    34       1    0X00        1.680      4.000
       	6   603       1    0X00        1.736     56.000
       	7    34       1    0X00        1.756     20.000
       	8    31       1    0X00        1.792     36.000
       	9   289       1    0X00        1.820     28.000
       10    34       1    0X00        1.840     20.000
  logfile: 

Logfile can also be used in conjunction with binlist (E4) files that are generated using the corresponding log file. Binlist files are discussed later, but suffice it to say that it is often useful to simultaneously display log file entries and the associated categories or "bins" into which the data might be averaged.

ERP Data Files

Data files constitute the primary format in which averaged ERP data are stored in the ERP software system. They are binary files having an initial header, termed the reference header, followed by a set of bins. The reference header contains information that applies to the entire data file, such as the number of channels, their descriptions, the subject description, etc. A bin refers to a set of data that are generated and/or processed as a unit, and consists of a bin header along with the data set itself. For example, during averaging, a bin results from the processing of all trials assigned to be averaged together by the binlist file. A bin always contains of a set of averaged ERPs, one for each channel, that were formed by a common set of operations (e.g., the average of a set of raw EEG epochs or the sum of a set of bins from a previous data file). In the typical case, a bin contains the averaged data that reflect the response to one stimulus type in one condition. There is no built-in limit to the number of bins in a data file, but it is often convenient to group data into separate files according to analysis and plotting requirements. Bins are numbered consecutively, starting at 0 in each data file. Bin 0 is reserved for calibration pulses in the initial data files created during the averaging process.

While all bins in a data file must have the same number of channels, there are very few other constraints on the data parameters across bins in a particular data file. That is, each bin header specifies the number of sampling points for the data in the bin, the sampling period, the number of sums, the bin description, etc. Bins are also subdivided into processing functions. Processing functions are ways in which the data have been processed; the "functions" half of the term is perhaps a red herring, and refers to the algorithm used in generating the subset of the data. There must always be at least one processing function - an average, but others can be present. Currently, noise estimates and standard errors can be calculated. All channels are present for each processing function.

The ERPSS program datafile (E1) can be used to examine and (optionally) alter the information in the reference header and the various bin headers for an ERPSS data file. The various entries in these headers have mnemonic descriptions, termed header keywords, which are used by both datafile and other ERPSS programs which access these header variables. These keywords (along with a brief description of the function of the associated variable) can be obtained by running edkeyword (E1).

One should be careful when altering data file headers, as many ERPSS programs rely on this information, and a mangled data file may not be able to be reincarnated.