The 1998 Intrusion Detection
                          Off-line Evaluation Plan

                           MIT Lincoln Laboratory
                    Information Systems Technology Group

                     Last Modification: 25 March 1998

1.0 Introduction

The 1998 intrusion detection off-line evaluation is the first of an ongoing
series of yearly evaluations conducted by MIT Lincoln Laboratory
("Lincoln") under DARPA ITO and Air Force Research Laboratory
sponsorship. These evaluations will contribute significantly to the
intrusion detection research field by providing direction for research
efforts and calibration of current technical capabilities.  They are
intended to be of interest to all researchers working on the general
problem of workstation and network intrusion detection.  The
evaluation is designed to be simple, to focus on core technology
issues, and to encourage the widest possible participation by
eliminating security and privacy concerns and by providing data types
that are used by the majority of intrusion detection systems.

Data for this first evaluation will be made available in the spring
and summer of 1998 (see attached schedule). The evaluation itself will
occur towards the end of the summer. A follow-up meeting for
evaluation participants and other interested parties will be held in
the fall to discuss research findings. Participation in the evaluation
is solicited for all sites that find the task and the evaluation of
interest.  For more information, and to register a desire to
participate in the evaluation, please send e-mail to
INTRUSION@SST.LL.MIT.EDU or call Marc Zissman at (781) 981-7495.

2.0 Technical Objective

Evaluations measure the ability of intrusion detection systems to
detect attacks on computer systems and networks.  This year's task
focuses on UNIX workstations and the goal is to determine whether any
of the following attack events occurred or were attempted during a
given network session:

  1. Denial of service
  2. Unauthorized access from a remote machine
  3. Unauthorized access to local superuser privileges by a local
     unprivileged user
  4. Surveillance and probing
  5. Anomalous user behavior 

Network sessions used for scoring are complete TCP/IP connections
which correspond to interactions using many services including telnet,
HTTP, SMTP, FTP, finger, rlogin, and others.  This task is posed in
the context of normal usage of computers and networks as one might
observe on a military base.  The evaluation is designed to foster
research progress, with the following four goals:

  1. Exploring promising new ideas in intrusion detection.
  2. Developing advanced technology incorporating these ideas.
  3. Measuring the performance of this technology.
  4. Comparing the performance of various newly developed and
     existing systems in a systematic, careful way.

Previous evaluations of intrusion detection systems have tended to
focus exclusively on the probability of detection, without regard to
probability of false alarm. By embedding attack sessions within normal
background traffic sessions, the current evaluation will allow us to
measure both detection and the false alarm rates simultaneously.

3.0 The Evaluation

Intrusion detection performance will be evaluated by measuring the
correctness of detection decisions for an ensemble of sessions which
simulate both normal traffic and attacks.  Normal sessions will be
designed to reflect (statistically) traffic seen on military
bases. Sessions with attacks will contain recent attacks and the types
of behaviors observed during illegal computer use.  For each session,
the intrusion detection system will be required to produce a score,
indicating the relative likelihood that an attack occurred during the
session.  The scores may take on any floating point values (positive,
negative or zero), with the convention that the more positive the
score, the more likely an attack occurred.

For any given floating point threshold, T, it will be possible to
compute the probability of detection (i.e. the number of attack
sessions having score greater than T divided by the total number of
attack sessions) and the probability of false alarm (i.e. the number
of normal sessions having score greater than T divided by the total
number of normal sessions). By varying T across the full range of
scores output by a system, it will be possible to plot a receiver
operating characteristic (ROC) curve, which plots the detection
probability versus the false alarm probability. This ROC curve can be
used to determine performance for any possible operating point.

ROC curves and statistics generated from these curves will be used to
compare alternative approaches to intrusion detection. ROC curves will
be generated for different types of attacks and anomolous
behavior. ROC curves will also be generated for systems using only BSM
data as input, for systems using only tcpdump data as input, and for
systems using both types of input data.

4.0 Training Data

Prior to the evaluation, a set of training data will be made available
to the participating sites. This data will be used to configure
intrusion detection systems and train free parameters.  Generally, the
types of training data provided will be those that are used by most of
today's commercial and research intrusion detection systems.  These
data will be generated on a simulation network. Both normal use and
attack sessions will be present. Distributions of normal session types
and normal session content will be similar to that on military bases.
Attack sessions will contain recent attacks and the types of behaviors
observed during illegal computer use.  Training data will contain the
following elements:

* tcpdump data for more roughly one month of network traffic as collected
  by a tcpdump packet sniffer. This data contains the contents of every packet
  transmitted between computers inside and outside a simulated military 
  base. Documentation on how tcpdump was invoked will also be provided.

* A "listfile" for the tcpdump data, indicating the following information
  for each important  network session:

  Session ID: a positive integer
  Start Date: in MM/DD/YYYY format
  Start Time: in HH:MM:SS format
  Session Duration: in HH:MM:SS format
  Service identifier: a string, indicating the service type and whether
	the service is tcp, udp, icmp, or some other non-tcp protocol. 
	The service will end in /u if this
	is a udp service, /i if this is icmp, and in other letters to represent
	other non-tcp protocols. Otherwise the service is assumed to be tcp.
	(e.g. exec, finger, ftp, ftp-data, finger, ...)
        A list of most of the well-known ports and associated services
        that will be used in our evaluation is available at:
	http://www.isi.edu/in-notes/iana/assignments/port-numbers
  Source Port: a positive integer, e.g. 1755, 1050
  Destination Port: a positive integer, e.g. 21, 25
  Source IP address: four non-negative integers separate by periods, e.g.
    192.168.1.30
  Destination IP address: four non-negative integers separate by periods, e.g.
    192.168.1.31
  Attack Score: 0 indicates no attack in this sessions, 1 indicates attack
    in this session
  Attack Name: a string, (e.g. "guess", "eject", "anomaly", etc.. 
    "-" indicates no attack)

  Listfiles are ASCII files. White space separates the fields. Newlines
  separate the records. The listfile will only contain information on a
  subset of the data in the tcpdump file. An example of a tcpdump listfile
  is shown below:

 11 01/23/1998 16:56:27 00:00:00 ftp-data 20 1770 192.168.0.20 192.168.1.30 0 -
 13 01/23/1998 16:56:36 00:00:03 finger 1772 79 192.168.1.30 192.168.0.20 0 -
 14 01/23/1998 16:56:42 00:00:03 smtp 1778 25 192.168.1.30 192.168.0.20 0 -
 15 01/23/1998 16:56:43 00:00:03 smtp 1783 25 192.168.1.30 192.168.0.20 0 -
 18 01/23/1998 16:56:45 00:00:00 http 1784 80 192.168.1.30 192.168.0.40 1 phf
 20 01/23/1998 16:56:49 00:00:14 ftp 43504 21 192.168.0.40 192.168.1.30 0 -

* Sun Basic Security Module (BSM) audit data from one UNIX Solaris
  host for some network sessions. This data contains audit information
  describing system calls made to the Solaris kernel.  Raw BSM binary
  output files are provided along with BSM configuration files and
  shell scripts used to initialize BSM auditing to record events from
  processes that implement important TCP/IP services.

* A "listfile" for the BSM data, with the same format as the listfile
  for the tcpdump data. Again, only a subset of the network sessions
  captured by BSM will be called out in the listfile.

* A "ps-monitor" file, containing the output of the unix process status
  command once per minute on the same machine on which BSM auditing
  was performed.

* Unix "dump" data, containing weekly epoch dumps and daily
  incremental dumps for each file system on the machine on which BSM
  auditing is performed.

* A postscript block diagram of the simulation network, showing the
  logical organization of the machines and routers relative to each
  other.

Sessions will be numbered sequentially, starting with "1". Some
sessions may be present only in the tcpdump data, some may be present
only in the audit data, and some may be present in both sets of
data. Session ID numbers are consistent between the tcpdump and audit
data. Not all sessions in tcpdump data will be included in list
files, and list files should not be used as primary inputs for 
intrusion detection systems.

The training data will initially be posted on our web site and then
be distributed on multiple CD-ROMs. It is
expected that tens of gigabytes of training data may be produced.

5.0 Development Test Data

Development test data is used to evaluate performance of alternative
intrusion detection systems using the training data prior to the final
official test. Sites can train systems using the training data and
perform preliminary tests using pre-specified development test data to
select system settings for the final test that provide good
performance.  Use of a common set of development test data (instead of
having each site perform separate cross-validation splits of the
training data) makes it possible to compare alternative approaches
across sites.

In general, development test data will be generated in a manner
similar to the training data. The formats of the various data elements
will be identical to the training data, with the exception that the
attack score and attack name fields of the listfiles will be empty.
However, an answer key will be distributed along with the development
test data that describes ground truth for that data set. The answer
key will be a listfile with three columns: session ID, score (0 for
normal, 1 for attack), and attack name.

For the 1998 evaluation, we will define a split of the training data
into a training component and a development-test component. For
example if the training data contains seven weeks of data, the
development-test data may be the final week of data.  We would then
recommend that sites train on the first six weeks of training data
and evaluate performance using the final week of the training
data. When this type of split is defined, no separate answer key will
be provided for the development-test data because this information is
already in the list file provided with the training data.

6.0 Evaluation Test Data

Evaluation test data or simply test data is the final set of data
which is used to test performance of each intrusion detection system
being evaluated.  Evaluation test data will be generated in a manner
similar to the training and development test data. The formats of the
various data elements will be identical to the development test data,
except that the answer key will not be distributed until the
evaluation is complete.  There will be attack types in the evaluation
test data that are not present in either the training data or the
development test data.

7.0 Anomaly Detection

Some intrusion detection systems are designed specifically to detect
anomalous user, system, and network behavior. We will insert such
anomalous behavior in the test and training data to evaluate such
systems. General consistency concerning user, system, and network
behavior will be maintained among the training, development test, and
evaluation test sets.  The same users and network configuration will
be used across the three data sets with a few exceptions to mimic
normal addition and deletion of users and services.  In addition, the
data will be continuous in time with the test data following the
training data in time. A time-adaptive anomaly detection system can
thus be trained on the training data and then correctly tested on the
test data without introducing artifacts.

8.0 Evaluation Rules

Sites may submit up to three official results files: one corresponding
to the sessions listed in the tcpdump list file, one corresponding to
the sessions listed in the BSM list file, and one corresponding to the
union of sessions listed in the two list files. Although we encourage
submission of all three results files, we realize that some sites will
be able to submit only a subset of the three.

It is permissible for a single site to evaluate multiple systems.  For
example, a site may submit three results files for system A and three
results files for system B.  In this case, however, the submitting
site must identify one system as the "primary" system prior to
performing the evaluation.

For any of the three possible results files (tcpdump, BSM,
tcpdump+BSM) that a sites chooses to submit, the site is required to
submit a scaled attack likelihood for each network session in that
listfile.  If a participating site does not submit a complete set of
results for that listfile, Lincoln will not report any results for
that listfile. For example, if there are 2000 network sessions listed
in the tcpdump listfile, and if a site chooses to submit a results
file for the tcpdump listfile, then it must produce and submit scores
for all 2000 network sessions in that listfile.

The following evaluation rules and restrictions must be observed by
all participants:

* Each decision is to be based only upon the specified network session
and any network session having already occurred.  Use of information
about test sessions occurring subsequent to the given session is not
allowed. The intrusion detection systems must be causal. 

* Knowledge of the training conditions (implied by data set directory
structure and other network information provided) is allowed.

* Examining the evaluation test data, or any other experimental interaction
with this data, is not allowed before all test results have been
submitted. This applies to all evaluation test data, whether part of an
evaluated session or not.

9.0 Format for Submission of Results

Sites participating in the evaluation must report test results for all
sessions.  These results must be provided to Lincoln in results files
using a standard ASCII record format, with one record for each
decision. Each record will have three fields separated by white space.
The first field is the session identifier assigned by Lincoln.  The
second field is the floating point score, indicating the scaled
likelihood that a given session contained an attack.  The third
(optional) field is the name of the attack.  Records are to be
separated by newline characters. Results files will be deposited in a
Lincoln external ftp site prior to the result submission deadline.

10.0 System Description

The name and a brief description of the system (the algorithms) used
to produce the results must be submitted along with the results, for
each system evaluated.

11.0 Execution Time

Sites must report the CPU execution time that was required to process
the test data, as if the test were run on a single CPU. Sites must
also describe the CPU and the amount of memory used.