The 1998 Intrusion Detection Off-line Evaluation Plan MIT Lincoln Laboratory Information Systems Technology Group Last Modification: 25 March 1998 1.0 Introduction The 1998 intrusion detection off-line evaluation is the first of an ongoing series of yearly evaluations conducted by MIT Lincoln Laboratory ("Lincoln") under DARPA ITO and Air Force Research Laboratory sponsorship. These evaluations will contribute significantly to the intrusion detection research field by providing direction for research efforts and calibration of current technical capabilities. They are intended to be of interest to all researchers working on the general problem of workstation and network intrusion detection. The evaluation is designed to be simple, to focus on core technology issues, and to encourage the widest possible participation by eliminating security and privacy concerns and by providing data types that are used by the majority of intrusion detection systems. Data for this first evaluation will be made available in the spring and summer of 1998 (see attached schedule). The evaluation itself will occur towards the end of the summer. A follow-up meeting for evaluation participants and other interested parties will be held in the fall to discuss research findings. Participation in the evaluation is solicited for all sites that find the task and the evaluation of interest. For more information, and to register a desire to participate in the evaluation, please send e-mail to INTRUSION@SST.LL.MIT.EDU or call Marc Zissman at (781) 981-7495. 2.0 Technical Objective Evaluations measure the ability of intrusion detection systems to detect attacks on computer systems and networks. This year's task focuses on UNIX workstations and the goal is to determine whether any of the following attack events occurred or were attempted during a given network session: 1. Denial of service 2. Unauthorized access from a remote machine 3. Unauthorized access to local superuser privileges by a local unprivileged user 4. Surveillance and probing 5. Anomalous user behavior Network sessions used for scoring are complete TCP/IP connections which correspond to interactions using many services including telnet, HTTP, SMTP, FTP, finger, rlogin, and others. This task is posed in the context of normal usage of computers and networks as one might observe on a military base. The evaluation is designed to foster research progress, with the following four goals: 1. Exploring promising new ideas in intrusion detection. 2. Developing advanced technology incorporating these ideas. 3. Measuring the performance of this technology. 4. Comparing the performance of various newly developed and existing systems in a systematic, careful way. Previous evaluations of intrusion detection systems have tended to focus exclusively on the probability of detection, without regard to probability of false alarm. By embedding attack sessions within normal background traffic sessions, the current evaluation will allow us to measure both detection and the false alarm rates simultaneously. 3.0 The Evaluation Intrusion detection performance will be evaluated by measuring the correctness of detection decisions for an ensemble of sessions which simulate both normal traffic and attacks. Normal sessions will be designed to reflect (statistically) traffic seen on military bases. Sessions with attacks will contain recent attacks and the types of behaviors observed during illegal computer use. For each session, the intrusion detection system will be required to produce a score, indicating the relative likelihood that an attack occurred during the session. The scores may take on any floating point values (positive, negative or zero), with the convention that the more positive the score, the more likely an attack occurred. For any given floating point threshold, T, it will be possible to compute the probability of detection (i.e. the number of attack sessions having score greater than T divided by the total number of attack sessions) and the probability of false alarm (i.e. the number of normal sessions having score greater than T divided by the total number of normal sessions). By varying T across the full range of scores output by a system, it will be possible to plot a receiver operating characteristic (ROC) curve, which plots the detection probability versus the false alarm probability. This ROC curve can be used to determine performance for any possible operating point. ROC curves and statistics generated from these curves will be used to compare alternative approaches to intrusion detection. ROC curves will be generated for different types of attacks and anomolous behavior. ROC curves will also be generated for systems using only BSM data as input, for systems using only tcpdump data as input, and for systems using both types of input data. 4.0 Training Data Prior to the evaluation, a set of training data will be made available to the participating sites. This data will be used to configure intrusion detection systems and train free parameters. Generally, the types of training data provided will be those that are used by most of today's commercial and research intrusion detection systems. These data will be generated on a simulation network. Both normal use and attack sessions will be present. Distributions of normal session types and normal session content will be similar to that on military bases. Attack sessions will contain recent attacks and the types of behaviors observed during illegal computer use. Training data will contain the following elements: * tcpdump data for more roughly one month of network traffic as collected by a tcpdump packet sniffer. This data contains the contents of every packet transmitted between computers inside and outside a simulated military base. Documentation on how tcpdump was invoked will also be provided. * A "listfile" for the tcpdump data, indicating the following information for each important network session: Session ID: a positive integer Start Date: in MM/DD/YYYY format Start Time: in HH:MM:SS format Session Duration: in HH:MM:SS format Service identifier: a string, indicating the service type and whether the service is tcp, udp, icmp, or some other non-tcp protocol. The service will end in /u if this is a udp service, /i if this is icmp, and in other letters to represent other non-tcp protocols. Otherwise the service is assumed to be tcp. (e.g. exec, finger, ftp, ftp-data, finger, ...) A list of most of the well-known ports and associated services that will be used in our evaluation is available at: http://www.isi.edu/in-notes/iana/assignments/port-numbers Source Port: a positive integer, e.g. 1755, 1050 Destination Port: a positive integer, e.g. 21, 25 Source IP address: four non-negative integers separate by periods, e.g. 192.168.1.30 Destination IP address: four non-negative integers separate by periods, e.g. 192.168.1.31 Attack Score: 0 indicates no attack in this sessions, 1 indicates attack in this session Attack Name: a string, (e.g. "guess", "eject", "anomaly", etc.. "-" indicates no attack) Listfiles are ASCII files. White space separates the fields. Newlines separate the records. The listfile will only contain information on a subset of the data in the tcpdump file. An example of a tcpdump listfile is shown below: 11 01/23/1998 16:56:27 00:00:00 ftp-data 20 1770 192.168.0.20 192.168.1.30 0 - 13 01/23/1998 16:56:36 00:00:03 finger 1772 79 192.168.1.30 192.168.0.20 0 - 14 01/23/1998 16:56:42 00:00:03 smtp 1778 25 192.168.1.30 192.168.0.20 0 - 15 01/23/1998 16:56:43 00:00:03 smtp 1783 25 192.168.1.30 192.168.0.20 0 - 18 01/23/1998 16:56:45 00:00:00 http 1784 80 192.168.1.30 192.168.0.40 1 phf 20 01/23/1998 16:56:49 00:00:14 ftp 43504 21 192.168.0.40 192.168.1.30 0 - * Sun Basic Security Module (BSM) audit data from one UNIX Solaris host for some network sessions. This data contains audit information describing system calls made to the Solaris kernel. Raw BSM binary output files are provided along with BSM configuration files and shell scripts used to initialize BSM auditing to record events from processes that implement important TCP/IP services. * A "listfile" for the BSM data, with the same format as the listfile for the tcpdump data. Again, only a subset of the network sessions captured by BSM will be called out in the listfile. * A "ps-monitor" file, containing the output of the unix process status command once per minute on the same machine on which BSM auditing was performed. * Unix "dump" data, containing weekly epoch dumps and daily incremental dumps for each file system on the machine on which BSM auditing is performed. * A postscript block diagram of the simulation network, showing the logical organization of the machines and routers relative to each other. Sessions will be numbered sequentially, starting with "1". Some sessions may be present only in the tcpdump data, some may be present only in the audit data, and some may be present in both sets of data. Session ID numbers are consistent between the tcpdump and audit data. Not all sessions in tcpdump data will be included in list files, and list files should not be used as primary inputs for intrusion detection systems. The training data will initially be posted on our web site and then be distributed on multiple CD-ROMs. It is expected that tens of gigabytes of training data may be produced. 5.0 Development Test Data Development test data is used to evaluate performance of alternative intrusion detection systems using the training data prior to the final official test. Sites can train systems using the training data and perform preliminary tests using pre-specified development test data to select system settings for the final test that provide good performance. Use of a common set of development test data (instead of having each site perform separate cross-validation splits of the training data) makes it possible to compare alternative approaches across sites. In general, development test data will be generated in a manner similar to the training data. The formats of the various data elements will be identical to the training data, with the exception that the attack score and attack name fields of the listfiles will be empty. However, an answer key will be distributed along with the development test data that describes ground truth for that data set. The answer key will be a listfile with three columns: session ID, score (0 for normal, 1 for attack), and attack name. For the 1998 evaluation, we will define a split of the training data into a training component and a development-test component. For example if the training data contains seven weeks of data, the development-test data may be the final week of data. We would then recommend that sites train on the first six weeks of training data and evaluate performance using the final week of the training data. When this type of split is defined, no separate answer key will be provided for the development-test data because this information is already in the list file provided with the training data. 6.0 Evaluation Test Data Evaluation test data or simply test data is the final set of data which is used to test performance of each intrusion detection system being evaluated. Evaluation test data will be generated in a manner similar to the training and development test data. The formats of the various data elements will be identical to the development test data, except that the answer key will not be distributed until the evaluation is complete. There will be attack types in the evaluation test data that are not present in either the training data or the development test data. 7.0 Anomaly Detection Some intrusion detection systems are designed specifically to detect anomalous user, system, and network behavior. We will insert such anomalous behavior in the test and training data to evaluate such systems. General consistency concerning user, system, and network behavior will be maintained among the training, development test, and evaluation test sets. The same users and network configuration will be used across the three data sets with a few exceptions to mimic normal addition and deletion of users and services. In addition, the data will be continuous in time with the test data following the training data in time. A time-adaptive anomaly detection system can thus be trained on the training data and then correctly tested on the test data without introducing artifacts. 8.0 Evaluation Rules Sites may submit up to three official results files: one corresponding to the sessions listed in the tcpdump list file, one corresponding to the sessions listed in the BSM list file, and one corresponding to the union of sessions listed in the two list files. Although we encourage submission of all three results files, we realize that some sites will be able to submit only a subset of the three. It is permissible for a single site to evaluate multiple systems. For example, a site may submit three results files for system A and three results files for system B. In this case, however, the submitting site must identify one system as the "primary" system prior to performing the evaluation. For any of the three possible results files (tcpdump, BSM, tcpdump+BSM) that a sites chooses to submit, the site is required to submit a scaled attack likelihood for each network session in that listfile. If a participating site does not submit a complete set of results for that listfile, Lincoln will not report any results for that listfile. For example, if there are 2000 network sessions listed in the tcpdump listfile, and if a site chooses to submit a results file for the tcpdump listfile, then it must produce and submit scores for all 2000 network sessions in that listfile. The following evaluation rules and restrictions must be observed by all participants: * Each decision is to be based only upon the specified network session and any network session having already occurred. Use of information about test sessions occurring subsequent to the given session is not allowed. The intrusion detection systems must be causal. * Knowledge of the training conditions (implied by data set directory structure and other network information provided) is allowed. * Examining the evaluation test data, or any other experimental interaction with this data, is not allowed before all test results have been submitted. This applies to all evaluation test data, whether part of an evaluated session or not. 9.0 Format for Submission of Results Sites participating in the evaluation must report test results for all sessions. These results must be provided to Lincoln in results files using a standard ASCII record format, with one record for each decision. Each record will have three fields separated by white space. The first field is the session identifier assigned by Lincoln. The second field is the floating point score, indicating the scaled likelihood that a given session contained an attack. The third (optional) field is the name of the attack. Records are to be separated by newline characters. Results files will be deposited in a Lincoln external ftp site prior to the result submission deadline. 10.0 System Description The name and a brief description of the system (the algorithms) used to produce the results must be submitted along with the results, for each system evaluated. 11.0 Execution Time Sites must report the CPU execution time that was required to process the test data, as if the test were run on a single CPU. Sites must also describe the CPU and the amount of memory used.