1999 DARPA Intrusion Detection Evaluation Plan

1 Introduction
2 Technical Objective
3 The Evaluation
    3.1 Detection
    3.2 Identification
4 Training Data
5 Optional Pretest
6 Evaluation Test Data
7 Format for submission of Results and Scoring
    7.1 Attack Detection
    7.2 Attack Identification
8 Evaluation Rules
9 Security Policy of Eyrie Air Force Base Network
10 Summary of Changes from 1998 Evaluation

Introduction

The 1999 intrusion detection off-line evaluation is the second of an ongoing series of yearly evaluations conducted by MIT Lincoln Laboratory ("Lincoln") under DARPA ITO and Air Force Research Laboratory sponsorship. These evaluations are contributing significantly to the intrusion detection research field by providing direction for research efforts and calibration of current technical capabilities. They are of interest to all researchers working on the general problem of workstation, or host-based, and network intrusion detection. The evaluation is designed to be simple, to focus on core technology issues, and to encourage the widest possible participation by eliminating security and privacy concerns and by providing data types that are used by the majority of intrusion detection systems.

Data for this second evaluation will be made available in the spring of 1999 (see attached schedule). The evaluation itself will occur during the summer. A follow-up meeting for evaluation participants and other interested parties will be held in the fall to discuss research findings. Participation in the evaluation is solicited for all sites that find the task and the evaluation of interest. For more information, and to register for participation in the evaluation, please send e-mail to OFFICE@SST.LL.MIT.EDU or call Marc Zissman at (781) 981-7495.

Technical Objective

Evaluations measure the ability of intrusion detection systems to detect and identify attacks on computer systems and networks. This year's task focuses on UNIX and Windows NT workstations and the goal is to determine whether any of the following attack events occurred or were attempted during the simulation run:

1. Denial of Service (dos) - Unauthorized attempt to disrupt the normal functioning of a victim host or network.
2. Remote to Local(r2l) - Unauthorized obtaining of user privileges on a local host by a remote user without such privileges.
3. User to Root (u2r) - Unauthorized access to local superuser or administrator privileges by a local unprivileged user.
4. Surveillance or Probe (probe) - Unauthorized probing of a machine or network to look for vulnerabilities, explore configurations, or map the network's topology.
5. Data Compromise (data) - Unauthorized access or modification of data on local host or remote host.

 

These attacks occur in the context of normal usage of computers and networks as one might observe on a military base. The evaluation is designed to foster research progress, with the following four goals:

1. Explore promising new ideas in intrusion detection.
2. Develop advanced technology incorporating these ideas.
3. Measure the performance of this technology.
4. Compare the performance of various newly developed and existing systems in a systematic, careful way. Previous evaluations of intrusion detection systems have tended to focus exclusively on the probability of detection, without regard to probability of false alarm. By embedding attack sessions within normal background traffic sessions, these evaluations allow us to measure both detection and the false alarm rates simultaneously.

 

The Evaluation

Intrusion detection systems will be evaluated on two different levels- 1) attack detection and 2) attack identification and analysis. Detection addresses the simple question of whether the presence of some piece of the attack was observed by the ID system. Identification addresses the question of how much important information the ID system was able to glean about the attack. Information is considered important if it can help a system administrator locate the effects of the attack, respond effectively to the attack (possibly preventing the attack from succeeding), or secure the system against such an attack in the future.

Detection

Intrusion detection performance will be evaluated by measuring the correctness of detection decisions for an ensemble of sessions which simulate both normal traffic and attacks. Normal sessions will be designed to reflect (statistically) traffic seen on military bases. Sessions with attacks will contain recent attacks and the types of behaviors observed during illegal computer use. For each attack that is detected, the intrusion detection system output will be required to include the time of the attack, the machine being attacked, and a score indicating the confidence level that this is indeed an attack. Scores may take on any floating point values (positive, negative or zero), with the convention that the more positive the score, the more likely an attack occurred. For any given floating point threshold, T, it will be possible to compute the probability of detection (i.e. the number of attacks having score greater than T divided by the total number of attacks) and the number of false alarms (i.e. the number of times a normal session is labeled with a score above T).

By varying T across the full range of scores output by a system, it will be possible to create a Detection/False-Alarm (DFA) plot, which graphs the detection percentage versus the number of false alarms. These plots are no longer referred to as ROC curves as they are not strictly ROC's since the X-axis cannot be interpreted as a probability when the maximum number of false alarms is not known.

This DFA plot can be used to determine performance for any possible operating point. DFA plots and statistics generated from these plots will be used to compare alternative approaches to intrusion detection for different types of attacks. Separate DFA plots will be generated for systems using different combinations of tcpdump, audit, and file system input. Separate DFA's will also be calculated for the five different categories of attacks ( see Technical Objective ) and for UNIX and NT specific attacks.

Identification

We strongly encourage participating ID systems to provide an additional set of information about the attack to demonstrate the extent to which the attack was fully identified. This information includes the attack source, destination, category, name, services used, and duration. We will then use this information to calculate scores for different systems to evaluate their effectiveness in providing different types of important information about the attacks to the system administrator (see Attack Identification for details).

Training Data

Prior to the evaluation, a set of training data will be made available to participating sites. This data will be used to configure intrusion detection systems and train free parameters. Generally, the types of training data provided will be those that are used by most of today's commercial and research intrusion detection systems. These data will be generated on a simulation network. Both normal use and attack sessions will be present. We will also distribute one week of training data (the last week) without any attacks to facilitate the training of anomaly detection systems. Distributions of normal session types and normal session content will be similar to that on military bases. Attack sessions will contain recent attacks and the types of behaviors observed during illegal computer use.

A primary goal of the 1999 evaluation is to determine whether intrusion detection systems can detect new and also Windows NT attacks. This goal will be accomplished by inserting many recent and novel new UNIX and NT attacks in the test data for the 1999 off-line evaluation, but not in the training data. Training data for the 1999 evaluation will be provided to illustrate normal background traffic patterns, to provide further example variants of older UNIX attacks that were included in the 1998 off-line evaluation, and to provide a limited number of examples of Windows NT attacks. No new UNIX attacks will be included in the 1999 training data and only a few Windows NT attacks will be included. We encourage participants to begin using the 1998 off-line training and test data as training data for the 1999 off-line evaluation, before the 1999 training data is delivered. In particular, participants should focus on developing systems that find attacks in the 1998 test data, but without testing on those attacks.

Training data will consist of the following elements:

1. Outside tcpdump data for roughly one month of network traffic as collected by a tcpdump packet sniffer. This data contains the contents of every packet transmitted between computers inside and outside a simulated military base. Documentation on how tcpdump was invoked will also be provided.
2. Inside tcpdump data collected by a sniffer located inside the simulated military base.
3. Sun Basic Security Module (BSM) audit data from one UNIX Solaris host. This data contains audit information describing system calls made to the Solaris kernel. Raw BSM binary output files are provided along with BSM configuration files and shell scripts used to initialize BSM auditing to record events from processes that implement important TCP/IP services. This year we will not distribute praudit ascii BSM data, because this simply duplicates the binary data.
4. Windows NT audit event logs as contained in the three files AppEvent.Evt, SysEvent.Evt, SecEvent.Evt. We will audit as many events as feasible. These event logs will be contained in the disk dumps described in the next section.
5.

Data from the file systems of the target machines: Last year we distributed the full system dump of one machine (Solaris). We are planning to NOT distribute full system dumps this year. Rather, for each of the five target machines (Solaris, SunOS, Linux, WinNT) we will distribute:

  • A long listing of every file on each target machine (as obtained by the "find . -ls" command as run from the root directory).
  • A tar file containing a subset of files on the target systems that ID systems generally use.

  • We request that all sites interested in obtaining this data add to this list (before March 15) any files they would be interested in obtaining. (See next paragraph). We are not planning to provide dumps of any machine, but we invite all participants to provide us with "sensor" or "agent" programs that search through specified directories and look for new files that have been added or other file system changes. We could then include these new files or the output of this agent in our tar file.

    List of Files to Be Distributed Daily:

    Solaris: /var/log/*
    /var/cron/*
    /var/adm/*
    /var/audit/*
    /etc/*
    SunOs: /var/log/*
    /var/adm/*
    /etc/*
    Linux: /var/log/*
    /etc/*
    WinNT: C:\WINNT\system32\LogFiles\inYYMMDD.log C:\WINNT\system32\config\*
    (this includes AppEvent.Evt, SysEvent.Evt, SecEvent.Evt)
6. A postscript block diagram of the simulation network, showing the logical organization of the machines and routers relative to each other.
7. A detection list file: This list will contain all the information about the attacks that we would require from a participant in order to receive full credit for detecting every attack. It will also be in the format we would expect to receive the detection list.
8. An identification list file: This list will contain all the information about the attacks that we would require from a participant in order to receive full credit for identifying every attack. It will also be in the format we would expect to receive the identification list.
9. An html table listing and describing all attacks in the training data.


The training data will initially be posted on our web site. It will be distributed on multiple CD-ROMs to participating sites upon request. It is expected that tens of gigabytes of training data will be produced.

Optional Pretest

We will be conducting an optional pre-test which will allow us to notify each participant of any problem in the format in which the results are submitted and to give participants a rough sense of the level of their system's performance. This pre-test will be optional, but highly recommended in order for sites to iron out the correct format for submission of results. During the actual evaluation, special consideration will not be given to sites that submitted their results in an incorrect format. Results for the pre-test will be accepted for day 4 of week 2 of the training data, so that attacks will be present in the results.

Evaluation Test Data

Evaluation test data or simply test data is the final set of data which is used to test performance of each intrusion detection system being evaluated. Evaluation test data will be generated in a manner similar to the training data. The formats of the various data elements will be identical to the training data, except that the answer key list of attacks will not be distributed until the evaluation is complete. There will be attacks in the evaluation test data that are not present in the training data.

Format for Submission of Results and Scoring

For the 1998 evaluation, we distributed list files that contained lists of every TCP, UDP, and ICMP session for all the network traffic observable in the tcpdump data. Participants were then required to assign a score to each entry in the list file indicating the likelihood that it corresponds to an attack. Using these list files for distribution and scoring had several drawbacks. First, it was often difficult for participants to align detection outputs of their ID system to the list files. Second, it required a great deal of time on the part of Lincoln Laboratory to create list files and accurately label which of the sessions correspond to which attacks. Third, such a list file is not an ideal representation of the data from host-based sensors. Systems that analyze audit data or disk dumps cannot meaningfully assign a score to each network session present in the list file. These systems rather look for suspicious activity in the audit trail or in a host's files and logs. Fourth, the list file method did not require ID systems to group together connections that correspond to one individual attack. Instead, participants labeled each session individually as being an attack session or a normal session. Thus it is not an ideal metric of the quality of information that the ID system can provide to the system administrator.

In order to address these issues and to generalize the applicability of the scoring system, the following method, described in the Attack Detection and Attack Identification sections, is being proposed for the 1999 evaluation. This approach evaluates intrusion detection systems with respect to both attack detection and attack identification.

Attack Detection

All participants will be required to submit a list of all attacks they detect in the test data. For each attack detected the following four pieces of information must be provided:

1. An attack identification number (it does not matter what the number is, but each different attack detected must be assigned a different integer).
2. Start time of the attack (month, day, year, hour, minute, second).
3. Destination machine of the attack (ip address or complete machine and domain name).
4. A score indicating the confidence level of the ID system that this is an attack.
5. Optionally, information regarding the identification of the attack (name, category..) may be placed on the same line, after a "#" sign. These comments won't be graded.


The submitted list of attacks must have the format of the following example:

ID Date(MM/DD/YYYY) Start_Time Destination Score
1 06/15/1999 08:34:28 pascal.eyrie.af.mil .9 # ping of death
2 06/15/1999 15:24:19 172.16.114.50 .35
3 06/16/1999 11:12:04 zeno.eyrie.af.mil 1 # eject
4 06/16/1999 11:45:22 zeno.eyrie.af.mil .05
5 06/16/1999 22:43:18 172.16.118.1 .6


The different fields in each entry in the list must be white space separated. All attacks detected for the two weeks of test data must be listed in one file in the above format, and the file should be named "detections.list". We will determine if any entry in the submitted detection list matches any attack that was run in the test data.

A match will be granted if the destination machine is correct and the hypothesized start time falls during any period of the attack. The score for the attack is the score provided in the final column of the detection list entry. In the event that the attack encompasses many different sessions that are disjoint in time, if the start time in the entry in the detection list occurs during any of these attack sessions, that entry in the detection list will be matched with the corresponding attack. In addition, we will give one minute leeway on either end of every attack session to allow for minor differences in the time set on different machines. Thus if an entry in the detection list specifies a time within one minute of any attack session, that entry will be matched to the attack. If more than one entry in the detection list match up with one attack, then the highest score given to all the matching entries will be accepted as the score for the attack. If the attack goes to multiple destinations (an ipsweep or a multihop for example) credit will be given for detection if an entry for any of the destination machines is provided in the detection list. Every entry in the detection list that does not match up with any true attack session will be labeled as a false alarm. With this information, we can create Detection/False-Alarm plots that show the percentage of attacks that were detected as a function of the number of false alarms per day, as the threshold score above which an attack is detected is varied.

Attack Identification

For every attack detected and listed in the detection list, the participant has the option to provide an additional set of information. This information will be used to evaluate the ability of the intrusion detection system to provide the system administrator with different types of useful information. Although this part of the evaluation is optional, it is highly recommended that each site provides all the relevant information that their system obtains.

To this end, a separate identification list file will be returned by all participants. The file should be named "identification.list". For each entry in the detection.list file, there should be a corresponding entry in the identification.list file. Each entry will contain an ID# (which must match the ID# of the corresponding entry in the detection list) and any of the following items of information that can be provided:

1. Date of attack: As for the detection list, the format is MM/DD/YYYY. This is the date the attack begins.
2. Name of attack: We will provide a list of all attack names that appeared in 1998 evaluation or in the 1999 training data (html file of attacks database). For attacks that are new to the 1999 test set (did not appear in the 1999 training data), the name provided in the identification list file will not be scored.
3. Category of attack: The five categories described in section 2 (dos, probe, u2r, r2l, data) will be used to categorize attacks, allowing some attacks to be in more than one category when necessary.
4. Start time (HH:MM:SS): The start time provided in the identification list will be used to determine the hypothesized time bounds of the attack, to calculate the percentage of sessions or actions in the attack that were found. This start time need not be identical to the start time provided in the detection list.
5. Duration (HH:MM:SS): This is the amount of time between the attack start and end times. If the attack has several stages (such as a setup, break-in, and actions), then the start time should be when the beginning of the setup occurs and the duration should encompass the full time period until all actions of the attacker are complete. The hour field can be larger than 24 if the attack extends over multiple days.
6. Source machine(s): Either ip addresses or full machine names including the domain will be accepted. Multiple machines must be comma separated. In addition, the shorthand notation x.y.z.(1-100) will be allowed to refer to the 100 machines x.y.z.1, x.y.z.2, ..., x.y.z.100.
7. Destination machine(s): Same format as for source machines.
8. Destination port(s) and Number of Connections to each:

Any well-known ports that are made use of during an attack should be listed. In addition, any port that is the destination of a TCP connection, UDP packet, or ICMP packet should be listed, unless it is an ftp-data connection. In the case of ftp-data, simply list the well-known port number (20), at that machine on which that port is used in an attack-related connection.

  • The list of ports should be classified according to weather they are ports on/at the attacking machine or the victim machine. For example, if there's a telnet from host "Attacker", port 12345, to host "Victim", port 23 (Attacker:12345 -> Victim:23), we would refer to port 23 on/at "Victim".
  • The shorthand notation 1-100 can be used to refer to ports 1,2,3,...,100.
  • UDP connections will be labeled as port#/u.
  • ICMP connections will be labeled as "i". (See examples below).
  • The number of times a single port or a range of ports is to be listed (connected to) should be represented by the numeral representing number of repetitions, placed within curly braces, after the port or port-range. For Example: (1-100) {5} indicates that ports 1, 2, 3, ..., 100 were each used 5 times, for a total of 500 connections.
  • The list should be comma seperated.
9. Comments: This field is optional and will not be made use of to score the identification part of the evaluation.

 

The entries in the identification list should be in the format of the following example:

ID: 1
Date: 03/08/1999
Name: ntinfoscan
Category: probe
Start_Time: 08:01:01
Duration: 00:15:14
Attacker: 206.048.044.018
Victim: hume.eyrie.af.mil
Ports:
          At_Attacker:
          At_Victim: 21 {1} , 20 {4} , 23 {1} , 80 {10} , 139 {2}
Username: anonymous
Comments:

ID: 2
Date: 03/08/1999
Name: pod
Category: dos
Start_Time: 08:50:15
Duration: 00:01:24
Attacker: 152.169.215.104
Victim: zeno.eyrie.af.mil
Ports:
           At_Attacker:
           At_Victim: i {2}
Username: n/a
Comments:

ID: 3
Date: 03/08/1999
Name: back
Category: dos
Start_Time: 09:39:16
Duration: 00:00:59
Attacker: 199.174.194.016
Victim: marx.eyrie.af.mil
Ports:
          At_Attacker:
          At_Victim: 80 {40}
Username: n/a
Comments:

ID: 4
Date: 03/08/1999
Name: httptunnel
Category: r2l
Start_Time: 12:09:18
Duration: 00:00:59
Attacker: 196.37.75.158
Victim: pascal.eyrie.af.mil
Ports:
          At_Attacker: 21 {1} , 20 {1} , 8000 {9}
          At_Victim: 23 {1}
Username: mariaht
Comments:

ID: 5
Date: 03/08/1999
Name: land
Category: dos
Start_Time: 15:57:15
Duration: 00:00:01
Attacker: (spoofed ip = pascal)
Victim: pascal.eyrie.af.mil
Ports:
           At_Attacker:
           At_Victim: 23 {1}
Username: n/a
Comments:

We intend to use the identification attack lists to evaluate different aspects of the performance of ID systems. A score between 0 and 1 will be assigned for each participant for each of the following categories.

1. Percentage of the Attack Detected: We will determine the percentage of connections of the attack that fall within the duration provided. Credit will be taken away if the hypothesized duration overextends the true duration on either end. For attacks with setup, break-in, and or follow-up actions, we will determine from the submitted start time and duration, which stages were detected by the ID system.
2. Detection
Delay:
This is the amount of time that elapsed between when the attack began and when it was detected. Early detection can be important to warn the system administrator of an attack before the main damage has been done. We can evaluate this ability of the system from the start time provided in the identification list. If the hypothesized start time is more than 5 seconds before the attack begins, no credit will be given.
3. Source: What fraction of sources were identified (credit will be taken away for incorrect sources).
4. Destination: What fraction of destination machines were identified (credit will be taken away for incorrect destinations).
5. Ports/
Services:
The score given will be based on the percentage of total number of destination ports, and number of connections made to(from) each, correctly identified. (Credit will be taken away for incorrect ports, or incorrectly detecting the number of connections made to(from) a particular port.)
6. Name: (for attacks that appeared in 1999 training data or in 1998 data).
7. Category: (dos, probe, u2r, r2l,data). NOTE: It is STRONGLY suggested that every participant provide a guess for the category of attack. We intend to show ROC plots for attacks by category, and if each detection entry is matched up with an identification entry containing a category hypothesis we will consider only the false alarms specific to each category in the ROC's. Any false alarm that does not have an associated category hypothesis will have to be counted as a false alarm for all categories of attacks. Only one category name should be provided, and if an attack could be placed in more than one category, no credit will be lost for providing only one category.


Evaluation Rules

It is permissible for a single site to evaluate multiple systems. For example, a site may submit results files for system A and results files for system B. In this case, however, the submitting site must identify one system as the "primary" system prior to performing the evaluation. In addition, developers must return results from individual components of their systems as well as one overall score to make the results more informative and easier to interpret. For example, if a site has a network-based signature system and a host-based statistical system, separate results should be returned for each system as well as overall results. Separate and composite results will make it possible to determine how much contribution is provided by each component as well as the performance of the fused output. Separate components could include network-based sniffer processors, BSM or NT audit processers, and statistical or signature based processing using these types of data or file system information. The following evaluation rules and restrictions must be observed by all participants:

  • Knowledge of the training conditions (implied by data set directory structure and other network information provided) is allowed.
  • Examining the evaluation test data by hand, or any other experimental interaction with this data, is not allowed before all test results have been submitted. This applies to all evaluation test data, whether part of an evaluated session or not.
  • The name and a brief description of the system (the algorithms) used to produce the results must be submitted along with the results, for each system evaluated.
  • Sites must report the CPU execution time that was required to process the test data, as if the test were run on a single CPU. Sites must also describe the CPU and the amount of memory used.

Security Policy: Eyrie Air Force Base Network

The security policy adopted by the base eyrie.af.mil is a loose one. No services are blocked for connections between the air force base and the outside. Most users inside and outside the base have accounts on only one machine. Some users have accounts on several machines both inside and outside the base. Some of these users are usually situated on the outside and may connect to their inside accounts if they want access to the inside network. Others of these users are situated inside the base and sometimes connect to their outside accounts. A subset of these users are system administrators who know the root password. These system administrators often conduct their administrative work and monitoring remotely. They typically log in from .mil machines on the outside, but they may come from other machines on occasion. It is recommended policy that system administrators telnet with their user names and then run su from the telnet session. However, it is not a violation of security if an administrator telnets as root. System administrators have the right to add users or delete users, and the right to add machines and take machines away. Users may download publicly available material from anonymous ftp sites and browse any web sites they wish. One outside machine will be running SNMP on a regular basis to monitor the health of systems on Eyrie Air Force Base. The internal subnet with IP addresses 172.16.112.* is protected. The only valid IP addresses allowed on this subnet are those specified in the List of Simulation Network Hosts. It is illegal to add another host onto this subnet.

There is a secret directory set up on the machines under /home/secret which contains highly confidential information. Only users in the secret group (abramh, elmoc, quintond, orionc) are allowed access to the secret files under this directory. When reading or writing to these files over the network, these users must use ssh rather than telnet in order to prevent the files being sent as clear text. Any tranferring of these files to another computer is strictly illegal.

Typical hacks are not allowed by any users. All of the following activities are illegal and fall under the category of an attack:

  • Disrupting system or network functioning
  • Probing for vulnerablities
  • Obtaining root priveleges through dubious means
  • Gaining access that the user is not privileged to have
  • Illegally modifying or accessing data not owned by the user
  • Installing or using previously installed back doors or trojan horses
  • Unauthorized uploading of data to anonymous ftp sites
  • Downloading of illegally uploaded data from anonymous ftp sites
  • Unauthorized using of SNMP
  • Tranferring any information that has been illegally obtained
  • Transferring any data that is accessible only by the secret group, even by a member of the secret group
  • Preparations required to carry out any of the above illegal actions.
  • Actions made possible by any of the above illegal events.
  • Installing or running a sniffer to capture network traffic.

Summary of Changes

The following summarizes the major changes made from the 1998 evaluation.

1.
Windows NT machines ( Microsoft Windows NT Server 4.0 Service Pack 1 ) will be added as victims. Attacks for these machines are being introduced as well.
2.
The format in which participants report attacks is being simplified, with the hope of making the reporting a more natural representation of the outputs of intrusion detection systems.
3.
The evaluation of systems will be more detailed, in that both attack detection and attack identification will be evaluated.
4.
No disk dumps will be provided. Rather lists of all files on the system, and the subset of files that ID systems analyze will be provided for all inside target machines.
5.
Insider attacks are allowed which will not be visible in outside sniffing data.
6.
Inside sniffing data will be provided.
7.
The category of data compromise is being added to the attack taxonomy, with several instances of this attack category to be implemented.
8.
The last week of training data will be contain no attacks.
9.
ASCII praudit BSM data will not be provided
10.
The "psmonitor" program will not be run and its output will not be provided.

 

top of page