The experiment manager is a script written in Python and invoked every minute by cron. It is responsible for starting and stopping experiments at the appropriate times. To have as many resources available as possible when starting experiments, every time it runs, it first looks for experiments to stop. It performs the following tasks:

A. Stop experiments
Watch for errors thrown in the execution of the following. Errors are split into two categories. The first category only stops the attempt to stop the reservation in which the error occurred. The second category is severe enough that the reservation and testbed has probably been left in an inconsistent state, and starting new experiments should not be attempted. For the second category, create a file lock. This will prevent the start of new experiments in an environment where something is wrong. However, we do want to stop as many experiments as possible, so continue attempting to stop experiments.

Get list of experiments to stop from table experiment_reservation (stop_time <=now and status = "started"). For each reservation:
  1. Update table experiment_reservation, set the status of reservation to "ending". Log this.
  2. Get list of VLANs from vlan_status. For each VLAN:
    1. Tell switch to destroy VLAN.
    2. Update switch_interface table, to free the switch interfaces that were assigned to VLAN
    3. Change status of VLAN to available by updating the vlan_status table
  3. Get list of actual PCs from nic_rel_to_actual. For each PC:
    1. Update LDAP server and deny access to experimental PCs.
    2. Attempt cleanup of experimental PC, if its status is "busy"
    3. Set status of PC to "free" (update table pc_status)
  4. If operations relating to specific PCs generated an error, set the state of the affected PCs to "error". These PCs are unavailable for further experiments until examined by an administrator. Log all errors.
  5. Update table experiment_reservation, set the status of experiment to "ended". Log this.
  6. Send email to researcher about end of experiment (Get email address from LDAP database).
B. Start experiments.
If any errors are thrown in the execution of the following, catch the exception and create a file lock; exit.
  1. Check the error file lock; if it is set, do not start any experiments. Administrator intervention is required due to irrecoverable errors in a previous run of this script.
  2. Get list of experiments to start from table reservations (start_time <= now and status = "reserved"). For each experiment:
    1. Perform pre-flight checks.
      1. Verify that all needed PCs are in status "free"
      2. Verify that all the software images are available
      If any check fails, log why and set the status of the reservation to "ended".
    2. Update table reservations, set the status of experiment to "starting". Log this.
    3. Get list of PCs in topography (returned sorted by boot_priority, from the table pc_virtual). For each virtual PC:
      1. Get real PC that was assigned to the topography (virtual) PC during reservation.
      2. Update table pc_status, mark pc as busy (keep track of which PCs have been set as busy, for reversal and recovery if an error happens)
      3. Grant access to PC by updating LDAP server (keep track of which PCs have been granted access to, for reversal and recovery if an error happens)
      4. Call remote script to setup user home directory. (keep track of which PCs have been setup, for reversal and recovery if an error happens)
      5. From {experiment_number, pc_virtual}, get software_images.
      6. Transfer images to PC. Keep track of images transfered for reversal and recovery if an error happens.
    4. In the following, we attempt to create a VLAN on the switch for each virtual VLAN in the topography, and assign to that VLAN the switch interfaces that are connected to the appropriate experimental PCs. From {reservation_number} get {pc_relative, nic_relative, pc_actual, nic_actual} from table nic_rel_to_actual. For each set :
      1. From {pc_actual, nic_actual} get the switch interface from table pc_connections. This returns an integer 1-48 (the switch has 48 interfaces), record it for later.
      2. From {experiment_number, pc_relative, nic_relative} (as we are not yet managing multiple images per PC, assume that the "relative" are the same as pc_virtual and nic_virtual) get {vlan_virtual, ip_address} from table nic_virtual
      3. Check table vlan_status to see if {vlan_virtual, reservation_number} has already been assigned a real Vlan number
      4. If not (no results), use function "get_free_vlan()" to find free vlan from table vlan_status
        Assign free vlan to virtual_vlan, record (update) in table vlan_status using the function "use_vlan (vlan_actual, reservation_number, vlan_virtual)"
      5. Assign interface to real VLAN and update table switch_interface using the function "use_interfaces (interface_num, vlan_actual)"
  3. Update table experiment_reservation, set the status of experiment to "started". Log this.
    Only one copy of the manager can be running at any one time; this is guaranteed by a file lock that is checked/set in an atomic operation at the beginning of the script. If severe errors are encountered during execution, an error file lock is set that also stops future invocations by cron. Administrator intervention is then required.


Helper Scripts
On each experimental machine, there are two user accounts that have for login shell a script. User "useradd" has for login shell the script "plus_user", which creates a home directory for users on experimental machines. User "userdel" has for login shell the script "minus_user", which cleans up experimental machines after an experiment. These user accounts have the ssh public key of the user running the manager script in their authorized_keys directory.