Skip to main content
Responses Due By

2026-03-24 23:59:59 US/Eastern Time

View CSO Procedure

Work With Us - Commercial Companies - Submit Solution

MYSTIC DEPOT: Vendor-Agnostic AI Evaluation Infrastructure


We look forward to your solution —
To submit, scroll to the form at the bottom of this page.

Project Description

Problem Statement

As artificial intelligence (AI) capabilities evolve at an extraordinary pace, the government requires evaluation infrastructure that can keep pace by continuously assessing new models against mission-specific benchmarks as they are released.


Further, the success of AI systems in national security contexts will depend on human-machine teaming. Evaluation must assess not only whether AI systems can perform tasks in isolation, but whether human-AI teams achieve better mission outcomes than either humans or AI alone.  


Evaluation must also keep pace as AI systems evolve from passive models to active agents that use tools, access systems, and execute multi-step tasks. Beyond model outputs, assessment must account for agent behaviors, including whether agents complete complex missions correctly and safely, use tools appropriately, and maintain auditability.


Desired Solution Attributes

The Department of War (DoW), in partnership with the Office of the Director of National Intelligence (ODNI), seeks an evaluation harness and government-specific benchmarks that together enable rigorous, reproducible, vendor-agnostic assessment of any AI system against government-defined criteria. The Government intends to use this harness across multiple programs. Solutions should be designed for broad applicability rather than single-program optimization. This Area of Interest (AOI) comprises two Lines of Effort (LOE); vendors may respond to one or both. Vendors submitting solutions must specify if they are addressing LOE 1, LOE 2, or both on the title slide or title page in their submission. Submission file titles should likewise indicate “LOE1_”, “LOE2_”, or “LOE1&2_” as a prefix. 


The Government is interested in considering solutions from a wide selection of vendors. All submissions should clearly explain which of the desired solution attributes they do and do not address, with proven examples of prior deployment, if applicable. The Government will consider partial solutions. Vendors are welcome to apply individually or in partnership. The Government may also request teaming arrangements amongst solution providers. Vendors are expected to demonstrate their solution in an unclassified environment as part of the Commercial Solutions Opening.


LOE 1: Evaluation Harness

This AOI seeks an evaluation harness or harnesses to serve as the integrated infrastructure of an execution environment, tooling, and methodology that connects models to benchmarks and produces structured evaluation data. Harness architecture should enable standardized, reproducible assessment of AI systems against defined criteria by providing the following:

  • Model Interface: Connects the harness to AI systems under evaluation. Provides a standardized, pluggable architecture for interfacing with diverse model types. 
  • Execution Engine: Orchestrates complex evaluation workflows across heterogeneous model and environment configurations. 
  • Measurement and Scoring System: Scores model outputs against benchmarks.
  • Human Evaluation Integration: Supports human-in-the-loop workflows with interfaces for subject-matter-expert review to measure and compare human workload, usability, and mission performance across human-only, AI-only, and human-AI team scenarios 
  • Output and Reporting Layer: Exports all data in open, non-proprietary formats, generates aggregate reports, and provides API access to evaluation data. 
  • Continuous Monitoring and Analytics: Automates model ingestion and evaluation. Tracks and monitors performance trends. Provides API access to evaluation data.
  • Configuration and Benchmark Management: Defines, versions, validates, and manages benchmarks and evaluation configurations to ensure consistency across environments.
  • Degraded Conditions Simulator: Simulates operational stress and network degradation in a controlled, reproducible environment. Enables assessment of model and system resilience under variable conditions to validate performance in mission-critical denied, degraded, intermittent, or limited (DDIL) environments.
  • Agentic Evaluation: Evaluates agent actions, tool invocations, and multi-step task execution. Provides for safe testing and maintains comprehensive audit trails of agent decisions and actions.
  • Adversarial AI: Supports automated red-teaming, including the execution of adversarial prompts and attack patterns. Scores robustness across attack categories and exports results in open formats.
  • Multimodal Inputs: Support for video, audio, and cross-modal datasets and comparison frameworks.

Submissions in response to this LOE should also have the following attributes:


  • Modular architecture enabling independent upgrade and addition of components
  • Containerized deployment supporting standard government orchestration platforms
  • Be deployable as directed across government environments—unclassified, classified cloud, and air-gapped—without fundamental architectural changes
  • Harness infrastructure and evaluation content (benchmarks, scoring, attacks) are interoperable but can be de-coupled
  • Appropriate access controls and ability to protect sensitive data according to classification requirements

LOE 2: Benchmark Development and Methodology

This AOI seeks solutions from vendors that create benchmarks across unclassified, secret, and top secret workflows, and that provide their methodology for government review and adoption. These benchmarks would be executed using the evaluation harness in LOE 1. 


The benchmarking methodology should address:

  • Requirements elicitation: Identifying what capabilities matter for a given mission context.
  • Task decomposition: Breaking complex capabilities into measurable evaluation tasks.
  • Input design: Constructing scenarios that ensure representativeness (reflection of realistic operational conditions) and operational realism (scenarios that reflect realistic workflows in human-AI teaming tasks).
  • Scoring criteria development: Defining what "good" looks like, including rubric construction that prioritizes interpretability (results that are easily understood and can be acted upon by decision makers).
  • Baseline establishment: Evaluating open-source/open-weight models to set performance baselines while ensuring fairness (no systemic advantage to particular architectures or vendors).
  • Validation: Verifying that benchmarks meet standards for validity (measures of intended capability), reliability (consistent results across runs), and discriminality (clear articulation across performance levels).
  • Gaming resistance: Designing benchmarks resistant to optimization without genuine capability improvement.
  • Maintenance: How benchmarks are updatable as requirements or model capabilities evolve to maintain their long-term utility.

Training Materials

Materials should enable government personnel to develop and maintain benchmarks without ongoing vendor support, including but not limited to: written methodology guide, worked examples, common pitfalls, quality assurance checklist, and training curriculum.


Vendor Qualifications

This AOI seeks solutions from vendors that are eligible to receive an Other Transaction award in accordance with 10 U.S.C. 4022 and have demonstrated expertise in AI evaluation, security testing, and benchmark creation. Submissions should provide specific, verifiable evidence of the following preferred qualifications as appropriate to the supported LOE:

  • Published research on, and/or demonstrable application of, evaluation methodology, benchmark design, measurement robustness, or gaming resistance
  • Active contribution to or development of industry-standard frameworks (e.g., HELM, AgentDojo, Inspect), red-teaming tools (e.g., Garak, PyRIT) or established benchmarks (e.g., MMLU, HumanEval)
  • Prior collaboration with frontier AI labs on evaluation testing or benchmark creation
  • Prior work with the National Institute of Standards and Technology Center for AI Standards and Innovation, United Kingdom AI Safety Institute, or equivalent government evaluation programs
  • Personnel holding active security clearances (Secret minimum; TS/SCI preferred) or documented clearability
  • Experience deploying evaluation infrastructure in secure government or enterprise environments, specifically within DoW or Intelligence Community (IC) environments (IL5/IL6, JWICS)
  • Familiarity with IC analytic tradecraft, DoW doctrine, or domain expertise in national security and military operations
  • Experience designing evaluation protocols involving human performance measurement


Awarding Instrument

This Area of Interest is being released in accordance with the Commercial Solutions Opening (CSO) process detailed within HQ0845-20-S-C001 (DIU CSO), posted to SAM.gov on 23 March 2020. This document can be found at: https://sam.gov/opp/c304359f88a0456bab1fa8837a3647f4/view 


Follow-on Production

Any prototype Other Transaction (OT) agreement awarded may result in follow-on production without further competitive procedures. The follow-on may be significantly larger than the prototype OT.

Anticipated follow-on activities include:

  • Deployment across additional classification levels and environments
  • Benchmark suite expansion for additional mission areas
  • Ongoing maintenance, security updates, and capability enhancements
  • Training and support for government evaluation personnel

Any prototype OT will include: "In accordance with 10 U.S.C. 4022(f), and upon a determination that the prototype project for this transaction has been successfully completed, this competitively awarded prototype OTA may result in the award of a follow-on production contract or transaction without the use of competitive procedures.”

Awarding Process

DIU

Before You Submit

What we recommend you include when you submit a solution brief.

When you submit to a DIU solicitation, we'll ask you to include a solution brief. Here's some guidance about what that entails.

Potential Follow-On Production Contract for Prototype Other Transaction Agreements

Companies are advised that any Prototype Other Transaction (OT) agreement awarded in response to this solicitation may result in the direct award of a follow-on production contract or agreement without the use of further competitive procedures. Follow-on production activities will result from successful prototype completion.

The follow-on production contract or agreement will be available for use by one or more organizations within the Department of Defense. As a result, the magnitude of the follow-on production contract or agreement could be significantly larger than that of the Prototype OT agreement. All Prototype OT agreements will include the following statement relative to the potential for follow-on production: “In accordance with 10 U.S.C. § 4022(f), and upon a determination that the prototype project for this transaction has successfully been completed, this competitively awarded Prototype OT agreement may result in the award of a follow-on production contract or transaction without the use of competitive procedures.”

2023 Other Transaction Guide

Common issues with submissions

If you are having problems uploading your AOI submission to DIU, it may be one of these common issues with submitting, click here for solutions to common submission issues.

Have a question about this solicitation?

Need clarification? Having technical issues?
Reach out to our team.

Contact Us

Submission Form
Please fill out the following form in its entirety.

*Required

Company Information
Company Contact Information

Submitter Information

Is your company headquarters address different from your company address?


Is your company a partially or wholly owned subsidiary of another company?


Is your company currently operating in stealth mode?



Is this your company's first submission to a Defense Innovation Unit solicitation?
This applies to solution briefs submitted in response to project-specific solicitations.


Is your company registered in Systems Award Management (SAM.gov) and assigned a current Commercial and Government Entity (CAGE) code?


Solution Brief

Solution briefs must be saved as a PDF that is 10MB or smaller. Papers should be approximately 5 or fewer pages and slide decks should be approximately 15 or fewer slides.

Upload (1) One Solution Brief Document* (max size 10mb)
I certify that this submission contains no data designated higher than "Controlled Unclassified Information" (CUI). Submissions with CUI and "FOUO" material may be accepted.


Any agreement awarded off of this solicitation will include language requiring your company to confirm compliance with Section 889 of the John S. McCain National Defense Authorization Act for Fiscal Year 2019 (Pub. L. 115-232). If you are not able to comply with the law, the Government may not be able to award the agreement.

We Work With You

If we think there’s a good match between your solution and our DoD partners, we’ll invite you to provide us with a full proposal — this is the beginning of negotiating all the terms and conditions of a proposed prototype contract.

After a successful prototype, the relationship can continue and even grow, as your company and any interested DoD entity can easily enter into follow-on contracts.

Our Process

  1. We solicit commercial solutions that address current needs of our DoD partners. (View all open solicitations and challenges.

  2. You send us a short brief about your solution.

  3. We’ll get back to you within 30 days if we’re interested in learning more through a pitch. If we're not interested, we'll strive to let you know ASAP.