We offer multiple pathways for your Technology and Talent solutions to impact national security by matching threats with real-world capabilities.
Pathways through Commercial Solutions Openings (CSO)
If your company has a proven track record of commercial viability with commercial off-the-shelf products and tech, you’re in a great position to work with us. We actively work with companies both in the U.S. and internationally, across allied countries.
You can submit your technical solutions to posted solicitations under our Commercial Solutions Opening (CSO) process and Other Transaction (OT) authority - a fast, flexible way that allows us to competitively solicit proposals for DoD projects, often awarding within 60-90 days.
Open Solicitations —
MYSTIC DEPOT: Vendor-Agnostic AI Evaluation Infrastructure
Responses Due By
2026-03-24 23:59:59 US/Eastern Time
Problem Statement
As artificial intelligence (AI) capabilities evolve at an extraordinary pace, the government requires evaluation infrastructure that can keep pace by continuously assessing new models against mission-specific benchmarks as they are released.
Further, the success of AI systems in national security contexts will depend on human-machine teaming. Evaluation must assess not only whether AI systems can perform tasks in isolation, but whether human-AI teams achieve better mission outcomes than either humans or AI alone.
Evaluation must also keep pace as AI systems evolve from passive models to active agents that use tools, access systems, and execute multi-step tasks. Beyond model outputs, assessment must account for agent behaviors, including whether agents complete complex missions correctly and safely, use tools appropriately, and maintain auditability.
Desired Solution Attributes
The Department of War (DoW), in partnership with the Office of the Director of National Intelligence (ODNI), seeks an evaluation harness and government-specific benchmarks that together enable rigorous, reproducible, vendor-agnostic assessment of any AI system against government-defined criteria. The Government intends to use this harness across multiple programs. Solutions should be designed for broad applicability rather than single-program optimization. This Area of Interest (AOI) comprises two Lines of Effort (LOE); vendors may respond to one or both. Vendors submitting solutions must specify if they are addressing LOE 1, LOE 2, or both on the title slide or title page in their submission. Submission file titles should likewise indicate “LOE1_”, “LOE2_”, or “LOE1&2_” as a prefix.
The Government is interested in considering solutions from a wide selection of vendors. All submissions should clearly explain which of the desired solution attributes they do and do not address, with proven examples of prior deployment, if applicable. The Government will consider partial solutions. Vendors are welcome to apply individually or in partnership. The Government may also request teaming arrangements amongst solution providers. Vendors are expected to demonstrate their solution in an unclassified environment as part of the Commercial Solutions Opening.
LOE 1: Evaluation Harness
This AOI seeks an evaluation harness or harnesses to serve as the integrated infrastructure of an execution environment, tooling, and methodology that connects models to benchmarks and produces structured evaluation data. Harness architecture should enable standardized, reproducible assessment of AI systems against defined criteria by providing the following:
- Model Interface: Connects the harness to AI systems under evaluation. Provides a standardized, pluggable architecture for interfacing with diverse model types.
- Execution Engine: Orchestrates complex evaluation workflows across heterogeneous model and environment configurations.
- Measurement and Scoring System: Scores model outputs against benchmarks.
- Human Evaluation Integration: Supports human-in-the-loop workflows with interfaces for subject-matter-expert review to measure and compare human workload, usability, and mission performance across human-only, AI-only, and human-AI team scenarios
- Output and Reporting Layer: Exports all data in open, non-proprietary formats, generates aggregate reports, and provides API access to evaluation data.
- Continuous Monitoring and Analytics: Automates model ingestion and evaluation. Tracks and monitors performance trends. Provides API access to evaluation data.
- Configuration and Benchmark Management: Defines, versions, validates, and manages benchmarks and evaluation configurations to ensure consistency across environments.
- Degraded Conditions Simulator: Simulates operational stress and network degradation in a controlled, reproducible environment. Enables assessment of model and system resilience under variable conditions to validate performance in mission-critical denied, degraded, intermittent, or limited (DDIL) environments.
- Agentic Evaluation: Evaluates agent actions, tool invocations, and multi-step task execution. Provides for safe testing and maintains comprehensive audit trails of agent decisions and actions.
- Adversarial AI: Supports automated red-teaming, including the execution of adversarial prompts and attack patterns. Scores robustness across attack categories and exports results in open formats.
- Multimodal Inputs: Support for video, audio, and cross-modal datasets and comparison frameworks.
Submissions in response to this LOE should also have the following attributes:
- Modular architecture enabling independent upgrade and addition of components
- Containerized deployment supporting standard government orchestration platforms
- Be deployable as directed across government environments—unclassified, classified cloud, and air-gapped—without fundamental architectural changes
- Harness infrastructure and evaluation content (benchmarks, scoring, attacks) are interoperable but can be de-coupled
- Appropriate access controls and ability to protect sensitive data according to classification requirements
LOE 2: Benchmark Development and Methodology
This AOI seeks solutions from vendors that create benchmarks across unclassified, secret, and top secret workflows, and that provide their methodology for government review and adoption. These benchmarks would be executed using the evaluation harness in LOE 1.
The benchmarking methodology should address:
- Requirements elicitation: Identifying what capabilities matter for a given mission context.
- Task decomposition: Breaking complex capabilities into measurable evaluation tasks.
- Input design: Constructing scenarios that ensure representativeness (reflection of realistic operational conditions) and operational realism (scenarios that reflect realistic workflows in human-AI teaming tasks).
- Scoring criteria development: Defining what "good" looks like, including rubric construction that prioritizes interpretability (results that are easily understood and can be acted upon by decision makers).
- Baseline establishment: Evaluating open-source/open-weight models to set performance baselines while ensuring fairness (no systemic advantage to particular architectures or vendors).
- Validation: Verifying that benchmarks meet standards for validity (measures of intended capability), reliability (consistent results across runs), and discriminality (clear articulation across performance levels).
- Gaming resistance: Designing benchmarks resistant to optimization without genuine capability improvement.
- Maintenance: How benchmarks are updatable as requirements or model capabilities evolve to maintain their long-term utility.
Training Materials
Materials should enable government personnel to develop and maintain benchmarks without ongoing vendor support, including but not limited to: written methodology guide, worked examples, common pitfalls, quality assurance checklist, and training curriculum.
Vendor Qualifications
This AOI seeks solutions from vendors that are eligible to receive an Other Transaction award in accordance with 10 U.S.C. 4022 and have demonstrated expertise in AI evaluation, security testing, and benchmark creation. Submissions should provide specific, verifiable evidence of the following preferred qualifications as appropriate to the supported LOE:
- Published research on, and/or demonstrable application of, evaluation methodology, benchmark design, measurement robustness, or gaming resistance
- Active contribution to or development of industry-standard frameworks (e.g., HELM, AgentDojo, Inspect), red-teaming tools (e.g., Garak, PyRIT) or established benchmarks (e.g., MMLU, HumanEval)
- Prior collaboration with frontier AI labs on evaluation testing or benchmark creation
- Prior work with the National Institute of Standards and Technology Center for AI Standards and Innovation, United Kingdom AI Safety Institute, or equivalent government evaluation programs
- Personnel holding active security clearances (Secret minimum; TS/SCI preferred) or documented clearability
- Experience deploying evaluation infrastructure in secure government or enterprise environments, specifically within DoW or Intelligence Community (IC) environments (IL5/IL6, JWICS)
- Familiarity with IC analytic tradecraft, DoW doctrine, or domain expertise in national security and military operations
- Experience designing evaluation protocols involving human performance measurement
Awarding Instrument
This Area of Interest is being released in accordance with the Commercial Solutions Opening (CSO) process detailed within HQ0845-20-S-C001 (DIU CSO), posted to SAM.gov on 23 March 2020. This document can be found at: https://sam.gov/opp/c304359f88a0456bab1fa8837a3647f4/view
Follow-on Production
Any prototype Other Transaction (OT) agreement awarded may result in follow-on production without further competitive procedures. The follow-on may be significantly larger than the prototype OT.
Anticipated follow-on activities include:
- Deployment across additional classification levels and environments
- Benchmark suite expansion for additional mission areas
- Ongoing maintenance, security updates, and capability enhancements
- Training and support for government evaluation personnel
Any prototype OT will include: "In accordance with 10 U.S.C. 4022(f), and upon a determination that the prototype project for this transaction has been successfully completed, this competitively awarded prototype OTA may result in the award of a follow-on production contract or transaction without the use of competitive procedures.”
FAQs
Q1: For offerors who also develop AI models or AI runtimes, does DIU anticipate any Organizational Conflict of Interest (OCI) limitations that would restrict participation in future model development, deployment, or related competitions within the programs utilizing the MYSTIC DEPOT evaluation environment? Would DIU consider mitigation approaches sufficient to address potential OCI concerns?
A1: DIU will evaluate all Solution Briefs submitted in accordance with the stated CSO Phase 1 and AOI Criteria. Together, DIU and vendor(s) whose solutions advance to latter phases of the evaluation process, will address OCI’s as necessary.
Q2. Does this work require native development inside IL5/IL6 environments? Do offerors need to have seats inside a SCIF in order to submit to this solicitation?
A2: Existing clearances are not required to respond to this AOI and DIU will consider solutions from vendors who do not currently possess them. However, project performance will require the ability to obtain clearances and DIU will accordingly consider a solution’s characterization of the status and level of personnel clearances, and experience deploying capabilities in secure government or enterprise environments.
Q3. Under this CSO, is a company allowed to submit one proposal as the prime contractor while also participating as a subcontractor on a separate proposal led by another organization?
A3. Yes. Per Section 3.3, Phase 1 Solution Brief of the DIU CSO, “All solution briefs correctly submitted in response to an AOI will be evaluated against the stated criteria.”
Pathways through Challenges or Commercial Acceleration Opportunities
We regularly seek proposals from both U.S.- and internationally-based ventures just like you. Apply through DIU’s Challenges or Commercial Acceleration Opportunities to showcase your potential and get tailored support.
Open Challenges and Commercial Acceleration Opportunities —
Sorry, there are no open challenges currently.
If you would like to be notified when new challenges are posted please fill out our interest form here.