SBIR Review Criteria by Agency: NIH vs NSF vs ARPA-H vs DOD Scoring Compared

Last updated: March 25, 2026 | Author: Nalin Bhatt, Cada

NIH scores your SBIR application on a 1-9 scale where 1 means "Exceptional." NSF uses the same 1-9 scale -- but 9 means "Exceptional." ARPA-H doesn't use a panel at all: a single Program Manager reads your 6-page summary and decides in 60 seconds. And at DOD, your technology gets rejected if it doesn't match the exact solicitation topic.

These aren't minor procedural differences. They determine whether you frame your innovation as hypothesis-driven research (NIH), high-risk R&D (NSF), a 10x health improvement (ARPA-H), or an operational solution (DOD). Get the framing wrong and you're dead on arrival.

Most founders write one application and submit it to multiple agencies. That's a reliable way to lose everywhere. Each agency has a different review culture, different criteria, different scoring direction, and different deal-breakers. This guide breaks down exactly how NIH, NSF, ARPA-H, and DOD evaluate SBIR applications -- based on Cada's experience writing across these agencies over the past two years.

Who Actually Reviews Your SBIR Application at Each Agency?

Before worrying about what reviewers score, understand who is reading your application. The review structure determines everything about how you should write.

Agency	Who Reviews	Review Format	Decision Mechanism
NIH	Panel of 15-20 scientists	3 assigned reviewers per application	Consensus score from study section
NSF	Program Director + technical expert	PD screens first, then merit review	PD decides based on screening + review
ARPA-H	Single Program Manager	PM reads and evaluates alone	PM decides: "Encourage" or "Discourage"
DOD (Standard SBIR)	Technical evaluators	Topic-based evaluation	Aligned to solicitation topic requirements
AFWERX	Panel evaluators	Pitch competition format	Presentation-based assessment
DARPA	Program managers	BAA-specific review	PM-driven, Proposers Day attendance matters

At NIH, you're writing for a committee of scientists who will debate your application's merits. At ARPA-H, you're writing for one person who needs to understand your concept in 60 seconds. At NSF, you need to pass a 5-question screening gate before your application even reaches technical review. These structural differences should change how you write every section.

How NIH Scores SBIR Applications: 5 Criteria on a 1-9 Scale

NIH uses the most formalized review process of any SBIR agency. A study section panel -- typically 15-20 domain scientists -- assigns 3 reviewers to each application: a primary reviewer (clinician-scientist), a secondary reviewer (methods expert), and a discussant (commercialization/environment expert).

The 5 Review Criteria

Each reviewer scores all 5 criteria independently on a 1-9 scale where 1 = Exceptional and 9 = Poor. Yes, the scale runs opposite to what most people expect.

Criterion	Core Question	What Kills Applications
Significance	Does this address an important health problem?	Generic health burden ("improves patient outcomes") instead of quantified burden with CDC/WHO data
Investigator(s)	Is the PI well-suited for this work?	No preliminary data, or data from a different system that doesn't support the proposed hypothesis
Innovation	Does this challenge existing approaches?	Claiming "novel" without explaining what specifically is new and why it matters scientifically
Approach	Is the methodology well-reasoned?	Sequential aim dependencies where Aim 2 fails if Aim 1 fails, or missing potential problems section
Environment	Does the institution support success?	Lack of collaboration evidence or missing equipment descriptions

Overall Impact -- The Score That Actually Matters

Reviewers also produce an Overall Impact score reflecting the likelihood the project will have a sustained, powerful influence on the research field. This is NOT the average of the 5 criterion scores. A fatal flaw in one criterion -- say, aims that are sequential dependencies -- can drive Overall Impact to a 7 even if the other four criteria score 2-3.

Applications scoring Overall Impact 1-3 are typically "Fundable." Scores of 4-5 mean "Needs Revision." Scores of 6-9 are "Not Competitive." The funded percentile varies by Institute -- some ICs fund the top 20%, others the top 30% -- so check your target IC's payline before assuming a score of 3 is safe.

Triage: Half of Applications Never Get Discussed

NIH triages the bottom half of applications before the study section meeting. "Not Discussed" means your application was triaged -- it never received a formal score or discussion. Any of these weaknesses alone can trigger triage:

No preliminary data for any aim
Central hypothesis is vague or untestable
Aims are sequential dependencies (Aim 2 requires Aim 1 success)
Phase I scope is really Phase II scope (too ambitious for 6-12 months)
Missing potential problems and alternative strategies

That means half the applications submitted to NIH are eliminated before a single reviewer advocates for them in the room. If your application has any of these issues, fixing them before submission is the highest-ROI use of your time.

What Good Looks Like at NIH

Significance (target score: 1-3): Name the disease, quantify the burden with incidence/prevalence/mortality data, cite CDC or WHO sources. Frame within the target Institute's strategic priorities. Include 2-3 sentences on commercial potential framed as a healthcare delivery problem -- not revenue projections.

Approach (target score: 1-3): Each aim has rationale, preliminary data, experimental design, expected outcomes, potential problems with specific alternatives, and milestones. Success criteria are specific: "This aim will be considered successful if [metric] exceeds [threshold]." Include a rigor and reproducibility paragraph.

Innovation (target score: 1-3): Use a comparison table showing specific feature/metric differences vs. current scientific approaches (not commercial competitors by name). Explain what is new and why it matters -- "novel" alone is never sufficient.

How NSF Scores SBIR Pitches: Innovation Classification Is Everything

NSF SBIR review works differently from NIH in almost every way. The scoring direction is reversed (9 = Exceptional), the criteria are different, and there's a screening gate before your pitch reaches technical review.

The Program Director Screening Gate

Before any technical review, the Program Director applies 5 screening questions. Fail any one and your pitch is declined regardless of technical merit:

Screening Question	What They're Really Asking
Has this been attempted/done before?	Is there genuine R&D novelty, or are you rebuilding something that exists?
Are there technical hurdles that NSF R&D could overcome?	Is the risk technical (fundable) or business/market risk (not fundable)?
Could this disrupt the targeted market segment?	Is the impact nationally significant or niche?
Is there evidence of product-market fit?	Do you have real customer signals, not just a TAM slide?
Is there potential for broad societal impact?	Can you name a specific population and mechanism of benefit?

This screening gate is the single most important thing to understand about NSF SBIR. Your pitch can have world-class technology and still get declined at screening if the PD classifies your work as engineering optimization rather than R&D.

Innovation Classification: The Single Most Important Factor

Before scoring your pitch, NSF reviewers assess whether the work represents genuine R&D or incremental engineering. Based on Cada's analysis of NSF review outcomes, we classify innovation into three tiers that predict scoring outcomes:

Tier A -- New scientific principle or method: Typically scores 7+ (out of 9). This is what NSF wants to fund.
Tier B -- Novel application of known science to a new domain: Typically scores 5+. Competitive but not a slam dunk.
Tier C -- Engineering optimization of existing approaches: Rarely scores above 4. This is effectively a decline.

If the reviewer can't clearly distinguish whether your work is Tier A/B or Tier C, that ambiguity is itself a red flag. NSF's primary gate is whether you're doing genuine high-risk/high-reward R&D versus product development dressed as research.

NSF Review Criteria

NSF uses 3 core criteria plus technical risk assessment:

Intellectual Merit -- Potential to advance scientific or engineering knowledge
Broader Impacts -- How the technology benefits society (for SBIR, this is NOT about education outreach or diversity programs -- it's about whether your technology itself has national significance)
Commercial Impact -- Market need, scalability, and whether NSF funding meaningfully de-risks the technology

Broader Impacts for SBIR founders: This trips up applicants who've written academic NSF grants. For SBIR, Broader Impacts means naming a specific population that benefits, a mechanism of benefit, and a plausible scale. "This technology will benefit society and create jobs" fails. "If Phase I demonstrates 95% accuracy, the technology could reduce diagnostic time by 40% for the 15M patients annually in rural health systems" passes.

Common NSF Decline Patterns

The top 3 reasons NSF declines SBIR pitches:

Incremental improvement, not R&D breakthrough -- "Better, faster, cheaper" without a technical leap gets classified as Tier C
Niche market, not nationally significant -- NSF funds technologies with broad societal impact, not narrow vertical solutions
Objectives describe product development, not R&D -- If a standard contractor could do the proposed work, it's not NSF-fundable

How ARPA-H Evaluates Applications: The 60-Second Test and PM Decision

ARPA-H is the newest health research agency, and its review process is radically different from NIH or NSF. There are no peer review panels. A single Program Manager reads your 6-page Solution Summary and decides whether to "Encourage" or "Discourage" you from submitting a full proposal.

The 60-Second Test

The PM should understand what your technology does and why it matters in under 60 seconds of reading your concept summary. If your opening section requires domain-specific knowledge to understand, the PM will assume your thinking is unclear. This is the single most important gate at ARPA-H.

A concept summary that fails the 60-second test:

"We are developing a platform to improve cancer treatment."

A concept summary that passes:

"We are developing a [specific technology] that [mechanism] to [quantified outcome], which would [health impact] for [specific population]."

The difference: the second version tells the PM exactly what, how, and for whom -- in one sentence.

5 Weighted Evaluation Criteria

Based on Cada's analysis of ARPA-H PM evaluation patterns, we model 5 weighted criteria that reflect what PMs reward:

Criterion	Weight	What the PM Looks For
Non-Incremental Innovation	25%	Is this genuinely 10x better, not 10%? A new mechanism, not a better implementation?
Health Impact and Scale	25%	Health burden quantified in patients/lives/QALYs -- NOT market size. Equity addressed.
Technical Feasibility and Milestones	20%	Measurable milestones with real Go/No-Go decisions. Honest about risks.
Team and Execution Capability	15%	Three-pillar coverage: technical + clinical + commercialization expertise.
Writing Quality and PM Communication	15%	Passes 60-second test. Jargon-free. Direct, outcome-focused. Quantified throughout.

Note: ARPA-H does not publish a formal scoring rubric like NIH. These weights reflect Cada's model of what PM review consistently rewards, based on our experience with ARPA-H submissions.

The 10x Bar

ARPA-H explicitly requires non-incremental innovation. Your mandatory metrics comparison table must show at least one metric with >= 10x improvement over existing approaches. The table requires sourced baselines and year-by-year targets.

"Better, faster, cheaper" is not ARPA-H language. "10x reduction in diagnostic time enabled by [mechanism]" is.

Language Culture: NIH Vocabulary Is a Red Flag at ARPA-H

ARPA-H rejects NIH language as a cultural signal of the wrong kind of thinking. Using the wrong vocabulary tells the PM you haven't read ARPA-H's own guidance -- and that's a credibility hit before they even evaluate your technology.

NIH Language (Avoid at ARPA-H)	ARPA-H Language (Use Instead)
"Hypothesis-driven"	"Will demonstrate"
"Specific aims"	"Milestones with Go/No-Go"
"Preliminary data suggests"	"Preliminary data demonstrates"
"Grantee"	"Performer"
"Phase 1"	"Base period"
"Program officer"	"Program manager / PM"
"Pilot study"	"Proof-of-concept"
"Market opportunity ($XB TAM)"	"Health impact (X million patients)"

Three-Pillar Team Requirement

ARPA-H expects your team to cover three pillars. Missing any one is a significant gap:

Technical expertise -- the science/engineering behind the innovation
Clinical expertise -- understanding of the health problem, patient needs, clinical workflow
Commercialization/adoption expertise -- regulatory pathway, manufacturing, reimbursement

If you don't have all three in-house, acknowledge the gap and show active recruiting plans. Pretending a missing pillar doesn't exist is worse than naming it.

How DOD Components Score SBIR Applications: Topic Alignment Is King

DOD SBIR is structurally different from civilian agency SBIR. You don't propose your own research question -- you respond to a specific solicitation topic published by a DOD component. Topic alignment is the primary scoring factor.

Key DOD Components and Their Formats

Component	Format	Key Differentiator
Standard DOD SBIR (Navy, Army, SOCOM, DEVCOM, DLA)	Topic-based proposals	Respond to explicit topic numbers with defined requirements
AFWERX	Pitch competition	Concise presentations, not traditional proposals
DARPA	BAA-specific	Respond to Broad Agency Announcements; Proposers Day attendance strongly recommended
DIU	Commercial solutions	Requires existing product at TRL 4+; NOT for early-stage R&D

DOD vs. Civilian Agency Differences

DOD SBIR awards are contracts, not grants. This changes the accountability structure -- you have deliverables and milestones defined by the solicitation, not self-defined research aims.

IP and patent protection matter more at DOD than at civilian agencies. Companies without filed patents or IP are at a measurable disadvantage -- DOD evaluators view IP ownership as evidence that you can deliver and protect the technology for government use.

DOD review evaluates your technology against a specific operational need. The question isn't "Is this scientifically innovative?" (NIH) or "Is this 10x better?" (ARPA-H) -- it's "Does this solve the problem we defined in the solicitation topic?"

What Good Looks Like at DOD

Topic alignment: Your proposal directly addresses every requirement listed in the solicitation topic. DOD topics are specific -- "develop a lightweight sensor for X environment" -- and reviewers evaluate how precisely you respond. A brilliant technology that doesn't match the topic gets rejected regardless of quality.

Operational context: You demonstrate understanding of the operational environment where your technology will be deployed. Using military/defense terminology correctly signals that you understand the end user.

Prior defense experience: Companies with prior DOD SBIR awards, CRADA agreements, or partnerships with defense research labs have a measurable edge. If you don't have prior experience, a strong letter of intent from a defense end-user helps close the credibility gap.

Common DOD Decline Patterns

Topic misalignment -- the proposal addresses a related but different problem than the solicitation topic specifies
No operational context -- the technology is described in commercial terms without connecting to the defense use case
Missing IP strategy -- no plan for protecting intellectual property or unclear data rights position
Overly academic framing -- proposal reads like an NIH grant instead of a defense contract response

SBIR Review Criteria by Agency: NIH vs NSF vs ARPA-H vs DOD Side-by-Side

Federal SBIR review criteria vary significantly across agencies. The same technology pitched to NIH, NSF, ARPA-H, and DOD needs four different narratives because each agency evaluates through a different lens. Here's the complete comparison:

Dimension	NIH	NSF	ARPA-H	DOD
Scoring scale	1-9 (1 = best)	1-9 (9 = best)	1-9 (9 = best)	Varies by component
# of criteria	5	3 + innovation classification	5 (weighted)	Topic-dependent
Who reviews	Panel of 15-20 scientists	Program Director + expert	Single Program Manager	Technical evaluators
Top criterion	Approach	Innovation Classification	Non-Incremental Innovation (25%)	Topic alignment
What kills apps	Sequential aim dependencies	Tier C innovation classification	Failing 60-second test	Misaligned to topic
Innovation bar	Hypothesis-driven R&D	High-risk/high-reward R&D	10x improvement required	Solves defined problem
Preliminary data	Required (higher than R21, lower than R01)	Less formal; customer signals valued	Proof-of-concept, not pilot study	Varies
Phase I award	Up to $314K (per NIH SBIR PA)	$305K (per NSF 23-515)	Varies by program, typically $1M-$5M	Varies by component
Review timeline	4-5 months to summary statement	Varies	Rolling submissions	Solicitation-dependent
Decision language	Fundable / Not Competitive	Invite / Decline	Encourage / Discourage	Select / Not Select
Language culture	Scientific, hypothesis-driven	R&D-focused, national significance	Plain language, outcome-focused	Operational, mission-focused

The Key Insight

Each agency optimizes its review process for a different question:

NIH: "Will this advance scientific knowledge and improve health?"
NSF: "Is this genuine high-risk/high-reward R&D with national significance?"
ARPA-H: "Can this solve a health problem in a way that cannot be achieved through conventional approaches?"
DOD: "Does this solve the specific operational problem we defined?"

The same therapeutic technology might score well at NIH by emphasizing the underlying biological mechanism, get classified as Tier C at NSF because it's an application of known science, receive an "Encourage" at ARPA-H because it shows 10x improvement in patient outcomes, and get passed over at DOD because there's no matching solicitation topic. Understanding which lens each agency uses is the difference between a competitive application and a wasted 80 hours.

Writing the Same Technology for Different Agencies

If you're applying to multiple agencies (which we recommend -- a portfolio approach improves your odds), here's how to adapt your narrative:

For NIH: Lead with scientific significance. Frame your technology as hypothesis-driven research. Quantify the health burden using CDC/WHO data. Structure aims as independent, testable hypotheses -- not a product development roadmap.

For NSF: Lead with your innovation classification. Demonstrate that your R&D is genuinely novel (Tier A or B), not engineering optimization (Tier C). Frame Broader Impacts around specific populations and mechanisms of benefit, not revenue.

For ARPA-H: Lead with the 10x improvement. Write your concept summary so a non-specialist understands it in 60 seconds. Frame impact in patients and lives -- never in market size. Use ARPA-H vocabulary (performer, base period, Go/No-Go).

For DOD: Lead with topic alignment. Show that your technology directly addresses the defined operational need. Emphasize IP protection and prior defense sector experience.

Before You Submit: 5-Point Checklist

Have you verified which scoring direction the agency uses? (NIH: 1 = best; NSF/ARPA-H: 9 = best)
Does your application use the agency's vocabulary? (Not NIH language at ARPA-H)
Have you addressed the agency's top decline pattern?
Is your innovation framed at the right level for the agency?
Does your application match the agency's review structure? (Panel vs. PM vs. topic-based)

Frequently Asked Questions About SBIR Review Criteria

Do all agencies use the same scoring scale?

No. NIH uses 1-9 where 1 = Exceptional (best). NSF and ARPA-H use 1-9 where 9 = Exceptional (best). This is one of the most common sources of confusion for founders applying to multiple agencies. If you're used to NIH scoring and see a "2" at NSF, that's near the bottom -- not near the top.

Can I submit the same application to multiple agencies?

Technically, yes -- there's no rule against it. But an application written for NIH reviewers will score poorly at ARPA-H because it uses the wrong language, wrong framing, and wrong structure. Each agency needs a tailored narrative. Budget 20-40 hours per agency-specific adaptation, not 5.

Which agency is easiest to get funded by?

It depends on your technology and stage. NIH Phase I success rates run 20-25% (source: NIH RePORTER data). NSF invitation rates after pitch are competitive. ARPA-H is newer and still establishing patterns. The "easiest" agency is the one where your technology best matches the review criteria -- not the one with the highest success rate.

How long does review take at each agency?

NIH: 4-5 months from submission to summary statement, 9-12 months to award. NSF: varies by program. ARPA-H: rolling submissions with faster turnaround (typically 4-8 weeks to initial response). DOD: tied to solicitation timelines, typically 3-6 months.

What's the biggest mistake founders make with SBIR applications?

Writing one application and submitting it to every agency. Each agency has a different review culture, different criteria, and different deal-breakers. An NIH-style application sent to ARPA-H signals that you don't understand how ARPA-H works -- and that's an immediate credibility hit with the PM reading your submission.

Get Agency-Calibrated Review Before You Submit

Cada's grant writing services include agency-calibrated review simulations that model how your application would score at NIH, NSF, ARPA-H, or DOD. Each simulation uses the actual criteria, scoring rubrics, and reviewer personas for the target agency -- not a generic checklist.

If you're not sure which agency your technology is most competitive for, that's the first question to answer before investing 40-80 hours in an application. We do a free 15-minute assessment call that gives you a straight answer on agency fit. No pitch, no obligation.

Book a free agency-fit assessment