IncluShift Research Brief · 2026-04-20

Responsible AI in Special-Education Software

AI-assisted features where peer-reviewed research supports them; rule-based algorithms where evidence demands human judgment. The three architectural guardrails — PII-stripping, human-in-the-loop, constrained prompts — that keep AI claims honest in a population protected by COPPA, FERPA, and IDEA.

By Davit Janunts, M.Ed. Special Education (Lehigh University — Fulbright Foreign Student Program); co-author, Morin, Janunts, et al. (2024), Exceptional Children, 90(2), 126-147, doi:10.1177/00144029231165506.

Summary

The AI conversation in special-education software is currently bifurcated. On one side, every edtech vendor now claims “AI-powered” capabilities regardless of whether machine-learning components do the work. On the other, a policy and procurement counter-current — FTC AI-washing enforcement, the Center for Democracy & Technology’s 2025 report on AI in IEP drafting, the Department of Education’s October 2024 developer guidance — is pushing districts to audit vendor claims at architectural depth. The brief below is written for that audit.

Special education is the single highest-stakes edtech domain for AI: 7.5 million students under IDEA protection, disability-category data that is legally sensitive by statute (FERPA, HIPAA where applicable), a documented history of algorithmic bias against Black and Latino students in discipline referral (Skiba et al., 2011), and a procurement pipeline in which 57 percent of SPED teachers already used AI for IEP drafting in 2024-25 (CDT, 2025). The responsible answer is neither blanket AI claims nor blanket AI avoidance. It is structural: PII-stripping before every LLM call, human-in-the-loop on every consequential decision, constrained prompts with output validation, and — critically — the discipline to say which product surfaces use AI and which use rule-based algorithms because the research does not support LLMs there.

The cost shape — why naive AI claims fail in SPED

The 2019 evaluation of mental-health app store quality claims (Larsen et al., NPJ Digital Medicine) found that 72 percent of top-rated apps made clinical claims that their evidence base did not support. The pattern has not improved in the 2024-2026 AI cycle. Across EdTech, the gap between marketed capability and implemented capability is now being priced by three parties:

FTC enforcement. 2024-2025 consent orders against vendors for “AI-washing” — representing AI capability that did not exist in the product. Civil penalties escalate with protected-class exposure; children’s services and disability services both qualify.
District procurement. State Student Data Privacy Consortium DPAs (NDPA v2.2, November 2025) now contain explicit AI-specific attestation clauses. Vendors who cannot document architectural guardrails are being declined at the DPA stage, before a contract is ever offered.
Civil-rights exposure. USDOE OCR’s October 2024 guidance on avoiding discriminatory AI use flagged SPED eligibility determinations, discipline referral, and IEP goal drafting as specific risk surfaces. Skiba et al. (2011) documented 2.19x Black/White disproportionality in office-discipline referrals; AI that inherits this training distribution inherits the bias.

The financial outcome is straightforward. Vendors who over-claim are being priced out of procurement; vendors who silently deploy AI without guardrails are being exposed in incident reports. The middle lane — honest specificity about where AI is used, and structural guardrails that make the claims auditable — is the only one currently funding.

Three architectural guardrails

1. PII-stripping gate before every LLM call

No student-identifying information leaves device or enters a prompt. A deterministic sanitizer (stripPII()) runs before any text reaches an LLM API. Names become [STUDENT], schools become [SCHOOL], DOBs become [DATE], diagnosis codes are redacted. After generation, tokens are re-mapped to originals device-locally. The model provider never sees identifiable content.

This pattern satisfies the FTC’s January 2025 COPPA amendments (effective April 22, 2026), which require a written information-security program and prohibit transmission of child personal data to third-party inference providers without explicit verifiable parental consent. It also satisfies FERPA 34 CFR Part 99, since the LLM provider never becomes a “school official” with a legitimate educational interest in identifiable records.

2. Human-in-the-loop on every consequential decision

AI outputs that affect a student’s legal rights, educational placement, or clinical services are drafts, not decisions. Specifically:

IEP drafts require professional review and signature before inclusion in the legal document; the UI explicitly labels AI output as draft, never final, and requires licensed professional approval.
Jargon translation (parent-facing) preserves a “View Original” toggle on every translation so the legal meaning of IEP language is never silently altered.
Scenario evaluation for teacher PD pairs AI feedback with CEC-aligned human rubrics. The AI never grades alone.
Decodable-text generation is validated against a deterministic ≥95 percent decodability threshold (Mesmer 2001; Cheatham & Allor 2012) before a passage reaches a student.

USDOE-OCR’s October 2024 civil-rights guidance explicitly names eligibility determinations and discipline referral as surfaces that must remain human-decisioned. Our architecture enforces this at the product layer, not the policy layer.

3. Constrained prompts and output validation

Free-form prompting is not used in student-facing surfaces. Every LLM call that affects student content is bounded by (a) a constrained system prompt with allow-listed vocabulary, (b) a post-generation validator that rejects output failing a domain-specific evidence-based threshold, and (c) a forbidden-token regex that blocks diagnostic terminology (“dyslexia,” “disorder,” “impairment”) from surfacing in student-facing strings per the Shifrer (2013) labeling-effect finding.

Decodability thresholds, Flesch-Kincaid reading-level targets, and MLU (mean length of utterance) bounds are checked server-side before output ever reaches the student. A passage the LLM returned at 87 percent decodability is rejected and regenerated until it meets ≥95 percent. This converts the LLM from a risky text generator into a constrained tool operating inside known research bounds.

Where AI is and isn’t used

The discipline that separates honest AI architecture from AI-washing is the willingness to say which product surfaces do not use AI because the research does not support it. Across the IncluShift ecosystem:

Product	AI used?	Why
IncluVoice	Yes — AAC prediction	n-gram + transformer reranking reduces motor actions (Vertanen et al. 2022)
IncluBridge	Yes — edge-case jargon translation	Curated dictionary handles 90%+ of terms; LLM handles ambiguous context only
IncluLiteracy	Yes — decodable generation	Output gated by ≥95% decodability validator (Mesmer 2001)
IncluTrain	Yes — scenario response evaluation	Paired with CEC-aligned human rubric
IncluShift OS	Yes — IEP Drafter (draft only)	Professional review required; mandatory disclaimer; PII-stripped before prompt
IncluMath	No — statistical only	Performance Factor Analysis (Pavlik 2009); LLM is not supported by CRA research base
IncluRegulate	No — rule-based	Distress-router is deterministic; mental-health app liability (Armontrout 2016) disqualifies LLM here
IncluManage	No — statistical only	ABC logging + celeration-line trend analysis; FBA hypothesis-generation pairs with function-based PBS literature (Carr 2002)
IncluSteps	No	ASQ-3 aligned milestone database (Squires & Bricker 2009); deterministic
IncluPathway	No	Holland RIASEC + Wehmeyer Self-Determination instruments; student-driven selection

Five of ten products use AI. Five do not. The five that do, use it in places where the peer-reviewed research supports it: communication prediction at the AAC surface (Vertanen et al. 2022), AI tutor gains at the scale of a Harvard RCT (Kestin et al. 2025, Scientific Reports, N=194, d≈0.67), and ITS meta-analytic evidence (Kutyniok et al. 2025, npj Science of Learning, 28 studies). The five that don’t, avoid it in places where the research doesn’t — math adaptive practice where Performance Factor Analysis (Pavlik et al. 2009) is the established optimal-difficulty engine, and regulation where mental-health app liability (Armontrout et al. 2016) disqualifies LLMs on both safety and legal grounds.

The COPPA 2025 × FERPA × IDEA intersection

Three federal regimes converge on AI-in-SPED, and together they leave less freedom of architectural choice than many vendors are acting on.

COPPA amendments (January 2025, effective April 22, 2026). The expanded definition of personal information now explicitly covers biometric identifiers, precise geolocation, and persistent identifiers tied to a device. The mandatory written data retention policy and written information-security program effectively require vendors to document their LLM-call boundaries at schema depth. A generic “we use AI responsibly” attestation no longer satisfies the rule.

FERPA (34 CFR Part 99). Transmission of identifiable education records to a third-party LLM provider is a disclosure under §99.30 requiring parental consent — unless the provider is a “school official” under §99.31(a)(1) with a legitimate educational interest, under direct control, and contractually restricted from further use. Most commercial LLM terms do not satisfy the direct-control prong. PII-stripping at the client boundary avoids the §99.30 disclosure entirely.

IDEA (34 CFR Part 300). IEP decisions are legal determinations under §300.324 requiring a team process with parental participation. AI-drafted IEP content that proceeds to a signed IEP without substantive professional review is a procedural-compliance failure, and a procedural failure that denies FAPE is independently actionable under §300.513(a)(2). Human-in-the-loop is not a best practice here; it is the legal floor.

Equity guard

SPED populations sit at the intersection of two documented algorithmic bias surfaces. First, Skiba et al. (2011) established 2.19x Black-to-White risk ratio in office-discipline referrals, with 45 percent of that gap driven specifically by subjective-referral categories (defiance, disrespect). Training an LLM on institutional discipline data inherits this bias by construction. Second, disability-category labeling carries its own labeling-effect cost: Shifrer (2013, American Sociological Review) documented that labels reduce teacher expectations and student achievement independent of actual student ability.

Architectural guardrails against these failures: (i) AI is not used in discipline-referral or eligibility-determination surfaces at all; (ii) no student-facing surface permits the user to self-identify a disability category — functional profiles are set by the educator and the student sees functional labels (“focus support,” “extended time”) only; (iii) every AI output is run through a forbidden-token regex that blocks diagnostic terminology before it reaches any student-facing string. The equity guard is structural. It is not a policy note added to a marketing page.

Operational fix — five yes/no questions a SPED director can ask any vendor

Does a PII-stripping pipeline run before every LLM call, and can the vendor produce the code path? “Yes, we care about privacy” is not an answer. The deterministic sanitizer should be nameable, auditable, and version-controlled.
Which specific product surfaces use AI, and which do not? A vendor who answers “everything” is claiming capability that probably doesn’t exist. A vendor who answers with a specific list (this product yes, this product no) is disciplined enough to be procurement-worthy.
Is there a forbidden-token regex that blocks diagnostic terminology from student-facing output? Required by the Shifrer (2013) labeling-effect finding and by IDEA strength-first language norms. Vendors who haven’t implemented this are passively exposing their users to labeling harm.
Do AI-drafted IEP sections require an explicit professional review gate in the product UI, not just in the privacy policy? §300.324 requires a team process; a vendor that lets an IEP section flow from LLM to signed document without a UI-enforced review step has built a procedural-compliance failure into the product.
What research citation supports each AI-used surface, and what research citation supports the rule-based surfaces not using AI? The honest answer cites both directions: where the evidence backs the LLM, and where the evidence backs the statistical or deterministic alternative. A vendor who can only answer in one direction is optimizing for a category narrative, not for their users.

These five questions track the structural-compliance layer that procurement security audits (SDPC NDPA v2.2, November 2025) and USDOE-OCR civil-rights guidance (October 2024) already imply. A vendor unable to answer all five in architectural detail is not yet safe to procure for a population protected by COPPA, FERPA, and IDEA.

This brief is a mechanism analysis of AI architecture requirements under current federal regulation and peer-reviewed research. It is not legal advice, compliance certification, or a substitute for counsel. Districts selecting AI-inclusive SPED technology should verify specific product architectures against the five questions above with their own counsel and IT security team. IncluShift products are adaptive practice and administrative tools, not medical devices, therapeutic interventions, or substitutes for professional educational assessment.

References

Children's Online Privacy Protection Rule, 16 CFR Part 312 (FTC final amendments, January 2025; effective April 22, 2026).
Family Educational Rights and Privacy Act, 20 U.S.C. §1232g; 34 CFR Part 99.
Individuals with Disabilities Education Act, 20 U.S.C. §1400 et seq. (2004), implementing regulations 34 CFR Part 300.
U.S. Department of Education (October 2024). Designing for Education with Artificial Intelligence: An Essential Guide for Developers. Office of Educational Technology.
U.S. Department of Education, Office for Civil Rights (October 2024). Avoiding the Discriminatory Use of Artificial Intelligence.
Larsen, M.E., Huckvale, K., Nicholas, J., Torous, J., Birrell, L., Li, E., & Reda, B. (2019). Using science to sell apps: Evaluation of mental health app store quality claims. NPJ Digital Medicine, 2, 18.
Akgun, S., & Greenhow, C. (2022). Artificial intelligence in education: Addressing ethical challenges in K-12 settings. AI and Ethics, 2(4), 431-440.
Marino, M.T., Vasquez, E., Dieker, L., Basham, J., & Blackorby, J. (2023). The future of artificial intelligence in special education. Journal of Special Education Technology, 38(3), 404-416.
Kestin, G., Miller, K., Klales, A., Milbourne, T., & Ponti, G. (2025). AI tutoring outperforms active learning. Scientific Reports, 15(1). Harvard RCT, N=194.
Kutyniok, G., Alaniz, S., Dutta, S., Mahmud, T., Stubbemann, M., Yadav, P., & Yoo, S. (2025). Intelligent tutoring systems in K-12 education: A systematic review. npj Science of Learning, 10. 28 studies, 4,597 students.
Unal, A., Kaya, F., Uzun, G.C., & Erdem, C. (2026). The impact of artificial intelligence on student learning outcomes: A second-order meta-analysis. Journal of Educational Computing Research. N=58,702 students, ES=0.67.
Zeide, E. (2019). The structural consequences of big data-driven education. Big Data & Society, 6(2), 1-15.
Southgate, E., Blackmore, K., Pieschl, S., Grimes, S., McGuire, J., & Smithers, K. (2019). Artificial intelligence and emerging technologies in schools. Australian Department of Education.
Armontrout, J., Torous, J., Cohen, M., McDougall, E.R., & Appelbaum, P. (2016). Current regulation of mobile mental health applications. Current Psychiatry Reports, 18(10), 91.
Center for Democracy & Technology (2025). From Personalized to Programmed: Artificial Intelligence in Special Education. Survey: 57% of SPED teachers used AI for IEP/504 development in 2024-25.
Shifrer, D. (2013). Stigma of a label: Educational expectations for high school students labeled with learning disabilities. American Sociological Review, 78(3), 407-428.
Skiba, R.J., Horner, R.H., Chung, C.G., Rausch, M.K., May, S.L., & Tobin, T. (2011). Race is not neutral: A national investigation of African American and Latino disproportionality in school discipline. School Psychology Review, 40(1), 85-107.

Brief #7 in the IncluShift Research Briefs series. See the full brief catalog at /hub/briefs.