AI Ethics and Safety
Why This Matters for Engineers
AI ethics is not just a policy concern — engineers make decisions that embed values into systems. Choosing training data, designing reward functions, setting content filters, deciding who gets access — these are ethical choices, not purely technical ones.
Core Concerns
AI Ethics Landscape
┌─────────────────┐ ┌─────────────────┐ ┌──────────────────┐
│ FAIRNESS │ │ SAFETY │ │ PRIVACY │
│ │ │ │ │ │
│ Bias in outputs │ │ Harmful content │ │ Training data │
│ Unequal perf │ │ Misuse / abuse │ │ PII leakage │
│ Discrimination │ │ Autonomy risks │ │ Memorization │
└─────────────────┘ └─────────────────┘ └──────────────────┘
┌─────────────────┐ ┌─────────────────┐ ┌──────────────────┐
│ TRANSPARENCY │ │ ACCOUNTABILITY │ │ AUTONOMY │
│ │ │ │ │ │
│ Explainability │ │ Who is liable? │ │ AI replacing jobs│
│ Audit trails │ │ Redress for │ │ Manipulation │
│ Model cards │ │ AI-caused harm │ │ Consent │
└─────────────────┘ └─────────────────┘ └──────────────────┘Bias in AI Systems
Types of Bias
Data Bias
Training data over/under-represents certain groups
e.g. facial recognition trained mostly on lighter-skinned faces
→ higher error rate on darker skin tones
Label Bias
Human annotators carry cultural assumptions into labeled data
e.g. sentiment labels reflect annotator's cultural norms
Historical Bias
Training data encodes past discrimination
e.g. resume screening trained on historical hires
→ perpetuates past hiring biases
Representation Bias
Certain languages, dialects, or regions appear rarely in training data
→ LLMs perform worse in low-resource languages
Aggregation Bias
Model trained on aggregate data performs poorly on subgroups
e.g. medical model trained on average population
→ inaccurate predictions for specific demographic groupsMitigating Bias
Data level:
→ Audit and balance training data
→ Use inclusive data collection
→ Remove personally identifiable proxies
Model level:
→ Fairness constraints in training objective
→ Differential performance testing across groups
→ Red-teaming for biased outputs
Deployment level:
→ Human review for high-stakes decisions
→ Regular bias audits on production outputs
→ Disaggregated evaluation metrics (not just aggregate accuracy)AI Safety Techniques
RLHF and Alignment
RLHF (Reinforcement Learning from Human Feedback) is the primary method current models use to align behavior with human values:
1. Collect preference data: human ranks model outputs A vs B
2. Train reward model to predict human preference
3. Fine-tune LLM via PPO to maximize reward model score
4. Result: model that is more helpful, less harmful, less deceptiveLimitation: reward model learns human preferences of annotators — which may not represent all users, cultures, or stakeholders.
Constitutional AI (Anthropic)
Instead of pure human feedback:
1. Define a "constitution" — a set of ethical principles
2. Model critiques its own outputs against the constitution
3. Model revises outputs to conform to principles
4. Fewer human labels needed; principles are explicit and auditableRLAIF (RL from AI Feedback)
AI model provides preference labels instead of humans. Cheaper to scale, but risk: AI preferences may diverge from human values if the judge model is itself flawed.
Harmful Content Categories
Content Safety Taxonomy:
Direct harm:
├── CSAM (child sexual abuse material)
├── Instructions for weapons (biological, chemical, nuclear)
├── Detailed suicide/self-harm methods
└── Violence against specific real people
Indirect / contextual harm:
├── Hate speech and discrimination
├── Misinformation and disinformation
├── Manipulation and social engineering scripts
├── Surveillance and stalking tools
└── Copyright and intellectual property violations
Misuse risks:
├── Phishing/spam generation at scale
├── Academic dishonesty
├── Deepfake generation
└── Automated influence operationsContent Safety Implementation
Multi-layer approach:
Input filters
→ Prompt injection detection
→ Classifier to detect harmful intent before LLM call
→ Rate limiting to prevent automated abuse
Model-level
→ Safety fine-tuning (don't answer certain queries)
→ System prompt constraints
→ Refusal training
Output filters
→ Toxicity classifier on generated text
→ PII detection and redaction
→ Domain-specific validators (medical, legal, financial disclaimers)
Human review
→ Sample auditing of flagged outputs
→ Feedback loop back to training dataPrivacy Concerns in LLMs
Training Data Privacy
Problem: LLMs memorize training data
→ Can reproduce verbatim text from training data on targeted prompts
→ Includes: email addresses, phone numbers, code, personal writing
Real incident: GPT-2 could be prompted to reproduce specific text
from training data including PII
Mitigations:
→ Data deduplication reduces memorization
→ Differential privacy in training (adds noise to gradients)
→ Post-training PII filtering
→ System prompts: "Do not reproduce personal information verbatim"Inference-Time Privacy
Risks:
→ Users send PII to model APIs (names, addresses, medical info)
→ Enterprise data sent to cloud APIs may be used for training
→ Conversation history may be retained
Mitigations:
→ Use self-hosted / on-premise models for sensitive data
→ PII detection + redaction before API calls
→ Review provider's data retention and training policies
→ Encrypt sensitive fields; only send anonymized versionsTransparency and Explainability
Model Cards
A model card is documentation that ships with an ML model:
Model Card Contents (ML Model documentation standard):
├── Intended use cases
├── Out-of-scope uses
├── Training data description
├── Evaluation results (overall + per demographic group)
├── Known biases and limitations
└── Ethical considerations and recommendationsExplainability vs. Interpretability
Interpretability: Understanding HOW the model works
→ What features does it rely on?
→ Why did this input produce this output?
→ Techniques: attention visualization, probing classifiers, activation patching
Explainability: Providing post-hoc explanations users can understand
→ "This loan was denied because of X, Y, Z factors"
→ Techniques: LIME, SHAP, counterfactual explanations
→ LLMs: Chain-of-Thought as explanation of reasoningLLMs are largely black boxes — CoT provides reasoning transparency, but models can hallucinate reasoning too (reasoning steps don't always reflect internal computation).
Regulatory Landscape
EU AI Act (2024 — phased enforcement)
Risk-based classification:
├── Unacceptable risk → BANNED
│ e.g. social scoring, real-time biometric surveillance in public
├── High risk → strict requirements
│ e.g. CV screening, credit scoring, medical devices, law enforcement
│ Requires: risk management, data governance, transparency, human oversight
├── Limited risk → transparency obligations
│ e.g. chatbots must disclose they're AI
└── Minimal risk → no obligations
e.g. spam filters, AI in video games
US: No federal AI law yet
→ Executive orders and sector-specific guidance (FDA, FTC, NIST AI RMF)
→ NIST AI Risk Management Framework (voluntary)
China: Generative AI regulations (2023)
→ Must register GenAI products with government
→ Training data must be licensed
→ Must label AI-generated contentResponsible AI Principles (Summary)
Principle What it means in practice
──────────────────────────────────────────────────────────────
Fairness Test disaggregated performance; audit for bias
Reliability Evaluate, monitor, maintain — not set and forget
Safety Red-team, filter, rate-limit; human review for risk
Privacy Minimize data collection; anonymize; respect consent
Inclusiveness Test across languages, cultures, ability levels
Transparency Document data, methods, limitations (model cards)
Accountability Clear ownership; human in the loop for high-stakesAgentic AI Safety Considerations
As AI systems gain autonomy (agents, multi-step tasks, tool use), new safety concerns emerge:
Prompt injection
Malicious content in environment (web pages, files) hijacks agent behavior
Mitigation: sandbox tool execution, validate tool outputs before acting
Unintended side effects
Agent takes irreversible actions (delete files, send emails, API calls)
Mitigation: minimal permissions, confirm-before-act for destructive operations
Goal misgeneralization
Agent pursues proxy metric, not true objective
e.g. "maximize clicks" → generates sensationalist content
Cascading failures
Multi-agent systems: one hallucination propagates through all downstream agents
Mitigation: cross-agent validation checkpoints
Resource acquisition
Agent acquires more resources/permissions than needed for the task
Mitigation: principle of least privilege; capability limits per task