Building Secure and Reliable Systems

Heather Adkins, Betsy Beyer, Paul Blankinship, Piotr Lewandowski, Ana Oprea, Adam Stubblefield

Security

14 skills extracted

Without Skills

Security and reliability treated as separate concerns, bolted on after design. Incident response is ad hoc. Access controls are either too permissive or too restrictive. No systematic approach to resilience, recovery, or secure deployment.

→

With Skills

Integrated security-reliability design reviews with scored risk assessments. Systematic least-privilege analysis with access classification frameworks. Structured incident response playbooks with role assignments and communication plans. Recovery plans with prioritized checklists and credential rotation procedures.

Quality

+0%

vs baseline

with skill

Problems These Skills Solve

Designing systems where security and reliability reinforce each other instead of conflicting

General skill set

Classifying access and applying least privilege without destroying developer productivity

General skill set

Building resilience through defense in depth, controlled degradation, and blast radius limitation

General skill set

Creating structured incident response and recovery plans before disasters strike

General skill set

Working Environment

*.{py,go,java,ts}

*.{yml,yaml}

*.{tf,yml}

design-doc.md

threat-model.md

*.{json,yml}

access-control.md

*.md

Skills operate on source code, infrastructure configs, deployment pipelines, access control policies, incident documentation, and system architecture descriptions. The book spans the full system lifecycle — design docs through production operations.

You provide

·their codebase or system description

·design documents

·deployment configs

·existing access control policies

·incident reports or postmortems

Install

Recommended

Full

14 skills

All 14 skills including capstone incident command and recovery

$ /plugin install building-secure-and-reliable-systems@bookforge-skills

Core

12 skills

Foundation + recovery, defense, testing, deployment, and incident response setup

$ /plugin install building-secure-and-reliable-systems@bookforge-skills --profile core

Minimal

7 skills

Foundation skills only — threat modeling, design reviews, least privilege, blast radius, disaster risk, code review, change rollout

$ /plugin install building-secure-and-reliable-systems@bookforge-skills --profile minimal

Extracted Skills

Securityhybrid

Adversary Profiling And Threat Modeling

Profile the likely adversaries targeting a system and produce a structured threat model with prioritized threat scenarios. Use when: designing a new system or service and need to identify who might attack it and how; evaluating whether existing security controls address the right threats; preparing a threat model document for a security review, compliance audit, or architecture decision record; assessing insider risk for a system that handles sensitive data or privileged operations; or mapping attack lifecycle stages to defensive controls. Applies the three adversary frameworks — attacker motivations, attacker profiles, and attack lifecycle stages — alongside a four-dimension actor-motive-action-target threat scenario matrix to produce ranked threat scenarios. Distinct from vulnerability assessment (which audits specific technical flaws) and penetration testing (which actively exploits weaknesses). Produces: adversary profile summary, insider risk matrix, threat scenario list ranked by likelihood and impact, and per-stage defensive control recommendations.

Securityhybrid

Disaster Risk Assessment

Use when you need to assess disaster risk for a system or organization, perform structured risk analysis before disaster planning, identify which disasters to plan for, build a prioritized risk register, quantify probability and impact of failure scenarios, or answer "what disasters should we prepare for and in what order."

Securityfull

Dos Defense And Mitigation

Design DoS-resistant systems or respond to an active denial-of-service attack. Use this skill when designing a new service and want to evaluate its attack surface and build in layered defenses, assessing whether a production system's architecture is DoS-hardened, investigating a traffic spike to determine whether it is an attack or a self-inflicted surge, detecting a client retry storm and needing to apply backoff and jitter fixes, building or reviewing a DoS mitigation system (detection + response pipeline), or deciding how to respond strategically to an ongoing attack without leaking information about your defenses to the adversary.

Securityplan-only

Incident Response Team Setup

Use when you need to set up an incident response team from scratch, design an IR team charter, define severity and priority models for incidents, create IR playbooks, build a structured testing program, design tabletop exercises, or answer "how do we build and validate our incident response capability."

Securityfull

Least Privilege Access Design

Analyze a system's access patterns and design least-privilege controls: classify data and APIs by risk, select the narrowest API surface for each operation, define authorization policies with multi-party approval for sensitive actions, establish emergency access override procedures, and optionally introduce a controlled-access production proxy. Use when reviewing access controls for an existing system, designing authorization for a new service, auditing whether engineers have more permissions than their roles require, deciding whether to use a bastion or proxy for privileged operations, or hardening administrative API surfaces against insider mistakes and external compromise. Produces an access classification report, API surface recommendations, authorization policy decisions, and emergency override guidelines.

Securityfull

Recovery Mechanism Design

Design recovery mechanisms for a software system or component: select the right rollback control strategy from a three-mechanism decision framework (key rotation, deny list, and minimum acceptable security version number / downgrade prevention), set rate-of-change policy that decouples rollout velocity from rollout content, eliminate wall-clock time dependencies from recovery paths, design an explicit revocation mechanism with safe failure behavior (distributing cached revocation lists rather than failing open), and provision emergency access for use when normal access paths are completely unavailable. Use when designing a new system's update or rollback architecture, reviewing an existing release pipeline for security-reliability tradeoffs, defining rollback policy for self-updating firmware or system software, designing a revocation mechanism for credentials or certificates, or planning emergency access infrastructure before an incident occurs. Output: a recovery mechanism design document with rollback control strategy per component, rate-of-change policy, revocation mechanism design, and emergency access plan.

Securityfull

Resilience And Blast Radius Design

Design or audit a system's resilience posture using a layered framework — defense in depth, controlled degradation (load shedding vs. throttling), blast radius compartmentalization (role/location/time), failure domains, 3-tier component reliability hierarchy, and continuous validation. Use this skill when designing a new system for failure, reviewing an existing architecture for single points of failure, limiting blast radius of a potential attack or outage, deciding fail-open vs. fail-closed behavior for a component, or building an incident-response-ready compartmentalization strategy.

Securityfull

Secure Code Review

Review code for security vulnerabilities and reliability anti-patterns: scan for SQL injection risks (raw string concatenation into queries), XSS exposure (untyped HTML construction), authorization bypass from multilevel nesting, primitive type obsession, YAGNI-inflated attack surface, and missing framework enforcement for authentication/authorization/rate-limiting. Use when conducting a security code review, auditing a codebase for injection vulnerabilities, checking whether auth logic could be bypassed by nesting errors, evaluating whether RPC backends use hardened interceptor frameworks, or assessing whether type-safety patterns (TrustedSqlString, SafeHtml, SafeUrl) are applied to user-controlled inputs. Produces a categorized security findings report with severity, anti-pattern class, affected locations, and fix recommendations grounded in hardened-by-construction design.

Securityfull

Secure Deployment Pipeline

Secure a software deployment pipeline against supply chain attacks from benign insiders (mistakes), malicious insiders, and external attackers: map pipeline threats to mitigations using the three-adversary threat model, generate binary provenance requirements for each build stage, define provenance-based deployment policies with choke-point enforcement, design verifiable build architecture (trusted build service, rebuild service, or hybrid), and produce a staged hardening roadmap with breakglass controls. Use when assessing supply chain security for a CI/CD pipeline, implementing binary provenance to trace artifact origins, designing deployment policies that verify what is deployed rather than who initiated deployment, hardening build infrastructure against insider threats, or establishing breakglass procedures that remain auditable. Requires secure-code-review as a prerequisite control (code review is the first mitigation layer against malicious or accidental code changes before they enter the pipeline). Produces a deployment pipeline security assessment with threat-mitigation mapping, provenance schema, policy rules, and a phased hardening plan.

Securityhybrid

Security Change Rollout Planning

Plan and execute a security change rollout across a service or fleet: classify the change into a time horizon (short / medium / long-term), triage affected systems by risk tier, select the appropriate rollout strategy with canarying and staged deployment, define communication strategy (internal and external), set rollback and success criteria, and produce a written rollout plan. Use when you need to respond to a zero-day vulnerability, roll out a security posture improvement, or drive an ecosystem or regulatory compliance change. Handles timeline disruption scenarios: accelerate when an exploit goes public, slow down when patch instability is detected, delay when embargo, external dependency, or limited blast radius dictates caution. Produces a rollout plan with timeline, per-tier risk triage, communication strategy, and explicit rollback criteria. Examples covered: Shellshock emergency patch, hardware security key (FIDO/WebAuthn) company-wide deployment, and Chrome HTTPS migration.

Securityhybrid

Security Incident Command

Command and manage an active security incident from declaration through remediation handoff using the incident management framework (Google's IMAG, derived from ICS). Use when: you have a confirmed or suspected security incident and need to take command; someone says "we have a security incident" or "we may have been compromised"; you need to stand up an incident command structure with staffing roles; you are running forensic investigation and need to coordinate parallel tracks; an incident has grown large enough to require shift rotation and formal handovers; or you need to decide when investigation is complete enough to move to ejection and remediation. Distinct from incident response team setup (which designs the team and IR capability before incidents) — this skill executes the live response. Applies the seven-step incident command process: declare, staff, establish operational security, run forensic investigation loop, scale with rotation, apply the lead-rate decline signal to decide ejection timing, and hand off with a structured brief. Produces: incident state document, forensic timeline, communication plan, and remediation handoff package.

Securityhybrid

Security Incident Recovery

Use when you need to recover from a security incident, build an incident recovery plan, execute post-breach remediation, rotate credentials after a breach, scope attacker impact across systems, build a recovery checklist, decide when to eject an attacker vs. continue observing, or run a post-incident postmortem with short-term and long-term action items.

Securityfull

Security Reliability Design Review

Review a system design for security and reliability tradeoffs before implementation begins. Use when: evaluating an architecture proposal or design document and need to identify where security and reliability requirements conflict with feature or cost requirements; auditing a proposed design to determine whether security and reliability are designed in from the start or likely to require expensive retrofitting later; deciding whether to build payment processing or sensitive data handling in-house versus delegating to a third-party provider; assessing whether a microservices framework or platform incorporates security and reliability by construction rather than by convention; or producing a design review report for a security review, production readiness review, or architecture decision record. Applies the emergent property test (security and reliability cannot be bolted on — they must arise from the whole design), the initial-versus-sustained-velocity model (early neglect accelerates to later slowdown), and the Google design document evaluation checklist covering scalability, redundancy, dependency, data integrity, SLA, and security/privacy considerations. Produces: a design review report with identified tensions between feature, security, and reliability requirements; tradeoff recommendations; and prioritized security/reliability improvements. Distinct from threat modeling (which focuses on adversary scenarios) and code review (which audits existing implementation). Applicable at any stage where design decisions are still open.

Securityfull

Security Testing Strategy

Select and implement a layered security testing strategy for a codebase: design unit tests for security properties (boundary conditions, negative inputs, access control invariants), set up integration testing with non-production seed data (avoiding the production data copy anti-pattern), choose and configure dynamic analysis sanitizers (AddressSanitizer for memory corruption, ThreadSanitizer for race conditions, MemorySanitizer for uninitialized reads — with their performance cost tradeoff accounted for in CI/CD scheduling), plan fuzz testing (write effective fuzz drivers using libFuzzer/AFL, apply dictionary inputs for structured formats like HTTP/SQL/JSON, build a seed corpus, integrate continuous fuzzing via ClusterFuzz or OSS-Fuzz), and integrate static analysis at the right depth for each development stage (linters in the IDE commit loop, abstract interpretation nightly, formal methods for safety-critical paths). Use when creating a security testing plan for a new service, setting up fuzz testing for a parser or protocol implementation, integrating static analysis into a CI/CD pipeline, adding sanitizer-enhanced nightly builds, or auditing coverage gaps found during secure-code-review. Produces a security testing strategy document with tool selection rationale, CI/CD integration plan, and coverage priorities derived from code review findings.