The ALB backend protocol (HTTP, HTTPS, HTTP/2, H2C) is independent of the client-facing frontend protocol.
Global and regional ALBs use load balancing scheme EXTERNALMANAGED; classic ALBs use EXTERNAL. EXTERNALMANAGED backend services can attach to EXTERNAL forwarding rules but not vice versa.
Each ALB forwarding rule supports only one port (1–65535); use multiple forwarding rules for multiple ports.
H2C (HTTP/2 cleartext) is not supported on classic Application Load Balancers.
GFE health check probes originate from IP ranges 130.211.0.0/22 and 35.191.0.0/16.
A proxy-only subnet (purpose REGIONALMANAGEDPROXY) is required only for regional external ALBs, not for global or classic ALBs.
Regional external ALB is the appropriate choice for compliance/data residency requirements (single geolocation).
The external Application Load Balancer operates in three modes: global external ALB, classic ALB, and regional external ALB; mode is immutable after creation.
Compute Engine VM access scopes can further restrict Artifact Registry access beyond IAM roles — the default read-only scope blocks writes even if the SA has Writer role; cloud-platform scope is needed for push.
The gcloud auth configure-docker HOSTNAME command configures the Docker credential helper for a specific Artifact Registry hostname (e.g., us-west1-docker.pkg.dev).
Artifact Registry cleanup policy changes take effect within approximately one day via a background job.
Artifact Registry cleanup deletions count against the per-project delete request quota.
Artifact Registry cleanup dry run results appear in Data Access audit logs, which must be explicitly enabled with "data write" type to see results.
Artifact Registry cleanup keep policies always override delete policies when an artifact matches both.
Artifact formats that don't support tags are treated as untagged for cleanup policy evaluation.
Artifact Registry cleanup policies require roles/artifactregistry.admin (needs artifactregistry.repositories.update and artifactregistry.versions.delete permissions).
Artifact Registry cleanup policies can only be applied to standard repositories, not virtual repositories or at the project level.
In Artifact Registry cleanup policies, tagState must be TAGGED to use tagPrefixes in conditions.
Cloud Build's default service account has read/write access to Artifact Registry.
Cloud Run cross-project Artifact Registry access requires granting roles to the Cloud Run Service Agent (service-PROJECT-NUMBER@serverless-robot-prod.iam.gserviceaccount.com), not just the runtime service account.
Artifact Registry supports CMEK encryption via Cloud KMS (Google-managed encryption by default), and organization policy can enforce CMEK.
The createOnPush Artifact Registry roles (roles/artifactregistry.createOnPushWriter and createOnPushRepoAdmin) are specifically for Container Registry to Artifact Registry migration, allowing gcr.io repo creation on push.
Artifact Registry cross-project access is not automatic — roles must be explicitly granted in the Artifact Registry project.
Compute Engine, GKE, and Cloud Run default service accounts get read-only access to Artifact Registry by default.
Artifact Registry Docker image paths follow the format LOCATION-docker.pkg.dev/PROJECT/REPOSITORY/IMAGE:TAG.
Artifact Registry IAM roles can be granted at two levels: project-wide (applies to all repositories) or repository-specific.
Artifact Registry image streaming only works when images are in the same region (or corresponding multi-region) as workloads.
No egress charge from Artifact Registry to Google Cloud services in the same region.
Organizations created after May 3, 2024 enforce iam.automaticIamGrantsForDefaultServiceAccounts by default, preventing automatic Editor role grants to default service accounts.
Making an Artifact Registry repository public requires granting roles/artifactregistry.reader to allUsers and capping per-user quotas to prevent abuse.
Artifact Registry remote repositories are read-only caching proxies for upstream public sources (Docker Hub, Maven Central, PyPI), enabling vulnerability scanning and provenance tracking on third-party dependencies.
Artifact Registry is Google Cloud's recommended registry, replacing Container Registry, with support for both container images and language packages (Go, Java, Node.js, Python, Ruby, Helm).
Artifact Registry repository format (--repository-format=docker) is specified at creation and cannot be changed afterward.
Artifact Registry repository mode (standard, remote, or virtual) cannot be changed after creation.
Artifact Registry repository format and mode are both immutable after creation — a repository's package type (Docker, Maven, npm) and operational mode (standard, remote, virtual) are permanent architectural decisions that cannot be corrected without recreating the repository and migrating all artifacts.
Resource Manager tags attach to Artifact Registry repositories only, not individual artifacts, for conditional IAM access control.
Virtual repositories require explicit grants for the Artifact Registry service account to access upstream repositories.
Virtual repository IAM roles apply to all upstream repositories regardless of individual upstream repo permissions.
Artifact Registry virtual repositories aggregate multiple repos behind a single endpoint with priority-based resolution order, mitigating dependency confusion attacks by prioritizing private over public upstream repos.
Auto mode VPC networks automatically create one /20 subnet per region from the 10.128.0.0/9 block.
A static route cannot be created with a destination that matches or fits within an existing subnet route range, and vice versa (unless using hybrid subnets).
Classic VPN does not support IPv6 — only HA VPN supports dual-stack and IPv6-only configurations.
Adaptive Protection requires a Cloud Armor Enterprise subscription and is enabled per-security policy.
DDoS protection is automatic (no configuration needed) for global external Application Load Balancers, classic Application Load Balancers, and external proxy Network Load Balancers.
Cloud Armor provides edge-first layered defense with four independent mechanisms: automatic DDoS protection at the Google Cloud edge for global external ALBs, prioritized rule evaluation ensuring highest-priority matches win, OWASP CRS 3.3.2-based WAF rules for application-layer filtering, and Enterprise-tier protection covering HTTP/HTTPS/HTTP2/QUIC protocols.
Cloud Armor Enterprise DDoS protection supports HTTP, HTTPS, HTTP/2, and QUIC protocols.
Cloud Armor operates at the Google Cloud edge, filtering traffic before it reaches backend resources or enters VPC networks.
Cloud Armor security policies use prioritized rules — the highest-priority matching rule wins.
Cloud Armor supports hybrid and multi-cloud deployments — it is not limited to GCP-hosted backends.
Cloud Armor preconfigured WAF rules are based on OWASP Core Rule Set 3.3.2 (CRS) and do not support XML body parsing.
roles/dns.admin can manage DNS records but cannot set IAM policies on zones (lacks setIamPolicy permission).
ALIAS records are skipped when exporting Cloud DNS zones to BIND format.
The @ symbol in Cloud DNS Console is treated literally, not as an apex alias — leave the DNS name field blank for apex records.
BIND zone file imports require trailing dots on fully qualified domain names to avoid relative-name interpretation.
If a CNAME record exists at a DNS name, no other record type can coexist at that name.
Cloud DNS configuration changes have cascading side effects that extend beyond the modified resource: enabling an outbound server policy silently disables resolution of all private zones, forwarding zones, and peering zones; DNSSEC disabling must follow a registrar-first sequence or cause resolution failures for the entire zone; and CNAME exclusivity silently prevents coexistence with any other record type at the same name — making DNS changes among the highest-blast-radius operations in GCP networking.
Cross-project binding zones work only within the same organization.
Before disabling DNSSEC on a Cloud DNS managed zone, DNSSEC must be deactivated at the domain registrar first to avoid resolution failures.
DNS64 synthesizes IPv6 addresses using the Well-Known Prefix 64:ff9b::/96 per RFC 6052 and requires NAT64 via Cloud NAT.
DNSSEC provides authentication against spoofing and cache poisoning, not encryption.
All records except SOA and NS must be removed before a Cloud DNS zone can be deleted via gcloud.
Cloud DNS does not support DNS forwarding for public zones — public zones must be authoritative.
NS and SOA records at the zone apex are auto-created and cannot be deleted via the API; they are removed only when the zone is deleted (per RFC 1034).
Using a Cloud DNS outbound server policy disables resolution of all Cloud DNS private zones, forwarding zones, peering zones, and Compute Engine internal DNS zones.
Managed reverse lookup zones are needed for non-RFC 1918 PTR records on Compute Engine VMs.
Cloud DNS security is limited in scope (DNSSEC provides authentication against spoofing and cache poisoning only, not encryption) and operationally fragile (configuration changes have cascading side effects where enabling outbound server policies silently disables private zones, CNAME exclusivity creates implicit constraints) — security-relevant DNS changes can silently break zone resolution.
Service Directory zones cannot have records added directly — data comes from the Service Directory namespace.
Shared VPC private/forwarding/peering zones must be created in the host project (or use cross-project binding in service projects).
Cloud DNS transactions group multiple record changes into an atomic unit — the entire transaction succeeds or fails.
Cloud DNS uses anycast to serve zones from multiple global locations for high availability and low latency.
Cloud HSM keys are certified to FIPS 140-2 Level 3.
HSM QPM quota is consumed in the project containing the HSM keys, not necessarily the project making the cryptographic request.
Single-tenant HSM instances must be in the same location as the key ring that contains the keys.
Cloud HSM offers two protection levels: hsm (multi-tenant shared infrastructure) and hsm-single-tenant (dedicated HSM instance).
Cloud HSM uses Cloud KMS as its frontend — all HSM operations go through the KMS API, not a separate API.
A single Cloud NAT gateway can be either Public NAT or Private NAT, never both; two separate gateways can serve the same subnet.
Traffic to Google APIs uses Private Google Access, not Public NAT, even when Public NAT is configured.
Cloud NAT is not based on proxy VMs; it is software-defined via the Andromeda networking stack and does not reduce network bandwidth per VM.
Cloud NAT does not allow unsolicited inbound connections; it only handles outbound NAT and established inbound response packets.
Private NAT addresses the overlapping IP range problem between VPC networks, using NCC spokes for connectivity.
Cloud NAT gateway is a regional resource configured on a Cloud Router.
Serverless resources (Cloud Run, Cloud Run functions, App Engine) require Direct VPC egress or Serverless VPC Access to use Cloud NAT.
Cloud NAT is a software-defined regional gateway on Cloud Router (not proxy VMs), routing internet egress while directing Google API traffic through Private Google Access instead, and requiring VPC egress configuration for serverless resources.
Cloud Source Repositories is unavailable to new customers after June 17, 2024.
Cloud VPN cannot be used to route traffic to the public internet — it is designed exclusively for private network communication.
Cloud VPN cipher configuration options are set at tunnel creation time and cannot be modified afterward — the tunnel must be deleted and recreated.
IKEv2 is required for IPv6 traffic over Cloud VPN; IKEv1 does not support IPv6.
Cloud VPN is a regional resource — you select a region, not a zone, when creating a VPN gateway.
Cloud VPN is site-to-site only — it does not support client-to-gateway (dial-in) VPN connections.
Each Cloud VPN tunnel supports up to 250,000 packets per second (combined ingress/egress), equivalent to 1–3 Gbps depending on packet size.
Cloud Build allowExitCodes takes precedence over allowFailure; if all steps have allowFailure: true and all fail, build status is still Successful.
Cloud Build integrates with Binary Authorization to check build attestations and block unauthorized deployments to Cloud Run/GKE.
Both Cloud Build default and private pools are fully managed, pay-per-build-minute, and auto-scale to zero.
Cloud Build CMEK compliance is automatic with no user configuration; an ephemeral encryption key is generated per build and destroyed after completion.
Cloud Build default machine type is E2STANDARD2 (2 CPUs); max disk size is 4000 GB.
Cloud Build default pool max concurrency is 30; private pool supports 100+.
Cloud Build default build timeout is 60 minutes; maximum is 24 hours (format: duration with s suffix).
Cloud Build currently uses Docker engine version 20.10.24.
Cloud Build provisions a fresh VM per build and destroys it after completion — no residual state persists.
Cloud Build global region uses default pools; specifying a specific region requires a private pool in that region.
In Cloud Build triggers, if a file matches both included and ignored file filters, the build is not invoked (ignored wins).
Cloud Build logs go to both Cloud Logging and Cloud Storage by default; logging: GCS_ONLY stores only in GCS.
Cloud Build config supports maximum 300 build steps per config file and up to 100 arguments per step.
Cloud Build private pools support 64 machine types compared to 5 for default pools.
Private pools can disable public IPs and provide static internal IP ranges; default pools cannot.
Private pool builds run in the region where the pool is created, not where the build is submitted.
Cloud Build private pools connect to customer VPC networks via VPC peering (private services access) to reach private resources.
Cloud Build queueTtl defaults to 3600s (1 hour) and ticks from createTime, while timeout ticks from startTime.
Cloud Build step script field is mutually exclusive with args and entrypoint.
Cloud Build performs a shallow clone (single commit) by default; must use git fetch --unshallow build step for full history.
Including [skip ci] or [ci skip] in a commit message prevents Cloud Build trigger invocation.
Cloud Build meets SLSA level 3 requirements for container images with verifiable build provenance.
Cloud Build steps run in Docker containers connected via the cloudbuild local Docker network, enabling inter-step communication.
Cloud Build steps run serially by default; use id and waitFor fields for parallel execution (waitFor: ['-'] starts immediately).
Cloud Build achieves supply chain security through three mechanisms: ephemeral build VMs with no residual state, SLSA level 3 attestation providing verifiable container provenance, and trigger service account precedence preventing config-level privilege escalation.
Cloud Build trigger branch and tag patterns use RE2 regex syntax; forward slashes cannot be used in tags.
Cloud Build trigger service account overrides any service account specified in the build config file.
Cloud Build user-defined substitution keys must start with an underscore (_); dynamicSubstitutions must be true for bash parameter expansions with trigger-invoked builds.
The /workspace volume is automatically mounted in all Cloud Build steps; additional volumes must be explicitly declared.
Cloud Run authentication model applies to services only, not jobs.
Cloud Run autoscaler targets 60% CPU utilization and 60% of maximum concurrency, both measured over a one-minute window.
Cloud Run usage is billed rounded up to the nearest 100 milliseconds; jobs have a minimum billing of 1 minute.
Cloud Run imports and caches the container image at deploy time; images are not pulled from the registry when new instances start.
Compute Flexible CUDs apply across Cloud Run, GKE, and Compute Engine; 3-year flexible CUD offers ~46% savings on CPU.
Setting Cloud Run concurrency to 1 negatively impacts scaling performance during traffic spikes due to cold start overhead per request.
The Cloud Run concurrency setting applies only to services (request-driven), not to jobs.
Cloud Run default concurrency is 80× the number of vCPUs when deployed via gcloud/Terraform (new services only); 80 when deployed via Console.
The Cloud Run concurrency setting is a maximum limit (ceiling), not a guarantee — Cloud Run may send fewer requests if CPU is already highly utilized.
Cloud Run maximum configurable concurrent requests per instance is 1000.
Cloud Run achieves deterministic container execution by resolving image tags to digests and caching images at deploy time — revisions always serve the exact image deployed regardless of subsequent tag mutations in the registry or registry outages after deployment.
Cross-project secret references in Cloud Run use projects/PROJECTNUMBER/secrets/SECRETNAME format (project number, not ID, in YAML/gcloud; Terraform uses project ID).
Only four default roles can invoke a Cloud Run service: Project Owner, Project Editor, Cloud Run Admin, and Cloud Run Invoker (roles/run.invoker).
Deploying to Cloud Run requires roles/run.developer, roles/iam.serviceAccountUser, and roles/artifactregistry.reader; cross-project additionally needs roles/iam.serviceAccountTokenCreator.
Direct VPC egress uses more IP addresses than Serverless VPC Access connectors in most cases.
Direct VPC egress supports network tags per service/job revision for fine-grained firewall rules; connectors share tags across all services using the same connector.
Direct VPC egress is the recommended method for Cloud Run outbound traffic to a VPC, requiring no connector VMs, with lower latency and scales to zero cost.
Docker Hub images used by Cloud Run are cached for up to 1 hour.
Cloud Run container instances have an ephemeral in-memory writable file system overlay that does not persist across shutdowns.
The Cloud Run free tier is per billing account (not per project), aggregated across projects, and resets monthly.
Cloud Run GPU workloads do not qualify for committed use discount (CUD) pricing.
Requests denied by IAM policy are not billed in Cloud Run.
Cloud Run idle instances may remain up to 15 minutes (10 minutes for GPU-enabled instances) after handling requests to absorb traffic spikes.
Cloud Run has a 9.9 GB max container image layer size when deploying from Docker Hub or Artifact Registry remote repository with an external registry.
Cloud Run internet egress for VPC-connected workloads requires chaining two regional constructs: Direct VPC egress (preferred over connector VMs) for outbound-only VPC access, then Cloud NAT on Cloud Router for internet-bound traffic — neither alone is sufficient for serverless-to-internet connectivity.
Cloud Run jobs automatically inject CLOUDRUNTASKINDEX (0-based) and CLOUDRUNTASKCOUNT environment variables per task.
A Cloud Run job supports 1–10,000 independent tasks that can run in parallel; default is 1 task.
Cloud Run job names must be 49 characters or less, unique per region and project, and cannot be changed after creation.
Cloud Run jobs default to 3 retries per failed task, configurable from 0 to 10.
Cloud Run jobs always use the second generation execution environment.
Cloud Run job default task timeout is 10 minutes; maximum is 168 hours (7 days); GPU tasks have a maximum timeout of 1 hour.
Cloud Run jobs support up to 10 containers per instance (including main container) as sidecars, sharing network namespace.
Cloud Run max instances is a per-revision limit, not per service; traffic splits and deployments can cause total instances to exceed the per-revision limit.
When Cloud Run max instances is reached, requests queue for up to 3.5× average container startup time or 10 seconds (whichever is greater); excess requests return HTTP 429.
Cloud Run pricing depends on region tier: Tier 1 (cheaper, includes us-central1, europe-west1) and Tier 2 (more expensive, includes Sydney, London, Frankfurt, São Paulo).
Cloud Run is regional — infrastructure runs in a specific region, redundantly available across all zones within that region.
Request-based billing is the default for Cloud Run services; instance-based billing must be opted into.
Cloud Run revisions are immutable; image tags are resolved to a digest at deploy time, and the revision always serves that specific digest.
Cloud Run scaling from zero can only be triggered by a request, not by CPU; CPU-only workloads with instance-based billing cannot self-wake without min instances > 0 or a wake-up request.
If a Cloud Run environment variable secret fails to load, the instance will not start; volume mount failures only surface at read time.
Cloud Run environment variable secrets are resolved once at instance startup time; Google recommends pinning to a specific version rather than latest.
Cloud Run's Secret Manager integration creates fail-fast startup semantics: the recommended production pattern (API access with pinned versions) defers resolution to runtime, but env-var-bound secrets that fail to load prevent instance startup entirely — forcing a choice between startup reliability and the best-practice access pattern.
Cloud Run services reference Secret Manager secrets at startup, failing fast if secrets are unavailable. When secrets are rotated, running instances retain the old version until new instances are created, creating a window where old and new secret versions coexist across the fleet.
The Cloud Run service account needs roles/secretmanager.secretAccessor on each referenced secret, verified at deployment time.
Cloud Run volume-mounted secrets fetch the secret value from Secret Manager on every read, making them compatible with secret rotation.
In Cloud Run first-gen single-container, the container identity owns the secret volume; in second-gen and first-gen multi-container, root owns the volume.
Cloud Run service names must be 49 characters or less, unique per region and project, and cannot be changed after creation.
Cloud Run services are deployed as private by default, requiring authentication credentials in requests.
gcloud run services replace overwrites the entire service configuration — changes made via Console or gcloud CLI can be lost.
Cloud Run services provide a unique HTTPS endpoint on *.run.app with managed TLS, WebSockets, HTTP/2, and gRPC support.
For Cloud Run instances with sidecars (multiple containers), instance CPU allocation is the sum of all container CPU limits.
Cloud Run source-based deployment uses Google Cloud buildpacks (open source) and supports Go, Node.js, Python, Java, .NET, and Ruby.
Cloud Run has three resource types: services (HTTP request handling), jobs (batch tasks to completion), and worker pools (pull-based background processing, in Preview).
Cloud Run services have two billing modes: request-based (charged only during request processing + per-request fee) and instance-based (charged for full instance lifetime, no per-request fee).
--update-secrets adds/updates secrets; --set-secrets replaces all secrets; --clear-secrets removes all; --remove-secrets removes specific ones.
Single-threaded apps on multi-vCPU Cloud Run instances cause vCPU hotspots where average CPU looks low but one core is maxed, misleading the CPU autoscaler.
Serverless VPC Access connectors require provisioned Compute Engine VM instances that add cost and maintenance overhead.
Cloud Run VPC egress should use Direct VPC egress over Serverless VPC Access connectors: direct egress requires no connector VMs (avoiding Compute Engine cost and maintenance), has lower latency, and both methods handle only outbound traffic.
Both Direct VPC egress and Serverless VPC Access connectors handle only outbound traffic from Cloud Run; inbound from VPC routes through a load balancer.
Cloud Run worker pools do not autoscale automatically (require manual scaling or custom autoscaler) and have no public HTTP endpoint or URL.
Cloud Run worker pools manage rollouts by splitting instances between revisions, not by splitting traffic.
Cloud SQL Auth Proxy is the recommended secure connection method for applications connecting to Cloud SQL instances.
Cloud SQL backups are encrypted by default using Google-managed or Customer-Managed Encryption Keys (CMEK).
All Cloud SQL backups after the first are incremental; when the oldest backup is deleted, the next oldest expands into a full backup.
Cascading replicas support up to 4 levels in the hierarchy (including primary) and up to 8 siblings per parent.
Cloud SQL retains deleted instance backups for 4 days; Customer Care can restore with original IP address.
Enhanced backups use Backup and DR Service for centralized management; standard backups are stored in the same project as the instance.
Cloud SQL Enterprise Plus edition supports sub-second downtime during maintenance (for MySQL and PostgreSQL).
Cloud SQL final backup default retention is 30 days; customizable 1–365 days (standard) or 1 day–10 years (enhanced).
Cloud SQL HA instances cost 2x a standalone instance (CPU, RAM, and storage).
Cloud SQL HA doubles infrastructure cost for automatic failover but the standby instance is idle — it cannot serve read queries, making read replicas in a third zone necessary for both read scaling and maximum resilience.
Cloud SQL HA failover takes approximately 60 seconds of downtime.
After Cloud SQL HA failover, the original primary is destroyed and recreated as the new standby.
Best practice is to place Cloud SQL read replicas in a third zone, different from both the primary and standby zones.
Cloud SQL HA and read replicas are complementary, not interchangeable: HA doubles cost with an idle standby for automatic failover, while replicas are strictly read-only without failover capability or independent backups — achieving both availability and read scaling requires deploying both patterns, ideally with replicas in a third zone.
Cloud SQL HA uses a shared static IP; applications reconnect using the same IP/connection string after failover.
Cloud SQL HA uses a standby VM in a different zone within the same region (not a different region).
The Cloud SQL HA standby instance cannot serve read queries.
Cloud SQL HA uses synchronous replication to regional persistent disk, not asynchronous replication.
Each Cloud SQL instance runs on a dedicated VM with an attached persistent disk and a static IP address.
Cloud SQL legacy HA (failover replica-based) was deprecated January 13, 2025; automatic migration to regional persistent disk HA began May 1, 2025.
On-demand Cloud SQL backups (standard option) are retained indefinitely until explicitly deleted.
Private IP cannot be removed from a Cloud SQL instance once configured.
Configuring private IP on an existing Cloud SQL instance causes a restart with a few minutes of downtime.
Cloud SQL private IP connectivity inherits all VPC peering limitations: non-transitivity restricts multi-network reach, non-RFC1918 ranges need manual authorization, and the required private services access creates an implicit peering dependency.
Non-RFC 1918 address ranges are not automatically authorized for Cloud SQL private IP connections and are not learned via peering by default.
A Cloud SQL replica inherits its private IP status and VPC peering configuration from the primary instance.
Cloud SQL private IP setup requires roles/compute.networkAdmin for managing private services access and the Service Networking API must be enabled.
Cloud SQL private IP connectivity requires a VPC network; public IP is internet-accessible.
Each Google Cloud service creates a /24 subnet from the customer's allocated IP range; a /24 supports approximately 50 Cloud SQL instances.
Cloud SQL private IP uses private services access to create private connections between a customer's VPC and Google's service producer VPC.
VPC Network Peering used by Cloud SQL private IP is not transitive — only directly peered networks can communicate.
Cloud SQL private IP connectivity inherits VPC peering constraints (non-transitivity, 25 peering limit, non-RFC1918 allocation restrictions) from private services access, adding networking complexity beyond what the database service itself requires.
Production Cloud SQL architecture requires concurrent investment across three dimensions: HA doubles cost with an idle standby that cannot serve reads, read replicas add scaling but block backup restores and cannot failover, and private networking inherits VPC peering constraints (non-transitivity, no overlapping subnets) — making the path from standalone instance to production-ready database a multiplicative increase in cost, complexity, and networking constraints.
Cloud SQL production architecture has simultaneously high cost floor and low scaling ceiling: the triple investment baseline (2x HA cost with idle standby, read replicas for scaling, private networking with peering constraints) creates a significant minimum infrastructure cost, while read scaling is capped at 10 direct replicas that cannot be independently backed up and provide no failover — bounding both the entry cost and maximum throughput of a production deployment.
Cannot delete a Cloud SQL primary instance without first promoting or deleting all its replicas.
Cannot restore a Cloud SQL primary from backup while replicas exist.
Cloud SQL read replicas provide constrained read scaling: maximum 10 direct replicas per primary, replicas cannot be backed up independently, replicas are strictly read-only with no failover capability, and creating replicas requires backups and binary logging enabled on the primary.
Cloud SQL uses GTID-based replication for all replication; it cannot be disabled.
Maximum of 10 direct read replicas per Cloud SQL primary instance; cascading replicas allow scaling beyond this limit.
Cloud SQL replicas do not support backups; backups are maintained on the primary instance only. Promoted replicas need their own backup configuration.
Backups cannot be configured on Cloud SQL read replicas.
Cloud SQL read replicas are read-only and do not provide failover — HA configuration is required for failover.
Creating a read replica requires automated backups enabled, binary logging (PITR) enabled, and at least one backup after binary logging was enabled.
Cloud SQL restores can only target instances with the same database version as when the backup was taken.
Cloud SQL supports exactly three database engines: MySQL, PostgreSQL, and SQL Server.
Cloud SQL has three system update types: hardware updates (no downtime, live migration), online updates (no downtime), and maintenance updates (require restart, cause downtime).
Cloud KMS Autokey automatically creates Cloud HSM keys by default; Autokey keys are functionally identical to manually created Cloud HSM keys.
Destroying a CMEK makes all data encrypted by it permanently inaccessible, enabling crypto-shredding for off-boarding or security remediation.
CMEK uses server-side, symmetric, envelope encryption with customer-controlled 256-bit AES-GCM keys; key material never leaves the Cloud KMS system boundary.
CMEK key ring location should be geographically near protected resources; using a separate project for keys supports separation of duties.
Revoking the service agent's CryptoKey Encrypter/Decrypter role, or disabling/destroying the CMEK key, makes data inaccessible — some services can experience permanent data loss if the key remains inaccessible too long.
Each project's service agent performs CMEK encrypt/decrypt operations; end users do not need the CryptoKey Encrypter/Decrypter role to access CMEK-protected resources.
Dedicated Interconnect 99.9% SLA requires at least 2 connections in the same metro but in different edge availability domains.
Dedicated Interconnect 99.99% SLA requires at least 4 connections: 2 in each of 2 different metros, each pair in different edge availability domains.
Dedicated Interconnect supports 10-Gbps (10GBASE-LR), 100-Gbps (100GBASE-LR4), and 400-Gbps (400GBASE-LR4) circuits, all single-mode fiber with max 10 km fiber length.
Google API client library traffic always uses 1440 MTU regardless of the VLAN attachment MTU setting on Dedicated Interconnect.
Jumbo frames (8896 MTU) on Dedicated Interconnect are only supported on unencrypted VLAN attachments.
LACP is required for Dedicated Interconnect even for a single circuit.
Dedicated Interconnect link type (e.g., 10G, 100G, 400G) cannot be changed after connection creation — a new connection must be created.
Dedicated Interconnect supports 1–8 circuits per connection, yielding max 80 Gbps (10G), 800 Gbps (100G), or 3200 Gbps (400G) per connection.
Deleting the system-generated default route (0.0.0.0/0) does not fully isolate a VPC from the internet because external passthrough NLB paths are independent of it.
Dynamic routing mode (regional vs global) determines whether Cloud Router dynamic routes apply in one region or all regions.
Coordinated (Cloud KMS-managed) EKM keys require a VPC connection; manual mode is required for internet connections.
Cloud EKM symmetric encryption uses both internal key material (Cloud KMS) and external key material (EKM) — losing either makes data permanently unrecoverable.
Cloud EKM external key material never leaves the customer's EKM and is never sent to Google.
Key Access Justifications optionally integrates with Cloud EKM so each request includes a reason code, and the EKM can allow or deny based on justification.
Automatic rotation is not supported for manually managed external keys; only coordinated EKM keys support automatic rotation.
Cloud EKM has two protection levels: EXTERNAL (internet) and EXTERNAL_VPC (VPC); VPC mode is available in regional locations only, not multi-regional.
External passthrough NLB backends must be in the same region and project but can be in different VPC networks.
External passthrough Network Load Balancers use direct server return (DSR) — responses bypass the load balancer and go directly from backend VMs to clients.
External passthrough NLB IPv6 support requires Premium Tier and uses /96 ranges from external dual-stack subnets.
Only one L3_DEFAULT forwarding rule is allowed per IP address; it acts as a fallback when more specific forwarding rules exist.
External passthrough NLB instance groups always use nic0; zonal NEGs (GCEVMIP) can target any network interface.
External passthrough NLB traffic steering uses sourceIPRanges on forwarding rules (up to 64 ranges) with a mandatory parent forwarding rule.
External IPv6 addresses on VPC subnets require Premium Tier networking.
Arm-based instances (N4A, C4A, T2A) have no SMT — each vCPU equals one physical core.
MIG autohealing health checks should be more conservative than load balancing health checks; autohealing triggers VM recreation while LB health checks only route traffic away.
Bare metal instances are identified by machine type names ending in -metal and run without a hypervisor.
Community-supported OS images (Fedora, FreeBSD, openSUSE) have no licensing charges but no Google support.
Container deployment on MIGs uses Container-Optimized OS with Docker, provisioned automatically when a container image is specified in the instance template.
Custom OS images incur image storage charges in the project that owns them.
Custom OS images only incur storage charges; premium OS images have additional cost
Custom machine types are only available for N and E series, with a 5% price premium over predefined types.
Block storage is billed for provisioned capacity from creation to deletion, even if the disk is unattached or the VM is stopped.
Durable disks can be attached to and detached from running instances without downtime.
Durable block storage is encrypted at rest and in transit by default, with support for customer-managed encryption keys.
Durable block storage (Hyperdisk and Persistent Disk) are network-attached devices transmitting data over Google's networks, not physically attached like Local SSD.
E2 is the lowest-cost machine series; the processor (Intel or AMD) is auto-selected by Google.
GCE provides a fully ephemeral high-performance workload tier by combining Spot/preemptible VMs (possible preemption, no live migration, configurable stop-or-delete) with Local SSD (data lost on any VM lifecycle event, cannot be boot volume), creating a distinct operational model suitable only for stateless batch processing, cache warming, or ML training with external checkpointing.
Compute Engine has five machine families: General-purpose, Compute-optimized, Storage-optimized, Memory-optimized, and Accelerator-optimized.
Google recommends Hyperdisk over Persistent Disk for new workloads; Persistent Disk is not available with the latest machine series.
GPU mapping: A4/A4X → B200/B300, A3 → H100/H200, A2 → A100, G2 → L4, G4 → RTX PRO 6000.
HPC-optimized images are located in the cloud-hpc-image-public project.
Hyperdisk performance (IOPS, throughput) can be configured independently of disk size, unlike Persistent Disk where performance is tied to provisioned capacity.
Hyperdisk supports dynamic resize and configurable performance; it is faster than Persistent Disk
Image families always point to the latest non-deprecated image version; rollback works by deprecating the current latest image.
All Compute Engine instances default to UTC time zone regardless of region.
Each instance network interface must be in a subnet of a unique VPC network — two NICs cannot be in the same VPC.
Compute Engine instances are zonal resources — a zone must be specified at creation time.
Local SSD cannot be used as a boot volume; boot volumes require durable block storage (Hyperdisk or Persistent Disk).
Local SSD data is lost on VM stop, suspend, restart, crash, or host failure — it is ephemeral storage.
Local SSDs in GCE are ephemeral storage that cannot serve as boot volumes — data is lost on instance stop, providing a storage tier with strict lifecycle coupling to the instance.
Machine type naming follows {series}-{ratio}-{vcpus} pattern, with optional -lssd (local SSD) or -metal (bare metal) suffixes.
The HTTPS MDS client identity certificate is valid for 7 days and must be refreshed
GCE custom metadata keys are case-sensitive
The GCE metadata server is always accessible regardless of firewall rules, never routes requests off the physical host, but HTTPS access requires client certificate rotation every 7 days — security depends on timely credential refresh, not network controls.
HTTPS metadata server endpoints are only available on Shielded VMs
On Linux, GCE HTTPS MDS certificates are stored at /run/google-mds-mtls/ (root.crt and client.key)
Requests to the GCE metadata server never leave the physical host running the VM
The HTTPS MDS self-signed root certificate (valid 50 years) is rotated on every instance stop/start cycle
The GCE metadata server is accessible at the link-local IPv4 address 169.254.169.254
When the same custom metadata key exists at both project and zonal levels, the zonal value takes precedence for VMs in that zone
Default VM limits are 1,000 for zonal MIGs and 2,000 for regional MIGs.
Compute Engine has a minimum uptime SLO of 99.9%
There is no additional charge for instance groups themselves; billing is only for the underlying resources.
Oracle Linux on GCE is partner-supported (by Oracle, not Google) with no licensing charges.
OS Login integrates SSH key management with IAM roles, providing admin vs non-admin access control.
Predefined machine type memory ratios: highcpu ~2GB/vCPU, standard ~4GB/vCPU, highmem ~8GB/vCPU, megamem ~14GB/vCPU, ultramem 24-31GB/vCPU, hypermem ~16GB/vCPU.
Preemptible VMs receive an ACPI G2 Soft Off signal giving up to 30 seconds (best effort) for shutdown scripts before forced termination
Standard GPU instances get 1-hour advance notice before maintenance; preemptible GPU instances do not
Compute Engine always terminates preemptible VMs after 24 hours of continuous running
Preemptible VMs are not billed for instance usage if preempted within 1 minute of creation (premium OS charges still apply)
Preemptible VMs cannot live migrate and cannot be set to automatically restart on maintenance events
Premium OS image costs are not discounted on preemptible VMs
Stopping and starting a preemptible VM resets the 24-hour counter; rebooting or resetting does not
Public OS images are available to all GCP projects by default with no special configuration needed.
Queue-based autoscaling using Pub/Sub is only available for zonal MIGs, not regional MIGs.
Regional MIGs deploy instances across multiple zones in a region, protecting against zonal failure; zonal MIGs do not.
Sole-tenant nodes are billed for node vCPU/memory with no extra charge for VMs on the node; GPUs and Local SSD billed separately
CPU overcommit on sole-tenant nodes is supported only on N1, N2, N2D, and N4 series
GCE sole-tenant nodes provide physical server isolation with significant operational constraints: one-to-one physical server mapping ensures no co-tenancy, but nodes are billed at node level regardless of VM utilization, preemptible VMs are not supported, and maintenance windows are immutable 4-hour blocks fixed at creation.
Migrate-within-node-group policy requires 1 holdback node per 20 nodes, with a minimum of 2 nodes in the group
Sole-tenant maintenance windows are 4-hour blocks set at node group creation and cannot be changed afterward
Preemptible VMs are not supported on sole-tenant nodes
Each sole-tenant node maps one-to-one to a physical server dedicated to a single project's VMs
Sole-tenant nodes support three host maintenance policies: default (migrate to any node), restart in place (same physical server, ~1hr downtime), and migrate within node group
Preemptible VMs have strict operational constraints: 24-hour maximum lifetime, no live migration, and no automatic restart on maintenance. Spot VMs offer similar cost savings (up to 91% discount) with flexible termination actions (stop or delete) but no mandatory maximum runtime.
Once dedicated preemptible quota is granted in a region, Spot VMs in that region must use it and cannot fall back to standard quota
Unlike preemptible VMs (24-hour limit), Spot VMs have no mandatory maximum runtime
Spot VM prices can change up to once per day
Spot VM termination action is configurable: STOP (default, VM goes to TERMINATED) or DELETE (VM is removed)
Spot VMs are not supported on A4X machine series or bare metal instances
Spot VMs offer up to 91% discount compared to standard VM pricing
Stateful MIGs preserve instance name, attached persistent disks, and metadata across restart, recreation, and update events.
Unmanaged instance groups do not support autoscaling, autohealing, rolling updates, multi-zone deployment, or instance templates — they only serve as load balancing targets.
Compute Engine VMs run on a KVM hypervisor
Memory-optimized X4 machine type scales up to 32 TB of memory and 1,920 vCPUs (bare metal).
Z3 storage-optimized series provides up to 72 TiB Titanium SSD on bare metal.
For new external passthrough NLB deployments, backend services are recommended over target pools.
Cloud Armor advanced network DDoS protection is only available for external passthrough Network Load Balancers.
Google Cloud Load Balancing requires no pre-warming — scales from zero to full traffic in seconds and handles instantaneous traffic spikes.
Passthrough load balancers preserve client source IP; proxy load balancers do not (backends see the proxy's IP).
Google Cloud LBs use: GFE for classic ALB/proxy NLB, Envoy-based GFE for global external ALB/proxy NLB, Envoy for regional/cross-region LBs, Maglev for external passthrough NLB, and Andromeda for internal passthrough NLB.
Assured Workloads is the key GCP product for regulatory and compliance needs.
GCP data protection is a per-service engineering effort, not a platform abstraction: Cloud SQL requires triple investment (HA + replicas + private networking) with each dimension independently constrained, while GCS requires defense in depth across three orthogonal dimensions (object immutability with versioned recovery, namespace security with organizational controls, four-tier encryption) — neither service's protection model transfers to the other.
GCP immutability is comprehensive across both networking infrastructure (HA VPN stack type, Cloud VPN ciphers, Dedicated Interconnect link type, IPv6 access type) and service infrastructure (Artifact Registry format/mode, KMS key rings, Workload Identity pool format, Cloud SQL private IP), requiring correctness-at-creation across all resource layers since no corrective post-creation path exists for any of these configurations.
Multiple GCP services enforce immutable-after-creation configuration that cannot be corrected without resource recreation: Artifact Registry locks format and mode, KMS cannot delete key rings or change key type/purpose/protection, GKE Workload Identity pool is not deletable, and Cloud SQL private IP cannot be removed — establishing a cross-service pattern where initial design mistakes are permanently embedded in the resource hierarchy.
GCP networking configuration is broadly immutable after resource creation: HA VPN gateway stack type (IPv4/IPv6), Cloud VPN cipher options, Dedicated Interconnect link type (10G/100G/400G), and subnet IPv6 access type (internal/external) all require resource deletion and recreation to change, making network architecture decisions permanent commitments that cannot be corrected in place.
GCP observability has systematic blind spots at security boundaries despite robust defaults: VPC Flow Logs miss ingress-denied packets while capturing egress-denied ones (creating a firewall visibility gap exactly at the attack surface), AND Cloud Logging export has temporal gaps (sinks not retroactive, Cloud Storage sink latency in hours) — the combination means network-layer security incidents at the ingress boundary may have neither real-time flow data nor timely log exports for forensic reconstruction.
GCP provides robust default observability without explicit instrumentation: Cloud Monitoring auto-collects metrics for most services without agents, Cloud Logging routes entries through per-resource Log Routers with immutable _Required sinks for compliance, and VPC Flow Logs introduce no performance impact — baseline visibility is available out-of-box, and the compliance audit trail cannot be disabled.
Autoscaling serves dual purposes: performance predictability (prevents overload) and cost reduction (removes idle resources).
The performance optimization cycle has four stages: define requirements, design and deploy, monitor and analyze, optimize — and it is continuous.
Performance requirements should be defined per layer of the application stack, not just at the application level.
GCP private data services like Cloud SQL and Memorystore require private IP connectivity that inherits VPC peering constraints (non-transitivity, peering limits), creating a connectivity overhead for each private data service adopted.
The reliability pillar has four focus areas: Scoping, Observation, Response, and Learning.
The reliability pillar has nine core principles including: define reliability based on UX goals, set realistic targets, resource redundancy, horizontal scalability, observability, graceful degradation, test recovery from failures, test recovery from data loss, and conduct postmortems.
Reliability is an organizational concern shared by dev, product, ops, platform eng, SRE, and business teams — not just engineering.
Reliability is consistent performance of intended functions plus uninterrupted service; resilience is the ability to withstand and recover from failures (a subset of reliability).
KMS key rotation and Secret Manager version pinning together enable zero-downtime secret and key rotation: KMS rotation creates new versions without re-encrypting existing data (ciphertext self-identifies its key version), and the recommended Secret Manager access pattern (API-direct with pinned version numbers) allows applications to transition between secret versions without restart — achieving non-disruptive credential rotation across the platform.
The security pillar has eight focus areas: infrastructure security, IAM, data security, AI/ML security, SecOps, application security, governance/risk/compliance, and logging/auditing/monitoring.
The security pillar has seven core principles: security by design, zero trust, shift-left security, preemptive cyber defense, use AI securely, use AI for security, and regulatory/compliance/privacy.
Cloud security uses a shared responsibility model — Google secures infrastructure; the customer secures workloads, data, and access.
Shift-left security tools in GCP include Cloud Build, Binary Authorization, and Artifact Registry.
The Well-Architected Framework applies to cloud-native, migration, hybrid, and multi-cloud scenarios — not just greenfield cloud builds.
Decoupled architectures enable independent upgrades, security controls, reliability goals, health monitoring, and granular cost/performance control.
Well-Architected Framework Perspectives are cross-pillar views for specific technologies, domains, or sectors — they are not additional pillars.
The Google Cloud Well-Architected Framework has six pillars: Operational Excellence, Security/Privacy/Compliance, Reliability, Cost Optimization, Performance Optimization, and Sustainability.
Stateless architecture enables reliability and scalability by allowing applications to scale quickly with minimum boot dependencies and withstand hard restarts.
Zero trust is implemented via Chrome Enterprise Premium and Identity-Aware Proxy (IAP).
All Cloud Storage classes share 99.999999999% (eleven 9s) annual durability.
Archive storage class provides millisecond-latency access, unlike competing cloud providers' cold tiers which may take hours/days.
Bucket names containing dots require domain ownership verification.
Bucket name and location are set at creation and cannot be changed afterward.
GCS bucket names are in a single global namespace — every name must be globally unique and is publicly visible.
Bucket names must be 3–63 characters, lowercase letters/numbers/dashes/underscores/dots only, cannot begin with goog, cannot contain google, and cannot be an IP address in dotted-decimal notation.
GCS buckets cannot be nested inside other buckets.
Two ways to change an object's storage class: rewrite the object or use Object Lifecycle Management.
Changing a bucket's default storage class does not affect existing objects — they retain their original class.
Client-side encrypted data is also encrypted server-side, resulting in layered encryption.
CMEK supports three key storage backends: software, HSM, and external (EKM).
CMEK keys are stored in Cloud KMS (Google stores them, customer manages them); CSEK keys are provided per-request and never stored by Google.
Credential Access Boundaries downscope OAuth 2.0 tokens to limit access to specific buckets and permission sets.
GCS data protection requires defense in depth across three independent dimensions: object-level protection (immutable objects with versioned recovery but noncurrent versions readable and bucket deletion unprotected), namespace-level security (globally unique bucket names enabling enumeration with parallel IAM/ACL surfaces), and encryption tiering (four levels from Google-managed to client-side with increasing control and decreasing unconditional durability) — each dimension addresses a different threat model and no single mechanism provides complete protection.
All GCS data is encrypted at rest by default using Google-managed keys at no additional charge.
The default storage class for a bucket is Standard if not specified at creation.
After bucket deletion, anyone can reuse the name (usually within seconds); deleting the project may hold the name for weeks or longer.
Disabling Object Versioning stops new noncurrent versions from accumulating but does not delete existing noncurrent versions.
The four GCS encryption options in order of increasing customer control: Default → CMEK → CSEK → Client-side.
GCS provides a four-tier encryption model (default, CMEK, CSEK, client-side) with increasing customer control, all built on always-on default encryption — but customer-managed keys (CSEK/client-side) shift the key loss risk entirely to the customer.
Flat namespace GCS buckets have no real folders — folders are simulated via / delimiters in object names.
Cloud Storage FUSE enables mounting buckets as local file systems for standard file I/O.
Generation number #0 in a Cloud Storage resource name refers to the most recent version of an object.
Hierarchical namespace (HNS) enabled buckets provide up to 8x higher initial QPS for reads/writes compared to flat-namespace buckets.
IAM and ACLs operate in parallel on GCS — if either system grants access, the user gets access.
AbortIncompleteMultipartUpload lifecycle action can only use age, matchesPrefix, and matchesSuffix conditions.
Lifecycle condition age: 0 is satisfied at midnight UTC after object creation, not immediately.
Lifecycle configuration changes can take up to 24 hours to take effect.
When both Delete and SetStorageClass lifecycle rules match simultaneously, Delete takes precedence.
GCS lifecycle management creates an economics-complexity tension: storage class pricing strongly incentivizes automated transitions (no retrieval fees for Standard/Rapid, no rewrite on SetStorageClass, minimum durations creating cost penalties for wrong placement), but lifecycle rule evaluation has complex precedence (Delete beats SetStorageClass), timing (age 0 means midnight UTC, not creation time), and hold semantics (holds block deletion even when conditions match) that can produce unexpected results.
Objects under holds or retention policies are not deleted by lifecycle rules even if conditions are met.
GCS lifecycle rules have complex evaluation semantics: Delete takes precedence over SetStorageClass when both match, holds and retention policies block lifecycle deletions, age-0 triggers at midnight UTC (not immediately), and each rule uses multi-condition AND logic.
Each lifecycle rule has exactly one action and one or more conditions; an object must match all conditions in a rule for the action to fire.
SetStorageClass lifecycle action does not rewrite the object — no retrieval fees, early deletion fees, or inter-region replication charges.
Losing a CSEK or client-side encryption key results in unrecoverable data; objects remain stored and billed until explicitly deleted.
Managed folders are resource overlays used solely for granting/revoking IAM permissions, not true directory structures.
GCS bucket namespace has a multi-layered security surface beyond IAM: bucket names are globally unique and publicly visible (enabling enumeration), deleted bucket names can be immediately reused by anyone (enabling squatting), and IAM and ACLs operate in parallel where either granting access is sufficient — requiring organizational controls (naming conventions, deletion policies, uniform bucket-level access) to prevent namespace-level exposure.
Rapid and Standard storage classes have no retrieval fees; Nearline, Coldline, and Archive have retrieval fees.
Noncurrent versioned objects are readable; soft-deleted objects are not — this is a key differentiator between the features.
An object is uniquely identified within a bucket by its name combined with a generation number.
GCS objects are immutable with atomic replacement; versioning and soft delete (default 7 days) provide recovery, but noncurrent versions are readable while soft-deleted objects are not — and both features require careful timing (30-second wait after config changes).
Object name maximum size is 1024 bytes in flat namespace, or 512 bytes each for folder name and base name in hierarchical namespace (HNS).
GCS enforces a once-per-second rate limit for replacing the same object; exceeding this yields 429 Too Many Requests.
Cloud Storage objects are immutable pieces of data stored in buckets.
GCS objects are immutable — they cannot be modified in place; replacement is atomic with the old version served until the new upload completes.
GCS performance engineering requires simultaneous awareness of naming patterns and storage class constraints: sequential object names cause backend hotspotting, hierarchical namespace buckets provide up to 8x higher QPS, individual objects enforce a once-per-second update limit, and Rapid class is restricted to zonal-only buckets with storage size quotas.
Public access prevention blocks access via allUsers and allAuthenticatedUsers principals.
Rapid storage class is zone-only and requires a Rapid Bucket; has storage size quotas unlike other classes.
Cloud Storage resource names use format projects//buckets/BUCKETNAME/objects/OBJECT_NAME.
Sequential object names during large-scale uploads cause hot-spotting on backend servers.
Server-side encryption is enabled by default on all Cloud Storage objects; CMEK and CSEK are supplemental options.
Signed URLs provide time-limited read/write access to GCS objects without requiring the recipient to have a Google account.
Soft delete is enabled by default on all Cloud Storage buckets with a 7-day retention period.
Google recommends soft delete over Object Versioning for protection against accidental/malicious deletions.
GCS storage class economics create a natural incentive for lifecycle automation: Standard and Rapid classes have no retrieval fees while Nearline/Coldline/Archive impose both retrieval fees and minimum storage durations (30/90/365 days), and the SetStorageClass lifecycle action avoids rewrite costs (no retrieval fees, no early deletion fees, no inter-region replication) — making automated downward class transitions via lifecycle rules strictly cheaper than manual object rewrites or delayed transitions.
Minimum storage durations: none for Rapid and Standard, 30 days for Nearline, 90 days for Coldline, 365 days for Archive.
GCS operational management requires simultaneous engineering across three largely independent dimensions: storage class economics with lifecycle automation (retrieval fee incentives vs minimum duration penalties), defense-in-depth data protection (object immutability, namespace security, four-tier encryption), and lifecycle rule evaluation complexity (Delete-over-SetStorageClass precedence, hold interactions, 24-hour propagation delay) — optimizing one dimension can conflict with another.
GCS provides unconditional eleven-nines durability across all storage classes with millisecond access latency — including Archive class, which unlike competing cloud cold tiers requires no retrieval delay.
After enabling uniform bucket-level access, there is a 90-day window to revert to fine-grained access; after that, uniform is permanent.
Uniform bucket-level access (IAM-only) is the recommended access model for GCS buckets; it disables ACLs.
Object Versioning is enabled per bucket, not per object; it affects all objects in the bucket.
Object Versioning does not protect against bucket deletion; only soft delete provides that protection.
Wait at least 30 seconds after changing versioning configuration before performing delete/replace operations.
GKE clusters with more than 3,000 Kubernetes service accounts may cause metadata server pod OOM kills.
GKE alpha clusters for testing unstable Kubernetes features are available in Standard mode only.
AppArmor default Docker profile is applied by the container runtime on all GKE containers; SELinux is not supported on GKE.
GKE cluster creation and upgrades pull container images from Artifact Registry (pkg.dev or gcr.io); an outage there can block new cluster creation and upgrades.
GKE Autopilot always uses Container-Optimized OS and always has Workload Identity Federation enabled.
GKE Autopilot nodes always use cos_containerd OS; Standard mode offers multiple OS choices.
GKE Autopilot clusters are always regional and always enrolled in a release channel (default: Regular).
GKE Autopilot bills per Pod resource request; Standard mode bills per node regardless of Pod utilization.
GKE Autopilot provides built-in ComputeClasses: Balanced, Scale-Out, and autopilot-spot.
GKE Dataplane V2 is enabled by default in Autopilot clusters, enabling network policy enforcement and observability.
Extended-duration Pods in Autopilot are protected from eviction for up to 7 days using the annotation cluster-autoscaler.kubernetes.io/safe-to-evict: "false".
GKE Autopilot provides a fully managed, opinionated operational model: always regional (never zonal), always cos_containerd (no OS choice), Google-managed control plane, with scale-to-zero capability — reducing operational surface area but removing infrastructure customization options available in Standard mode.
In GKE Autopilot mode, underlying Compute Engine VMs are not visible in gcloud CLI or Cloud console and cannot be accessed via SSH.
GKE Autopilot is safe for multi-team production clusters with fully managed infrastructure, pod-level billing granularity, and opinionated security baselines.
GKE Autopilot SLA covers both the control plane and Pod compute capacity.
GKE Autopilot has two billing models: pod-based billing for general-purpose workloads and node-based billing for workloads selecting specific hardware.
GKE Autopilot clusters can scale to zero nodes when no workloads are running.
GKE beta Kubernetes APIs are available in version 1.24 and later.
Binary Authorization continuous validation (CV) continuously monitors running container images against policies.
All nodes in a GKE node pool are identical; individual nodes cannot be configured independently within a pool.
GKE cluster deletion does not drain nodes gracefully; kubectl drain must be used manually before deleting.
GKE cluster state can be stored in etcd or Spanner, but the etcd API is always exposed regardless of backend.
ClusterIP Service type provides a stable virtual IP and DNS name for internal Pod-to-Pod communication within a GKE cluster.
The GKE control plane is always managed by Google — customers cannot SSH into or directly access it.
The GKE control plane is always fully managed by Google Cloud in both Autopilot and Standard modes.
GKE credential rotation rotates SSL certificates, cluster CA, and control plane IP address.
Critical kube-system Pods (kube-dns, konnectivity-agent) must always have a schedulable node pool; removing all untainted pools causes DNS and control plane failures.
GKE Dataplane V2 is eBPF-based and provides both high performance networking and built-in network policy enforcement.
The default GKE node pool creates 3 nodes per compute zone using cos_containerd image and a general-purpose machine type.
Workload Identity Federation federated access tokens have a default lifetime of 1 hour; client libraries refresh within 3 min 45 sec of expiry.
Gateway API is GKE's recommended approach for advanced L7 traffic management, replacing Ingress for new deployments.
Pods with hostNetwork: true bypass Workload Identity Federation and use the Compute Engine metadata server directly.
Hubble (via Dataplane V2 observability) is the tool for diagnosing dropped packets and misconfigured network policies; uses hubbledroptotal metric.
Same namespace + service account name across GKE clusters in the same project results in the same IAM identity; mitigate with separate projects or distinct namespace names.
Legacy ABAC should be disabled on GKE clusters; RBAC and IAM should be the only authorization mechanisms.
LoadBalancer Service type provisions a regional external passthrough Network Load Balancer with a public IP.
Multi-cluster Services (MCS) enables cross-cluster and cross-region failover with a single global DNS name and health-aware routing.
The GKE metadata server has a limit of 500 concurrent connections per node.
In multi-zonal GKE clusters, node pools are automatically replicated across all zones, which has quota implications.
GKE network policies default to allow-all; once any policy is applied to a namespace, unmatched traffic is denied.
Every node in a GKE node pool is labeled with cloud.google.com/gke-nodepool set to the pool's name.
The node pool's IAM service account is still used for container image pulls even with Workload Identity Federation enabled.
NodeLocal DNSCache caches DNS queries on each node to reduce DNS latency at scale in GKE clusters.
GKE nodes are Compute Engine virtual machine instances.
GKE PodDisruptionBudget respect during node pool deletion must be explicitly configured and is capped at one hour.
principal:// references a single IAM resource; principalSet:// references a set of resources (namespace, cluster) in Workload Identity Federation.
GKE release channels govern automatic Kubernetes version upgrades; without a release channel (Standard only), no automatic upgrades occur.
GKE Sandbox (gVisor) is incompatible with node service account authentication but compatible with Workload Identity Federation.
Shared VPC is the multi-tenant networking pattern where the host project owns networking resources and service projects run GKE clusters.
The workload identity pool format is PROJECT_ID.svc.id.goog and is not deletable even after all clusters are removed.
Workload Identity Federation for GKE is the recommended method for Pods to access Google Cloud APIs, over node service accounts or JSON keys.
GKE Workload Identity Federation is the recommended API access pattern but requires namespace and service account naming discipline: same namespace + SA name across clusters creates identity collisions, and the pool format is permanent (not deletable).
HA VPN 99.99% SLA requires tunnels from both HA VPN interfaces (interface 0 and interface 1) to peer gateway interfaces — a single tunnel on one interface is insufficient.
Active-passive routing is recommended for a single HA VPN gateway (consistent bandwidth); active-active is recommended for multiple gateways to avoid bandwidth loss from unused passive tunnels.
HA VPN failover takes 40–60 seconds of expected packet loss when a tunnel goes down as Cloud Router withdraws learned routes.
Full mesh is not required on the GCP side for 99.99% SLA — only one tunnel per HA VPN interface to the corresponding peer interface; full mesh may only be required by the peer vendor.
HA VPN over Cloud Interconnect allows using regional internal IP addresses for both HA VPN and peer gateways.
HA VPN tunnels must use BGP (dynamic routing) — static routing is not supported for HA VPN.
HA VPN gateway stack type (IPV4ONLY, IPV4IPV6, IPV6_ONLY) cannot be changed after creation — the gateway must be deleted and recreated.
VPC-to-VPC HA VPN gateways in the same region achieve 99.99% SLA; gateways in different regions achieve only 99.9% SLA.
The IAM API is eventually consistent — writes may not be immediately visible to reads or access checks.
Basic roles (Owner, Editor, Viewer) should not be used in production; they include thousands of permissions across all GCP services.
Access scopes on Compute Engine are a legacy mechanism — both scopes and IAM roles must allow access for an API call to succeed.
A conditional role binding does NOT override an unconditional binding for the same role — the unconditional binding still grants access.
IAM conditions cannot be used with basic roles (Owner/Editor/Viewer) or with allUsers/allAuthenticatedUsers.
IAM Conditions use Common Expression Language (CEL) for attribute-based access control expressions.
GCP IAM security requires defense in depth across two independent dimensions: correct policy evaluation understanding (deny-first with fail-closed conditions, inheritance unions, conditional bindings that don't override unconditional) for authoring policies, and active service account hardening (revoking default editor role, controlling impersonation, managing dual principal/resource nature) for closing privilege escalation — neither alone is sufficient.
If a deny policy condition evaluates to true OR cannot be evaluated, the deny rule applies (fail-closed behavior).
Deny policy conditions only support resource tag functions, not the full set of condition attributes available in allow policies.
Deny policies override allow policies — a principal can be denied access even if allowed by an allow policy.
The effective allow policy for a resource is the union of its own policy plus all ancestor policies in the resource hierarchy.
Cloud Storage buckets require uniform bucket-level access enabled to use IAM conditions.
Best practice is to grant IAM roles to Google Groups rather than individual users for easier management.
Service account key rotation follows a strict order: create new key → switch apps → disable old key → delete old key.
Best practice is no more than 100 conditional role bindings per allow policy to avoid size limits.
IAM permissions follow the format service.resource.verb (e.g., resourcemanager.projects.list) and cannot be granted directly — only through roles.
IAM policy evaluation is a layered system with fail-closed deny semantics: deny policies trigger on unevaluable conditions, conditional allow bindings never override unconditional bindings for the same role, and all policies inherit top-down through the org/folder/project/resource hierarchy.
IAM allow policies on parent resources (org → folder → project → resource) are automatically inherited by all child resources.
Service account keys are a security risk; prefer Workload Identity, attached service accounts, or federated identity over key-based authentication.
IAM role names follow the convention roles/<service>.<roleName> (e.g., roles/storage.admin).
Default quota is 100 service accounts per project; can request an increase.
Default service accounts (Compute Engine, App Engine) automatically receive roles/editor — a security anti-pattern that should be disabled via org policy.
Service accounts are both principals (can be granted roles on resources) and resources (other principals can be granted roles on them).
Service account impersonation requires roles/iam.serviceAccountTokenCreator on the target service account.
Service accounts do not belong to Google Workspace domains — Workspace domain-wide shares do not include them.
Deleting and recreating a service account with the same email does not restore old role bindings because bindings use internal unique IDs, not email addresses.
Service accounts require active security hardening: they have dual nature (principal and resource), default accounts receive overly broad editor role, impersonation needs explicit token creator role, and they are invisible to Workspace domain-wide shares.
The Service Account User role (roles/iam.serviceAccountUser) is a privilege escalation vector — anyone with this role inherits the service account's full access.
Cloud Interconnect protects up to 50% of aggregate capacity; provision 2x required bandwidth across redundant connections for full protection.
Active-active is the recommended VLAN attachment configuration for Cloud Interconnect; active/passive risks undetected misconfigurations.
Dedicated Interconnect supports up to 100 Gbps connections at colocation facilities.
ECMP across active-active Cloud Interconnect VLAN attachments requires flow-based hashing on on-premises devices; per-packet load balancing is unsupported and causes TCP connection closures.
Cloud Interconnect does not encrypt traffic by default; achieving confidentiality requires an overlay protocol — either MACsec for link-layer encryption between router and Google edge, or HA VPN tunnels over Interconnect using regional internal IP addresses for end-to-end IPsec encryption.
Cloud Interconnect offers four connection types: Dedicated, Partner, Cross-Cloud, and Cross-Site Interconnect.
HA VPN over Cloud Interconnect uses a reduced gateway MTU of 1440 bytes.
IP address spaces between on-premises and VPC networks must not overlap when using Cloud Interconnect.
Cloud Interconnect does not encrypt traffic by default; encryption requires MACsec (router-to-edge) or HA VPN over Cloud Interconnect (IPsec on VLAN attachments).
Physical Cloud Interconnect connections cannot be moved between projects or renamed after creation.
Production-grade Cloud Interconnect requires two independent engineering efforts: 2x bandwidth overprovisioning with global dynamic BGP routing for redundancy, AND an encryption overlay (MACsec or HA VPN over internal IPs) for confidentiality — neither alone is sufficient for production readiness.
Cloud Interconnect can reduce egress costs, unlike Cloud VPN alone which does not reduce egress costs.
Cloud Interconnect redundancy requires provisioning 2x bandwidth (50% capacity rule), dynamic BGP routing via Cloud Router, and global dynamic routing mode for cross-region failover — creating a three-layer resilience model.
Cloud Interconnect traffic prefers the lowest-metric local region; cross-region failover only occurs when all local VLAN attachments are down and requires global dynamic routing.
Cloud Interconnect offers three SLA tiers: 99.99% (critical production), 99.9% (non-critical production), and no SLA.
Cloud Interconnect uses Cloud Router for dynamic (BGP) routing between on-premises and VPC networks.
Custom IP ranges on VLAN attachments require /29 or /30 for IPv4 and /125 or /126 for IPv6; private IPv4 addresses cannot be used with custom IP range flags.
Cloud Interconnect VLAN attachment MTU should match VPC network MTU; mismatches drop non-TCP packets regardless of DF bit setting.
Internal passthrough NLB backend VMs must bind to the forwarding rule IP or 0.0.0.0/:: because the destination IP on received packets is the forwarding rule's IP, not the backend's own IP.
Internal passthrough NLB cannot mix instance group and zonal NEG (GCEVMIP) backends on the same backend service.
Internal passthrough NLB forwarding rules are immutable after creation — must delete and recreate to change ports or IP.
Internal passthrough NLB global access is disabled by default; without it, only same-region clients can connect. On-prem clients via VPN/Interconnect must match the LB region unless global access is enabled.
Internal passthrough NLB L3_DEFAULT forwarding rule protocol enables non-TCP/UDP protocols (ICMP, SCTP, ESP, AH, GRE) and requires --ports=ALL.
Internal passthrough NLB zonal NEGs (GCEVMIP) only support IPv4; instance groups support both IPv4 and IPv6.
The internal passthrough Network Load Balancer is regional, Layer 4, built on Andromeda, preserves client source IP, and does not terminate SSL/TLS.
Internal passthrough NLB uses load balancing scheme INTERNAL, distinguishing it from external passthrough NLB which uses EXTERNAL.
A subnet's IPv6 access type (internal or external) cannot be changed after subnet creation.
IPv6 subnets are only supported on custom mode VPC networks, not auto mode or legacy networks.
Keyless identity patterns (WIF for external workloads + GKE Workload Identity for pods) fully eliminate credential management overhead by replacing service account keys with federated token exchange across all workload types.
The roles/cloudkms.admin role does NOT grant encrypt or decrypt permissions; separate granular roles exist for encrypt-only, decrypt-only, and combined encrypter-decrypter.
All Cloud KMS key version states except Destroyed incur costs, including Disabled and Scheduled for destruction.
Cloud KMS automatic key rotation is supported for symmetric keys only; asymmetric keys must be rotated manually.
Cloud KMS Autokey automates provisioning of key rings, keys, and service accounts on demand during resource creation, requiring no pre-provisioning.
Over 40 GCP services support Customer-Managed Encryption Keys (CMEK) with software and HSM keys; over 30 services support EKM keys.
CRC32C integrity verification is recommended (not mandatory) for Cloud KMS encrypt and decrypt operations to detect in-transit data corruption.
Crypto-shredding in GCP works by destroying the CMEK that protects data, rendering the data selectively unreadable.
Customer-Supplied Encryption Keys (CSEK) are not a Cloud KMS feature — they are supported only by Cloud Storage and Compute Engine, where the customer provides key material at use time.
The destroyscheduledduration for a Cloud KMS key is only configurable at key creation time and cannot be changed afterward.
Default scheduled-for-destruction duration for Cloud KMS key versions is 30 days; minimum is 24 hours (0 for import-only keys); maximum is 120 days.
Destroying a Cloud KMS key version that is still in use causes permanent, irrecoverable data loss.
Cloud EKM keys reside in an external key manager and are never sent to Google.
Encrypt requests target the crypto key (not a specific version) — Cloud KMS uses the primary version automatically; decrypt also targets the key and determines the version from ciphertext metadata.
Cloud KMS governance provides complementary safety guarantees: strict separation of duties ensures administrators cannot perform cryptographic operations (and vice versa) while key rotation is operationally safe because it creates new versions without re-encrypting existing data and the version is embedded in ciphertext for transparent decryption — together enabling key governance where operational mistakes in rotation cannot cause data loss and administrative access cannot enable data exfiltration.
IAM access control operates at the key level, not the key version level — you cannot control access to individual key versions.
After a Cloud KMS key version is destroyed, it takes 45 days for key material to be purged from all Google active and backup systems.
Cloud KMS key rings and EKM connections cannot be deleted, and deleted key names cannot be reused.
Cloud KMS key rings are bound to a specific location and cannot be moved after creation.
Cloud KMS key type, purpose, and protection level are all immutable after creation.
Manual key rotation does not pause or modify an existing automatic rotation schedule.
For symmetric encryption keys, only the primary version can encrypt data; any enabled version can decrypt, but non-primary enabled versions cannot encrypt.
Only symmetric encryption keys have a primary version; asymmetric keys require specifying the version explicitly.
Cloud KMS uses probabilistic encryption — identical plaintext encrypted with the same key version produces different ciphertext each time.
Cloud KMS raw symmetric key material can never be viewed or exported; encrypt/decrypt operations happen within the service.
Cloud KMS raw key material is never viewable or exportable by any Google Cloud principal.
Cloud KMS REST API requires plaintext to be base64-encoded before calling encrypt, and decrypt responses are also base64-encoded.
KMS key rotation creates new versions without re-encrypting existing data; the version is embedded in ciphertext for symmetric keys, automatic rotation only works for symmetric keys, and manual rotation preserves auto schedules — making re-encryption a separate operational responsibility.
Rotating a Cloud KMS key creates a new key version but does not re-encrypt previously encrypted data — re-encryption must be done explicitly.
Cloud KMS enforces strict separation of duties between administration and cryptographic operations: the admin role cannot encrypt or decrypt, IAM access control operates at the key level (not individual versions), and raw key material is never viewable or exportable — no single role or access path can both manage keys and use them, and the key material itself is inaccessible regardless of permissions.
Cloud KMS software protection level keys are FIPS 140-2 Level 1 certified.
For symmetric Cloud KMS keys, the version used for encryption is embedded in the ciphertext, so decryption automatically selects the correct version.
Log-based alerting policies are for rare important events (pattern matching), while log-based metrics combined with alerting policies are for monitoring trends and thresholds.
Cloud Logging automatically collects logs from Google Cloud resources without requiring manual setup for default collection.
New Cloud Storage log sinks can take several hours to begin routing because log entries to Cloud Storage are processed hourly.
Cloud Logging's two default sinks implement a deliberate two-tier routing architecture: Required is completely immutable and always routes compliance-critical audit logs (Admin Activity, System Events), while Default can be modified, disabled, or redirected — guaranteeing compliance log capture at the infrastructure level while giving operators full control over operational log routing and cost.
The _Default sink is system-created but can be modified, disabled, or have its destination changed and exclusion filters added.
Error Reporting only analyzes log entries stored in log buckets within the same project where Error Reporting runs.
Exclusion filters do not reduce entries.write API quota consumption because they are applied after the Logging API has already received the entries.
Cloud Logging export has temporal reliability gaps despite the robust two-tier default architecture: sinks are not retroactive (entries received before sink creation or during misconfiguration are permanently lost), Cloud Storage sink destinations can delay initial routing by several hours, and only the immutable _Required sink guarantees uninterrupted routing — making log completeness in custom destinations a best-effort property, not a guarantee.
Log sink supported destinations are: log bucket, BigQuery dataset, Cloud Storage bucket, Pub/Sub topic, and Google Cloud project.
Intercepting aggregated sinks prevent child resource sinks from processing matched entries, except the child's _Required sink which always processes originating entries.
Linked BigQuery datasets created from log buckets are read-only and must never be set as a sink destination.
Each Google Cloud project, billing account, folder, and organization has its own Log Router that temporarily buffers and routes log entries through sinks.
The Ops Agent is the recommended agent for collecting application and third-party logs on Compute Engine VMs.
The Required sink is system-created and immutable — it cannot be modified or deleted — and routes Admin Activity, System Event, and Access Transparency audit logs to the Required log bucket.
When a sink routes log entries to a Google Cloud project as destination, sinks in that destination project cannot reroute the entries to another project (one-hop limit).
Log sinks cannot retroactively route entries received before creation or during misconfiguration; the "Copy logs" feature can retroactively copy from a log bucket to Cloud Storage.
Log entries with timestamps older than the retention period or more than 24 hours in the future are discarded by the Log Router.
Cloud Logging provides two query interfaces: Logs Explorer for troubleshooting individual entries, and Log Analytics for SQL-based trend analysis.
Memorystore for Redis Cluster has a hierarchical resource structure: Instance → Shards → Nodes (primary + replicas), where "instance" and "cluster" are interchangeable terms.
Memorystore for Redis Cluster is based on open-source Redis 7.x and supports a subset of the total Redis command library (not all commands available).
Each Memorystore for Redis Cluster shard has exactly 1 primary node and up to 5 replica nodes, with replicas automatically distributed across zones for HA.
VPC networking must be configured before creating a Memorystore for Redis Cluster instance.
Memorystore for Redis uses asynchronous replication — acknowledged writes may be lost during failover because replicas can lag behind the primary.
Memorystore for Redis Basic Tier has no replication, no automatic failover, and is a standalone cache instance.
Memorystore for Redis operates under a constrained model compared to self-managed Redis: private IP only (requiring VPC connectors for serverless access), RDB-only persistence (no AOF, creating a data loss window between snapshots), 300 GB maximum instance size, and Basic Tier entirely lacking replication — workloads requiring AOF durability, public access, or instances larger than 300 GB cannot use the managed service.
During automatic failover, Memorystore for Redis promotes the replica with the least replication lag.
The connection string and IP address remain the same after a Memorystore for Redis failover — no application changes are needed.
Memorystore for Redis Standard Tier (not Basic) is required for high availability and data replication.
Memorystore for Redis I/O threads require M3 or higher capacity tier and Redis 6.x or later.
Manual failover via API is not supported for Memorystore for Redis instances with read replicas enabled.
Memorystore for Redis maximum instance size is 300 GB for both Basic and Standard tiers.
Memorystore for Redis supports up to 5 read replicas on Standard Tier only, each adding 16 Gbps read throughput.
Memorystore for Redis does not support AOF persistence; only RDB snapshots are available for data persistence.
Memorystore for Redis instances use private IPs only and are not exposed to the public internet; clients must connect via the same VPC network.
Memorystore for Redis supports 1 to 5 read replicas when readReplicaMode is enabled, providing both read distribution and HA failover targets.
Serverless environments (App Engine, Cloud Run, Cloud Run functions) require a Serverless VPC Access connector to reach Memorystore for Redis.
Memorystore for Redis Standard Tier provides a 99.9% availability SLA with cross-zone replication and automatic failover.
Cloud Monitoring automatically collects metrics for most GCP services without requiring an agent.
Cloud Endpoints metrics use the serviceruntime metric type and write against the api monitored-resource type.
Cloud Monitoring forecasted metric-value condition has a forecast window ranging from 1 hour to 7 days.
Legacy GKE metrics use the container.googleapis.com prefix; newer GKE metrics use the kubernetes.io prefix.
Log-based alerts monitor individual log entries; SQL-based alerts monitor aggregations across multiple log entries.
Cloud Monitoring metric-absence condition has a maximum retest window of 23.5 hours.
Cloud Monitoring automatically closes incidents and sends closure notifications when metric-based policy conditions are no longer met.
Cloud Monitoring metric type names are prefixed by service name (e.g., compute.googleapis.com/instance/cpu/utilization).
A Cloud Monitoring metrics scope allows monitoring multiple GCP projects and AWS accounts from a single scoping project.
Cloud Monitoring writes one time series per unique combination of resource labels and metric labels.
The Ops Agent must be installed on Compute Engine VMs to collect application and third-party metrics (Apache, Nginx, MongoDB, PostgreSQL).
PromQL can be used in Cloud Monitoring alerting policy conditions for complex expressions and dynamic thresholds.
Cloud Monitoring alerting policies monitor continuously; a snooze temporarily disables a policy for a defined period.
Cloud Monitoring supports three alerting policy types: metric-based, log-based, and SQL-based (SQL-based is Public Preview).
Cloud Monitoring has three metric kinds: GAUGE (value at a point in time), CUMULATIVE (accumulated over time), and DELTA (change over a period).
VM Manager metrics use the osconfig.googleapis.com prefix, not compute.googleapis.com.
GCP multi-network connectivity is fundamentally constrained regardless of model: Shared VPC requires same-organization hierarchy with single-host attachment and doesn't migrate existing resources, while VPC peering is non-transitive, prohibits subnet overlap, and never exchanges IAM or dynamic routes — forcing an architecture-level choice between organizational coupling (Shared VPC) and connectivity limitations (peering) with no unconstrained option.
Partner Interconnect supports VLAN attachment capacities from 50 Mbps to 50 Gbps.
Partner Interconnect 99.99% SLA requires 4 VLAN attachments across 2 metros, one per edge availability domain, with 2 Cloud Routers (one per region) and global routing enabled on the VPC.
Partner Interconnect Layer 2 requires customer-configured BGP; Layer 3 has fully automated BGP handled by the service provider and supports pre-activation.
MED values cannot be sent or learned over Layer 3 Partner Interconnect connections because MED values cannot pass through autonomous systems.
Google's SLA for Partner Interconnect covers only connectivity between VPC and the service provider's network, not the service-provider-to-customer leg.
Partner Interconnect uses a third-party service provider as an intermediary, suited for when a data center cannot reach a Dedicated Interconnect colocation facility or bandwidth needs are below 10 Gbps.
Peering dynamic routes apply based on the dynamic routing mode of the exporting VPC network, not the importing one.
When PGA is disabled on a subnet, internal-only VMs on that subnet cannot reach Google APIs or services.
Private Google Access (PGA) is enabled per subnet, not per VM or per VPC.
Private Google Access only affects VMs without external IP addresses; VMs with external IPs can already reach Google APIs normally and are unaffected by PGA.
Private Google Access requires correct DNS, routing, and firewall configuration in the VPC network to function.
PGA traffic can use the VM's primary internal IP or an alias IP range as the source IP.
Private Google Access reaches Google APIs, Private Services Access connects to services via Service Networking API/peering (e.g., Cloud SQL private IP), and Private Service Connect provides endpoint-based managed service access.
Policy-based routes are never exchanged via VPC Network Peering.
Prohibited VPC subnet ranges include: Google public IPs, 0.0.0.0/8, 127.0.0.0/8, 169.254.0.0/16, 224.0.0.0/4, 255.255.255.255/32, and Private Google Access VIPs (199.36.153.4/30, 199.36.153.8/30).
Pub/Sub guarantees at-least-once delivery — duplicates are possible, and publish-side duplicates can have different publishTime values with the same messageId.
Pub/Sub message attribute keys must not start with goog (reserved prefix).
BigQuery export subscriptions can leverage the topic's schema, which is not available with the basic Dataflow template.
Delivery attempt counting for dead-letter topics is best-effort and approximate — messages may be forwarded after fewer or more attempts than configured.
Dead-letter topics can reside in a different project from the subscription using --dead-letter-topic-project.
The Pub/Sub service account needs roles/pubsub.publisher on the dead-letter topic and roles/pubsub.subscriber on the source subscription, granted after creating the dead-letter topic.
Dead lettering is configured as a subscription property, not on the source topic.
The max-delivery-attempts parameter ranges from 5 (default) to 100.
The dead-letter topic must have its own subscription to consume forwarded messages; otherwise forwarded messages are lost.
For pull subscriptions with inactive subscribers, the tracked delivery attempt count may reset to zero, causing more deliveries than the configured maximum.
The default ack deadline for exactly-once subscriptions is 60 seconds (vs. the normal default).
Expired ack IDs return INVALID_ARGUMENT for exactly-once subscriptions (unlike regular subscriptions which return OK).
When combining ordering keys with exactly-once delivery, acks must be in-order and throughput drops to approximately thousands of messages per second.
Publish-side duplicates can still occur even with exactly-once delivery enabled (publisher retries can cause duplicate messages).
Exactly-once delivery is only supported on pull subscriptions (including StreamingPull), not push or export subscriptions.
The exactly-once delivery guarantee only applies when subscribers connect to the service in the same region; multi-region subscribers may still receive duplicates.
Export subscriptions replace Dataflow pipelines when no transformation is needed before writing to BigQuery or Cloud Storage.
A Pub/Sub message supports a maximum of 100 attributes, with keys ≤256 bytes and values ≤1024 bytes.
A Pub/Sub message must contain at least one of: non-empty data, ordering key, or attributes — all are not required simultaneously.
Pub/Sub message ordering imposes cascading operational constraints: it is immutable after subscription creation, caps throughput at 1 MBps per key, and redelivery of one message forces redelivery of all subsequent messages for that key.
Message ordering is set at subscription creation and cannot be changed afterward.
If a message with an ordering key fails to publish, subsequent messages with that same key fail until resumePublish() is called in the client library.
Publish throughput per ordering key is limited to 1 MBps.
Ordering keys are not equivalent to partitions; they are expected to have much higher cardinality than partitions in partition-based systems.
Message ordering should not be enabled for Dataflow subscriptions — Dataflow has its own ordering mechanism via windowing and ordering keys can reduce pipeline performance.
Push subscriptions allow only one outstanding message per ordering key, resulting in the worst throughput for ordered workloads.
Redelivery of one message causes all subsequent messages for that ordering key to be redelivered, even previously acknowledged ones.
Pub/Sub uses per-message parallelism (leasing individual messages), not partition-based parallelism like Kafka/Pulsar.
roles/pubsub.publisher is the minimum IAM role needed to publish messages to a topic.
Multiple pull subscribers on the same subscription each receive a subset of messages (load balancing behavior).
Pull subscriptions support dynamic acknowledgment deadline extension, allowing arbitrarily long processing times.
Pub/Sub push delivery sends messages as HTTP POST requests to webhooks.
Push subscriptions deliver one message per request and cap outstanding messages; flow control is server-managed.
Push endpoints must have DNS-resolvable names and valid SSL certificates (non-self-signed).
Pub/Sub REST API message data must be base64-encoded.
Pub/Sub is designed for service-to-service communication; use Firebase for client-to-server and Cloud Tasks for async service calls.
Pub/Sub offers three subscription types: Pull, Push, and Export (BigQuery and Cloud Storage).
Pub/Sub typical latency is approximately 100 milliseconds.
The roles/secretmanager.secretAccessor role grants only secretmanager.versions.access — it cannot list secrets or view metadata.
Best practice is to avoid passing secrets via environment variables or filesystem and instead use the Secret Manager API directly via client libraries.
Billing applies to Enabled and Disabled secret versions; Destroyed versions are free.
Workloads on Compute Engine or GKE require the cloud-platform OAuth scope to use Secret Manager.
Creating secrets requires the roles/secretmanager.admin (Secret Manager Admin) IAM role.
Creating a secret via CLI/API does not automatically create a version; the Console creates a first version only if a value is provided during creation.
Secret Manager encrypts all secrets with AES-256 at rest and TLS in transit by default with no configuration required; CMEK is available for customer-controlled keys.
The Destroyed state for a secret version is irreversible — contents are permanently discarded and the version cannot transition to another state.
Best practice is to disable secret versions before destroying them; disabling is reversible, destroying is not.
The basic roles/editor role does not include secretmanager.versions.access; only roles/owner among basic roles grants secret access.
Secret Manager has five predefined IAM roles: Admin, Secret Accessor, Secret Version Adder, Secret Version Manager, and Viewer.
Workloads on Compute Engine or GKE require the cloud-platform OAuth scope to use Secret Manager.
Secret Manager supports IAM Conditions for date/time-based expirable access and resource-attribute filtering (e.g., secret name prefix, specific version).
Secret Manager IAM has a granularity mismatch between role scope and resource structure: roles are granted at the secret level (not per-version), but secrets have three version states (Enabled/Disabled/Destroyed) with distinct access semantics — and the accessor role grants only versions.access (no list or metadata), while the viewer role sees metadata but not payloads, forcing dual-role grants for full operational visibility without violating least privilege.
IAM roles cannot be granted on a secret version — only on the secret itself.
The lowest-level resource for granting Secret Manager IAM roles is the individual secret.
Expiration on production secrets should be avoided because it causes irreversible deletion; use time-based IAM conditions instead.
Best practice is to reference secrets by specific version number in production, not the latest alias, to enable validation and rollback.
Production secret access should use the Secret Manager API directly (not env vars/files), pin to specific version numbers (not latest), and account for the fact that Cloud Run env var secrets resolve only at startup — creating a specific operational model.
Secret Manager production integration requires application-level awareness across three dimensions: the access pattern (API-direct with pinned versions, avoiding env var indirection), the rotation model (notification-only, requiring application code to handle credential refresh), and the IAM granularity (per-secret not per-version, meaning version-level access control must be enforced by application logic) — the service stores secrets but delegates lifecycle management to the consuming application.
Regional secrets are not automatically replicated across locations, unlike global secrets.
Regional secrets require using the regional service endpoint, not the default global service.
Regional secrets ensure data stays within the chosen location at rest, in use, and in transit.
Access to regional secrets is limited to applications and services running within the same location.
The --remove-rotation-schedule flag removes both nextrotationtime and rotation_period, while --remove-next-rotation-time removes only the timestamp.
Every secret requires a replication policy: automatic (Google chooses regions, charged for one location) or user-managed (customer selects regions, charged per location).
The Secret Manager REST API requires base64-encoded payload data when adding a secret version.
The nextrotationtime for a Secret Manager rotation schedule cannot be less than 5 minutes in the future.
Secret Manager rotation is notification-only: it sends a Pub/Sub message rather than rotating the value, has a 5-minute minimum scheduling window, and removing the schedule clears both period and next-time — actual rotation logic must be implemented externally via subscriber code.
The rotation_period for a Secret Manager rotation schedule cannot be less than 1 hour.
Secret Manager retries failed rotation message delivery for up to 7 days, after which the rotation is canceled.
Secret Manager rotation does not rotate the secret value itself — it only sends a SECRET_ROTATE notification to configured Pub/Sub topics; the actual rotation logic must be implemented by a subscriber.
Secret Manager skips scheduled rotations if an in-flight rotation already exists; concurrent rotations are prevented.
Secret names allow uppercase/lowercase letters, numerals, hyphens, and underscores with a maximum length of 255 characters.
Secret version values must not exceed 64 KiB in size.
Secrets are global resources by default; regional secrets exist for data residency compliance.
Secret Manager stores actual secret values (viewable with permissions); Cloud KMS stores cryptographic keys (never viewable/extractable).
Secret Manager secret data is immutable — modifications require adding a new version rather than changing existing ones.
Secret versions have three states: Enabled (accessible), Disabled (exists but not accessible, re-enableable), and Destroyed (permanently discarded, irreversible).
The roles/secretmanager.viewer role can read secret and version metadata but cannot access secret payloads.
VPC Service Controls can add network-level perimeter protection to Secret Manager API access in addition to IAM.
Accessing Cloud SQL privately from serverless compute (Cloud Run) requires navigating three networking layers: (1) explicit VPC bridging for serverless egress, (2) private services access VPC peering for Cloud SQL, and (3) peering's inherent constraints (non-transitivity, no overlapping subnets, no IAM exchange) — each layer adds configuration surface and distinct failure modes.
Serverless access to Memorystore for Redis is doubly constrained: Memorystore enforces private-IP-only connectivity (no public exposure, 300GB cap, no AOF persistence, basic tier lacks replication) AND serverless workloads require explicit VPC bridging (Direct VPC egress or connector VMs) to reach any VPC-private resource, creating a two-layer networking dependency chain where either layer's failure blocks connectivity entirely.
Serverless workloads in multi-tenant GCP environments face compounding network constraints from two independent layers: multi-network connectivity is fundamentally limited regardless of model (Shared VPC requires same-org hierarchy, peering is non-transitive with no IAM exchange), AND serverless resources require explicit VPC bridging just to reach any VPC network — meaning serverless in a shared network must solve both the bridging gap and the connectivity model constraints simultaneously.
Multi-tenant serverless architectures requiring private data access must configure the full VPC connectivity stack (VPC, NAT, connectors/direct VPC egress, private services access) for each tenant's isolation boundary, multiplying the networking complexity per tenant.
Serverless workloads accessing private data must navigate five dependency layers: (1) VPC bridging for network egress, (2) peering non-transitivity for database reach, (3) private services access for Cloud SQL IP allocation, (4) Secret Manager API integration with fail-fast startup semantics, and (5) rotation notification gaps requiring application-level credential refresh — each layer adds a failure mode invisible from the serverless abstraction.
Production serverless workloads requiring private data access need VPC configuration, NAT gateways, and private connectivity — networking infrastructure that adds significant complexity to otherwise simple serverless deployments.
Serverless resources (Cloud Run, Cloud Run functions, App Engine) cannot natively reach VPC networks — they require explicit VPC bridging via Direct VPC egress or Serverless VPC Access connectors, connectivity is egress-only from serverless to VPC, and services like Memorystore mandate this bridging for access.
Shared VPC resource costs are billed to the service project, except VPN gateway egress (billed to host project) and VLAN attachment traffic (billed to attachment owner).
A project cannot be both a Shared VPC host project and a service project simultaneously.
Existing resources in a newly attached service project do not automatically use the shared network; new resources must be created to use shared subnets.
Internal DNS names for VMs in Shared VPC use the service project's project ID, not the host project ID.
Legacy networks are not supported for Shared VPC.
The compute.networkUser role grants subnet access — at project level (all subnets including future) or at subnet level (specific subnets only).
Host and service projects must be in the same Google Cloud organization for Shared VPC.
A service project can attach to only one host project; multiple host projects are allowed per organization.
Shared VPC enforces a strict organizational hierarchy for multi-tenant networking: host and service projects must share an organization, each service project attaches to exactly one host project, DNS uses service project IDs, and GKE follows the same host-service pattern for cluster networking.
Spanner automatically deletes a database if its CMEK key is unavailable for more than 30 consecutive days.
Google health check probe IP ranges include 35.191.0.0/16 and 130.211.0.0/22; IAP uses 35.235.240.0/20; Cloud DNS uses 35.199.192.0/19; Serverless VPC Access uses 35.199.224.0/19.
Special routing paths (health checks, IAP, DNS, etc.) are non-removable and do not appear in the route table.
Each primary IPv4 subnet range has 4 unusable addresses: network (first), default gateway (second), second-to-last (reserved), and broadcast (last). Secondary ranges have no reserved addresses.
Minimum VPC subnet size is /29 (8 addresses). Recommended maximum is /8.
Primary IPv4 subnet ranges can be expanded but never shrunk after creation.
Subnet routes always take precedence over static and dynamic routes for overlapping destinations, unless using hybrid subnets.
VPC subnets are regional resources while VPC networks are global resources.
VPC firewalls implement an asymmetric, stateful security posture: default rules deny all ingress but allow all egress (asymmetric baseline), connection tracking expires after 10 minutes of idle (stateful with silent timeout), and source port filtering is unsupported (coarse-grained matching) — the net effect is that outbound-initiated connections are permissive by default but their return path depends on connection tracking state that silently expires.
VPC firewall connection tracking expires after 10 minutes of inactivity.
The default priority for VPC firewall rules is 1000; priority range is 0–65535 (lower = higher priority).
At equal priority, deny rules override allow rules in VPC firewall evaluation.
VPC firewall rules support destination port specification only; source port filtering is not supported.
VPC firewall rules are scoped to a single VPC network and cannot be shared across networks, including peered networks.
VPC firewall rules are stateful — return traffic for allowed connections is automatically permitted.
VPC Flow Logs provide asymmetric visibility into firewall-blocked traffic: egress denied packets are captured (sampled before egress firewall evaluation) but ingress denied packets are not captured — creating a systematic blind spot for inbound attack detection that must be supplemented with firewall rule logging or other network security tooling.
The default VPC Flow Logs aggregation interval is 5 seconds.
Egress packets blocked by firewall rules are captured by VPC Flow Logs (sampled before egress firewall evaluation).
GKE intra-node pod-to-pod traffic requires intranode visibility to be enabled for VPC Flow Logs capture.
Ingress packets blocked by firewall rules are not captured by VPC Flow Logs.
Enabling VPC Flow Logs introduces no delay or performance impact.
Default secondary sampling rate for VPC Flow Logs is 50% via Compute Engine API and 100% via Network Management API.
VPC Flow Logs for Shared VPC are written to the host project, not service projects.
Every VPC has two implied firewall rules: deny all ingress and allow all egress.
The metadata server at 169.254.169.254 is always accessible regardless of firewall rules and cannot be blocked.
VPC networks are global resources; subnets are regional.
Cloud Router does not auto-advertise received peering subnet routes; custom advertisement mode with explicit ranges is required.
Static and dynamic routes are not exchanged by default in VPC Peering; must enable --export-custom-routes and --import-custom-routes.
Internal DNS names do not resolve across peered VPC networks; Cloud DNS peering zones are required.
The dynamic routing mode of the exporting network determines regional vs global availability of peering dynamic routes; the importing network's mode is irrelevant.
Firewall rules are not exchanged via VPC Peering; each VPC must independently create ingress allow rules.
VPC peering provides constrained connectivity: it is non-transitive, never exchanges IAM policies, prohibits overlapping subnets, and does not auto-advertise peered routes via Cloud Router.
VPC Network Peering exchanges subnet routes automatically but never exchanges IAM policies.
Subnet IP ranges cannot overlap across peered VPC networks; peering will fail if they do.
VPC Peering is non-transitive — if A peers with B and B peers with C, A cannot reach C through B.
Two auto mode VPC networks cannot be peered because both use ranges within 10.128.0.0/9.
VPC route evaluation order: Special routing paths → Policy-based routes → Subnet routes → Custom routes (static/dynamic) → System-generated default routes.
VPC security has dual asymmetry in enforcement and observability that creates a blind spot at the ingress boundary: firewall rules default to deny-all-ingress/allow-all-egress (asymmetric enforcement), while flow logs capture denied-egress but miss denied-ingress (asymmetric visibility) — the traffic most aggressively blocked by default is precisely the traffic invisible to forensic analysis.
Shared VPC uses a host project model and requires all projects to be in the same organization.
Port 25 (SMTP) egress to external IPs is blocked by default by Google Cloud (not via a firewall rule).
A VPC network's /48 ULA IPv6 range (from fd20::/20) is globally unique within Google Cloud and cannot be removed or changed once assigned.
VMs can have multiple network interfaces attached to different VPC networks.
Direct resource access (granting IAM roles directly to federated identities) is preferred over service account impersonation; impersonation requires roles/iam.workloadIdentityUser.
Workload Identity Federation is the recommended alternative to service account keys for external workloads accessing Google Cloud resources.
The google.subject attribute mapping is required for all workload identity pool providers and has a maximum length of 127 characters.
Best practice is one workload identity pool per non-Google Cloud environment (dev, staging, prod).
Workload Identity Federation principal identifiers use project number (not project ID) in fully qualified resource names.
Workload Identity Federation supports identity providers using OIDC, SAML V2.0, X.509 certificates, AWS, Azure, Active Directory, GitHub, and GitLab.
Workload Identity Federation token exchange follows the OAuth 2.0 token exchange spec (RFC 8693) via the Security Token Service (STS).
Workload Identity Federation provides keyless authentication for GKE workloads and external identities, serving as the recommended alternative to service account keys and reducing credential management overhead.
Workload identity pool providers support up to 50 custom attributes for principalSet:// bindings.
Binary Authorization forms a continuous supply chain security chain: Cloud Build creates attestations at build time and can block unauthorized deployments, while GKE continuous validation (CV) monitors running containers against policies — covering both deployment-time gating and runtime drift detection.
Cloud Armor's edge-first defense compensates for VPC-level ingress visibility gaps by filtering and logging malicious traffic at the Google Cloud edge before it reaches the VPC boundary where flow logs have systematic blind spots for denied ingress packets.
Cloud Run autoscaling reliably handles all traffic patterns including sudden spikes via the 60% utilization/concurrency target.
Cloud Run billing is fully optimizable through request-based pay-per-use default, CUD discounts shared across Cloud Run, GKE, and Compute Engine, and zero-cost for IAM-denied requests.
Cloud Run concurrency scales naturally with vCPU allocation (default 80x vCPUs, max 1000 per instance), providing predictable request distribution across instances.
Cloud Run's three resource types (services, jobs, worker pools) have fundamentally divergent operational characteristics rather than being variations of a single model: services get authentication and autoscaling, jobs always use the second-generation execution environment, and worker pools have neither autoscaling nor public URLs — requiring type-specific operational expertise rather than a single Cloud Run mental model.
Cloud Run achieves zero-ops deployment for simple stateless workloads with deterministic container execution (tag-to-digest resolution, image caching at deploy) and managed HTTPS endpoints with TLS, WebSockets, and gRPC support.
Cloud SQL backups provide seamless disaster recovery with indefinite retention for on-demand backups and consistent connection strings after failover.
Cloud SQL provides complete disaster recovery coverage with incremental backups, indefinite on-demand backup retention, 4-day deleted instance recovery by Customer Care, and same-IP failover for HA instances.
Cloud SQL private IP access is operationally simple for serverless workloads: stable connection strings survive failover, and private networking avoids public internet exposure.
CMEK governance provides sufficient control over data lifecycle across GCP services through duty-separated KMS administration, non-disruptive rotation, and unified key lifecycle management.
CMEK key availability directly controls data lifecycle across GCP services: revoking access, disabling, or destroying a key makes all encrypted data permanently inaccessible, enabling crypto-shredding as a data governance tool.
CMEK key lifecycle has asymmetric risk: rotation is non-disruptive (creates new version without re-encrypting, ciphertext self-identifies its key version), but key destruction or access revocation permanently destroys all encrypted data across 40+ services — making rotation a safe operational practice but destruction an irreversible data lifecycle event.
CMEK key lifecycle serves as the single control plane for data governance across GCP: rotation is operationally safe (ciphertext self-identifies its key version), but destruction permanently shreds all encrypted data across every service — and GCS's tiered encryption model means choosing CMEK explicitly opts into this asymmetric risk, trading storage durability for cryptographic access control at the KMS layer.
Container security in GCP spans the full lifecycle from build provenance to runtime identity: Cloud Build attestations with Binary Authorization enforce supply chain integrity through deployment, while Workload Identity Federation provides keyless runtime credentials — but the end-to-end chain depends on Kubernetes namespace naming conventions for identity isolation, making organizational discipline the binding constraint on what is otherwise a technical guarantee.
Dedicated Interconnect SLA tiers are strictly proportional to geographic redundancy: no SLA for single connections, 99.9% requires 2 connections in the same metro (different edge domains), 99.99% requires 4 connections across 2 metros — each tier doubles the connection count and expands geographic distribution.
Secret Manager's Pub/Sub-based rotation notification creates a fragile end-to-end chain: rotation triggers are notification-only (no actual value change), delivery to Cloud Run depends on push subscriptions (one message per request, SSL required), and Pub/Sub's exactly-once guarantee is pull-only — meaning rotation events to serverless consumers may duplicate or be lost, undermining the rotation lifecycle that Secret Manager delegates entirely to application code.
Managed Instance Groups provide complete VM fleet management with autoscaling, autohealing (using conservative health checks separate from load balancing), and regional distribution protecting against zonal failure.
Preemptible VMs deliver full cost savings compared to on-demand instances across all billing dimensions.
GCP's abstraction inversion drives multiplicative expertise cost: managed services that appear operationally simple demand deeper technical expertise than self-managed alternatives (application-level semantics for Pub/Sub delivery guarantees, Secret Manager rotation, plus comprehensive immutability requiring upfront design), while production costs compound multiplicatively across per-service dimensions — requiring teams to maintain simultaneous deep expertise across every service rather than broad shallow knowledge of any single layer.
GCP managed services create an abstraction inversion where operational simplicity demands deeper technical expertise than self-managed alternatives: services require application-level awareness of delivery semantics, rotation patterns, and IAM granularity, AND infrastructure decisions are comprehensively immutable across networking and service configuration — so mistakes require both deep domain knowledge to avoid and full resource recreation to correct.
Production-grade private data access in GCP requires navigating a cost matrix across two independent dimensions: a universal connectivity tax from VPC peering constraints (non-transitivity, overlapping IP prohibition, DNS non-resolution) affects every data service identically, while production protection investment (HA, replicas, backups, encryption) compounds independently within each service — making total data infrastructure cost the product of service count times the sum of connectivity overhead and per-service protection engineering.
GCP data services demand the deepest expertise behind the simplest interfaces: data management is the highest-complexity operational domain (four-dimensional GCS engineering, per-service protection investment, CMEK cross-service blast radius) while the abstraction inversion ensures this complexity is hidden behind managed interfaces that make cost and risk appear simpler than they are.
GCP data durability requires governance at two independent scales: per-service protection engineering (Cloud SQL triple investment in HA/replicas/private networking, GCS defense-in-depth across immutability/namespace/encryption tiers) AND cross-service CMEK blast radius management (a single key destruction cascades data loss across every service using that key, voiding per-service durability guarantees regardless of investment).
GCP data governance demands simultaneous mastery of two orthogonal time horizons: upfront architectural commitment (immutable infrastructure decisions, dual IAM/CMEK control planes compounding with cross-layer irrecoverability) AND ongoing per-service protection engineering (Cloud SQL triple investment, GCS defense-in-depth, CMEK blast radius management) — neither dimension compensates for deficiency in the other.
GCP data management is the highest-complexity operational domain: GCS alone requires four-dimensional engineering (storage class economics, defense-in-depth protection, namespace security, CMEK governance with cross-service blast radius), and this per-service engineering pattern repeats independently across Cloud SQL (triple investment: HA + replicas + private networking), Memorystore (constrained operational model), and Secret Manager (application-level awareness) — with no shared platform abstraction to amortize the complexity.
GCP's dual security governance (IAM access control + CMEK data control) combined with KMS operational safety (duty-separated, non-disruptive rotation) achieves effective layered defense where compromise of one governance surface does not compromise the other and routine operations cannot accidentally breach either.
GCP creates an expertise paradox: managed service adoption intended to reduce operational burden demands deeper expertise than self-managed alternatives across networking, identity, and data governance — while costs compound multiplicatively, meaning organizations pay more for services that require more skill to operate correctly than the infrastructure they replaced.
GCP failure cost peaks at invisible irrecoverable boundaries: multiplicative cost compounding across infrastructure dimensions (HA, networking, identity) intersects governance detection gaps at security boundaries (missing ingress flow logs, temporal logging gaps), meaning the most expensive operational failures occur precisely where observability is weakest and mistakes cannot be reversed.
GCP governance has a fundamental detection gap at irrecoverable boundaries: the most dangerous failure mode is irrecoverable mistakes where observability is weakest (immutability at security boundaries with blind flow logs and temporal logging gaps), while the governance model demands simultaneous mastery of upfront architectural commitment and ongoing operational engineering — creating a window where irrecoverable decisions are made with the least available information and detected only after correction is impossible.
GCP's comprehensive infrastructure immutability combined with managed services' requirement for deep application-level understanding prevents configuration drift in production: resources remain as designed throughout their lifecycle and operators who understand delivery semantics, rotation patterns, and IAM granularity configure them correctly from the start.
GCP's most dangerous operational failure mode is irrecoverable mistakes at boundaries where observability is weakest: infrastructure immutability permanently locks configuration errors (KMS key rings, AR repo format, VPC stack type), while VPC Flow Logs miss ingress-denied traffic and log export has temporal gaps — creating a systematic pattern where the hardest-to-detect errors are also the hardest to fix.
GCP managed service risk concentrates in naming and identity dimensions: GKE naming collisions cascade through the managed service chain granting unintended IAM identity across Cloud Run, Pub/Sub, and Secret Manager integrations, while every managed service independently demands mastery of both application semantics and identity lifecycle — making naming conventions the single highest-leverage security control across the platform.
GCP managed services (Pub/Sub, Secret Manager) require application-level production awareness that cannot be abstracted away by platform configuration: Pub/Sub demands explicit throughput-correctness decisions (ordering caps at 1 MBps/key, exactly-once is pull-only and regional, dead-letter counting approximate) while Secret Manager demands version pinning, API-direct access patterns, and rotation event handling — contradicting the "fully managed = no application changes" assumption.
GCP managed services require dual mastery spanning application semantics and identity lifecycle: Pub/Sub and Secret Manager demand application-level awareness of delivery guarantees, rotation semantics, and IAM granularity mismatches, while the container security chain from build provenance through runtime identity adds a parallel lifecycle requiring coordinated namespace/SA naming discipline — application-level and identity-level expertise cannot substitute for each other.
GCP managed services (Cloud Run, GKE Autopilot) shift operational complexity rather than eliminating it: serverless networking requires VPC bridging, NAT chains, and peering navigation, while container security demands build attestation, binary authorization, and namespace-disciplined identity — both require the infrastructure expertise the managed abstractions were designed to replace.
Multi-tenant serverless on GCP represents worst-case platform risk across all dimensions simultaneously: the serverless value proposition is negated by networking and identity engineering investment while managed service risk concentrates in naming conventions — the least formalized, least tested, and least visible part of the development lifecycle.
GCP default observability infrastructure (auto-collected metrics, per-resource log routing, zero-impact flow logs) provides sufficient data for comprehensive security incident forensics across all traffic patterns.
GCP platform adoption demands engineering investment that is shifted rather than reduced: managed services redirect operational complexity to networking and identity design while security governance requires upfront immutable architectural commitments across IAM, CMEK, and infrastructure layers — the total engineering surface area is comparable to self-managed infrastructure, just differently distributed.
The universal VPC peering connectivity tax amplifies GCP's highest-complexity operational domain: data management already requires four-dimensional engineering (storage class economics, defense-in-depth protection, lifecycle automation, key governance), and every private data service adds a fifth dimension of peering non-transitivity, IP range exhaustion, and serverless bridging requirements.
GCP production engineering complexity exceeds naive estimates through two independent mechanisms: costs compound multiplicatively across infrastructure dimensions (HA, replicas, private networking, geographic redundancy) AND the investment target shifts from traditional operations to networking/identity/governance domains — teams that budget for only one mechanism systematically underestimate total effort.
GCP production costs are multiplicative rather than additive across two dimensions: infrastructure investment compounds within each service (Cloud SQL 2x for HA plus read replicas plus private networking; Interconnect 2x bandwidth plus encryption overlay plus geographic redundancy) AND each service requires independent protection engineering with no shared abstractions, so total cost = Σ(per-service compounding cost) with no economies of scale.
Production-grade GCP services require infrastructure investment that compounds multiplicatively: Cloud SQL demands concurrent investment in HA (2x cost for idle standby), read replicas (capped at 10, no independent backups), and private networking (peering-constrained), while Cloud Interconnect requires geometric scaling (2x bandwidth + 2x metros for 99.99% SLA) — each service's production readiness is independently expensive with no cross-service economy of scale.
GCP production irrecoverability compounds across layers: infrastructure-level immutability locks resource configuration at creation (AR format, KMS key rings, WIF pools, Cloud SQL private IP), while CMEK key destruction permanently eliminates all encrypted data — mistakes at either layer are unrecoverable, and the layers are independent (a correctly configured key ring can still encrypt data you can never access, and a perfectly mutable resource can have its data crypto-shredded).
GCP has two independent key/secret rotation challenges with complementary risk profiles: KMS rotation is operationally safe (duty-separated, non-disruptive) but destruction is catastrophic, while Secret Manager rotation is notification-only (Pub/Sub message, no actual value change) creating startup/rotation tension in Cloud Run — both channels must be mastered independently and both surface as availability events in production.
GCP secret and key rotation is end-to-end fragile across two independent mechanisms: the event-driven Secret Manager rotation chain depends on Pub/Sub notifications that are subject to delivery guarantee trade-offs (approximate dead-letter counting, ordering throughput limits), while the dual rotation challenge compounds KMS rotation safety (non-disruptive but requires manual re-encryption) with Cloud Run startup tension (fail-fast semantics on secret version changes) — creating a system where no single rotation event can be assumed to propagate reliably to all consumers.
GCP security governance operates through two independent, non-overlapping control planes: IAM controls who can access resources via layered deny-first evaluation with service account hardening, while CMEK controls whether data remains readable at all via key lifecycle — compromising one plane does not compromise the other, but production security requires operating both simultaneously.
GCP's dual security governance (IAM access control + CMEK data control) compounds with cross-layer infrastructure immutability: access policies are mutable post-deployment but encryption keys, repository formats, VPN configurations, and network identity pools are locked at creation, making security architecture an upfront design constraint that cannot be iteratively evolved.
GCP security governance spans three independent failure domains requiring simultaneous mastery: access control (IAM deny-first evaluation with upfront architectural commitment), data control (CMEK lifecycle where key destruction causes irrecoverable data loss), and credential rotation (fragile Pub/Sub notification chains for Secret Manager, KMS version management) — failure in any one domain compromises the security posture that depends on all three.
GCS customer-managed encryption (CMEK/CSEK tiers in the four-tier model) trades data durability for access control: CMEK key destruction permanently shreds encrypted data across services, and CSEK key loss causes unrecoverable data loss — the stronger the customer control over encryption, the higher the operational risk to data availability.
GCS's eleven-nines durability guarantee is conditional on encryption key availability: with default Google-managed encryption, durability is unconditional; with CMEK, key destruction permanently renders data inaccessible (crypto-shredding); with CSEK, key loss means permanent data loss — higher customer encryption control inversely correlates with unconditional durability.
CMEK key lifecycle is the single governance surface for data persistence across GCP: a key that protects GCS objects (conditionally voiding eleven-nines durability on destruction), Spanner databases (auto-deleted after 30 days of key unavailability), and other CMEK-integrated services creates cross-service blast radius from a single key management decision — rotation is safe but destruction or revocation cascades across all dependent services simultaneously.
GCS operational management spans four largely independent dimensions: storage class economics with lifecycle automation, data protection via defense-in-depth across object/namespace/encryption layers, namespace security requiring organizational controls, and cross-service CMEK key governance where a single key destruction voids eleven-nines durability across GCS and every other CMEK-protected service — the fourth dimension is invisible in GCS-specific tooling but dominates the durability guarantee.
GCS versioning with soft delete provides complete data protection: immutable objects with atomic replacement, recoverable noncurrent versions, and default 7-day soft delete retention cover all accidental modification and deletion scenarios.
GKE Autopilot provides a complete managed Kubernetes experience — always regional, pod-level billing, zero-node scaling, Google-managed nodes — suitable for all production workload types.
GKE Autopilot eliminates all infrastructure operations (always regional, Google-managed nodes, pod-level billing) but the identity design it shifts to is itself fragile: Workload Identity isolation depends on namespace + service account naming conventions across clusters (same name = same IAM identity), and service accounts require active hardening against dual-nature privilege escalation — concentrating all Autopilot operational risk into Kubernetes naming discipline and IAM policy hygiene.
GKE Autopilot eliminates infrastructure operations (always regional, Google-managed nodes, pod-level billing) but shifts the operational burden to identity design: Workload Identity Federation demands namespace and service account naming discipline where same-namespace same-name collisions create identity aliasing, and mistakes in identity configuration are harder to detect than infrastructure misconfiguration because they fail silently at authorization time rather than visibly at provisioning time.
GKE Workload Identity addresses service account key risks (dual nature, default editor role, impersonation surface) but shifts security requirements to namespace and naming discipline — same namespace + same KSA name = same GCP identity regardless of intent, requiring organizational practices rather than just enabling the feature.
GKE Workload Identity eliminates service account keys via WIF's unified keyless pattern but makes identity isolation depend on naming conventions: same namespace + service account name across clusters in the same project maps to the same IAM identity, shifting the security boundary from cryptographic keys to organizational naming discipline.
GKE naming-dependent identity risk cascades through the managed service chain: a namespace/service-account naming collision grants unintended IAM identity, which inherits access to Secret Manager secrets, Pub/Sub topics, and Cloud SQL instances — each of which shifts its own complexity to application-level awareness, amplifying the blast radius of a single naming error across the entire service mesh.
GKE workload security requires mastery of two orthogonal dimensions: naming-dependent identity isolation where namespace/service-account conventions determine IAM identity across clusters (creating cross-cluster identity collisions from naming mistakes), AND the platform's dual IAM/CMEK control planes where access governance and data governance operate independently — making GKE security simultaneously a function of team naming discipline and architectural control-plane design.
GKE Workload Identity Federation provides complete per-pod identity isolation for Google Cloud API access.
Cloud Interconnect achieves fully production-ready connectivity with redundancy overprovisioning, dynamic BGP routing, and active-active VLAN configuration across edge availability domains.
Production Cloud Interconnect demands compounding infrastructure investment: the 99.99% SLA tier requires 4 connections across 2 metros (geographic redundancy with 2x bandwidth overprovisioning), and achieving confidentiality requires independently layering HA VPN or MACsec encryption on top — meaning the highest availability and security posture requires both geographic and protocol-level engineering that scales cost super-linearly.
KMS governance provides complementary safety for routine operations (duty separation prevents admin crypto access, rotation creates new versions without re-encrypting) but cannot mitigate the catastrophic risk of key destruction — the 30-day scheduled destruction window is the sole safeguard, and once expired, data loss is permanent and cross-service, regardless of governance quality.
Memorystore for Redis provides failover with preserved connection strings and no data loss.
Production serverless on GCP requires engineering investment comparable to traditional infrastructure: the five-layer networking stack (VPC bridging, peering, private services access, NAT, secret rotation) plus build-to-runtime identity chain demands the same depth of infrastructure and security mastery that serverless was designed to eliminate, compounded by immutable dual control planes (IAM + CMEK) that cannot be retrofitted.
Production serverless on GCP requires parallel engineering investment across networking (five-layer private data stack: VPC bridging, peering non-transitivity, NAT chaining, private services access, Cloud DNS), identity (build provenance via SLSA attestation, Binary Authorization gate, Workload Identity naming discipline), and secrets (startup fail-fast vs rotation notification-only tension) — each dimension independently mandatory, collectively contradicting the managed-simplicity premise.
Pub/Sub dead-letter topics provide reliable failure handling with configurable retry thresholds, proper IAM-gated routing, and cross-project dead-letter topic support.
Pub/Sub delivery guarantees form a throughput-correctness trade-off spectrum: default at-least-once delivery is throughput-unconstrained but allows duplicates, adding ordering caps throughput at 1 MBps per key with cascading redelivery risk, and exactly-once narrows further to pull-only subscriptions — each stronger guarantee progressively shrinks the operational envelope, and composing ordering with exactly-once produces the most constrained mode (sequential in-order acks, thousands-per-second throughput).
Pub/Sub provides exactly-once delivery guarantees from publisher to subscriber.
Pub/Sub combining exactly-once delivery with ordering keys achieves sequential exactly-once processing — the gold standard for message processing guarantees.
Secret Manager replication creates a residency-accessibility trade-off: automatic replication provides global access with Google-chosen regions, while user-managed regional replication ensures data residency at rest, in use, and in transit but restricts access to applications in the same region — requiring upfront architectural alignment between compliance requirements and application deployment topology.
Secret Manager requires irreconcilable architectural decisions at two independent levels: infrastructure topology (replication trades data residency for global accessibility, regional secrets require regional service endpoints) and application integration (API-direct access with version pinning, Pub/Sub-based rotation notifications, IAM granularity mismatch at secret level) — neither level can be deferred or changed cheaply after deployment.
Serverless in multi-tenant GCP production environments negates the serverless value proposition: engineering investment matches traditional infrastructure (five-layer networking, identity lifecycle, security commitment) while the multi-tenant networking stack and per-message delivery semantics add complexity layers that traditional infrastructure does not impose — the abstraction saves nothing and costs more to reason about.
Serverless workloads in multi-tenant GCP environments represent worst-case total cost of ownership: the five-layer networking stack compounds with per-service protection engineering costs, and each additional private data service (Cloud SQL, Memorystore, Secret Manager) multiplies both the networking and security governance dimensions independently — total cost grows superlinearly with service count.
Container supply chain integrity spans the full build-to-deploy lifecycle: Cloud Build creates attestations, Binary Authorization blocks unauthorized images and continuously validates running containers, while Cloud Run resolves tags to digests and caches images at deploy time — ensuring revisions always serve the exact verified artifact.
VPC Flow Logs provide complete network forensics for all firewall-related security investigations, with no performance impact and standard aggregation.