Infrastructure as Code Notes
学习来自reference
infrastructure as code
an important shift in mindset: you can manage almost everything in code, including servers, databases, networks, log files, application configuration, documentation, automated tests, deployment processes, and so on.
categories of IAC tools:
Ad hoc scripts
Ad hoc script 特指一类具体的script,完成一系列动作。如bash/python安装软件,启动软件等。
Configuration management tools
典型代表:ansible。通过配置,管理许多机器,实现ad hoc scripts幂等。
Server templating tools
相比上面的configuration management tools,它是另一种思路。通过将server打包成image,进而直接安装在host上,而不需要进一步ansible config。
分为两大类:
Virtual machines:模拟真实的OS,有CPU/memory/network等,它很重,因为要virtualize所有hardware在OS层面上。如Packer/Vagrant
Containers:一种特殊的隔离进程。轻量。如docker
Orchestration tools
image有了,如何编排它们:
- deployment
- monitoring / auto healing / auto scaling
- load balancing
- service discovery
- …
处理这些的工具有:Kubernetes / Amazon ECS / Nomad / Docker Swarm
Provisioning tools
上面的configuration management、server templating、orchestration tools只是定义了如何在server上run
而server和整套infra的创建 就需要如 Terraform / CloudFormation 这种工具去provisioning。
Terraform
HashiCorp开源。底层通过API calls不同cloud providers(AWS Azure GCP)来provisioning。
|
|
上面的例子就是通过Terraform创建一个AWS instance,然后DNS配在 Google Cloud上。进而跨了多个cloud providers
tips
EOF: heredoc syntax,允许你创建多行string,不需要\n
interpolation:
${...}
,允许在字符串中 插入变量,如 ${var.server_port}terraform output [output_name]:without apply changes情况下查看outputs
data sources: a piece of read-only information that is fetched from the provider。不同于Resource的是:resources cause Terraform to create, update, and delete infrastructure objects, data resources cause Terraform only to read objects.
manage terraform state:Terraform提供workspace概念,默认是default。它提供了一种能力:构建不同env的state
- 这里想说的是,基于workspace的构建方式,并不能作为我们区分env的方式。官方文档里有句话:A common use for multiple workspaces is to create a parallel, distinct copy of a set of infrastructure in order to test a set of changes before modifying the main production infrastructure.
- Non-default workspaces are often related to feature branches in version control. The default workspace might correspond to the “master” or “trunk” branch, which describes the intended state of production infrastructure.
- Instead, use one or more re-usable modules to represent the common elements, and then represent each instance as a separate configuration that instantiates those common elements in the context of a different backend
- 针对不同环境的隔离,我们基于file layout。也就是说一个环境,一个目录。
dynamic:Terraform处理loops的一个关键字,通过for_each动态的去generate tag block如下
1 2 3 4 5 6 7 8 9
dynamic "tag" { for_each = var.custom_tags content { key = tag.key value = tag.value propagate_at_launch = true } }
count的问题:Terraform requires that it can compute
count
andfor_each
during theplan
phase, before any resources are created or modified. This means thatcount
andfor_each
can reference hardcoded values, variables, data sources, and even lists of resources (so long as the length of the list can be determined duringplan
), but not computed resource outputs.terraform plan:plan比较的对象是state文件,如果有manually change则apply时会出问题。所以,要么所以infra change都通过Terraform,要么补救措施通过Terraform import命令(现成的工具Terraforming)
Refactoring Can Be Tricky:针对infra的重构和以往对code的重构有很大不一样,比如change name对Terraform来说,默认就是先delete 再create一个新的。中间必然会产生downtime。
plan
:carefully scanning output manuallycreate_before_destroy
:在lifecycle中加入create_before_destroy=true。terraform state
:只针对Resource rename情况下。可以通过手动执行terraform state mv aws_security_group.instance aws_security_group.cluster_instance
。immutable parameters
:一些Resource的参数是不可变的,change意味着destroy+create。所以要小心看文档。
production-grade infra
这意味着很多事: servers, data stores, load balancers, security functionality, monitoring and alerting tools, building pipelines, and all the other pieces of your technology that are necessary to run a business.
然而大多数情况下,工作量的预估都是错误的,尤其是devops方面的。why,
- devops as an industry 还很年轻,有许多坑要踩。像Terraform也才出现于2010s左右
- devops工作很容易受到yak shaving:牵一发动全身的感觉,比如需要一个部署服务,而它的依赖configuration/SFTP/TLS/DNS/Login等等,比如部署APP出发bug,进而引起连锁反应TLS issue、timeout等等。这些都是牵一发动全身。
- accidental complexity:devops牵涉到的是:everything from build to deployment to security and so on。所以一切可能遇到的问题深浅都是未知的。比如pipeline agent/network、线上timeout/OOM等等。
Production-Grade Infrastructure Checklist 摘抄于terraform-up&running
Task | Description | Example tools |
---|---|---|
Install | Install the software binaries and all dependencies. | Bash, Chef, Ansible, Puppet |
Configure | Configure the software at runtime. Includes port settings, TLS certs, service discovery, leaders, followers, replication, etc. | Bash, Chef, Ansible, Puppet |
Provision | Provision the infrastructure. Includes servers, load balancers, network configuration, firewall settings, IAM permissions, etc. | Terraform, CloudFormation |
Deploy | Deploy the service on top of the infrastructure. Roll out updates with no downtime. Includes blue-green, rolling, and canary deployments. | Terraform, CloudFormation, Kubernetes, ECS |
High availability | Withstand outages of individual processes, servers, services, data centers, and regions. | Multidatacenter, multiregion, replication, auto scaling, load balancing |
Scalability | Scale up and down in response to load. Scale horizontally (more servers) and/or vertically (bigger servers). | Auto scaling, replication, sharding, caching, divide and conquer |
Performance | Optimize CPU, memory, disk, network, and GPU usage. Includes query tuning, benchmarking, load testing, and profiling. | Dynatrace, valgrind, VisualVM, ab, Jmeter |
Networking | Configure static and dynamic IPs, ports, service discovery, firewalls, DNS, SSH access, and VPN access. | VPCs, firewalls, routers, DNS registrars, OpenVPN |
Security | Encryption in transit (TLS) and on disk, authentication, authorization, secrets management, server hardening. | ACM, Let’s Encrypt, KMS, Cognito, Vault, CIS |
Metrics | Availability metrics, business metrics, app metrics, server metrics, events, observability, tracing, and alerting. | CloudWatch, DataDog, New Relic, Honeycomb |
Logs | Rotate logs on disk. Aggregate log data to a central location. | CloudWatch Logs, ELK, Sumo Logic, Papertrail |
Backup and Restore | Make backups of DBs, caches, and other data on a scheduled basis. Replicate to separate region/account. | RDS, ElastiCache, replication |
Cost optimization | Pick proper Instance types, use spot and reserved Instances, use auto scaling, and nuke unused resources. | Auto scaling, spot Instances, reserved Instances |
Documentation | Document your code, architecture, and practices. Create playbooks to respond to incidents. | READMEs, wikis, Slack |
Tests | Write automated tests for your infrastructure code. Run tests after every commit and nightly. | Terratest, inspec, serverspec, kitchen-terraform |
module tips
small module:新手容易把所有环境都写到一个module或文件里。坏处和写代码一样,很明显。infra更是如此,我们需要保证 小的独立的 单元。
composable modules:unix philosophy。function composition。minimize side effects。
releasable module:use Git tag semantic versioning。可以release到https://registry.terraform.io/
beyond terraform modules:思路转变很重要。尽管module都在说Terraform code,但是module folder里也可以放其他infra code。参考run-vault Bash script。也就是说,避免不了non-terraform code来弥补declarative特性。当然有些work-around:null_resource
provisioners:用来执行script在local/remote机器上。provisioner可以和null_resource结合来跑script在Terraform life-cycle中
external data source:pass data from terraform to external program. external program pass data back to terraform by json。如
|
|
testing
The DevOps world is full of fear: fear of downtime; fear of data loss; fear of security breaches
infra change需要通过测试来提高自信。Infra code该怎么测试呢
- manual testing:
- 手动测试。注意的是:有些private subnet需要jumphost才能手动测试
- cleaning up: cloud-nuke 和 aws-nuke 都是可以快速 delete everything in AWS account的工具
- automated testing:
- Terratest 通过deploy real infrastructure in real env,然后validate real infrastructure by api/ssh/…
- 一些工具: pre-commit-terraform / goss
benefits
如果说服采用IaC是件很难的事情,尤其是非developer,因为IaC会带来额外的许多成本。这里记录几个出发点:
I have an idea for how to reduce our outages in half.
deployment process is fully automated, reliable, and repeatable