Data Sources

Query existing resources and external data

7 min read

Data Sources

In the previous tutorial, we learned how Terraform tracks infrastructure with state. Now let's talk about looking stuff up.

Resources create things. Data sources read things. Need the latest AMI? Query it. Need info about an existing VPC someone else created? Look it up. Data sources let you reference infrastructure you didn't create with Terraform — like reading someone else's notes without being able to edit them.

Data Source Basics

"How do I use something that already exists in AWS?"

Like this:

data "aws_ami" "ubuntu" {
  most_recent = true
  owners      = ["099720109477"]  # Canonical

  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
  }
}

resource "aws_instance" "web" {
  ami           = data.aws_ami.ubuntu.id  # Use the data source
  instance_type = "t2.micro"
}

Pattern: data.type.name.attribute

See the difference? resource creates, data reads. Same block structure, different superpower.

Common Data Sources

Let's go through the ones you'll use all the time.

aws_ami — Find AMIs

“Do I really have to hardcode AMI IDs?"

Nope. Let Terraform find the latest one:

# Latest Amazon Linux 2
data "aws_ami" "amazon_linux" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["amzn2-ami-hvm-*-x86_64-gp2"]
  }
}

# Latest Ubuntu 22.04
data "aws_ami" "ubuntu" {
  most_recent = true
  owners      = ["099720109477"]

  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}

Use it:

resource "aws_instance" "web" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t2.micro"
}

aws_vpc — Reference Existing VPC

"My team already set up a VPC manually. How do I use it?"

Look it up by tag, ID, or just grab the default:

# Get default VPC
data "aws_vpc" "default" {
  default = true
}

# Get VPC by tag
data "aws_vpc" "production" {
  tags = {
    Environment = "production"
  }
}

# Get VPC by ID
data "aws_vpc" "specific" {
  id = "vpc-0abc123def456789"
}

Use it:

resource "aws_subnet" "new_subnet" {
  vpc_id     = data.aws_vpc.production.id
  cidr_block = "10.0.100.0/24"
}

aws_subnets — Get Multiple Subnets

# All subnets in a VPC
data "aws_subnets" "private" {
  filter {
    name   = "vpc-id"
    values = [data.aws_vpc.production.id]
  }

  tags = {
    Tier = "private"
  }
}

Returns a list of IDs:

resource "aws_instance" "web" {
  count     = length(data.aws_subnets.private.ids)
  subnet_id = data.aws_subnets.private.ids[count.index]
  # ...
}

aws_availability_zones — Get AZs

data "aws_availability_zones" "available" {
  state = "available"
}

output "azs" {
  value = data.aws_availability_zones.available.names
  # ["us-east-1a", "us-east-1b", "us-east-1c", ...]
}

aws_caller_identity — Who Am I?

"Wait, which AWS account am I even using right now?"

Good question. Find out:

data "aws_caller_identity" "current" {}

output "account_id" {
  value = data.aws_caller_identity.current.account_id
}

output "arn" {
  value = data.aws_caller_identity.current.arn
}

Useful for dynamic policies:

data "aws_caller_identity" "current" {}

resource "aws_s3_bucket_policy" "bucket_policy" {
  bucket = aws_s3_bucket.my_bucket.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect    = "Allow"
        Principal = { AWS = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:root" }
        Action    = "s3:*"
        Resource  = [
          aws_s3_bucket.my_bucket.arn,
          "${aws_s3_bucket.my_bucket.arn}/*"
        ]
      }
    ]
  })
}

aws_region — Current Region

data "aws_region" "current" {}

output "region" {
  value = data.aws_region.current.name  # "us-east-1"
}

aws_iam_policy_document — Build IAM Policies

"Writing IAM policies in raw JSON is painful."

Agreed. Use this instead — it's a game changer:

data "aws_iam_policy_document" "s3_read" {
  statement {
    effect = "Allow"
    actions = [
      "s3:GetObject",
      "s3:ListBucket"
    ]
    resources = [
      "arn:aws:s3:::my-bucket",
      "arn:aws:s3:::my-bucket/*"
    ]
  }
}

resource "aws_iam_policy" "s3_read" {
  name   = "s3-read-policy"
  policy = data.aws_iam_policy_document.s3_read.json
}

Benefits over inline JSON:

  • Terraform validates syntax (no more missing commas)
  • You can reference other resources
  • You can merge multiple documents

How cool is that?

Combining Policy Documents

data "aws_iam_policy_document" "s3_access" {
  statement {
    effect    = "Allow"
    actions   = ["s3:GetObject"]
    resources = ["arn:aws:s3:::bucket1/*"]
  }
}

data "aws_iam_policy_document" "dynamodb_access" {
  statement {
    effect    = "Allow"
    actions   = ["dynamodb:GetItem", "dynamodb:PutItem"]
    resources = ["arn:aws:dynamodb:*:*:table/my-table"]
  }
}

data "aws_iam_policy_document" "combined" {
  source_policy_documents = [
    data.aws_iam_policy_document.s3_access.json,
    data.aws_iam_policy_document.dynamodb_access.json
  ]
}

External Data

"Can I pull secrets from AWS Secrets Manager?"

Absolutely. This is where data sources really shine.

aws_secretsmanager_secret_version — Get Secrets

data "aws_secretsmanager_secret_version" "db_creds" {
  secret_id = "prod/database/credentials"
}

locals {
  db_creds = jsondecode(data.aws_secretsmanager_secret_version.db_creds.secret_string)
}

resource "aws_db_instance" "main" {
  username = local.db_creds.username
  password = local.db_creds.password
  # ...
}

aws_ssm_parameter — Get SSM Parameters

data "aws_ssm_parameter" "ami_id" {
  name = "/infrastructure/ami/web-server"
}

resource "aws_instance" "web" {
  ami = data.aws_ssm_parameter.ami_id.value
  # ...
}

Data Source vs Resource

AspectResourceData Source
PurposeCreate/manageRead/query
Keywordresourcedata
StateTrackedNot tracked (just cached)
LifecycleCreate, update, deleteRefresh on each plan

Filtering

Most data sources support filters:

data "aws_instances" "web_servers" {
  filter {
    name   = "tag:Role"
    values = ["web"]
  }

  filter {
    name   = "instance-state-name"
    values = ["running"]
  }
}

output "web_server_ips" {
  value = data.aws_instances.web_servers.public_ips
}

Filter vs Tags

# Using filter
filter {
  name   = "tag:Environment"
  values = ["production"]
}

# Using tags argument (simpler, when supported)
tags = {
  Environment = "production"
}

Both work. tags is cleaner when available. Use whichever doesn't make you squint.

Handling Missing Data

"What happens if my data source finds nothing?"

It explodes. Well, it errors out. Let's handle that gracefully:

# This fails if no matching AMI found
data "aws_ami" "web" {
  # ...
}

Handle with count:

locals {
  use_custom_ami = var.ami_id != ""
}

data "aws_ami" "web" {
  count       = local.use_custom_ami ? 0 : 1
  most_recent = true
  # ...
}

resource "aws_instance" "web" {
  ami = local.use_custom_ami ? var.ami_id : data.aws_ami.web[0].id
}

Remote State Data Source

"Can I read outputs from another Terraform project?"

Yes! This is how large teams share infrastructure info across projects:

# In your VPC project, you output:
output "vpc_id" {
  value = aws_vpc.main.id
}

# In another project, read it:
data "terraform_remote_state" "vpc" {
  backend = "s3"
  config = {
    bucket = "my-terraform-state"
    key    = "vpc/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_instance" "web" {
  subnet_id = data.terraform_remote_state.vpc.outputs.subnet_ids[0]
  # ...
}

Great for large projects split across multiple state files. Your VPC team manages VPCs, your app team references them.

HTTP Data Source

"Can I fetch data from a random URL?"

Sure can. Fetch data from any API:

data "http" "myip" {
  url = "https://api.ipify.org?format=text"
}

resource "aws_security_group_rule" "ssh_from_my_ip" {
  type              = "ingress"
  from_port         = 22
  to_port           = 22
  protocol          = "tcp"
  cidr_blocks       = ["${chomp(data.http.myip.response_body)}/32"]
  security_group_id = aws_security_group.web.id
}

Complete Example

Dynamic infrastructure using multiple data sources:

# Current region and account
data "aws_region" "current" {}
data "aws_caller_identity" "current" {}

# Latest Ubuntu AMI
data "aws_ami" "ubuntu" {
  most_recent = true
  owners      = ["099720109477"]

  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
  }
}

# Existing VPC
data "aws_vpc" "main" {
  tags = {
    Name = "main-vpc"
  }
}

# Private subnets in that VPC
data "aws_subnets" "private" {
  filter {
    name   = "vpc-id"
    values = [data.aws_vpc.main.id]
  }

  tags = {
    Tier = "private"
  }
}

# Available AZs
data "aws_availability_zones" "available" {
  state = "available"
}

# IAM policy for EC2 role
data "aws_iam_policy_document" "ec2_assume_role" {
  statement {
    actions = ["sts:AssumeRole"]
    principals {
      type        = "Service"
      identifiers = ["ec2.amazonaws.com"]
    }
  }
}

# Create resources using data
resource "aws_iam_role" "web" {
  name               = "web-server-role"
  assume_role_policy = data.aws_iam_policy_document.ec2_assume_role.json
}

resource "aws_instance" "web" {
  count = min(length(data.aws_subnets.private.ids), 3)

  ami                  = data.aws_ami.ubuntu.id
  instance_type        = "t2.micro"
  subnet_id            = data.aws_subnets.private.ids[count.index]
  iam_instance_profile = aws_iam_instance_profile.web.name

  tags = {
    Name   = "web-${count.index}"
    Region = data.aws_region.current.name
  }
}

output "infrastructure_info" {
  value = {
    account_id = data.aws_caller_identity.current.account_id
    region     = data.aws_region.current.name
    vpc_id     = data.aws_vpc.main.id
    ami_used   = data.aws_ami.ubuntu.id
  }
}

What's Next?

Data sources are incredibly powerful — they connect your Terraform to the real world beyond just what you've created. You now know:

  • How to look up AMIs, VPCs, subnets, and AZs
  • Building IAM policies the right way
  • Reading secrets and remote state
  • Filtering and handling missing data

Terraform has some serious built-in logic powers too. Let's explore expressions and functions — conditionals, loops, string manipulation, and more. Let's go!