Intelligent AWS Network Troubleshooting with Strands AI Agents
From Manual Network Debugging to AI-Powered Troubleshooting: A First-Timer’s Journey with Strands
At 2 AM, your ECS backend tasks can’t connect to your Aurora database. Security groups look correct, route tables seem fine, but something’s blocking the connection. You’re about to start the familiar dance of checking 15 different AWS services, cross-referencing documentation, and manually piecing together the network path.
What if an AI agent could systematically analyze your entire network configuration, trace the connection path, and identify the exact blocking component with surgical precision?
This was my first time ever working with AI agents or the Strands framework. In just 3-4 hours, I went from complete beginner to having a sophisticated AWS network troubleshooting agent with 30+ specialized tools.
A Complete Beginner’s First AI Agent
Before this project, I had zero experience with AI agent frameworks. I’d just recently heard about Strands but never touched it. I assumed that building something that could reason about complex AWS network configurations would be very complex and take time.
Four hours later, I had an agent that could systematically analyze VPC configurations, trace network paths, correlate security group rules with route tables, and provide actionable remediation steps. The speed at which you can go from concept to working solution with Strands is genuinely impressive.
The Traditional Network Troubleshooting Problem
Network connectivity issues in AWS aren’t just technical problems; they’re detective stories. When your Lambda function can’t reach an RDS instance, or your ECS tasks can’t communicate with external APIs, you’re not just debugging code, you’re investigating a complex multi-layered system.
The traditional approach looks like this:
- Check security groups (source and destination)
- Verify route tables for all involved subnets
- Examine Network ACLs (stateless rules)
- Confirm VPC configuration and CIDR blocks
- Validate NAT Gateway and Internet Gateway settings
- Review VPC endpoint configurations
- Cross-reference AWS documentation for service-specific requirements
Each step requires switching between different AWS console pages, remembering service-specific networking behaviors, and manually correlating findings. By the time you’ve gathered all the data, you’ve lost the big picture.
Enter the Strands Framework
The Strands framework enables you to build AI agents that can systematically analyze complex infrastructure configurations. Unlike traditional scripts that follow predetermined logic, Strands agents can reason about your network configuration and adapt their investigation based on what they discover.
Here’s what makes this approach powerful: instead of encoding network troubleshooting logic into code, we give the AI agent comprehensive tools to gather raw AWS configuration data and let it reason about the relationships and dependencies.
Building the Network Troubleshooting Agent
Core Architecture: Data-Driven Investigation
My agent follows a fundamental principle: tools gather data, AI provides reasoning. Rather than building smart tools that make decisions, I built simple tools that return comprehensive AWS configuration data.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
@tool
def get_security_group(sg_id: str, region: str = "eu-central-1") -> str:
"""Get security group configuration - raw data only."""
try:
ec2 = boto3.client("ec2", region_name=region)
response = ec2.describe_security_groups(GroupIds=[sg_id])
sg = response["SecurityGroups"][0]
return str({
"GroupId": sg["GroupId"],
"GroupName": sg["GroupName"],
"Description": sg["Description"],
"VpcId": sg.get("VpcId"),
"IpPermissions": sg["IpPermissions"],
"IpPermissionsEgress": sg["IpPermissionsEgress"],
"Tags": sg.get("Tags", []),
})
except Exception as e:
return f"Error: {str(e)}"
This tool doesn’t analyze whether the security group rules are “correct”—it simply returns the configuration in a structured format. The AI agent uses this data along with context from other tools to understand the network topology and identify issues.
Comprehensive Tool Coverage
My agent includes 30+ specialized tools covering most aspects of AWS networking:
Service Discovery Tools:
- EC2 instances with security group associations
- ECS clusters, services, and tasks with network configurations
- Lambda functions with VPC settings
- RDS instances and Aurora clusters with subnet groups
Network Infrastructure Tools:
- VPC and subnet configurations
- Route tables and their associations
- Network ACLs (the stateless layer everyone forgets)
- NAT Gateways and Internet Gateways
- Load balancers and target groups
Advanced Connectivity Tools:
- VPC endpoints for AWS service access
- S3 bucket policies affecting VPC endpoint traffic
- DynamoDB table configurations
- DNS resolution settings
Integration with AWS Documentation
One of the most powerful aspects of my implementation is the integration with AWS’s official documentation through their MCP (Model Context Protocol) server:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Setup AWS Documentation MCP client
aws_docs_client = MCPClient(
lambda: stdio_client(
StdioServerParameters(
command="uvx",
args=["awslabs.aws-documentation-mcp-server@latest"]
)
)
)
# Combine custom tools with AWS documentation
with aws_docs_client:
aws_doc_tools = aws_docs_client.list_tools_sync()
all_tools = custom_tools + aws_doc_tools
This means my agent can reference current AWS best practices, security recommendations, and service-specific networking requirements while analyzing configurations. It’s like having the AWS documentation team participate in your troubleshooting session.
The System Prompt: Structured Investigation Methodology
Rather than hoping the AI stumbles toward a solution, I guide it through a systematic analysis framework:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
system_message = """
EXPERT AWS NETWORK CONNECTIVITY ANALYSIS REQUEST
REQUESTED ANALYSIS SCOPE:
1. RESOURCE DISCOVERY & CONFIGURATION AUDIT
- Identify and analyze network configuration of both source and destination services
- Document VPC, subnet, availability zone, and security group assignments
- Review service-specific networking settings
2. END-TO-END NETWORK PATH ANALYSIS
- Trace complete network path from source to destination
- Analyze route tables for all involved subnets
- Examine Network ACL rules (stateless) for traffic flow
- Review security group rules (stateful) for proper ingress/egress permissions
3. SECURITY & COMPLIANCE VALIDATION
- Verify configurations against AWS security best practices
- Reference AWS Well-Architected Framework security pillar recommendations
- Ensure least-privilege access principles are maintained
4. ROOT CAUSE IDENTIFICATION
- Pinpoint exact blocking component(s) with specific evidence
- Explain the technical reason for connectivity failure
5. ACTIONABLE REMEDIATION PLAN
- Provide step-by-step resolution instructions
- Include specific AWS CLI commands ready for execution
- Suggest verification procedures to confirm fixes
"""
This structured approach ensures the agent follows best practices and provides actionable recommendations rather than vague suggestions.
First Implementation Insights: What I Learned in 4 Hours
Token Inefficiency: The Hidden Cost
While building 30+ tools (together with my buddy Claude Code) in a few hours felt productive, I quickly realized I was being inefficient with tokens. Even though I was filtering AWS responses to relevant fields, there were still optimization opportunities:
1
2
3
4
5
6
7
8
9
10
11
@tool
def get_security_group(sg_id: str, region: str = "eu-central-1") -> str:
# My implementation extracted relevant fields...
return str({
"GroupId": sg["GroupId"],
"GroupName": sg["GroupName"],
"VpcId": sg.get("VpcId"),
"IpPermissions": sg["IpPermissions"], # This can be HUGE
"IpPermissionsEgress": sg["IpPermissionsEgress"], # This too
"Tags": sg.get("Tags", []), # Often unnecessary for network analysis
})
The token waste comes from:
- Verbose security group rules:
IpPermissions
arrays can be massive with complex nested structures for CIDR blocks, port ranges, and security group references - Converting everything to strings: Large nested dictionaries become verbose when stringified
- Full tag arrays: Returning all tags when maybe only network-relevant ones matter
- Unnecessary depth: Some AWS responses have deeply nested structures that could be flattened
Strands Built-in Tools: The Path Not Taken
Halfway through building my custom AWS tools, my colleague hinted that Strands has a built-in use_aws
tool. This was a humbling moment. I’d spent a few hours (with Claude Code) building something that might already exist in a more optimized form.
1
2
3
4
5
6
7
8
9
# My custom approach - 30+ individual tools
from tools import (
find_instance, get_security_group, list_instances,
list_ecs_clusters, get_ecs_service_details,
# ... 25+ more tools
)
# Strands built-in approach - potentially more efficient
from strands.tools.aws import use_aws
The built-in tool likely handles token optimization, response filtering, and error handling more efficiently than my first-time implementation. For future iterations, I plan to investigate customizing the built-in tool rather than maintaining 30+ custom implementations.
What Worked Incredibly Well
Despite the inefficiencies, several design decisions proved exactly right:
Data-driven tool design: Letting the AI reason about raw configuration data rather than embedding logic in tools was the correct approach, even with token overhead.
Comprehensive coverage: Having tools for every AWS networking component gave the agent complete visibility. The agent could correlate findings across services in ways I hadn’t anticipated.
MCP integration: Connecting with AWS documentation through MCP was surprisingly straightforward and immediately valuable. The agent could reference current best practices while analyzing configurations. The agent also provided remediations with up to date AWS CLI commands, including correct resource ARNs and IDs.
Structured system prompts: The systematic investigation framework prevented the agent from wandering aimlessly through AWS configurations.
Technical Implementation Insights
Raw Data vs. Processed Intelligence
Every tool returns AWS API responses formatted as structured data. This design choice prevents tools from making assumptions about what constitutes “correct” configuration:
1
2
3
4
5
6
7
8
9
10
11
12
# Good: Returns raw security group rules
"IpPermissions": [
{
"IpProtocol": "tcp",
"FromPort": 3306,
"ToPort": 3306,
"IpRanges": [{"CidrIp": "10.0.0.0/8"}]
}
]
# Bad: Tools making judgments
"SecurityGroupStatus": "Allows MySQL access from private networks - OK"
The AI agent can reason about these rules in context with other network components, while tools remain objective data gatherers.
MCP Integration Patterns
The Model Context Protocol integration gives my agent access to current AWS documentation:
1
2
3
4
# The agent can reference current best practices
query_result = aws_docs_client.query(
"What are the security group best practices for RDS access?"
)
This ensures recommendations align with current AWS guidance rather than outdated practices embedded in training data.
Optimization Roadmap
Immediate Wins
Token-optimized responses: Filter out verbose nested structures and return only network-relevant data:
1
2
3
4
5
6
# Instead of full IpPermissions, extract key info
"NetworkAccess": {
"AllowedPorts": [3306, 443],
"SourceCidrs": ["10.0.0.0/8", "172.16.0.0/12"],
"SourceSecurityGroups": ["sg-12345", "sg-67890"]
}
Smart tool selection: Rather than giving the agent 30+ tools immediately, use a tiered approach where basic tools can suggest more specific tools to call.
Built-in tool evaluation: Test Strands’ use_aws
tool against my custom implementations for both functionality and token efficiency.
Future Enhancements
VPC Reachability Analyzer integration: AWS’s definitive network path analysis tool could provide authoritative connectivity answers:
1
2
3
4
5
6
@tool
def analyze_network_path(source_id: str, destination_id: str, destination_port: int):
"""Use VPC Reachability Analyzer to trace network path."""
# Implementation would create reachability analysis
# and return definitive path information
pass
Lessons Learned
Framework vs. Custom Development
Start with framework capabilities: Before building custom tools, thoroughly explore what the framework already provides. Strands likely has optimized implementations of common patterns.
Custom tools for domain-specific needs: Build custom tools for specialized requirements that frameworks don’t address, but leverage built-in capabilities for standard operations.
Iterative optimization: First implementations should prioritize functionality over efficiency. Optimize based on actual usage patterns rather than theoretical concerns.
Tool Design Philosophy
Keep tools simple and objective: Complex logic belongs in the AI reasoning layer, not in data gathering tools. This separation makes the system more maintainable.
Optimize for tokens, not just data completeness: Return filtered, relevant data rather than complete API responses. The AI can always request additional details through follow-up tool calls.
AI Agent Architecture
Structured prompts prevent wandering: Without systematic investigation frameworks, agents can get lost in the complexity of AWS configurations.
Data correlation is where AI shines: The ability to correlate findings across multiple AWS services simultaneously is where AI agents provide value beyond traditional scripts.
Documentation integration is crucial: Access to current AWS best practices and service-specific requirements elevates the agent from configuration reader to expert advisor.
The Bottom Line
Building intelligent AWS network troubleshooting with Strands represents a fundamental shift from reactive debugging to proactive analysis. The fact that a complete beginner can build sophisticated infrastructure analysis tools in a few hours speaks to both the power of the Strands framework and the maturity of AI agent development.
The key insight isn’t just about automating existing processes, it’s about enabling a level of comprehensive analysis that wasn’t practical with manual approaches. When your agent can simultaneously analyze security groups, route tables, Network ACLs, VPC endpoints, and cross-reference everything against current AWS best practices, you’re not just debugging faster, you’re debugging better.
Try It Yourself
The Strands framework provides the foundation for building your own AWS automation agents. Whether you’re tackling network troubleshooting, cost optimization, or security compliance, the patterns I’ve explored here can adapt to your specific infrastructure challenges.
Start with comprehensive data gathering tools, add systematic reasoning through structured prompts, and let AI agents transform your AWS operations from reactive firefighting to proactive optimization.
The framework is accessible, the patterns are straightforward. Network troubleshooting that might have required a specialist and a lot of time, now takes minutes, with solid results and actionable remediation plans.
Your AWS infrastructure is complex enough. Your troubleshooting process doesn’t have to be.