Job Description

About the Role

We're seeking an exceptional Principal Site Reliability Engineer to architect, design, and build our SRE foundation from the ground up at InfiniteChoice. This is a rare greenfield opportunity to establish SRE practices, develop custom tooling, and create the reliability culture that will support our platform serving millions of users and billions in transaction volume.

As our Principal SRE, you'll combine deep technical expertise with strategic vision to build world-class monitoring, observability, and automation systems. You'll have the autonomy to define our SRE processes, select technologies, and create the framework that ensures our systems are reliable, scalable, and performant.

Location: Remote - US based

What You Will DoSRE Foundation & Process Development

Build SRE practices from scratch - define SLIs, SLOs, error budgets, and reliability metrics
Establish incident response procedures, on-call rotations, and post-mortem processes
Create reliability engineering standards and best practices across all engineering teams
Develop disaster recovery and business continuity strategies
Design and implement capacity planning and performance optimization frameworks

Architecture & Tool Development

Drive architecture decisions for comprehensive application and infrastructure monitoring solutions
Design and develop custom SRE tools for automated monitoring, alerting, and remediation
Build observability platforms that provide deep insights into system performance and user experience
Create automation frameworks for deployment, scaling, and incident response
Architect logging, metrics, and tracing systems for distributed microservices environments

Google Cloud Infrastructure Excellence

Leverage Google Cloud Platform services to build resilient, scalable infrastructure
Implement cloud-native monitoring using Stackdriver, Cloud Monitoring, and Cloud Logging
Design auto-scaling and self-healing systems using GKE, Cloud Functions, and managed services
Optimize cloud costs while maintaining high availability and performance standards
Establish security and compliance frameworks within GCP environments

Innovation & Continuous Improvement

Research and implement cutting-edge SRE tools and methodologies
Leverage AI and machine learning for predictive analytics, anomaly detection, and automated remediation
Create dashboards and reporting systems that provide actionable insights to engineering and business teams
Establish feedback loops for continuous improvement of reliability and performance
Stay current with industry best practices and emerging technologies in the SRE space

What You Must HaveSRE & Infrastructure Expertise

12+ years of experience in Site Reliability Engineering or Infrastructure Engineering
5+ years in lead SRE roles building and scaling SRE teams and processes
Proven track record designing and implementing monitoring and observability solutions at scale
Deep understanding of distributed systems, microservices architectures, and cloud-native patterns
Experience with infrastructure as code, configuration management, and deployment automation

Google Cloud Platform Proficiency

Hands-on experience with Google Cloud Platform is required
Expertise with GCP monitoring and observability stack (Cloud Monitoring, Cloud Logging, Cloud Trace)
Experience with GKE, Compute Engine, Cloud Functions, and other core GCP services
Knowledge of GCP networking, security, and compliance capabilities
Understanding of GCP cost optimization and resource management

Technical Skills

Strong programming skills in Python, Go, Java, or similar languages
Experience with monitoring tools (Prometheus, Grafana, Datadog, New Relic, or similar)
Proficiency with containerization (Docker, Kubernetes) and orchestration platforms
Knowledge of CI/CD pipelines, automated testing, and deployment strategies
Understanding of database performance tuning and optimization (SQL and NoSQL)

AI & Automation

Familiarity with AI-driven development tools and methodologies is a huge plus
Experience with machine learning for operations (AIOps), anomaly detection, or predictive analytics
Knowledge of automated incident response and self-healing systems
Understanding of AI/ML tools for log analysis, pattern recognition, and intelligent alerting

Problem-Solving & Mindset

Strong analytical and troubleshooting skills for complex distributed systems
Experience with high-pressure incident response and crisis management
Detail-oriented with commitment to operational excellence and continuous improvement
Comfortable with ambiguity and building processes in a fast-growing environment
Passion for reliability, automation, and engineering best practices
Demonstrated experience building SRE programs and processes from the ground up is a HUGE plus

Education

Bachelor's degree in Computer Science, Engineering, or equivalent professional experience
Industry certifications (Google Cloud Professional, SRE or related certifications preferred)

What We Offer

Ground-floor opportunity to build SRE practices and culture from scratch
Full autonomy to define processes, select technologies, and establish best practices
Direct impact on platform reliability serving millions of users
Opportunity to create lasting engineering culture and operational excellence
Remote-first culture with in-person meeting in Dallas, TX on need basis
Collaborative environment with smart, passionate engineers and cross-functional teams
Access to cutting-edge technologies and AI-driven development tools
Competitive compensation, equity participation, and comprehensive benefits

Ready to Build World-Class Reliability?

Join us in creating the SRE foundation that will power InfiniteChoice's next phase of growth. If you're passionate about reliability engineering, love building systems from scratch, and want to establish the operational excellence that scales with our business, we'd love to hear from you.

About InfiniteChoice

InfiniteChoice was founded to help people find the experiences they want simply and effortlessly. We leverage a new type of business model and platform that uniquely applies automation and technology to solve the challenges of scale and complexity in experience discovery.

Existing business and marketing technologies can no longer handle the demands of connecting millions of consumers with vast inventories of experiences across a fragmented, global marketplace of people, partners, and providers.

Our mission is to disrupt this status quo by creating seamless connections between consumers and experiences. We're just at the beginning of this journey, but our approach is working: we've helped over 275 million visitors connect to millions of experiences, generating over $2 billion in revenue for our brands and partners.

Job Tags

Remote work

Similar Jobs

Pink's Wilmington, NC

Window Cleaning Tech Job at Pink's Wilmington, NC

...and help this exciting team, working with clients to make their windows spotless! You will contribute to a fun, fast growing team while... ...among its customers.Technician Role Overview:* Cleaning of interior and exterior residential and commercial windows using...

Retail Services WIS Corporation

Area Manager Job at Retail Services WIS Corporation

Job Posting: Area Manager Detroit, MI Full Time- Salaried Entry Level Salary Range: $45,000 - $55,000 Lead with confidence as our Area Manager for Inventory Operations! Are you a people leader with a passion for driving efficiency and excellence...

Carle Health

Audiologist Job at Carle Health

...evaluations (VNG) Perform auditory brainstem response testing Communication and meetings with device reps Train and precept audiology... ...and balance evaluations preferred 2 years experience in a medical facility preferred Pediatric experience preferred...

Applied Business Communications (ABcom)

Fire Alarm Designer/Engineer/Estimator Job at Applied Business Communications (ABcom)

...structured cabling contractor, and we are recruiting our next fire alarm engineer to join our growing team. Purpose of this position:... ...other critical markets. Assists the sales team with site visits, design, sales engineering, takeoffs, bills of material, estimates, and...

BrightStar Care Woodlands Texas

HIRING EVENT-CNAs, RNs, Caregivers Needed for Woodlands/N Houston and Surrounding Area Job at BrightStar Care Woodlands Texas

...Job Description Job Description BRIGHTSTAR CARE HIRING EVENT WE ARE HIRING CNAs, RNs, CAREGIVERS FOR THE WOODLANDS/N HOUSTON... ...'s License, SS Card, Auto Insurance, and CPR Offering on-the-spot interviews and expediated hiring! Tell a Friend and Bring a Friend...

Principal Site Reliability Engineer (SRE) Job at INFINITE CHOICE LLC, Dallas, TX

TUZEVWNtZlVGM2hPeUw5aW1ueHMrU0h6dnc9PQ==

Job Description

Job Description

Job Tags

Similar Jobs

Pink's Wilmington, NC

Window Cleaning Tech Job at Pink's Wilmington, NC

Retail Services WIS Corporation

Area Manager Job at Retail Services WIS Corporation

Carle Health

Audiologist Job at Carle Health

Applied Business Communications (ABcom)

Fire Alarm Designer/Engineer/Estimator Job at Applied Business Communications (ABcom)

BrightStar Care Woodlands Texas

HIRING EVENT-CNAs, RNs, Caregivers Needed for Woodlands/N Houston and Surrounding Area Job at BrightStar Care Woodlands Texas

Dallas, TX

Full Time

2026-05-25

2026-06-24