As a Senior Site Reliability Developer (IC4), you will play a key role in ensuring the availability, scalability, and operational excellence of OCI's Japan Sovereign Cloud services. You will design and implement automation, drive service reliability improvements, lead complex incident investigations, and partner with development teams to improve operational readiness. You will own and prioritize an SRD operational improvement backlog based on shift feedback, incident reviews, alert quality reviews, and business reliability requirements.
The role combines software engineering expertise with large-scale cloud operations and requires participation in a 24x7 shift rotation supporting critical cloud infrastructure. You will translate operational and business requirements into reliability plans, then execute improvements through tooling, automation, runbook updates, process changes, and cross-team coordination. You will also serve as a technical mentor for less experienced engineers and contribute to continuous improvement initiatives across the organization. You will collaborate with JP Sovereign Cloud and EU Sovereign Cloud teams to share operational practices and align reliability improvements where appropriate.
Qualifications
- Native-level Japanese language proficiency and business-level English communication skills
- 5+ years of experience in Site Reliability Engineering, Software Engineering, Cloud Infrastructure, DevOps, or related technical disciplines
- Proficiency in one or more programming languages such as Java, Python, Go, C++, or similar
- Experience with cloud platforms, infrastructure automation, observability, monitoring, and incident response practices
- Strong understanding of Linux systems administration, networking, storage, and performance optimization
- Demonstrated ability to troubleshoot complex cross-functional production issues and drive root cause analysis
- Ability to participate in a 24x7 shift rotation and provide technical leadership during critical service events
- Demonstrated ability to intake, triage, and prioritize operational issues raised by shift teams and convert them into executable improvement plans
- Experience improving alert quality, reducing alert noise, increasing actionability, and ensuring operational documentation supports timely incident response
- Ability to balance business requirements, technical feasibility, and operational risk when planning reliability improvements
Career Level - IC4
Only Oracle brings together the data, infrastructure, applications, and expertise to power everything from industry innovations to life-saving care. And with AI embedded across our products and services, we help customers turn that promise into a better future for all. Discover your potential at a company leading the way in AI and cloud solutions that impact billions of lives.
True innovation starts when everyone is empowered to contribute. That’s why we’re committed to growing a workforce that promotes opportunities for all with competitive benefits that support our people with flexible medical, life insurance, and retirement options. We also encourage employees to give back to their communities through our volunteer programs.
We’re committed to including people with disabilities at all stages of the employment process. If you require accessibility assistance or accommodation for a disability at any point, let us know by emailing [email protected] or by calling 1-888-404-2494 in the United States.
Oracle is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans’ status, or any other characteristic protected by law. Oracle will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law.
By continuing you agree to our Terms & Privacy Policy.