Tuesday, May 29, 2012

How to Avoid the OC4J Servers Got Automatically Restarted

If you notice that your production OC4J servers that house SOA/BPEL are automatically restarting every now and then, don't panic. Here are the reasons why and how to avoid it.

First let's understand what restarts the OC4J servers

Usually if there are automatic restarts it will be the OPMN that does it. OPMN stands for Oracle Process Monitoring and Notification. There is an opmn daemon process running on all cluster nodes pinging the JVM and HTTPD processes for health checking and death detection. If the response from an OC4J or HTTP server exceeds a configurable timeout (ping timeout), OPMN may consider  the server dead and decide to restart it. More explanation of the OPMN Automatic Restart can be found here.

The OPMN Ping operation is configured through timeout, retry and interval. The default Ping Timeout is 20 second.

If the OPMN is checking the health of an OC4J JVM, then there is a reverse ping operation as well.

Second, let's see how to avoid it.

Sometimes, even if an OC4J server's response to the OPMN ping operation exceeds the ping timeout, it is not because it is dead, but only that it was under heavy load, or there is something wrong with the network.

Therefore two ways to avoid the OPMN restarting the OC4J in in such circumstance.

1. You want to find out if your server is overloaded. If it is, tune the performance of the server accordingly. There are plenty of postings and documentations about performance monitoring and tuning of SOA servers running in OC4Js.

2. You can configure the ping and reverse ping to change how the OPMN detects servers death. The configuration is in the opmn.xml on each node. My friend Chintan Shah's has beautifully summarize the details in his blog http://chintanblog.blogspot.com/2010/12/opmn-ping-timeouts.html