@ZEAL Blog·厉
We stand alone,
TOGETHER.

Troubleshoot High CPU Issues

Posted by zeal on 2005-08-11 11:01 , 17094 characters , Via |  + 1 - 0   English
转载请保留本行原始出处声明信息 : http://www.zeali.net/entry/139 MaDe1nZEAL
标签 ( JAVA/C++ ):  ,  , 
Got high CPU Issues? Explore operating system commands specific to JVMs to detect the cause of the problem. by Steve Pozarycki, July 14, 2005. 以后有时间再翻译一下。

High CPU issues can occur for many reasons: your application server (or stand-alone Java application), user-created threads, bad coding practices, or third-party software. Unfortunately, determining where this happens can sometimes be difficult. If your Java applications ever encounter high CPU issues, you will find the tips for debugging presented here to be useful.

You can attempt to help troubleshoot this problem by using operating system-specific commands and gathering data. All of the information for these operating systems is based on the Sun JVM. However, the same information is provided from other JVMs (such as the BEA JRockit JVM) so the same concepts apply.

Here we'll look at examples of troubleshooting this problem on Solaris and Microsoft Windows systems. Information for Hewlett-Packard, AIX, and Linux can be found on the BEA support Web site (see Resources).

Note that because HP does not have a prstat-like command available there is a binary executable that I developed that is available on the BEA Support Web site that you can download to gather the necessary information on HP. The information presented here was run on the WebLogic Platform, but the troubleshooting methodology will apply to any Java application.

Troubleshooting Solaris
Run the prstat command on the Java process, and do this for several iterations so you can see a pattern. For example:

prstat -L -p <PID> 1 1

If you do not have prstat available on your Solaris installation, do this instead:

ps -Le -o pid,user,s,lwp,pcpu,
	args | awk '$3 != "S" { print 
	}'

where the output will show the Process id (PID), user, state, and the lightweight thread number. An example is something like this:

0 root T 1 0.0 sched
2769 root O 1 0.5 ps -Le -o 
	pid,user,s,lwp,args
3383 spoz R 27 0.1 
	/wwsl/sharedInstalls/solaris/j2
	sdk1.4.2_04/bin/java -client 
	-Xms32m -Xmx200m X
3383 spoz R 31 0.1 
	/wwsl/sharedInstalls/solaris/j2
	sdk1.4.2_04/bin/java -client 
	-Xms32m -Xmx200m X
2770 spoz R 1 0.3 awk $3 != "S" 
	{ print }

You can then use the information from the fourth column of the output for the Light Weight Process (LWP) number in these steps. Run the pstack command on the Java process to get a mapping of the LWP to the PID. For example:

pstack 9499

Redirect the output to a file. If you use regular thread libraries (that is, you do not have /usr/lib/lwp in the LD_LIBRARY_PATH) in Solaris, a LWP is not mapped to the OS thread directly, so you have to take pstack from the process (check to see if you are using the alternate thread library).

Next, get several thread dumps of the java application over time to make sure you get the right threads executing. You can do this by doing a kill -3 <PID> on the Java process. Then map the LWP id to the Java Thread id. For example, if the LWP in question is 8, it could map to the Java thread 76. We would then calculate the hex value of 76 to be 0x4c.

Look in the thread dump to find the thread that matched nid=0x4c, and you will have the thread that is consuming the CPU. You will then have to figure out why this is happening in your code or if the top of the stacktrace is from WebLogic. Then you can contact support.

Let's look at an example of this process. Run the pstack command (pstack 9499) and redirect the output to a file (see Listing 1). If you use regular thread libraries (you do not have /usr/lib/lwp in the LD_LIBRARY_PATH) in Solaris, a LWP isn't mapped to the OS thread directly, so you have to take pstack from process (check to see if you are using the alternate thread library). Listing 1 shows the java/8 process at the top of prstat. Let's look at the pstack output for lwp# 8. It maps to thread# 76 from the pstack output (see Listing 2). Next get a thread dump of the java application by doing a kill -3 <PID> on the Java process. Since lwp# 8 is mapped to thread# 76, we calculate the hex value of 76 to be 4c, which maps to nid=0x4c in the JVM thread dump:

"Thread-6" prio=5 tid=0x1e2fd8 
	nid=0x4c waiting on monitor 
	[0xe2e81000..0xe2e819d8]
	at java.lang.Thread.sleep(
	Native Method)
	at weblogic.management.deploy.
	GenericAppPoller.run(
	GenericAppPoller.java:139) 

In this example the thread using the most CPU is sleeping. This thread is specific to WebLogic, and it is an application poller that runs when WebLogic is started in Development mode. It runs every 30 seconds, so it is apparent that the thread dump was not captured in time to see activity in this thread. Ideally all three steps should be done quickly and in a row to capture the data as close together in time as possible. You can do so through a simple shell script like this:

#
# Takes an argument (PID of the 
	Java process) and loops three 
	times. This will append the 
	prstat information to a file 
	called dump_high_cpu.txt. The 
	thread dump information will 
	either be in file where stdout 
	was redirected or printed on 
	the screen.
#

for loopnum in 1 2 3
do
	prstat -L -p $1 1 1 >> 
		dump_high_cpu.txt
	pstack $1 >> dump_high_cpu.txt
		kill -3 $1
	echo "prstat, pstack, and 
		thread dump done. #" $loopnum
	sleep 1

	echo "Done sleeping."

done 

Troubleshooting Windows
You can use pslist and get thread details of the java process on Microsoft Windows. The pslist command is freely available from the Sysinternals freeware site (see Resources). Run pslist -d <Java PID> and redirect the output to a file, and do this for several iterations so you can see a pattern. You should see the user time and kernel time increasing. Take thread dumps of the application over several iterations by pressing Ctrl-Break when the application is running in a command window. If the application is WebLogic and you have it running as a service, then you can execute a separate command to get a thread dump on Windows (see Resources).

Take the Thread id # that seems to be increasing, change the decimal value to a hex value (you can use the Calc function in Windows), and look in the thread dump for the nid=0x<Your Hex Value> until you find the offending thread. You will then have to figure out why this is happening in your code or if the top of the stacktrace is from WebLogic (for example), and then you can contact BEA support.

Or, you can also use the Process Explorer from Sysinternals (see Resources). This tool shows the CPU usage dynamically. Unfortunately, there is no logging or history in Process Explorer, so you will have to monitor it and write down the Thread id from the Java process that is taking up the most CPU. To do this with Process Explorer:

  1. Find the java process, then right click and select properties.
  2. Click the Threads tab to bring up all the threads associated with this java process. You can click on one of the threads that are listed as MSVCRT.dll+<Some hex offset>. You can see the Thread id listed in the pane below it.
  3. Observe which thread is using up the most CPU.
  4. Take thread dumps of the application (as described previously for Windows).
  5. Take the Thread id # from step 2, and change the decimal value to a hex value (you can use the Calc function in Windows).
  6. Look in the thread dump for the nid=0x<Your Hex Value> until you find the offending thread.

You will have to figure out why this is happening in your code or if the top of the stacktrace is from WebLogic (for example), and then you can contact BEA support.

Here's an example of this procedure using using pslist and thread dumps. Run pslist -d 1720 (the output shows only a couple of threads):

java 1720:

Tid Pri Cswtch State User Time 
	Kernel Time Elapsed 
2000 8 2 Wait:UserReq 
	0:00:00.000 0:00:00.000 
	0:04:41.965
588 9 4744 Wait:UserReq 
	0:00:02.814 0:00:00.110 
	0:04:41.965
1784 9 132 Wait:UserReq 
	0:00:00.080 0:00:00.000 
	0:04:41.955
1756 9 196 Wait:UserReq 
	0:00:00.931 0:00:00.000 
	0:04:41.955
1716 8 6 Wait:Queue 
	0:00:00.000 0:00:00.000 
	0:04:41.945

We take the same output again some time later to get another snapshot of the threads to see which have increased significantly, and doing so should point us at the Thread id (TID) to look at. We can then run pslist again (pslist -d 1720):

java 1720:

Tid Pri Cswtch State User Time 
	Kernel Time Elapsed Time
2000 8 657 Wait:UserReq 
	0:00:00.090 0:00:00.010 
	0:08:01.211
588 10 59123 Wait:UserReq 
	0:00:48.830 0:00:02.633 
	0:08:01.211
1784 8 150 Wait:UserReq 
	0:00:00.090 0:00:00.000 
	0:08:01.201
1756 8 251 Wait:UserReq 
	0:00:00.941 0:00:00.000 
	0:08:01.201
1716 8 6 Wait:Queue 
	0:00:00.000 0:00:00.000 
	0:08:01.191

We now see that Thread id 588 is using up most of the user/kernel time and hence the CPU. Obviously this thread has a problem. We can then take the Thread id number of 588 and convert that to hexadecimal (0x24). We can then look at the thread dumps we took while the problem was occurring and look for nid=0x24. We see that this corresponds to ExecuteThread 10 from the thread dumps (see Listing 3).

We can see that it is on a socketWrite() native method, but it looks like the HelloWorld2.service() is doing something wrong. We can then look at that line number (94 of HelloWorld2.java) to see what is going on. From this code snippet of HelloWorld2.java from the service() method:

89 out.println(
	ExampleUtils.returnHtmlHeader(
	"Hello World 2"));
90 out.println("<h4>");
91 for (int i=0;i<100000000;i++) 
	{
92 int j = 0;
93 j = j +i;
94 out.println(defaultGreeting + 
	" " + defaultName + "!");
95 }
96
97 out.println("</h4>");
98 out.println(
	ExampleUtils.returnHtmlFooter()
	);

we see that for some reason the output stream is being written in a very long for loop, which is the error and cause of the high CPU in this case. If this code is corrected, then the CPU will not be maxed out.

Once you have the list of steps needed on your operating system to figure out high CPU issues, you should be able to more easily determine the cause of the issue. You can even put those simple steps into a script or cron job that can be run to monitor your application in a quick and easy manner.

About the Author
Steven Pozarycki is a senior backline developer relations engineer with BEA Systems, Customer Support Group. He specializes in troubleshooting and solving complex customer issues with their mission-critical applications on BEA products. Contact Steve at spoz@bea.com.  

 

Resources

Output Redirect

Listing 1. In the pstack command’s output file redirect the java/8 process is at the top of prstat. The pstack output for lwp# 8 maps to thread# 76 from the pstack output (see Listing 2).

$ prstat -L -p 9499 1 1

PID USERNAME SIZE RSS STATE PRI NICE TIME CPU 

  PROCESS/LWPID

9499 spoz 153M 100M sleep 58 0 0:00.22 0.6% java/8

9499 spoz 153M 100M sleep 58 0 0:00.10 0.2% java/10

9499 spoz 153M 100M sleep 58 0 0:00.11 0.1% java/9

9499 spoz 153M 100M sleep 58 0 0:00.03 0.0% java/5

9499 spoz 153M 100M sleep 58 0 0:01.01 0.0% java/1

9499 spoz 153M 100M sleep 58 0 0:00.00 0.0% java/12

9499 spoz 153M 100M sleep 58 0 0:00.00 0.0% java/11

9499 spoz 153M 100M sleep 58 0 0:00.00 0.0% java/14

9499 spoz 153M 100M sleep 58 0 0:00.00 0.0% java/13

9499 spoz 153M 100M sleep 59 0 0:00.07 0.0% java/7

9499 spoz 153M 100M sleep 59 0 0:00.00 0.0% java/6

9499 spoz 153M 100M sleep 59 0 0:00.00 0.0% java/4

9499 spoz 153M 100M sleep 58 0 0:00.11 0.0% java/3

9499 spoz 153M 100M sleep 58 0 0:00.00 0.0% java/2

Thread Dump

Listing 2. Do a kill -3 <PID> on the Java process to get a thread dump of the java application. Since lwp# 8 is mapped to thread #76, we calculate the hex value of 76 to be 4c, which maps to nid=0x4c in the JVM thread dump.

----------------- lwp# 8 / thread# 76 -----------------

ff29d190 poll (e2e81548, 0, bb8)

ff24d154 select (0, 0, 0, e2e81548, ff2bf1b4, 

  e2e81548) + 348

ff36b134 select (0, bb8, 7fffffff, fe4c8000, 0, bb8) + 

  34

fe0f62e4 __1cCosFsleep6FpnGThread_xl_i_ (0, bb8, 

  fe4c8000, 1, 0, 1e2fd8) + 234

fe23f050 JVM_Sleep (2, 0, bb8, fe4de978, fe4c8000, 

  1e2fd8) + 22c 

0008f7ac ???????? (e2e818d4, bb8, 1e2fd8, 984a4, 0, 

  109a0)

0008c914 ???????? (e2e8194c, 1, fe4d6a80, 98564, 8, 

  e2e81868)

fe5324e8 __1cMStubRoutinesG_code1_ (e2e819d8, e2e81c10, 

  a, f6cb5000, 4, e2e818f0) + 3ec

fe0cbe94 __1cJJavaCallsLcall_helper6FpnJJavaValue

  _pnMmethodHandle_pnRJavaCallArguments_pnGThread__v

  _ (e2e81c08,fe4c8000, e2e81b54, 1e2fd8, 8e764, 

  e2e81c10) +308

fe1f6dbc __1cJJavaCallsMcall_virtual6FpnJJavaValue

  _nLKlassHandle_nMsymbolHandlee81c08, e2e81b54) + 

  150pnGThread__v_(f6cb64b8, e2e81b40, e2e81b44, 

  fe4c8000, e2d8) + 60e_5pnGThread__v_ (e2e81c08, 

  e2e81c04, e2e81c00,e2e81bf4, e2e81bec, 1e2f8000, 

  e2e81d10, 1e, e) + 120FpnKJavaThread_pnGThread__v_ 

  (f6817ff8, 1e2fd8, fe4c7fd70) + 

  3d8cKJavaThreadDrun6M_v_ (e2e02000, fe4d3e34, 

  fe4c8000, 7fd70, 1e2fd8,fe213ec8 _start (fe4c8000, 

  fe625d10, 0, 5, 1, fe401000) + 20

ff36b728 _thread_start (1e2fd8, 0, 0, 0, 0, 0) + 40

Take a Number

Listing 3. If you take the Thread id number 588 and convert it to hexadecimal (0x24), you can then look at the thread dumps you took while the problem was occurring and look for nid=0x24 to see that it corresponds to ExecuteThread 10 from the thread dumps.

"ExecuteThread: '10' for queue: 'default'" daemon prio=

  5 tid=0x20d75808 nid=0x24c runnable 

  [219ff000..219ffd90]

  at java.net.SocketOutputStream.socketWrite0(

    Native Method)

  at java.net.SocketOutputStream.socketWrite(

    SocketOutputStream.java:92)

  at java.net.SocketOutputStream.write(

    SocketOutputStream.java:136)

  at weblogic.servlet.internal.ChunkUtils.

    writeChunkTransfer(ChunkUtils.java:222)

  at weblogic.servlet.internal.ChunkUtils.writeChunks(

    ChunkUtils.java:198)

  at weblogic.servlet.internal.ChunkOutput.flush(

    ChunkOutput.java:284)

  at weblogic.servlet.internal.ChunkOutput.

    checkForFlush(ChunkOutput.java:344)

  at weblogic.servlet.internal.ChunkOutput.write(

    ChunkOutput.java:221)

  at weblogic.servlet.internal.ChunkOutput.write(

    ChunkOutput.java:236)

  at weblogic.servlet.internal.ChunkOutputWrapper.

    write(ChunkOutputWrapper.java:95)

  at weblogic.servlet.internal.ChunkWriter.write(

    ChunkWriter.java:37)

  at java.io.Writer.write(Writer.java:150) - locked

    <0x11d0d1c0> (a weblogic.servlet.internal.

    ChunkWriter)

  at java.io.PrintWriter.write(PrintWriter.java:230)

    - locked <0x11d0d1c0> (a weblogic.servlet.internal.

    ChunkWriter)

  at java.io.PrintWriter.write(PrintWriter.java:247)

  at java.io.PrintWriter.print(PrintWriter.java:378)

  at java.io.PrintWriter.println(PrintWriter.java:515)

    - locked <0x11d0d1c0> (a weblogic.servlet.internal.

    ChunkWriter)

  at examples.servlets.HelloWorld2.service(

    HelloWorld2.java:94)

  at javax.servlet.http.HttpServlet.service(

    HttpServlet.java:853)

  at weblogic.servlet.internal.

    ServletStubImpl$ServletInvocationAction.run(

    ServletStubImpl.java:1058)

  at weblogic.servlet.internal.ServletStubImpl.

    invokeServlet(ServletStubImpl.java:401)

  at weblogic.servlet.internal.ServletStubImpl.

    invokeServlet(ServletStubImpl.java:306)

  at weblogic.servlet.internal.

    WebAppServletContext$ServletInvocationAction.

    run(WebAppServletContext.java:5412)

  at weblogic.security.service.SecurityServiceManager.

    runAs(SecurityServiceManager.java:744)

  at weblogic.servlet.internal.WebAppServletContext.

    invokeServlet(WebAppServletContext.java:3086)

  at weblogic.servlet.internal.ServletRequestImpl.

    execute(ServletRequestImpl.java:2544)

  at weblogic.kernel.ExecuteThread.execute(

    ExecuteThread.java:153)

  at weblogic.kernel.ExecuteThread.run(

    ExecuteThread.java:134)

Last Modified on 2007-05-25 22:00
没有评论.
Since 2005.04.27  梦想 就像鸡蛋 要么孵化 要么臭掉RSS Feed (Entries) | Back To Home | @ZEAL | 沪ICP备05024379号