这个是发生在上周周末的真实案例,因为cxf client 端线程安全导致的错误,总结出来希望其他使用cxf的兄弟注意。
首先描述一下背景,简单的说就是使用cxf作为web service的客户端,运行在weblogic上,连接外部的服务器。为了测试需要,开发了一个简单的模拟器模拟服务器端,准备在release之前跑稳定性测试。
结果出问题了,在排除掉一些干扰和诸如网络环境,设置等之后问题依旧,由于系统负责,包括ws的模拟器也是出了一个之前没有试过的方法,因此费了不少时间来查找问题。过程很枯燥,应该很多人经历过,在一个大的系统中找到一个小错误的出处,可以说是一门学问,技术耐心和运气都是需要的.....跳出这个过程,由于问题表现在web service的网络连接在这个异常上,在服务器端模拟器的日志中有大量的这种异常信息:
2009-07-24 19:23:22,898 DEBUG ( : ) (tomcat-exec-56) [Http11NioProcessor] - Error parsing HTTP request header
java.io.EOFException: Unexpected EOF read on the socket
at org.apache.coyote.http11.InternalNioInputBuffer.readSocket(InternalNioInputBuffer.java:589)
at org.apache.coyote.http11.InternalNioInputBuffer.parseRequestLine(InternalNioInputBuffer.java:425)
at org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:825)
at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:719)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2080)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
at java.lang.Thread.run(Thread.java:619)
2009-07-24 19:23:22,898 DEBUG ( : ) (tomcat-exec-56) [Http11NioProcessor] - Error parsing HTTP request header
java.io.EOFException: Unexpected EOF read on the socket
at org.apache.coyote.http11.InternalNioInputBuffer.readSocket(InternalNioInputBuffer.java:589)
at org.apache.coyote.http11.InternalNioInputBuffer.parseRequestLine(InternalNioInputBuffer.java:425)
at org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:825)
at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:719)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2080)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
at java.lang.Thread.run(Thread.java:619)
而服务器端模拟器这次是我们第一次使用tomcat和coyote,因此怀疑是tomcat的问题,在再三追查代码无果的情况下,决定换一个服务器端模拟器来确认问题所在:到底是cxf的客户端的问题,还是服务器端模拟器。一个简单的模拟器写出来了,一个跳过所有业务逻辑直接调用cxf客户端实现代码的测试小程序写出来了,测试之后发现,问题依旧。于是将目光集中到cxf的客户端上。
在测试中发现这样一个规律,在上述服务器端的异常发生前,在客户端中总是会有规律的出现下面这个异常:
Jul 24, 2009 10:36:18 PM org.apache.cxf.phase.PhaseInterceptorChain doIntercept
INFO: Interceptor has thrown exception, unwinding now null
tps = 25
Exception in thread "Thread-41" 2009-07-24 22:36:19,925 149585 [Thread-41] (********Impl.java:459) ERROR junit.framework.Test - Got an exception when invoking **** service:javax.xml.ws.WebServiceException: java.lang.NullPointerException
(这里的信息是和业务相关的,不方便打出,总之和我们讨论的问题无关)
at test.TestMci.execute(TestMci.java:84)
at test.TestMci.access$1(TestMci.java:81)
at test.TestMci$TestThread.run(TestMci.java:90)
at java.lang.Thread.run(Thread.java:595)
Caused by: javax.xml.ws.WebServiceException: java.lang.NullPointerException
at org.apache.cxf.jaxws.JaxWsClientProxy.invoke(JaxWsClientProxy.java:142)
at $Proxy40.authorizeAndPurchase(Unknown Source)
at *********************
6 more
Caused by: java.lang.NullPointerException
at org.apache.cxf.transport.http.HTTPConduit.prepare(HTTPConduit.java:483)
at org.apache.cxf.interceptor.MessageSenderInterceptor.handleMessage(MessageSenderInterceptor.java:46)
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:226)
at org.apache.cxf.endpoint.ClientImpl.invoke(ClientImpl.java:469)
at org.apache.cxf.endpoint.ClientImpl.invoke(ClientImpl.java:299)
at org.apache.cxf.endpoint.ClientImpl.invoke(ClientImpl.java:251)
at org.apache.cxf.frontend.ClientProxy.invokeSync(ClientProxy.java:73)
at org.apache.cxf.jaxws.JaxWsClientProxy.invoke(JaxWsClientProxy.java:124)
8 more
看来问题是出现在这里了。
进一步的测试发现,低压力下比如2-3个工作线程,基本不会有任何问题,因此可以解释为什么功能测试时不出现问题。工作线程加到10个,基本上还算问题,只有极其偶然的会出现一次两次这个异常。进一步加大工作线程,由于受到测试机器的性能限制,10个工作线程和100个工作线程的tps基本相同,都大体在300TPS左右(客户端在笔记本上跑的,cpu已经90%+了)。测试的结果是上面的异常开始变的有规律,恩,非常搞笑的规律,大概每10000次左右的请求就发生一次上述异常,非常的稳定而执着的重现。服了,这么有规律而重现性极好的错误,还真是难得一见........
分析一下问题,在tps基本保持不变的情况下,客户端线程从10增加到100问题就变得明显。因此问题的焦点直指线程安全这个老大难问题,重新审视我们使用cxf的代码,发现有个地方
/**
* the server service
*/
MessagingChannelServiceImplService mcsis = null;
/**
* the sever service port
*/
MessagingChannelService mcs = null;
mcsis = new MessagingChannelServiceImplService(serviceWsdlUrl);
mcs = mcsis.getMessagingChannelServiceImplPort();
MessagingChannelService是cxf自动生成的,这个是@WebService,其他的业务代码都是调用它上面的业务方法来实现。由于serviceWsdlUrl不变,因此我们重用了一些东西,避免每次都初始化一次。看来问题出现在这里,试着将代码修改为threadlocal,让每个线程都初始化一次然后保存给自己使用。
修改后的客户端代码在之后的测试中,非常稳定,没有再出现上面的异常,问题算是解决了。
我对cxf不是很熟悉,找了一下也没有找到到底是那里造成的线程不安全,google了一下找到几个地方,但是似乎还不能完全说明问题。先列出来慢慢研究:
1) Are JAX-WS client proxies thread safe?
装载自这里:
http://cxf.apache.org/faq.html#FAQ-AreJAXWSclientproxiesthreadsafe%253F
Official JAX-WS answer: No. According to the JAX-WS spec, the client proxies are NOT thread safe. To write portable code, you should treat them as non-thread safe and synchronize access or use a pool of instances or similar.
CXF answer: CXF proxies are thread safe for MANY use cases. The exceptions are:
* Use of ((BindingProvider)proxy).getRequestContext() - per JAX-WS spec, the request context is PER INSTANCE. Thus, anything set there will affect requests on other threads. With CXF, you can do:
((BindingProvider)proxy).getRequestContext().put("thread.local.request.context", "true");
((BindingProvider)proxy).getRequestContext().put("thread.local.request.context", "true");
and future calls to getRequestContext() will use a thread local request context. That allows the request context to be threadsafe. (Note: the response context is always thread local in CXF)
* Settings on the conduit - if you use code or configuration to directly manipulate the conduit (like to set TLS settings or similar), those are not thread safe. The conduit is per-instance and thus those settings would be shared.
* Session support - if you turn on sessions support (see jaxws spec), the session cookie is stored in the conduit. Thus, it would fall into the above rules on conduit settings and thus be shared across threads.
For the conduit issues, you COULD install a new ConduitSelector that uses a thread local or similar. That's a bit complex though.
For most "simple" use cases, you can use CXF proxies on multiple threads. The above outlines the workarounds for the others.
2) cxf的wiki中谈到Client API中的Proxy-based API
wiki 地址:
http://cwiki.apache.org/CXF20DOC/jax-rs.html
Limitations
Proxy methods can not have @Context method parameters and subresource methods returning Objects can not be invoked - perhaps it is actually not too bad at all - please inject contexts as field or bean properties and have subresource methods returning typed classes : interfaces, abstract classes or concrete implementations.
Proxies are currently not thread-safe.
3) thread safe issue caused by XMLOutputFactoryImpl
找到的一个cxf的bug,
https://issues.apache.org/jira/browse/CXF-2229
Description
Currently CXF calls StaxUtils.getXMLOutputFactory() to get the cached instance of XMLOutputFactoryImpl. But XMLOutputFactoryImpl.createXMLStreamWriter is not thread-safe. See below.
javax.xml.stream.XMLStreamWriter createXMLStreamWriter(javax.xml.transform.stream.StreamResult sr, String encoding) throws javax.xml.stream.XMLStreamException {
try{
if(fReuseInstance && fStreamWriter != null && fStreamWriter.canReuse() && !fPropertyChanged){
fStreamWriter.reset();
fStreamWriter.setOutput(sr, encoding);
if(DEBUG)System.out.println("reusing instance, object id : " + fStreamWriter);
return fStreamWriter;
}
return fStreamWriter = new XMLStreamWriterImpl(sr, encoding, new PropertyManager(fPropertyManager)); -- this is not thread safe, since the new instance is assigned to the field fStreamWriter first, then it is possible that different threads get the same XMLStreamWriterImpl when they call this method at the same time.
}catch(java.io.IOException io){
throw new XMLStreamException(io);
}
}
The solution might be, StaxUtils.getXMLOutputFactory() method creates a new instance of XMLOutputFactory every time, don't cache it.