sun的程序员也是程序员啊！(续)

刚刚鄙视完sun，继续performance tuning，结果又发现问题：测试中，偶尔会出现一些古怪错误，经检查，发现有以下可疑的异常：

[#|2010-05-05T14:27:37.295+0800|WARNING|sun-glassfish-comms-server2.0|com.sun.jbi.httpsoapbc.OutboundMessageProcessor|_ThreadID=53;_ThreadName=HTTPBC-OutboundReceiver-47;_RequestID=42321e60-6723-4831-a99a-b4dd1ac3e35f;|HTTPBC-E00759: An exception occured while processing a reply message. HTTP transport error: java.util.ConcurrentModificationException
com.sun.xml.ws.client.ClientTransportException: HTTP transport error: java.util.ConcurrentModificationException
        at com.sun.xml.ws.transport.http.client.HttpClientTransport.getOutput(HttpClientTransport.java:134)
        at com.sun.xml.ws.transport.http.client.HttpTransportPipe.process(HttpTransportPipe.java:143)
        at com.sun.xml.ws.transport.http.client.HttpTransportPipe.processRequest(HttpTransportPipe.java:89)
        at com.sun.xml.ws.transport.DeferredTransportPipe.processRequest(DeferredTransportPipe.java:91)
        at com.sun.xml.ws.api.pipe.Fiber.__doRun(Fiber.java:595)
        at com.sun.xml.ws.api.pipe.Fiber._doRun(Fiber.java:554)
        at com.sun.xml.ws.api.pipe.Fiber.doRun(Fiber.java:539)
        at com.sun.xml.ws.api.pipe.Fiber.runSync(Fiber.java:436)
        at com.sun.xml.ws.api.pipe.helper.AbstractTubeImpl.process(AbstractTubeImpl.java:106)
        at com.sun.xml.ws.tx.client.TxClientPipe.process(TxClientPipe.java:177)
        at com.sun.xml.ws.api.pipe.helper.PipeAdapter.processRequest(PipeAdapter.java:115)
        at com.sun.xml.ws.api.pipe.Fiber.__doRun(Fiber.java:595)
        at com.sun.xml.ws.api.pipe.Fiber._doRun(Fiber.java:554)
        at com.sun.xml.ws.api.pipe.Fiber.doRun(Fiber.java:539)
        at com.sun.xml.ws.api.pipe.Fiber.runSync(Fiber.java:436)
        at com.sun.xml.ws.client.Stub.process(Stub.java:248)
        at com.sun.xml.ws.client.dispatch.DispatchImpl.doInvoke(DispatchImpl.java:180)
        at com.sun.xml.ws.client.dispatch.DispatchImpl.invoke(DispatchImpl.java:206)
        at com.sun.jbi.httpsoapbc.OutboundMessageProcessor.outboundCall(OutboundMessageProcessor.java:1256)
        at com.sun.jbi.httpsoapbc.OutboundMessageProcessor.dispatch(OutboundMessageProcessor.java:1296)
        at com.sun.jbi.httpsoapbc.OutboundMessageProcessor.processRequestReplyOutbound(OutboundMessageProcessor.java:747)
        at com.sun.jbi.httpsoapbc.OutboundMessageProcessor.processMessage(OutboundMessageProcessor.java:257)
        at com.sun.jbi.httpsoapbc.OutboundAction.run(OutboundAction.java:63)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.util.ConcurrentModificationException
        at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
        at java.util.HashMap$EntryIterator.next(HashMap.java:834)
        at java.util.HashMap$EntryIterator.next(HashMap.java:832)
        at com.sun.xml.ws.transport.http.client.HttpClientTransport.createHttpConnection(HttpClientTransport.java:364)
        at com.sun.xml.ws.transport.http.client.HttpClientTransport.getOutput(HttpClientTransport.java:118)
        ... 25 more

从异常上看，调用metro进行webservice的调用过程中，有对hashmap做游历进行读取的操作，期间抛出ConcurrentModificationException。感觉这个又是bug了，毕竟这个前前后后的代码，都是sun自己的：openESB, metro, jdk。

找到hashmap抛出异常的代码：

        final Entry<K,V> nextEntry() {
            if (modCount != expectedModCount)
                throw new ConcurrentModificationException();

明显这个ConcurrentModificationException是hashmap主动抛出的，看条件if (modCount != expectedModCount)

找到expectedModCount，HashIterator构造时初始化为当时hashmap实例的modCount并保持不再修改，实际就是记下迭代开始时hashmap的状态：

        HashIterator() {
            expectedModCount = modCount;
            ...
        }

再看modCount相关代码

    /**
     * The number of times this HashMap has been structurally modified
     * Structural modifications are those that change the number of mappings in
     * the HashMap or otherwise modify its internal structure (e.g.,
     * rehash).  This field is used to make iterators on Collection-views of
     * the HashMap fail-fast.  (See ConcurrentModificationException).
     */
    transient volatile int modCount;

    从javadoc中可以得知modCount是用来记录hashmap实例的结构被修改的次数。同时明确指出这个域用来在做迭代时实现fail-fast。

    现在非常明确的可以知道问题的来源了：在hashmap实例的坐迭代的过程中，其他线程修改了这个hashmap，导致modCount 和 expectedModCount不符，因此直接抛出ConcurrentModificationException)来实现fail-fast。

    hashmap的这个做法没有问题，那么问题就是出在它的使用者上了：为什么在hashmap进行迭代的过程中，会修改这个hashmap？而且，明显的没有做同步保护，要知道hashmap是明确申明不是线程安全的。

    先找到这个hashmap被调用的代码，metro下的HttpTransportPipe，方便起见请打开以下URL使用fisheye工具查看代码：

http://fisheye5.cenqua.com/browse/jax-ws-sources/jaxws-ri/rt/src/com/sun/xml/ws/transport/http/client/HttpTransportPipe.java?r=1.14

    public Packet process(Packet request) {
        HttpClientTransport con;
        try {
            // get transport headers from message
            Map<String, List<String>> reqHeaders = (Map<String, List<String>>) request.invocationProperties.get(MessageContext.HTTP_REQUEST_HEADERS);
            //assign empty map if its null
            if(reqHeaders == null){
                reqHeaders = new HashMap<String, List<String>>();
            }

        ......

        for (Map.Entry<String, List<String>> entry : reqHeaders.entrySet()) {
            httpConnection.addRequestProperty(entry.getKey(), entry.getValue().get(0));
        }

    出现问题的reqHeaders是从request中获取到的，明显是这个方法之外还有其他线程在修改这个hashmap。

    简单修改一下代码：

    public Packet process(Packet request) {
        HttpClientTransport con;
        try {
            // get transport headers from message
            Map<String, List<String>> reqHeaders = new HashMap<String, List<String>>();
            Map<String, List<String>> reqHeadersInRequest = (Map<String, List<String>>) request.invocationProperties.get(MessageContext.HTTP_REQUEST_HEADERS);
            //assign empty map if its null
            if(reqHeadersInRequest != null){
                reqHeaders.putAll(reqHeadersInRequest);
            }

    不直接使用原有的hashmap实例了，既然其他线程会同时进行修改操作，那么这个实例就是很不安全的了。我们重新new了一个新的HashMap,然后将原有HashMap的数据用putAll方法设置进入。用编译后的class文件覆盖glassfish/lib/webservice-rt.jar中的同名文件，重新测试，跑了20分钟，上述的ConcurrentModificationException异常没有再出现。

    总结一下这个bug反映的问题，sun的开发人员在metro中是这样使用hashmap：
1. 将hashmap按照引用在各个实例间传递
2. 在不同地方有不同线程同时读写
3. 读写时不加锁，不做同步保护

    很无语，这种做法，不是自己找死吗？hashmap不是线程安全，使用hashmap时并按照引用传递时，要不保证只读，要不就保证同时只有一个线程进行读写，前两者都不能保证就必须自己加锁做同步保护。

    后来找到这个类的最新版本，发现在后面的版本中已经fix这个问题，有兴趣的话可以打开下面的URL看版本对比，sun官方的fix方式和我上面的完全一致，呵呵。

http://fisheye5.cenqua.com/browse/jax-ws-sources/jaxws-ri/rt/src/com/sun/xml/ws/transport/http/client/HttpTransportPipe.java?r1=1.14&r2=1.15&u=3&k=kv

    还有一个关联的issue，https://jax-ws.dev.java.net/issues/show_bug.cgi?id=467，看了一下内容，和我们的场景完全不一样，看来修改这个地方纯属巧合。

    说点题外话，算是牢骚吧：

    有点怀疑metro是不是根本就没有做过性能测试，我们的测试场景，openESB下通过bepl调用4个我们称为common service的webservice，目前大概做到1200个tps，算下来common service的webservice的tps大概是1200*4 = 5K附近，上面的问题就非常明显，之前tps没有上去前没有这么严重。
    可以参考我之前的一个blog， http://www.blogjava.net/aoxj/archive/2010/04/29/319706.html，在解决这里提到的http long connection 和 TIME_AIT的问题之前，我们的tps比较低，cpu压不上去，当时好像这个问题不明显。后来搞定之后tps上来了才暴露出来。
     考虑上一个blog中 == 比较无效导致cache失效的bug，我对metro的代码质量真是很没有信心。按说这样的大型项目，release之前怎么也要做做压力测试，稳定性测试之类，很容易发现类似问题才是。我相信，不是每个用metro的地方，tps都只需要跑几十tps而已吧。我在我的普通开发机上做测试，大概只能跑到100个tps，没有发现出错。换到比较强劲的机器，tps上到1000后，上面的错误立即凸现。

posted on 2010-05-05 21:18 sky ao 阅读(2899) 评论(3) 编辑收藏所属分类: 杂谈

# re: sun的程序员也是程序员啊！(续) 2010-05-06 12:07 乐蜂网精油

分享一下！！回复更多评论

# re: sun的程序员也是程序员啊！(续) 2010-05-06 17:36 BeanSoft

... 强烈建议适用商业版本的ESB。。。在Oracle公司的战略中凡是开源版本的软件一律作为 Lite 版本提供回复更多评论

# re: sun的程序员也是程序员啊！(续) 2010-05-07 14:19 俏物悄语

学习一下！回复更多评论

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园 IT新闻 Chat2DB C++博客博问管理
相关文章: 解决drupal的globalrediect模块的重定向循环问题 Java University 网站开通过程吐糟 cloudfoundry介绍-(1)申请试用讲个笑话吧，关于"keep it simple" 被收购之后sun打算放弃开源社区了吗？ sun的程序员也是程序员啊！(续) sun的程序员也是程序员啊！一个因参数定义不合理造成的滑稽错误引发的思考 drupal的安装配置笔记 [fun]我们的代码规模比起来还是差得远

# re: sun的程序员也是程序员啊！(续) 2010-05-06 12:07 乐蜂网精油

# re: sun的程序员也是程序员啊！(续) 2010-05-06 17:36 BeanSoft

# re: sun的程序员也是程序员啊！(续) 2010-05-07 14:19 俏物悄语

Sky's blog

导航

留言簿(8)

随笔分类

随笔档案

阅读排行榜

评论排行榜

常用链接

统计

其他链接

友情链接

最新评论

sun的程序员也是程序员啊！(续)

评论