paulwong

#

PIG中的分组统计百分比

http://stackoverflow.com/questions/15318785/pig-calculating-percentage-of-total-for-a-field

http://stackoverflow.com/questions/13476642/calculating-percentage-in-a-pig-query

posted @ 2013-04-10 14:13 paulwong 阅读(393) | 评论 (0)编辑 收藏

CombinedLogLoader

PIG中的LOAD函数,可以在LOAD数据的同时,进行正则表达式的筛选。

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the
 * NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF
 * licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file
 * except in compliance with the License. You may obtain a copy of the License at
 * 
 * 
http://www.apache.org/licenses/LICENSE-2.0
 * 
 * Unless required by applicable law or agreed to in writing, software distributed under the License is
 * distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and limitations under the License.
 
*/

package org.apache.pig.piggybank.storage.apachelog;

import java.util.regex.Pattern;

import org.apache.pig.piggybank.storage.RegExLoader;

/**
 * CombinedLogLoader is used to load logs based on Apache's combined log format, based on a format like
 * 
 * LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
 * 
 * The log filename ends up being access_log from a line like
 * 
 * CustomLog logs/combined_log combined
 * 
 * Example:
 * 
 * raw = LOAD 'combined_log' USING org.apache.pig.piggybank.storage.apachelog.CombinedLogLoader AS
 * (remoteAddr, remoteLogname, user, time, method, uri, proto, status, bytes, referer, userAgent);
 * 
 
*/

public class CombinedLogLoader extends RegExLoader {
    // 1.2.3.4 - - [30/Sep/2008:15:07:53 -0400] "GET / HTTP/1.1" 200 3190 "-"
    
// "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_4; en-us) AppleWebKit/525.18 (KHTML, like Gecko) Version/3.1.2 Safari/525.20.1"
    private final static Pattern combinedLogPattern = Pattern
        .compile("^(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+.(\\S+\\s+\\S+).\\s+\"(\\S+)\\s+(.+?)\\s+(HTTP[^\"]+)\"\\s+(\\S+)\\s+(\\S+)\\s+\"([^\"]*)\"\\s+\"(.*)\"$");

    public Pattern getPattern() {
        return combinedLogPattern;
    }
}

posted @ 2013-04-08 11:28 paulwong 阅读(279) | 评论 (0)编辑 收藏

Analyzing Apache logs with Pig



Analyzing log files, churning them and extracting meaningful information is a potential use case in Hadoop. We don’t have to go in for MapReduce programming for these analyses; instead we can go for tools like Pig and Hive for this log analysis. I’d just give you a start off on the analysis part. Let us consider Pig for apache log analysis. Pig has some built in libraries that would help us load the apache log files into pig and also some cleanup operation on string values from crude log files. All the functionalities are available in the piggybank.jar mostly available under pig/contrib/piggybank/java/ directory. As the first step we need to register this jar file with our pig session then only we can use the functionalities in our Pig Latin
1.       Register PiggyBank jar
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar;
Once we have registered the jar file we need to define a few functionalities to be used in our Pig Latin. For any basic apache log analysis we need a loader to load the log files in a column oriented format in pig, we can create a apache log loader as
2.       Define a log loader
DEFINE ApacheCommonLogLoader org.apache.pig.piggybank.storage.apachelog.CommonLogLoader();
(Piggy Bank has other log loaders as well)
In apache log files the default format of date is ‘dd/MMM/yyyy:HH:mm:ss Z’ . But such a date won’t help us much in case of log analysis we may have to extract date without time stamp. For that we use DateExtractor()
3.       Define Date Extractor
DEFINE DayExtractor org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('yyyy-MM-dd');
Once we have the required functionalities with us we need to first load the log file into pig
4.       Load apachelog file into pig
--load the log files from hdfs into pig using CommonLogLoader
logs = LOAD '/userdata/bejoys/pig/p01/access.log.2011-01-01' USING ApacheCommonLogLoader AS (ip_address, rfc, userId, dt, request, serverstatus, returnobject, referersite, clientbrowser);
Now we are ready to dive in for the actual log analysis. There would be multiple information you need to extract out of a log; we’d see a few of those common requirements out here
Note: you need to first register the jar, define the classes to be used and load the log files into pig before trying out any of the pig latin below
Requirement 1: Find unique hits per day
PIG Latin
--Extracting the day alone and grouping records based on days
grpd = GROUP logs BY DayExtractor(dt) as day;
--looping through each group to get the unique no of userIds
cntd = FOREACH grpd
{
                tempId =  logs.userId;
                uniqueUserId = DISTINCT tempId;
                GENERATE group AS day,COUNT(uniqueUserId) AS cnt;
}
--sorting the processed records based on no of unique user ids in descending order
srtd = ORDER cntd BY cnt desc;
--storing the final result into a hdfs directory
STORE srtd INTO '/userdata/bejoys/pig/ApacheLogResult1';
Requirement 1: Find unique hits to websites (IPs) per day
PIG Latin
--Extracting the day alone and grouping records based on days and ip address
grpd = GROUP logs BY (DayExtractor(dt) as day,ip_address);
--looping through each group to get the unique no of userIds
cntd = FOREACH grpd
{
                tempId =  logs.userId;
                uniqueUserId = DISTINCT tempId;
                GENERATE group AS day,COUNT(uniqueUserId) AS cnt;
}
--sorting the processed records based on no of unique user ids in descending order
srtd = ORDER cntd BY cnt desc;
--storing the final result into a hdfs directory
STORE srtd INTO '/userdata/bejoys/pig/ ApacheLogResult2 ';
Note: When you use pig latin in grunt shell we need to know a few factors
1.       When we issue a pig statement in grunt and press enter only the semantic check is being done, no execution is triggered.
2.       All the pig statements are executed only after the STORE command is submitted, ie map reduce programs would be triggered only after STORE is submitted
3.       Also in this case you don’t have to load the log files again and again to pig once it is loaded we can use the same for all related operations in that session. Once you are out of the grunt shell the loaded files are lost, you’d have to perform the register and log file loading steps all over again.

posted @ 2013-04-08 02:06 paulwong 阅读(351) | 评论 (0)编辑 收藏

PIG小议

什么是PIG
是一种设计语言,通过设计数据怎么流动,然后由相应的引擎将此变成MAPREDUCE JOB去HADOOP中运行。
PIG与SQL
两者有相同之处,执行一个或多个语句,然后出来一些结果。
但不同的是,SQL要先把数据导到表中才能执行,SQL不关心中间如何做,即发一个SQL语句过去,就有结果出来。
PIG,无须导数据到表中,但要设计直到出结果的中间过程,步骤如何等等。

posted @ 2013-04-05 21:33 paulwong 阅读(350) | 评论 (0)编辑 收藏

PIG资源

Hadoop Pig学习笔记(一) 各种SQL在PIG中实现
http://guoyunsky.iteye.com/blog/1317084

http://guoyunsky.iteye.com/category/196632

Hadoop学习笔记(9) Pig简介
http://www.distream.org/?p=385


[hadoop系列]Pig的安装和简单示例
http://blog.csdn.net/inkfish/article/details/5205999


Hadoop and Pig for Large-Scale Web Log Analysis
http://www.devx.com/Java/Article/48063


Pig实战
http://www.cnblogs.com/xuqiang/archive/2011/06/06/2073601.html


[原创]Apache Pig中文教程(进阶)
http://www.codelast.com/?p=4249


基于hadoop平台的pig语言对apache日志系统的分析
http://goodluck-wgw.iteye.com/blog/1107503


!!Pig语言
http://hi.baidu.com/cpuramdisk/item/a2980b78caacfa3d71442318


Embedding Pig In Java Programs
http://wiki.apache.org/pig/EmbeddedPig


一个pig事例(REGEX_EXTRACT_ALL, DBStorage,结果存进数据库)
http://www.myexception.cn/database/1256233.html


Programming Pig
http://ofps.oreilly.com/titles/9781449302641/index.html


[原创]Apache Pig的一些基础概念及用法总结(1)
http://www.codelast.com/?p=3621


!PIG手册
http://pig.apache.org/docs/r0.11.1/func.html#built-in-functions

posted @ 2013-04-05 18:19 paulwong 阅读(369) | 评论 (0)编辑 收藏

NIO Socket非阻塞模式

Server socket编程的时候,一个SERVER服务一个连接的时候,是阻塞线程的,除非用多线程来处理。

NIO只使用一条线程即可以处理多个连接。是基于事件的模式,即产生事件的时候,通知客户端处理相应的事件。

1)server端代码
    /** 
     *  
     * 
@author Jeff 
     * 
     
*/  
    
public class HelloWorldServer {  
      
        
static int BLOCK = 1024;  
        
static String name = "";  
        
protected Selector selector;  
        
protected ByteBuffer clientBuffer = ByteBuffer.allocate(BLOCK);  
        
protected CharsetDecoder decoder;  
        
static CharsetEncoder encoder = Charset.forName("GB2312").newEncoder();  
      
        
public HelloWorldServer(int port) throws IOException {  
            selector 
= this.getSelector(port);  
            Charset charset 
= Charset.forName("GB2312");  
            decoder 
= charset.newDecoder();  
        }  
      
        
// 获取Selector  
        protected Selector getSelector(int port) throws IOException {  
            ServerSocketChannel server 
= ServerSocketChannel.open();  
            Selector sel 
= Selector.open();  
            server.socket().bind(
new InetSocketAddress(port));  
            server.configureBlocking(
false);  
            server.register(sel, SelectionKey.OP_ACCEPT);  
            
return sel;  
        }  
      
        
// 监听端口  
        public void listen() {  
            
try {  
                
for (;;) {  
                    selector.select();  
                    Iterator iter 
= selector.selectedKeys().iterator();  
                    
while (iter.hasNext()) {  
                        SelectionKey key 
= (SelectionKey) iter.next();  
                        iter.remove();  
                        process(key);  
                    }  
                }  
            } 
catch (IOException e) {  
                e.printStackTrace();  
            }  
        }  
      
        
// 处理事件  
        protected void process(SelectionKey key) throws IOException {  
            
if (key.isAcceptable()) { // 接收请求  
                ServerSocketChannel server = (ServerSocketChannel) key.channel();  
                SocketChannel channel 
= server.accept();  
                
//设置非阻塞模式  
                channel.configureBlocking(false);  
                channel.register(selector, SelectionKey.OP_READ);  
            } 
else if (key.isReadable()) { // 读信息  
                SocketChannel channel = (SocketChannel) key.channel();  
                
int count = channel.read(clientBuffer);  
                
if (count > 0) {  
                    clientBuffer.flip();  
                    CharBuffer charBuffer 
= decoder.decode(clientBuffer);  
                    name 
= charBuffer.toString();  
                    
// System.out.println(name);  
                    SelectionKey sKey = channel.register(selector,  
                            SelectionKey.OP_WRITE);  
                    sKey.attach(name);  
                } 
else {  
                    channel.close();  
                }  
      
                clientBuffer.clear();  
            } 
else if (key.isWritable()) { // 写事件  
                SocketChannel channel = (SocketChannel) key.channel();  
                String name 
= (String) key.attachment();  
                  
                ByteBuffer block 
= encoder.encode(CharBuffer  
                        .wrap(
"Hello !" + name));  
                  
      
                channel.write(block);  
      
                
//channel.close();  
      
            }  
        }  
      
        
public static void main(String[] args) {  
            
int port = 8888;  
            
try {  
                HelloWorldServer server 
= new HelloWorldServer(port);  
                System.out.println(
"listening on " + port);  
                  
                server.listen();  
                  
            } 
catch (IOException e) {  
                e.printStackTrace();  
            }  
        }  
    }


server主要是读取client发过来的信息,并返回一条信息

2)client端代码
    /** 
     *  
     * 
@author Jeff 
     * 
     
*/  
    
public class HelloWorldClient {  
      
        
static int SIZE = 10;  
        
static InetSocketAddress ip = new InetSocketAddress("localhost"8888);  
        
static CharsetEncoder encoder = Charset.forName("GB2312").newEncoder();  
      
        
static class Message implements Runnable {  
            
protected String name;  
            String msg 
= "";  
      
            
public Message(String index) {  
                
this.name = index;  
            }  
      
            
public void run() {  
                
try {  
                    
long start = System.currentTimeMillis();  
                    
//打开Socket通道  
                    SocketChannel client = SocketChannel.open();  
                    
//设置为非阻塞模式  
                    client.configureBlocking(false);  
                    
//打开选择器  
                    Selector selector = Selector.open();  
                    
//注册连接服务端socket动作  
                    client.register(selector, SelectionKey.OP_CONNECT);  
                    
//连接  
                    client.connect(ip);  
                    
//分配内存  
                    ByteBuffer buffer = ByteBuffer.allocate(8 * 1024);  
                    
int total = 0;  
      
                    _FOR: 
for (;;) {  
                        selector.select();  
                        Iterator iter 
= selector.selectedKeys().iterator();  
      
                        
while (iter.hasNext()) {  
                            SelectionKey key 
= (SelectionKey) iter.next();  
                            iter.remove();  
                            
if (key.isConnectable()) {  
                                SocketChannel channel 
= (SocketChannel) key  
                                        .channel();  
                                
if (channel.isConnectionPending())  
                                    channel.finishConnect();  
                                channel  
                                        .write(encoder  
                                                .encode(CharBuffer.wrap(name)));  
      
                                channel.register(selector, SelectionKey.OP_READ);  
                            } 
else if (key.isReadable()) {  
                                SocketChannel channel 
= (SocketChannel) key  
                                        .channel();  
                                
int count = channel.read(buffer);  
                                
if (count > 0) {  
                                    total 
+= count;  
                                    buffer.flip();  
      
                                    
while (buffer.remaining() > 0) {  
                                        
byte b = buffer.get();  
                                        msg 
+= (char) b;  
                                          
                                    }  
      
                                    buffer.clear();  
                                } 
else {  
                                    client.close();  
                                    
break _FOR;  
                                }  
                            }  
                        }  
                    }  
                    
double last = (System.currentTimeMillis() - start) * 1.0 / 1000;  
                    System.out.println(msg 
+ "used time :" + last + "s.");  
                    msg 
= "";  
                } 
catch (IOException e) {  
                    e.printStackTrace();  
                }  
            }  
        }  
      
        
public static void main(String[] args) throws IOException {  
          
            String names[] 
= new String[SIZE];  
      
            
for (int index = 0; index < SIZE; index++) {  
                names[index] 
= "jeff[" + index + "]";  
                
new Thread(new Message(names[index])).start();  
            }  
          
        }  
    }




posted @ 2013-03-31 13:38 paulwong 阅读(354) | 评论 (0)编辑 收藏

CSS选择器

一个完整的标签称为元素,元素里面有属性名,属性值。

选择器相当于WHERE子句,结果就是返回符合WHERE子句的元素,可能是多个。

.class
class值=class,含有class属性,且值为class的元素。

a
标签名=a,含有标签名为a

#id
id值=id,含有属性名为id,且值为id的元素。

el.class
标签名=el and class值=class,含有标签名为el,含有class属性,且值为class的元素。

posted @ 2013-03-31 10:26 paulwong 阅读(228) | 评论 (0)编辑 收藏

HTTPCLIENT之COOKIE资源

Get Cookie value and set cookie value
http://www.java2s.com/Code/Java/Apache-Common/GetCookievalueandsetcookievalue.hm

How can I get the cookies from HttpClient?
http://stackoverflow.com/questions/8733758/how-can-i-get-the-cookies-from-httpclient

HttpClient 4.x how to use cookies?
http://stackoverflow.com/questions/8795911/httpclient-4-x-how-to-use-cookies

Apache HttpClient 4.0.3 - how do I set cookie with sessionID for POST request
http://stackoverflow.com/questions/4166129/apache-httpclient-4-0-3-how-do-i-set-cookie-with-sessionid-for-post-request

!!HttpClient Cookies
http://blog.csdn.net/mgoann/article/details/4057064

Chapter 3. HTTP state management
http://hc.apache.org/httpcomponents-client-ga/tutorial/html/statemgmt.html

!!!contact-list类库依赖包之commons-httpclient
http://flyerhzm.github.com/2009/08/23/contact-list-library-dependencies-of-commons-httpclient/

posted @ 2013-03-31 09:18 paulwong 阅读(291) | 评论 (0)编辑 收藏

一个不错的学习JAVA教程

http://tutorials.jenkov.com/java-concurrency/index.html

posted @ 2013-03-29 13:47 paulwong 阅读(548) | 评论 (0)编辑 收藏

伪造IP、COOKIE的那些事

http://www.udpwork.com/item/8135.html

http://wangjinyang.blog.sohu.com/101351399.html






posted @ 2013-03-28 11:17 paulwong 阅读(528) | 评论 (0)编辑 收藏

仅列出标题
共112页: First 上一页 64 65 66 67 68 69 70 71 72 下一页 Last