Perl Print Duplicate Line

# Find out duplicate line, if yes, print it out.
AcceptEnv LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY
LC_MESSAGES
AcceptEnv LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT
AcceptEnv LC_IDENTIFICATION LC_ALL

# Example of overriding settings on a per-user basis
#Match User anoncvs
# X11Forwarding no
# AllowTcpForwarding no
# ForceCommand cvs server
AcceptEnv LC_IDENTIFICATION LC_ALL

#! /usr/bin/perl
use strict;

open(FH, 'dupLine.sample');
my %seen;
while (<FH>) {
$seen{$_}++;
}

while (my ($line, $count) = each %seen) {
print "$count: $line" if $count > 1;
}

Using the standard Perl shorthands:

my %seen;
while ( <> ) { 
    print if $seen{$_}++;
}

As a "one-liner":

perl -ne 'print if $seen{$_}++'

More data? This prints <file name>:<line number>:<line>:

perl -ne 'print ( $ARGV eq "-" ? "" : "$ARGV:" ), "$.:$_" if $seen{$_}++'

Explanation on %seen:

%seen declares a hash. For each unique line in the input $seen{$_} is a scalar slot in the hash named by the the text of the line.
Using the postfix increment operator (x++) we take the value for our expression, remembering toincrement it after the expression. So, if we haven't "seen" the line $seen{$_} is undefined--but when forced into an numeric "context" like this, it's taken as 0--and false.
Then it's incremented to 1.

So the first time we see a line, we take the undefined value which fails the if. It increments the count at the slot to 1. Thus, it is 1 for any future occurrences at which point it passes the if condition.

Now as I said above, %seen declares a hash, but with strict turned off, any variable expression can be created on the spot. So the first time perl sees $seen{$_} it knows that I'm looking for %seen, it doesn't have it, so it creates it.

An added neat thing about this is that at the end, if you care to use it, you have a count of how many times each line was repeated.

-----------------------------------------------------
Silence, the way to avoid many problems;
Smile, the way to solve many problems;

posted on 2012-04-26 11:01 Chan Chen 阅读(291) 评论(0) 编辑收藏所属分类: Linux

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园 IT新闻 Chat2DB C++博客博问管理
相关文章: top命令的Load average 含义及性能参考基值 Curl命令使用整理 Java Web Application 远程调试 Memcache服务器端的安装 FTP Server set up ssh server on ubuntu server 12.04 edition Install Linux, Apache, MySQL, PHP (LAMP) stack on Ubuntu Split Large File in Linux Linux Increase The Maximum Number Of Open Files / File Descriptors A tips to remove cache

Chan Chen Coding...

导航

统计

文章分类

文章档案

最新评论

Perl Print Duplicate Line