我有一个名为urls.list的URL列表
https://target.com/?first=one
https://target.com/something/?first=one
http://target.com/dir/?first=summer
https://fake.com/?first=spring
https://example.com/about/?third=three
https://example.com/?third=three我想让它们基于它们的域(如https://target.com )而独一无二,这意味着每一个域都有自己的协议打印一次,并且避免了下一个URL。因此,结果将是:
https://target.com/?first=one
http://target.com/dir/?first=summer
https://fake.com/?first=spring
https://example.com/about/?third=three这就是我想做的:
cat urls.list | cut -d"/" -f1-3 | awk '!a[$0]++' >> host_unique.del
for urls in $(cat urls.list); do
for hosts in $(cat host_unique.del); do
if [[ $hosts == *"$urls"* ]]; then
echo "$hosts"
fi
done
done发布于 2021-05-24 05:30:02
这个awk可能会做你想做的事。
awk -F'/' '!seen[$1,$3]++' urls.listbash替代方案在大型数据/文件集上非常缓慢,但它就是这样的。
使用mapfile aka readarray,这是一个bash4+特性,关联数组。再加上一些bash功能。
#!/usr/bin/env bash
declare -A uniq
mapfile -t urls < urls.list
for uniq_url in "${urls[@]}"; do
IFS='/' read -ra url <<< "$uniq_url"
if ((!uniq["${url[0]}","${url[2]}"]++)); then
printf '%s\n' "$uniq_url"
fi
done发布于 2021-05-24 05:29:18
带着你所展示的样品,请试一试。
awk 'match($0,/https?:\/\/[^/]*/){val=substr($0,RSTART,RLENGTH)} !arr[val]++' Input_file解释:添加了上面的详细说明。
awk ' ##Starting awk program from here.
match($0,/https?:\/\/[^/]*/){ ##using match to match http or https followedby ://
val=substr($0,RSTART,RLENGTH) ##Creating val which has matched string value here.
}
!arr[val]++ ##Checking condition if val not present in arr then print current line.
' Input_file ##Mentioning Input_file name here.https://stackoverflow.com/questions/67667026
复制相似问题