Paradzap - Parallel Adzapper
Overview
Paradzap is an URL rewriter for Squid. It is based on
adzapper rule syntax, but doesn't have any of its advanced features (like stand-alone proxy mode). In addition, it supports concurrent requests and rule hashing by domain (or IP) address.
Rule set
- Global (slow)
- Per-domain (hashed, very fast)
Matching
- Per-domain
- Global classful
First match is decisive
Hashing
- Domain name xxx.yyy.domain.com:
- look for xxx.yyy.domain.com
- look for yyy.domain.com
- look for domain.com
- look for com
- look for xxx.
- look for .yyy.
- look for .domain.
- global
- IPv4 address 1.2.3.4 - look for 1.2.3.4 - look for 1.2.3 - look for 1.2 - look for 1 - global - netmask matching might be implemented later
- IPv6 may be implemented later
Rule file syntax
[[domain]:]CLASS pattern [replacement]
CLASS<filename
Replacements support positional parameterns ($[0-9]), but do not support other variables (like $&) for performance reasons.
PASS **123**
AD **567**
bignews.com:PRINT article.php?(**) http://bignews.com/print.php?$1
domain.com:AD **jjj**
another.org:PASS **kkk**
multi.net:AD **abc**
:AD **nnn**
:PASS **ksjdkj**
orthisway.ru:
:AD **123**
:PASS **789**
PASS **globalagain**
adnet.com:AD
wikipedia.org:PASS
example.net:
:PRINT /relative_to_domain**
:ADHTML http://(absolute|ads).example.net/**
PRON</path/to/file/with/domains
Domain file syntax
very.simple.com
another.one.ru
and.so.on.org
possibly.with.regexps.net **bad_advert**
and.relative.co.uk /banner?**
paradzap.conf syntax
paradzap.conf is sourced as a shell script
cd /home/sat/paradzap/
ZAP_RULEFILES="zap.pass zap.ad zap.porn zap.adzap"
State
An alpha version is undergoing preliminary testing.
Architecture
Dependencies
- URI (URI::Escape, URI::Split)
Data
Rule set is one big hash
- hash ("GLOBAL", "domain.com", "1.2.3.4")
- list of lists
- ("AD", "**")
- ("AD", "**123**")
- ("PASS", "**567**")
- ("PRINT", "(*)article.php?(*)", "$1print.php?$2")
The problem is an utter lack of scalability. 600k domains take about 600M of RAM. Something should be done about it.
Request flow
- uri_unescape
- case sensitivity
- maybe lowercase everything
- maybe just domain part
- maybe just case-insensitive matching
- traverse rule set
- maybe uri_escape (seems to work fine without it)
- output
URL lists
Topic revision: r9 - 22 Oct 2007 - 03:14:02 - Main.AndrewPantyukhin
Cenkes.Paradzap moved from Cenkes.Paradzapper on 18 Oct 2007 - 17:04 by Main.AndrewPantyukhin -
put it back