TWiki> Cenkes Web>Paradzap (22 Oct 2007, Main.AndrewPantyukhin)EditAttach
Tags:
create new tag
, view all tags

Paradzap - Parallel Adzapper

Overview

Paradzap is an URL rewriter for Squid. It is based on adzapper rule syntax, but doesn't have any of its advanced features (like stand-alone proxy mode). In addition, it supports concurrent requests and rule hashing by domain (or IP) address.

Rule set

  1. Global (slow)
  2. Per-domain (hashed, very fast)

Matching

  1. Per-domain
  2. Global classful
First match is decisive

Hashing

  1. Domain name xxx.yyy.domain.com:
    • look for xxx.yyy.domain.com
    • look for yyy.domain.com
    • look for domain.com
    • look for com
    • look for xxx.
    • look for .yyy.
    • look for .domain.
    • global
  2. IPv4 address 1.2.3.4 - look for 1.2.3.4 - look for 1.2.3 - look for 1.2 - look for 1 - global - netmask matching might be implemented later
  3. IPv6 may be implemented later

Rule file syntax

[[domain]:]CLASS pattern [replacement]
CLASS<filename
Replacements support positional parameterns ($[0-9]), but do not support other variables (like $&) for performance reasons.
PASS **123**
AD **567**
bignews.com:PRINT article.php?(**) http://bignews.com/print.php?$1
domain.com:AD **jjj**
another.org:PASS **kkk**
multi.net:AD **abc**
:AD **nnn**
:PASS **ksjdkj**
orthisway.ru:
:AD **123**
:PASS **789**
PASS **globalagain**
adnet.com:AD
wikipedia.org:PASS
example.net:
:PRINT /relative_to_domain**
:ADHTML http://(absolute|ads).example.net/**
PRON</path/to/file/with/domains

Domain file syntax

very.simple.com
another.one.ru
and.so.on.org
possibly.with.regexps.net **bad_advert**
and.relative.co.uk /banner?**

paradzap.conf syntax

paradzap.conf is sourced as a shell script
cd /home/sat/paradzap/
ZAP_RULEFILES="zap.pass zap.ad zap.porn zap.adzap"

State

An alpha version is undergoing preliminary testing.

Architecture

Dependencies

  • URI (URI::Escape, URI::Split)

Data

Rule set is one big hash
  • hash ("GLOBAL", "domain.com", "1.2.3.4")
    • list of lists
      • ("AD", "**")
      • ("AD", "**123**")
      • ("PASS", "**567**")
      • ("PRINT", "(*)article.php?(*)", "$1print.php?$2")
The problem is an utter lack of scalability. 600k domains take about 600M of RAM. Something should be done about it.

Request flow

  1. uri_unescape
  2. case sensitivity
    • maybe lowercase everything
    • maybe just domain part
    • maybe just case-insensitive matching
      • performance?
  3. traverse rule set
  4. maybe uri_escape (seems to work fine without it)
  5. output

URL lists

Topic revision: r9 - 22 Oct 2007 - 03:14:02 - Main.AndrewPantyukhin
Cenkes.Paradzap moved from Cenkes.Paradzapper on 18 Oct 2007 - 17:04 by Main.AndrewPantyukhin - put it back
 

Cenkes - IT Pro Bono