TWiki> Cenkes Web>PortsTodo (08 Nov 2007, Main.AndrewPantyukhin)EditAttach
Tags:
create new tag
, view all tags

Ports Todo

  • alternating dependencies
    • autodetect what to depend on based on pkg_info -W filename
  • PLIST= file dir/ dirtry/? dirforce/* dir/+

Automating, etc.

  • preferably fully distributed, but realistically the first step will be pretty centralized
    • at least each atomic task will be
    • but absolutely 100% open-source, so everyone can run
      • good docs needed, too

Data input

  • cvs/svn/...
  • cvsup
  • rsync
  • http crawling
  • ftp crawling
  • commit mails
    • with/without diffs
  • keep-alives (icmp, http, ftp, tcp)
  • bug-tracking systems
    • gnats
    • Debian BTS
      • cool SOAP interface

Data storage

  • file systems and databases
  • first shot could be the default, most obvious format - plain files
    • slow
    • very slow
  • but we need indexing
    • pgsql
  • possibly, we may store plain files, and extract info from them into DB
    • store /usr/ports
    • run lots of make -V SOMEVAR
    • store results into indexed DB tables
    • freshports.org (and I'm sure many others) are already doing it

Data processing

  • probably perl
  • probably IPC should be done via pipes and sockets, so that any programs can be used
    • distributed computing should be possible in this way
    • distributed computing is vital, e.g. in large-scale crawling

Technical data mining

  • VCS
    • either mine from the tip
    • or mine from timeline
      • this is initially more complicated, but logically more clear
  • real-time or not
    • real-time is quite hard
      • may introduce lots of temporary inconsistencies (e.g. repocopies) which need to be preened later (e.g. cvsup output)
    • non-real-time gets some impatient people so frustrated they can't use it
      • greylisting => grey hair
  • tracing commit mails
    • thanks to dfilter a real-time cvs mirror is possible
      • repocopies will lag, probably even if using cvsup-master
    • I'm always getting lured into thinking about a perfect DB-based VCS (or worse - perfect FS which includes VCS) - need to build some mental fence and keep within its boundaries
  • crawling is interesting
    • http://en.wikipedia.org/wiki/Category:Free_web_crawlers
    • make fetch checksum for all ports
      • takes just about 4 days over ADSL
      • takes just about a day over 10Mbit/s
      • so this stuff is not that scary
    • in most cases, all we need to do is to check status
      • if possible, use HTTP HEAD
      • if fails, try HTTP GET, but send TCP RST right after we get the headers
        • in some cases it won't be enough to see if file hasn't been changed
    • it's very important not to block, in theory an old box can wait on thousands of HTTP HEAD's
  • keep-alives are interesting
    • get a list of domain names
      • together with OSI4 ports and protocols
    • convert them to a list of IP addresses
      • everything's important
      • multiple IPv4
      • multiple IPv6
      • bad DNS - tricky
    • check IP's
      • ICMP
      • TCP SYN
      • HTTP GET /
      • FTP
      • preferably record RTT for everything
        • not very relevant if done from home
        • cool graphs
      • check from one location may not be enough
      • extra points - automated traceroute analysis
      • don't be too persistant to get us into blocklists
        • distribute pingers if we do get there smile
    • the keep-alive info should be reused when doing crawling and other stuff

Mailing-list processing

  • people should be able to trace words of interest
    • like portnames they maintain
    • thread-oriented design is needed
  • this is long-term, apparently
Topic revision: r3 - 08 Nov 2007 - 23:10:03 - Main.AndrewPantyukhin
 

Cenkes - IT Pro Bono