autodetect what to depend on based on pkg_info -W filename
PLIST= file dir/ dirtry/? dirforce/* dir/+
Automating, etc.
preferably fully distributed, but realistically the first step will be pretty centralized
at least each atomic task will be
but absolutely 100% open-source, so everyone can run
good docs needed, too
Data input
cvs/svn/...
cvsup
rsync
http crawling
ftp crawling
commit mails
with/without diffs
keep-alives (icmp, http, ftp, tcp)
bug-tracking systems
gnats
Debian BTS
cool SOAP interface
Data storage
file systems and databases
first shot could be the default, most obvious format - plain files
slow
very slow
but we need indexing
pgsql
possibly, we may store plain files, and extract info from them into DB
store /usr/ports
run lots of make -V SOMEVAR
store results into indexed DB tables
freshports.org (and I'm sure many others) are already doing it
Data processing
probably perl
probably IPC should be done via pipes and sockets, so that any programs can be used
distributed computing should be possible in this way
distributed computing is vital, e.g. in large-scale crawling
Technical data mining
VCS
either mine from the tip
or mine from timeline
this is initially more complicated, but logically more clear
real-time or not
real-time is quite hard
may introduce lots of temporary inconsistencies (e.g. repocopies) which need to be preened later (e.g. cvsup output)
non-real-time gets some impatient people so frustrated they can't use it
greylisting => grey hair
tracing commit mails
thanks to dfilter a real-time cvs mirror is possible
repocopies will lag, probably even if using cvsup-master
I'm always getting lured into thinking about a perfect DB-based VCS (or worse - perfect FS which includes VCS) - need to build some mental fence and keep within its boundaries