[PLUG] The Grand Unified Historical PLUG Mailing List Archive Project

Russell Senior seniorr at aracnet.com
Wed Feb 18 02:17:02 UTC 2004


I keep asking related questions, I should just come out and say what I
am working on...

Some considerable time ago (July 2001), I asked PLUG subscribers,
particularly those of ancient abidance, for their archives of PLUG
mailing list traffic.  PLUG began nearly 10 years ago (see Message-ID:
<By3OpeQ.spu at delphi.com>), and much of its early mailing list traffic
had scattered to the wind.  Skylab had started to get a little dodgy.
I wanted to sweep it all (as much as I could find) back up into a
coherent pile again.  I received contributions from 7 individuals
besides myself, and have drawn on two existing archives (skylab's and
drizzle's).

I have made a few runs at consolidating this stuff before, and in the
last few days I've started a new run at it.  The goal is to
consolidate, remove duplication, remove non-PLUG-list messages, and
provide clean messages (reconstructing to the extent possible the form
they had as they passed through the mailing list server by removing
individualized delivery and other extraneous headers).  In short,
generic but suitable for further archeology.

With a reasonably up-to-date drizzle archive (as of a few days ago)
and all my other sources, I have 218,925 total articles of which
90,821 are unique message-id's (one message-id has 17 copies!).  They
are all currently sitting in a provisional Postgresql database table.

Remaining (known) problems include:

  a) not all of those 90,000+ message-ids are PLUG list messages, and
     I don't want to pass along anything that wasn't originally
     public;

  b) many of the messages have delivery headers specific to those
     individuals that contributed their archive, as well as storage
     headers unique to their archive, whereas I want only the generic
     parts of the message headers;

  c) I want to be able to establish an accurate ordering of the
     messages;

  d) I want to identify missing messages; and

  e) I want to identify common authorship.

Each of these problems and my general associated thinking is discussed
below:

Problem a) will probably be dealt with by querying the headers of all
messages for plug-list signature items (To: lines, passing through the
mailing list host, that sort of thing) and identifying/excluding those
that lack "identifying marks".

One of the ideas I have for dealing with problem b) is that for
messages with multiple copies, find the common lines in all headers,
and use those plus an appropriate '^From ' line (these will differ,
but we need to have one).  It's not quite as simple as that because of
line folding.

 ** If anyone feels a burning need to suggest some Perl code for doing
    this, do _not_ allow me to stop you! ** (i.e., it is what I am
    working on now ;-)

This strategy isn't going to suffice for cases where we only have one
copy and it was delivered to an individual, but I am hoping that what
I learn from what gets "thrown away" from the multi-copy resolution
process will inform what I need to do in these remaining cases.

There is a similar, if less serious, problem for differing message
bodies.  For example, in the case of the 17-copy message mentioned
above, there are two variants.  It appears the difference is limited
to trailing white space (one variant looks like it has an extra \n).
Still, in multiple copy cases, the bodies should be compared.

Problem c) is complicated by the observation that "peoples clocks are
often incoherent".  The "Date:" header lines are unreliable (e.g. see
the current mailman monthly archives with messages listed in January
1980, September 2006 and August 2019).  A couple of things can help
here:

  i) using the timestamps from the "Received:" header associated with
     the mailing list host.  This clock might be wrong, but it is
     likely to be monotonicly increasing;

  ii) check using a "parent_id" (either from "In-Reply-To:" or
      "References:" headers) to ensure that antecedent messages really
      do occur earlier in the ordering.

Problem d) can be accomplished, if imperfectly, by looking at
"In-Reply-To:" and "References:" headers and seeing if those
message-ids appear in the database, and then perhaps creating
placeholder records for those not found.

Problem e) is solved manually.  Many individuals appear with slightly
permuted "From:" lines.  I want to create a table which links the
variants to a common author_id.  This will allow amusements like
computation of posting frequency (FWIW, _not_ accounting for
acknowledged message duplicates, Rich Shepard is _way_ out front
quantitatively with 15,954 of the 218,925!).  Basically, this will
require looking through the list of "From:" lines and identifying the
permutations and modifying the author table (or equivalent)
accordingly.


My zero-th order approximate end point for this project is just to
recreate a global mbox, but it is easy to imagine some kind of
web-interface to the database to enable searching, or just some
html-ification that would allow Google to index it.

General and/or specific comments, ideas, suggestions (even additional
pre-1998-07 contributions) are welcome.

-- 
Russell Senior         ``I have nine fingers; you have ten.''
seniorr at aracnet.com




More information about the PLUG mailing list