MAIL ME THIS POSTING!
MY E-MAIL ADDRESS:
For example: [email protected]

02/27/04 Friday 11:41am EST

     Watkins Glen link down.

02/26/04 Friday 11:40am EST

     Watkins Glen link went down again.

     Nick and John G, went to CH and replaced everything, nothing
worked, put back in the original and it started working again.

     Suspect bad wire/amp/antenna.
 
02/23/04 Monday 1:23pm EST

     Watkins Glen link is down, CH <-> TH link finally
went into total failure.
 
     Replaced radio card at CH.  Link came back.
 
02/21/04 Saturday 5:53pm EST

    Choice One modem lines are giving busy signals.
 
    Portmaster 0 rebooted, seemed to clear the problem.
 
01/30/04 Friday 12:30pm EST

     modem banks at 277 1228 were upgraded total non functionality
last night.  Working fine now...
 
01/18/04 Sunday 3:47pm EST

     primary mail spool drive lost, replaced with new drive.

     backups from last night are damaged, so taking from Saturday night.

     users sent logs of mail received so they can ask for important
stuff to be resent.

     Need to reconsider raid.
 
01/17/04 Saturday 12:25pm EST

    outgoing e-mail and webmail interface down while changing UPS that
went out.

01/10/04 Saturday 1:28pm EST

    Net is damaged out in the middle of the US somewhere, google, netscape,
yahoo, have only intermittent access.
 
01/07/04 Wednesday 4:23pm EST

    wvbr1 locked up, telebooted.
 
12/31/03 Wednesday 12:14pm EST

    cmc radio linked locked up, rebooted fvradio.
 
12/24/03 Wednesday 11:50am EST

    rp3 link feeding DSL modems at Roy Parks locked up.  Router tweaked,
and rebooted.  If it does it again, we will replace it.
 
12/13/03 Saturday 6:00pm EST

     fvradio -> CMC1 and all related backbones are down while we fix
the cabling at fairview.
 
12/06/03 Saturday 1:58pm EST

     SSID on fvair changed to fvair from lightlink
     SSID on wvbrair changed to wvbrair from lightlink

     Users who have not registered their MAC address with us may be
blocked from using these two radios.
 
11/28/03 Friday 8:10pm EST

    Adore crashed panic on cpu 0 swtch
 
11/25/03 Tuesday 10:48am EST

    Light locked up at around 3 am causing failures on web and dialup.

    Beeper did not go off, apparently Arch Wireless is down also.

    Serious.
 
11/23/03 Sunday 3:19pm EST

    Major outage at a level3.net router in New York City is
making connectivity between RR customers and lightlink flakey at
best.  Been bad since Friday night.  No ETA
 
11/10/03 Monday 10:23am EST

    POP mail server crashed this morning.

    VFS No free inodes ask Linus
 
09/16/03 Tuesday 10:03am EST

    Adore locked up for unknown reasons.  Rebooted
 
09/04/03 Thursday 2:18pm EST

    rp2 router feeding paradyne and pairgain DSL out of Roy Parks
was rebooted a few times before we found a routing error causing
network confusion.

    We also want to get rid of the private IP's being given out
to people because infected machines fill up the masquerading
tables making it impossible for anyone on a private Ip to get out.

     I goofed that change up but good, so things are back where they
were for the moment.
 
08/28/03 Thursday 10:52pm EST

     Upstream down for about about an hour.
 
08/23/03 Saturday 2:48pm EST

     mail server taken down at 2pm per scheduled down to replace
failing hard drive.  Drive was not replaced, but root partition
was moved to a different place on the drive.
 
     Went well.
 
08/15/03 Friday 11:41am EST

     Radius authetication failed for dialup modems this morning
because the authentication server (light) ran out of swap space
because the web server was not being hit upon probably due to outages
across the state.
 
     We were also not using our full battery of swap space, have added
2gig, so we should be fine :)
 
07/17/03 Thursday 1:07pm EST

    adore crashed on panic on cpu 0: swtch
 
07/14/03 Monday 10:47am EST

     adore crashed from panic on cpu 0: swtch

06/17/03 Tuesday 10:18am EST
 
    Mail server 'mx' locked up for unknown reasons.  Rebooted.

     This is a rare event, hope this is not a premonition of things to
come.
 
05/30/03 Friday 3:24pm EST

    Apparently gem was failing to do web logs properly starting
from May 19th, which may or may not have had anything to do with its
failure on the 24th.
 
05/24/03 Saturday 11:27pm EST

    Gem lost is primary hard drive tonight while replacing
a CDROM drive.  Turned machine off, put in DVD drive, turned it on,
and drive was totally dead.

    Web e-mail and USER AREA were off line.  Some web hits
will be permanently lost for Saturday.

05/06/03 Tuesday 2:20pm EST

    Mail crashed from errors on /dev/sdc.  /dev/sdd wouldn't
boot when rebooted.  Cold booted everything and reseated all the
drives, and it came back fine with mininmal damage.

    I hate this job.

04/30/03 Wednesday 9:37pm EST

     Alpha test modems will be taken off line for a short time every
12noon pm bouncing those on it.  These are the numbers starting with
216 0008 for Ithaca adn 387 7110 for Elmira.  Those using our normal
modems will not be affected.
 
04/20/03 Sunday 10:14pm EST

    adore crashed this morning at 8am from panic on cpu 2 swtch.  Seems
to be doing this about once a month.

     rebooted the new modem banks tonight bouncing everyone, testing
the second set of modems in the bank.
 
04/19/03 Saturday 12:22pm EST

     Massive DDOS attack started 1pm Thursday afternoon taking
our webserver and primary DNS off line.  Intermittent attacks
followed until late Friday afternoon.

     IP numbers of non IP domains have been moved to get out from
under attack.  Some pages may not work properly again until client
machines are rebooted.
 
04/10/03 Thursday 5:40pm EST

    isdn2 modem bank in ithaca locked up, rebooted.
 
03/20/03 Thursday 2:47pm EST

    adore crashed from swtch condition.
 
03/17/03 Monday 7:58pm EST

    isdn4 bad modem line 1/2
 
03/16/03 Sunday 7:08pm EST

    isdn2 modem bank rebooted bouncing 16 users.

    permanent bad modem at isdn2-8
 
03/14/03 Friday 12:21pm EST

    Bad ISDN lines on modem bank 2 caused dropped connections.

03/08/03 Saturday 1:56pm EST

    pop server taken off line at 12 pm to upgrade motherboard to 900MHz.

    Went smoothly.

03/01/03 Saturday 9:00pm EST

     Lost the power strip that powers the core router.   Replaced

     Lost connectivity for about 5 minutes.
 
02/21/03 Friday 9:30pm EST

     Power outage around 5pm took down entire system.  All systems
running again by 9pm.
 
02/18/03 Tuesday 08:42am EST

     Our firewall fw1 machine locked up this morning for unknown
reasons blocking traffic to all legacy services, mail, web, modem
authentication and news.
 
     Reboot fixed it.  Don't you just hate that?  Or maybe you
gotta love it because it wasn't anything worse.
 
     That's its first lockup ever in about half a year of service.
 
02/10/03 Monday 11:50pm EST

    Adore crashed, panic on cpuX swtch

02/03/03 Monday 11:54am EST

     Recent problems with our upstream resulted in periodic outages to
various parts of the net, google.com and others.  Fastnet claims to
have fixed the situation.  It has been reoccuring on and off for the
last week or so.  01/25/03 Saturday 11:01pm EST

     Fastnet, our upstream, is doing upgrades to their core routers to
protect against the latest SQL worm.  Outage should last about 30
minutes.
 
01/21/03 Tuesday 04:08am EST

    Lightlink was down starting at 12 midnight until 4am due
to a power failure.

01/04/03 Saturday 09:04am EST

    Power out at 5am or so.  Came back around 8am.  Gave it some time
to prove stable., everything backup around 9:30.

    Watkins Glen link down until Sunday due to power outage
at Conn Hill.
 
12/29/02 Sunday 3:39pm EST

     Gem, webmail and mailing list machine, out of control, sendmail's
taking 2 cpu seconds to verify.  Rebooted to clear out.
 
12/27/02 Friday 04:06am EST

    Light again crashed mbuf map full

     Did a netstat -na vmunix.6 vmcore.6 from /var/crash and found
multiple sendmail connects.  Have firewalled off the spammers.  This
may break something else, too tired to think straight at the moment.
I hate spammers.
 
12/27/02 Friday 02:49am EST

    Light crashed, mbuf map full

12/18/02 Wednesday 9:59pm EST

   Mail system died this morning from a spam attack.  7000 pieces of
bounced mail accumulated in the mail queue and the mail server ran
out of process space.

   We have taken measures to prevent this in the future.

     Also later in the afternoon the webmail interface web server died
for unknown reasons, so webmail was down until late afternoon.
 
12/12/02 Thursday 7:07pm EST

    Wireless link to Watkins Glen went down from ice at about 2pm.
Terry Hill Yagis were ice coated.
 
12/08/02 Sunday 12:14pm EST

    Modem 14 on 272 2284 not answering.
 
12/06/02 Friday 11:12pm EST

     Modem 8 on 272 2284 not answering.

11/26/02 Tuesday 11:39pm EST

   Another modem on the 5026 banks giving ring no answers.

11/26/02 Tuesday 1:16pm EST

     Modems on bank one of the V90 group locked up, had to reset the
modem bank a few times.  Something or someone is causing ping times to
isdn.lightlink.com to periodically get huge.
 
11/19/02 Tuesday 12:37pm EST

     Web server locked up from scsi errors at 11:30 or so.

     Lost /dev/sd9a on web server housing /usr/local and /w5 and /var

     Have temporarily replaced with mirror.

     Will rebuild the master from the tape backups.

11/09/02 Saturday 10:49am EST

   Outgoing mail is down cleaning up after a spam attack that
took out the mail server.

10/25/02 Friday 12:14pm EST

     All three mail servers were hosed this morning from a spam
attack.  User Area was off line, as was webemail and outgoing smtp
server.  All should be fine now.

     Really have to figure out some way to stop this particular kind
of attack, they are really deadly.

10/22/02 Tuesday 6:48pm EST

    UPS replaced, knocking off all Ithaca dialup users.
 
10/21/02 Monday 7:18pm EST

     UPS died on modem banks, everyone booted.

10/17/02 Thursday 11:08am EST

     News server died on own determinism, no idea why.

     Restarted.

10/16/02 Wednesday 3:46pm EST

    Jammed modem line today refusing to let people connect, got it
reset and its working fine now.

10/06/02 Sunday 1:23pm EST

     Light web server taken down to add more disk space.
 
10/04/02 Friday 10:46am EST

    277 0356 rebooted to clear out lan failures.
 
09/29/02 Sunday 2:47pm EST

    Light, web server, and adore, shell server, taken off line
for 2 hours to upgrade tables and power that hold them.

    Dialup was unavailble during this period due to no DNS nor
radius authentication.
 
09/21/02 Saturday 2:10pm EST

    Majesty, light and adore taken down for 10 minutes to upgrade
resolver libc's.
 
09/16/02 Monday 7:15pm EST

     Mail taken down to increase mail and spam partitions to 4.5 gigs
each.
 
09/14/02 Saturday 9:03pm EST

     Incoming pop mail and news offline at 12pm to move computer tables.

     Outgoing smtp server offline this evening to replace dying
ethernet card.......
 
09/13/02 Friday 12:50am EST

    Adore shell server crashed Panic on CPU 2: swtch

    Becoming more often...

09/04/02 Wednesday 8:13pm EST

    Mail server mx taken down to increase memory to 512Megs.

08/31/02 Saturday 5:07pm EST

    Adore, the shell machine, crashed on panic on cpu2, swtch.

    Seems to enjoying doing this every couple of weeks.

08/26/02 Monday 11:47am EST

     There was a failure this morning in the program that
authenticates remote users to our smtp server.  Some users were not
able to send mail, got 'Lightlink Relaying Denied' error messages.

     New and improved program has bugs...  :)
 
08/10/02 Saturday 6:59pm EST

     Ithaca dialup modem banks taken off line to change IP range
from 205.232.34.x to 216.7.30.x.  All static IP's changed also for
dialup users only.
 
08/04/02 Sunday 12:34pm EST

     Name service was messed up last night during upgrades.
Intermittent failures to get out to the net would have been caused by
this.
 
07/16/02 Tuesday 12:06pm EST

     web mail interface down over night due to a failed upgrade.
 
     mail itself not affected.

     During the periods when gem was actually down, logins to adore
were not available to due to nfs mounts timing out.

06/23/02 Sunday 11:59am EST

     newsfeeds down for 12 hours due to running out of disk space

06/06/02 Thursday 1:42pm EST

    Adore rebooted on panic on cpu 0 swtch
 
05/25/02 Saturday 12:10pm EST

    Light taken down at 12:00pm to add second CPU.

    No problems.
 
05/18/02 Saturday 6:39pm EST

     Adore taken down at 6pm to install second CPU on its motherboard.

     So far so good...

05/13/02 Monday 09:20am EST

    Adore/shell crashed for unknown reasons, cpu panic swtch

    Seems to be doing it every month or so.

04/27/02 Saturday 6:21pm EST

     All Ithaca V90 modems were successfully rebooted using automated
script which will run at 3:30 every Sunday morning.
 
04/23/02 Tuesday 11:10pm EST

    isdn rebooted due to instability.  Users were being disconnected
on lines 1 and 3 repeatedly.
 
04/13/02 Saturday 3:23pm EST

     12pm started scheduled down.

     Light taken down, cpu upgraded to 150 MHz, second ethercard
added, power supply vaccuumed.

     Adore taken down, added 4 64mb memory sticks power supply
vaccuumed.

     Light's UPS had its battery replaced.

     Modems taken off line to replace their UPS battery too.

     Wheels placed on table holding light and adore.

     Total down 3.5 hours.

04/06/02 Saturday 1:18pm EST

     smtp outgoing mail server off line for 10 minutes emergency upgrades.

     It's back now.
 
03/23/02 Saturday 11:29am EST

    web hits are off line for a while, while we practice an upgrade
to light on majesty.

03/19/02 Tuesday 3:13pm EST

     Elmira was down for a few minutes due to an EMI frame outage.
 
02/03/02 Sunday 7:47pm EST

     Mail server taken down at 1pm to upgrade memory.

     Didn't go well, machine turned to glue with 1024Megs, had
to remove memory and go back to old configuration.

     After much playing around mail was back up by 3:30pm.

     Gonna have to do this again next weekend.

01/28/02 Monday 4:16pm EST

    Elmira modens offline again from about 12 noon until 4pm due
to a mistake made by Verizon while doing upgrades on the circuits.
Virtual circuit DLCI's were lost :)
 
01/26/02 Saturday 3:27pm EST

     Elmira modems were off line from about 2am until morning due
to repeated power outages.

     Public radio static IP addresses should be working properly now,
subnet range was 'shared' by another machine offering bogus mac
address for default route 205.232.177.65
 
01/24/02 Thursday 7:07pm EST

    Harmony rebooted itself at 4pm for unknown reasons.

    This affected dialup users on 277 5026 only.

    In may of 2001 the config file was corrupted or copied over
and was no good, so users were able to dial up but not go anywhere
beyond local IP numbers.

    We rebooted again at 6:10pm and again at 7:10pm

    All should be well now.

12/09/01 Sunday 5:38pm EST

    Light rebooted last night to get rid of /usr/local/main
    Adore rebooted tonight for same reason.
  
12/04/01 Tuesday 2:09pm EST

     Last night ftp daemon was upgraded to 2.6.2 due to a root exploit,
unforunately the new version did not work right and prevented the
uploading of pages.  This has been fixed.
 
10/29/01 Monday 12:24am EST

    Adore crashed, panic on cpu 0 swtch.

    Been up long time!
 
09/19/01 Wednesday 6:37pm EST

    Major web outages caused by Nimda worm attack on our webserver.

    Things are mostly under control at this time.

    Outgoing mail server rebooted to replace UPS.
 
08/02/01 Thursday 08:38am EST

     Mail server mx and pop locked up at 8:38 from process table full,
not sure why.  Rebooted.   Could have been a lock file problem with
the spam trap.
 
07/26/01 Thursday 12:45pm EST

    Adore, shell machine, locked up at 11:58am from memory chip
going bad.  Removed, not yet replaced.

07/19/01 Thursday 2:54pm EST

     DDOS against iron and seal from 213.148.3.210 using known
problems with web interface.  Turned off interface and blocked IP
range at router.
 
07/11/01 Wednesday 1:49pm EST

    Adore lost its root drive swap partition to errors.  In process
of installing new drive.

06/30/01 Saturday 6:53pm EST

    Mail taken down to increase disk space.  Gives us some time
to decide what to do about abandoned but growing mailboxes.

    Gem rebuilt with new mother board.

06/30/01 Saturday 05:03am EST

    Gem locked up again, multiple ethernet cards did not work,
probably pci bus gone. Will replace motherboard today.

    smtp server moved over from gem to emerald.
 
06/29/01 Friday 09:39am EST

     Gem (outgoing smtp server) locked up ether port.  Reseated
card in different slot, if continues will replace card.
 
06/24/01 Sunday 1:34pm EST

     Rain storms over the past few days have taken their toll on the
Fairview to CMC link.  It is presently down for repairs.
 
06/09/01 Saturday 09:16am EST

     Shell machine adore locked up due to failure of spamtrap to clear
lock.  Basically I think bad locking finally caught up with us, ran
out of processes.  Cleared them out and cleared the spamtrap lock and
all was well.

05/25/01 Friday 8:03pm EST

     All web pages restored.  End of Event
 
05/25/01 Friday 10:50am EST

    Main web drive locked up on light, needs to be replaced.

    Presently running on mirror drive, do not upload web pages.
 
05/12/01 Saturday 1:31pm EST

     All majordomo mailing lists hosted by lightlink have been
down since about a day ago.  An upgrade to the sendmail.cf on gem
broke the majordomo.aliases.  Mail sent to the mailing lists
would have been returned with User Unknown.

     My apologies.
 
05/07/01 Monday 2:46pm EST

    two modems on isdn2 were giving busy signals, reset seems
to have cleared them out.

05/04/01 Friday 8:58pm EST

     Power strip blew out tonight from old age apparently, knocking
all modems off line, along with gem, emerald and majesty.  emerald is
the smtp server, so outgoing mail was interrupted.  Things should be
working again.
 
05/01/01 Tuesday 11:35am EST

     Between Friday and Monday there was a significant outage on the
net between lightlink and earthlink.  This was caused by a
malfuntioning router in our upstream's network.  Due to this network
outage, much mail may have been delayed getting to Earthlink, and some
mail from earthlink may have been returned to sender as undeliverable.
 
03/31/01 Saturday 6:48pm EST

     I crashed the outgoing smtp server accidentally by
hitting the wrong button on the UPS, trying to test it, I
turned it off.  Sigh.

03/18/01 Sunday 6:09pm EST

    Outgoing smtp server locked up its ethernet port for about 20 minutes
this afternoon for unknown reasons.

    Majesty filed to properly beep me when smtp went due to incorrect
settings.  This has been fixed terminatedly.

03/07/01 Wednesday 7:52pm EST

    mail (mx) was taken down at 6pm for scheduled upgrades.

    Back up at 7pm.

    Scsi wiring was cleaned up on /dev/sda.

    spam filter was installed, with no filtering.
 
03/01/01 Thursday 3:56pm EST

     gem was rebuilt.  web mail was down for a short while.

02/27/01 Tuesday 8:15pm EST

     Adore load 120.0, large amounts of mail waiting in queue
for 5 days was suddenly dumped on me as postmaster as undeliverable.
Procmail running 5000 lines of spamtrap drove load high.

02/23/01 Friday 3:12pm EST

    Got mail bombed again by same guy from sympatico.

     This time we put in the DUL data base, and it is keeping it out.
Still lots of sendmail's opening up.  Not ideal.
 
02/20/01 Tuesday 7:29pm EST

     Two large mail bombs came in this afternoon bringing all
mail serves to their knees.  About 3:00 to 5:00pm.
 
     No mail lost, just some temper.

     While putting in filters for the spam I managed to block
out the smtp server from everyone for about 10 minutes at 18:14pm
 
02/11/01 Sunday 4:27pm EST

    Mail was down for 20 minutes to make emergency changes to disk
arrangements.  /var/log given own partition so it doesn't 'wear out'
the root drive.

02/05/01 Monday 10:03am EST

     Well after 340 days of uptime, adore finally locked up.  Had to
reboot.
 
10/20/00 Friday 6:52pm EST

    Light rebooted.  Error in rc.local required multiple reboots.

10/06/00 Friday 11:28am EST

    ftp www.lightlink.com locked up, for unknown reasons

    killed off inetd and restarted.  now its fine.

    That's a first.

10/05/00 Thursday 9:22pm EST

    DSL modems back on line at about 10:30 am this morning.  Another
customer's traffic through same switch was triggering a bug in the
DSLAM firmware.  Moved customer to another switch and DSLAM
started to behave again.  Will upgrade firmware shortly.

    Sheesh, only 20 hours of down.
 
10/04/00 Wednesday 7:06pm EST

     At 12:39pm this afternoon our Paradyne DSL modem rack went into
continuous reboot mode.  We have spent 7 hours so far trying to debug
it to no avail.  Problem is even new chassis, cards and control cards
are doing the same thing!

     This is probably going to be a long down.

09/29/00 Friday 10:34am EST

     light load went high from SSL attack.

09/23/00 Saturday 8:43pm EST

     6:00pm Scheduled down, ftp server rebuilt with new disk drives
raising space from 2 gig to 8 gig and adding a 40 gig backup mirror
drive.

09/20/00 Wednesday 11:29am EST

     web server locked up from onslaught of ssl requests from rogue
site, blocked at border router.
 
09/12/00 Tuesday 11:41pm EST

     Lightning took out the Roy Parks network for 3 hours tonight
ending around 10:30pm.  Power came and went many times before
stabilizing.  The Roy Parks router is running at 96 percent cpu, and
showing signs of smoking.  Will replace the student backbone shortly
with dedicated line back to Fairview.
 
08/23/00 Wednesday 11:19am EST

    Out of control hits on the secure server locked up light with
load 131.  Modem dialups were blocked during this time.
Gonna have to do something about this, but not sure what.
 
08/17/00 Thursday 11:33am EST

    Romance had a hard drive partition crash last night, /dev/hda11.

     When rebooted, it didn't mount majesty's web data logs, so they
have been unavailable.  In trying to get them to mount, majesty locked
up and had to be rebooted.  Majesty had been up for 300 days.
 
     Machines get old and rickety if left up too long.

     Sort of like me.

08/15/00 Tuesday 10:23pm EST

     Light rebooted at 10:15pm, been up for 137 days, but was beginning
to show signs of irrepairable corruption in core.  tty's that could
not be erased, and virtual domains that went to our home page rather
than their own, twice in as many mornings.

     Also some confusion on the UMG weg page, we will see if all this
clears up.
 
07/24/00 Monday 4:37pm EST

     5026 modem bank:
 
     modems 67-69 reburned with firmware and reseated in card cage.
They were rejecting connections for at least 2 months.

07/18/00 Tuesday 1:15pm EST

    Load on light went to 110 this morning at around 11am from
a flood of secure server hits, probably an errant client.  Failures
of various warning systems, buddy in particular, caused a delay
in finding out about the condition.

    While load was high, dialup authentication and some DNS services
also failed.

07/09/00 Sunday 9:45pm EST

     news was down for most of today while we copied over
the entire news spool to the new machine.  It seems
to working fine, little or no news should have been lost.

     aurora went from 200MHz 12 gig to 500MHz 120Gig.

07/07/00 Friday 10:13am EST

     news was down for about 2 hours yesterday afternoon preparing
for a new server.

06/25/00 Sunday 11:10pm EST

    ftpd upgraded to 2.6.0 plus patch to handle root exploit.
 
06/07/00 Wednesday 9:48pm EST

    A major network snafu on the morning of June 6th, caused
the net to slow down and become unreachable in many cases.

    Mail backlogs caused waves of incoming mail when the net
opened up again, causing all three of our mail servers to crash
from lack of process space.

    Some mail was lost.

06/02/00 Friday 10:16pm EST

    Power outages took out Pleasant Grove pop T1 at about 5pm, lasted
for about 1 hour.

    ftp server taken down for emergency repairs, disk upgrades
not completed.  Power supply replaced due to stuck fan.  Motherboard
needs to be upgraded to handle 40 gig drives.

04/05/00 Wednesday 7:05pm EST

     Pleasant Grove T1 suffered an event, becoming sticky and not
allowing pings to go through.  Went out there but found nothing wrong,
rebooted router, but it cleared it self up before I did that.
 
04/02/00 Sunday 6:30pm EST

    Light take down to reinstall /dev/sd2, the log drive.

    Went without incident.

03/31/00 Friday 01:53am EST

    At about 1:30am, light lost its primary log drive /dev/sd2

    panic on cpu 0 brought the machine down.

    Replaced drive with existing emtpy partitions, will put in
new drive later.

    I do not love this job.

    It's making me old before my time.


03/30/00 Thursday 4:34pm EST

     277 5026 modem bank was rebooted a few times to stop corruption
caused by new monitoring program.

03/20/00 Monday 9:27pm EST

   news server hosed, lost news spool, had to rebuild from scratch.

   All extant news lost.

03/08/00 Wednesday 11:50am EST

     pop server mx rebooted to install linxu 2.0.38

03/01/00 Wednesday 8:55pm EST

     At about 19:17pm this evening, lightlink suffered a distributed
denial of service attack that saturated our bandwidth for about 30
minutes.  It was directed at a colo machine on our network.

02/29/00 Tuesday 9:18pm EST

    Earlier today at about 5pm we suffered an attack on our
secure server, address 192.248.252.133 was opening repeated
connections to port 443 on light.  Blocked at T3.  Load
on light went to 150, took about 30 minutes to find out what
was going on, penetrate the machine and block it.
 
     Tonight at about 9:15pm, harmony stopped autheticating properly,
people calling up with weird errors.  Rebooted harmony and the
erpcd's running on light.  No idea what happened.

02/28/00 Monday 11:48am EST

     mx, mail server, spawned multiple crond's this morning at about
10am.  This caused multiple buddy's to fire up, flooding our system
with buddy messages, and finally filling the process table on mx.

     Although there are no logs indicating mail bounced, it is
possible that some did.  Mail was also queued on the back up servers
and delivered later, but it is not clear that all mail was caught
properly.

     A monitor has been placed on crond on mx, if it goes over 5, I will
get beeped immediately.


01/23/00 Sunday 10:54am EST

     Due to an incorrect /etc/syslog.conf setting, news logs were
being dumped on wrong hard drive, resulting in failure to rotate them
and the hard drive filling up.  News has not been propagating for
maybe a day.  News was coming in, but not going out.
 
01/20/00 Thursday 6:06pm EST

    17:15pm

     pop server suffered more scsi errors this afternoon.  Took her
down to replace entire computer, case, power supply motherboard, cpu
and memory.  Running on pentium II now.

     No mail lost.

01/18/00 Tuesday 6:26pm EST

     17:10pm

     pop server crashed completely, main mail drive showing corrupted
FATS.  Attempts to save it were in vain.  Mail Spool saved to temp
directory first, should be little damage to most mailboxes.

     Swapped out entire bay of scsi drives with new one.

01/17/00 Monday 8:53pm EST

   pop server taken down to replace scsi cable 8:29pm

01/15/00 Saturday 1:16pm EST

     pop server taken down to swap out SCSI terminator

     Also taken down last night at about 5pm to

     Replace scsi jumpers
     Replace CPU fan
     Remove Tape drive
     Get rid of all real time mirrors
 
     Rebooted again at 6pm to install new hard drive partitions for
nightly backups.  We should be able to get about 24 days of full mail
spool backups with the present system.

01/03/00 Monday 6:22pm EST

    Light web server preemptively rebooted to avoid slow down.

    Every 30 days or so light becomes sticky, there is some
evidence this is caused by repeated ssh's into light.
 
01/03/00 Monday 2:39pm EST
 
    Mail server, mx, locked up again, this time we think
we caught the hard drive causing it, sdc1 running /var/spool/mqueue.

    Damage was minimal, all mail should be intact.

12/27/99 Monday 10:04am EST

     9:00am
 
     Harmony our primary terminal server lost its power supply,
replaced with backup.
 
12/24/99 Friday 12:21pm EST

    fvradio locked up, needed rebooting.

12/20/99 Monday 6:14pm EST

     Web mail interface demo version expired today, have installed
a new demo, and ordered the real thing.  Apparently *LOTS* of
people like it! :)
 
12/19/99 Sunday 6:30pm EST

     Mail taken down to replace scsi card, and create new disk
partitions on /dev/hda and /dev/hdc
 
     Popper also changed to server mode.  Really cuts down on the file
copying.

12/18/99 Saturday 2:49pm EST

     Mail suffered a severe hard drive crash early this morning and
had to be taken down at about 12:30pm to rebuild the mail spool
drives.

     Apparently one of the mirror drives went bad and the mirror
software did not disable it as might have been expected.  This caused
corruption in the file allocation table, which resulted in corrupted
mailboxes, and some bounced mail for some people.

     During the rebuild, some mailboxes were probably totally lost.

     We will go through the logs and inform those that had bounced
mail or lost mail boxes privately.

12/08/99 Wednesday 1:56pm EST

    External net looks down, can't ping rahul.net
 
12/08/99 Wednesday 07:20am EST

    Light load very high 100 or so.

    Rebooted, got trace back error during the core dump.

    Rebooted again.

    Hundreds of radius and httpsd's running.

    Httpsd seem to be coming from 208.150.209.130, filtered
at T3

    Rebooted all NetServers.  Radius chilled out.

    Only bad taste is the traceback we got during the first core
dump.  Could be coincidental, could be something more serious.  New
swap partition is in place if that matters.

12/05/99 Sunday 6:58pm EST

    Light and adore upgraded to Y2K patches and libc.

    Down time 6pm to 7pm

12/05/99 Sunday 2:07pm EST

    Testing backup dialup server in preparation for tonights down
of light and adore, found that password files were not being updated,
so some people could not sign on.  That has been fixed.

    Elmira customers are having problems with a burnt out modem which
is being fixed shortly.

12/04/99 Saturday 6:10pm EST

     Mail was offline for 10 minutes tonight to upgrade is primary
spool drives from mirrored IDE's to mirrored fast wide SCSI
differentials.

12/04/99 Saturday 3:09pm EST

    Outgoing mail service interrupted for a few minutes, needed to reboot
the smtp server due to jammed processes.

12/02/99 Thursday 1:59pm EST

     Light rebooted.

     Getting out of memory errors, cpu %0, topp sticking, other
anomalies.  halt locked up machine, had to cold boot.

     Adore was locked up while light was being rebooted.
 
11/25/99 Thursday 12:17pm EST

    Elmira modems were down for a number of hours this morning
due to a power cable cut in the neighborhood.

     Mail server, mx, was rebooted twice around 12:15 to install new
hard drives to hold mail spool.  It may be rebooted again through out
the day.

11/21/99 Sunday 7:07pm EST

     Starting about 5:30pm this afternoon news was down for a short
while as we moved the machines to a new room.  At about 6:30pm, mail
was down for the same reason.  No mail was lost.

     The changes put both news and mail on a 100 megabit full duplex
switched backbone, to help improve performance.

     News will be down again in a day or two when we move everything
over to scsi drives from IDE which really can't take the I/O.

11/20/99 Saturday 12:00pm EST

    mx ethernet locked up.  Did an eth0 down and up, and
it started up again.
 
    no idea why

11/19/99 Friday 9:48pm EST

     At approximately 17:28 this afternoon our network had an 'event',
data slowed down, people couldn't sign on, things came to a
standstill.

     Although it is still unclear what happened, at the same time we
were being massively overrun by spam coming into our system from one
of our own users, which caused our mail server to run out of process
room and crash.  Bringing this under control took about 2 hours,
during which time incoming mail may have been queued, and pop accounts
would have been slow to respond.

     Some mail may have been lost during the crash.

11/05/99 Friday 10:05pm EST

     secure shell keys changed across all machines.

     mx lost its trust to smtp (emerald) so pophash.db was not being
transfered, causing people to be unable to relay mail through
lightlink from remote places to remote places.  Started at about 17:00
until 22:00.

11/05/99 Friday 08:17am EST

    mx mail rebooted to reinstall 2.0.36 OS

    Addition of mqueue partition seems to have helped in the overload.
 
11/04/99 Thursday 7:38pm EST

    Light rebooted to clear out dying OS.

    top locking up rather than running smoothly, happens
every 30 days or so.
 
11/02/99 Tuesday 4:36pm EST

    New kernel on mail has made the situation worse.

    Have added a new partition for /var/spool/mqueue.
 
10/30/99 Saturday 12:12pm EST

     Mail was taken down to install new kernel, to see if we can
improve the performace issues.  However the new kernel refused to boot
properly, being unwilling to use one of the mirror drives (hdc) in DMA
mode.  Booted back to 2.0.36.
 
10/29/99 Friday 5:50pm EST

     Our T1 has been decomissioned.

     At 5pm the T1 was disconnected interrupting service to modems,
shell, web and incoming mail.
 
     A few minutes later connectivity was restored through our T3.

10/28/99 Thursday 09:49am EST

    Ethernet card on pop server mx locked up, machine unresponsive to
network requests.  Rebooted.  If it happens again, will replace the
ethernet card.

10/24/99 Sunday 2:22pm EST

     Romance lost its root hard drive.  This caused emerald (smtp
mail), gem (majordomo) and mx (pop mail) to lock up certain functions
and they needed to be rebooted.
 
     Nothing was lost, and romance's hard drive was replaced by its
mirror.
 
10/23/99 Saturday 1:40pm EST

   Majesty patched for Y2K and Libc.so/sa

   Rebooting may have caused lockups on light and adore.

 
10/15/99 Friday 8:14pm EST

    Adore locked up, out of swap space.  Don't know which
process.  Brought under control without rebooting.

10/10/99 Sunday 8:44pm EST

     V90 modems still giving busy signals every other call.

     Dial once, get a modem.  Dial again, it rings once, then turns
busy.
 
     Put in another trouble report.
 
10/07/99 Thursday 8:33pm EST

     V90 phone lines are returning random busy signals, looks like a
problem with Ma Bell.

     Just keep trying to get in.

     Trouble ticket opened.

     0141031
 
10/05/99 Tuesday 11:25pm EST

     mx, pop mail server rebooted.  High load, sticky controls,
multiple processes running, in particular crond spawning timely jobs,
but many of them.  Will have to watch for this.

     Also yesterday a dialup V90 NetServer 16-I was swapped out of the
Elmira pop due to failing modems.  This caused some problems when
customers were moved to another modem bank with incorrect settings.

     The hunt group remains down in Elmira for one of the v90 banks.

10/05/99 Tuesday 04:43am EST

     Denial of service attack from dannyboy.easynet.co.uk against port
443 httpsd caused load on light to go 80, jamming it completely.  Took
me a while to figure it out, and get it blocked.

     Radius server was also jammed from USR.lightlink.com, creating
hundreds of radius daemons.

     People were not able to sign on from about 2am to 5am.

     Things should be stable now.
 
10/03/99 Sunday 2:19pm EST

    Adore crashed and rebooted itself for unknown reasons.

    panic cpu0 swtch
 
09/28/99 Tuesday 8:08pm EST

    Adore rebooted by me, for system upgrades.

    /usr/local split into /usr/local and /usr/local/main
    /dev/sd2g released
 
09/25/99 Saturday 2:23pm EST

     Never rains but pours.

     Romance lost its root drive last night, has been replaced with
the mirror drive.

     Majesty just lost its root swap partition.  Swap is presently on
the mirror, will probably move the mirror to root and toss the present
root drive.  /etc/mtab and /etc/dhcpd.leases were lost.
 
09/07/99 Tuesday 10:18pm EST

    Lost a modem card in the Multitech racks, caused no answers
on 3 modems and interrupted the hunt group.  Will busy out
and get card replaced.

     Nysernet had troubles installing a new OC3 fiber ring that hosed
some of their network over the past few days.  This has caused
intermittent connectivity problems to the net.
 
09/06/99 Monday 8:18pm EST

     A modem was giving busy signals earlier this afternoon on isdn2.
I rebooted the bank bouncing people who were on isdn2.

     The net seems to have a major outage going on, some sites are not
accessible, other's are very slow.

     Some other ISP's are not affected, so it may be a problem with
Sprint.

09/01/99 Wednesday 2:45pm EST

     ftp daemon was upgraded to 2.5.0 today on adore, and promptly
broke uploads.  Reverted to old version until fixed.
 
08/21/99 Saturday 6:23pm EST

    Scheduled down time at 6pm.

    Adore upgraded to Sparc 20, 256Meg.

08/20/99 Friday 6:19pm EST

     Adore locked up around 10am this morning from uninterruptible
wait states on the mail partition.  I had to crash it and reboot.
During the process the password file was corrupted and some could not
sign on.

     Later in the afternoon, I hot swapped /dev/sd1a back into adore
so we could get a good root mirror going for tomorrow.  This also
crashed adore again because the second root drive steals the SCSI
target from the first root drive and bye bye system.
 
08/18/99 Wednesday 6:27pm EST

    Adore crashed at 17:43pm from swtch panic.

    It's Sparc 5 will be replaced by a 20 on Saturday.

07/31/99 Saturday 09:24am EST

     Adore crashed at 3:33am from asynch memory faults, pretty much
proving its not a memory problem as all chips are new.  So its time to
replace the mother board.  At 9am this morning I pulled sd1a root
mirror drive in order to bring up a new Sparc 20 which will replace
adore for a while, and forgot that swap was on the drive, so adore
crashed again.
 
07/18/99 Sunday 12:55pm EST

     On Saturday night at about 5pm Fairview Square was hit directly
by lighting on the power pole that leads into the main power plant of
Fairview.  The fuses were burnt out on the power pole, and the main
circuit breaker inside of the Fairview Complex was blown to
smithereens.  Power was restored by about 10pm at which point we
brought up all systems only to find that the T3 router would not boot.
It was replaced by a backup router which took about 1.5 hours and
everything was up and running by about 11:30pm.  The router seems to
have a flakey flashrom card, perhaps a result of the strike, but also
perhaps simply a mechanical problem.
 
07/06/99 Tuesday 11:43pm EST

     Adore went load hight at about 11:15pm.  I tried to play with it
extensively to see what might be causing it, no luck.  pff -a locks
up.  ps didn't show procmail this time as the first in line but login.
 
07/05/99 Monday 7:58pm EST

    Adore went load high again.  Changed procmail to latest
version 3.13.1
 
07/05/99 Monday 6:46pm EST

    Adore went load high with locked mail partition.  Took
a core dump and rebooted, will send to Sun.

07/04/99 Sunday 6:29pm EST

    6:00pm Tried to switch the bad UPS with the new one without
bringing the system down, needless to say it didn't work.  Adore
knocked off line, along with many modems.

07/01/99 Thursday 12:54pm EST

    Adore went load high at about 12:30pm this afternoon from
a problem we thought was only with the popper.  But the popper
was not running.

    This time we got a core dump, and will be sending it to
Sun for analysis.

06/26/99 Saturday 10:32am EST

    Power outage at 10:15am or so.  UPS decided to go bad at
just that moment, adore, and modems taken off line.

06/24/99 Thursday 11:30pm EST

     Shell users found fetchmail broken this morning after
an upgrade to imapd on mx last night around 12 midnight.

     In general people should not be using fetchmail, except once
after they first enable their shell mail with enableshellmail.
 
     MX has been stable, and adore has been stable with the popper
off.
 
06/21/99 Monday 4:05pm EST

    About 2pm we started a test version of the popper on adore,
one that ran as a daemon rather than through inetd.  About
two hours later adores load started to climb.

    Every drive partition was listable except /var/spool/mail.

    This has happened on two different drives in two different
partitions, in two different drive trays in two different drive
bays.  This indicates it is not a drive problem.  Also Iwas
able to write to another partition on sd9 which holds
/var/spool/mail, so the drive was working fine.

    Since this follows the popper it seems to be a kernel/popper
problem.  One has to ask why this never occured on light.

     MX

     Earlier today mx started to refuse popper connections because the
inetd daemon that listens on port 110 was hard coded to turn off if
too many connections came in at once.  Max Parke hacked the code to
get rid of that check, and things have been fine ever since.

06/20/99 Sunday 8:12pm EST

     Adore locked up from disk drive failure.  Always seems like it is
the new mail drive.  This time in sd9e.

     10:00pm

     Adore locked up again, this time no sign of what was wrong,
totally jammed, no cursor no nothing.  When I went to reboot, the
monitor wouldn't come on.

     I took out 4 pieces of the new memory and it would boot, I put
number 5 back in, and it gave asynch memory fault.

     So we are running on 4 for the moment.
 
06/20/99 Sunday 6:21pm EST

     Adore taken down for emergency service.

     All memory replaced and upgraded from 145M to 256Meg with
Sun bar code memory.

06/20/99 Sunday 12:35pm EST

     Saturday 6/19/1999 mail was moved from adore to mx starting
at 2pm.  Most systems were online again at 4pm, but some remote
users were not able to send through our smtp server until about 9pm
due to failure to copy the pophash.db file over the smtp.lightlink.com

     Adore locked up at about 1am 6/20/99, and didn't notify
us of the failure until 4am when it was rebooted.  Crash was
asynch memory fault again.  Memory will be place tonight.

06/15/99 Tuesday 3:40pm EST

     Major net outage for most of 6/14/99

     Adore locked up at 8:45am this morning, then freed itself, popper
unavailable.

06/13/99 Sunday 9:03pm EST

     Adore locked up again at about 8:30pm.  Load 128.0

     Seems to be the mail drive getting stuck in a wait state and
everything piles up on top of it.  Everything else was running fine, but
anything accessing that drive failed, like ls or df etc.

     Mostly poppers were building up.  Couldn't kill them off with
killer.

     So...

     sendmail killed, inetd killed, everyone bounced.

     Mail partition moved from sd8a to sd9e and popdrop from sd8g to
sd9g.

    Sendmail restarted, inetd restarted, and logins allowed.

    Hopefully that will be the end of this.

06/11/99 Friday 4:15pm EST

     A mirror drive on adore locked up this morning at about 3am,
causing the load on adore to go to 630.0, preventing people
from getting their mail and jamming out shell users.

     This also locked up light, which shares adore's home directories,
which then prevented others from signing on.  This is in part why
light and adore must be separated.

     The drive is undergoing testing, it was the mail mirror drive,
and locked up during the nightly copy of the mail directories to the
mirror.

     A few users had their pop mail stuck in a half way state when
adore came back up, and could not get their mail.  These have all been
cleaned up.

06/05/99 Saturday 9:51pm EST

    Scheduled down at 6pm.

    Mail was moved from light to adore with little problem.

    Both MX and popper functions were moved.

    Spool directories were also moved.  Light no longer supports
mail.

06/04/99 Friday 6:31pm EST

    Scheduled down at 6pm.

    /var was moved from /dev/sd8a to /dev/sd9f
    /var/spool/mail was moved from /dev/sd8a to itself as an outer directory
 
    Everything went flawlessly.

05/23/99 Sunday 10:05pm EST

     Scheduled down at 6:00pm, news2, ftp, majordomo were off line
until about 7:30pm to put rollers on the tables that hold them.  Then
troubles with a hub resulted in news not drawing new news, so hub was
replaced.

04/29/99 Thursday 12:18pm EST

     SMTP server (emerald) went out of control at about 11:15am, no
known reason at the moment.  Load was at 15, process table full,
sendmail requests were being denied.

04/21/99 Wednesday 10:42pm EST

    Adore crashed hard.  New memory is on order.
 
04/11/99 Sunday 11:18am EST

    Adore rebooted itself this morning at 9:36am.

04/05/99 Monday 9:40pm EST

    Modem 3 causing ring no answers all day long.

    Also modem 54.

    Sigh

04/01/99 Thursday 12:14am EST

     admiral (news) was down since 22:00 3/30/99 due to a smoked root
drive.  pingers did not catch it because admiral had been removed from
the ping list for reasons lost to antiquity.

     No one reported that news was down probably because it did not
affect our main news reading machine, however no new news was coming
in for over 24 hours.

     Main drive has been replaced by the mirror, and more stringent
monitoring programs will be written to make sure this and other
machines do not go down without my finding out about it.

03/28/99 Sunday 10:37pm EST

     ftp was taken down to fix a recalcitrant ethernet card on the
private backbone.

03/28/99 Saturday 9:23am EST

     Remote relaying through our smtp server was intermittenly broken
again for short whiles.  This was caused by two sepearate machines
trying to update the authorized IP database, thus wiping out each
others work.

03/26/99 Friday 11:00pm EST

     FTP and MAILING list machine were taken down for upgrades.

     CPU fan was replaced on emerald (ftp) and power supply was replace
on gem (mailing list).

03/25/99 Thursday 1:33pm EST

    In preparation for upgrades to gem and emerald, smtp.lightlink.com
was moved to mx.  The anti spam database that updates legal IP
numbers for remote users to use our smtp server was not updating
properly on mx, so from about 8am until now, remote users have
not been able to send e-mail.

03/23/99 Tuesday 11:00pm EST

    Adore rebooted.  Telnet was not working, OS beginning to die?

    Light was still jammed, so rebooted also.


03/08/99 Monday 4:57pm EST

     Nysernet lost the routes to the T3, so although the
routers were up, our T3 customers were unable to get anywhere.

     207.127.235.0/24
     207.127.234.0/24
     207.127.233.64/26
 
03/08/99 Monday 10:02am EST

   External net was down from 4:55am til 10am.  Nysernet
replaced a major router downtown.

03/06/99 Saturday 7:28pm EST

     Momentary bug in new account creation programs caused /etc/passwd
permissions to be not world readable causing the id command on adore
to fail, causing shell users errors in the prompt script.

03/03/99 Wednesday 6:17pm EST

    adore crashed from asynch memory errors.  Probably needs
to be replaced.

02/28/99 Sunday 3:20pm EST

    web server was locked up, possibly due to not being restarted
properly.

02/28/99 Sunday 12:28pm EST

     Adore suffered a major down at 11:45am this morning causing light
to lock up.  It was a partial crash which did not set off the alarms,
so we didn't find out about it until 12 noon.  Normally adore reboots
itself after such events, but this time it just crashed and stayed
crashed.  Had to cold reboot both adore and light causing minor disk
damage, which took about 20 minutes to rebuild.

     It's just this kind of thing that demands that we separate adore
and light and all the machines from each other.

02/19/99 Friday 02:23am EST

     Web logs were lost for Feb 16 and 17 due to a program bug.

02/17/99 Wednesday 4:07pm EST

     Light locked up for unknown reasons, millions of popper and
sendmail processes all waiting to run, couldn't do top or anything.

     Rebooted.

02/16/99 Tuesday 11:00pm EST

    Single modem giving ring no answer on 5026.  Its been cleared.

02/15/99 Monday 11:40pm EST

     Light rebooted to rearrange ftp.  Mysterious failures in ftp have
occured since the newest version was installed.  The newest version
did not work at all, and then the old version stopped doing lists
properly.

     We have light working properly now, although we still have no
idea why the original software started to fail as it is still failing.

     Been running dynamically loaded ls's forever, now they don't work
and have to use a static ls.  Don't ask me.

02/02/99 Tuesday 7:28pm EST

    Light rebooted, getting sluggish perhaps from long term
memory leaks.  71 days is too long to go without rebooting.
 
    Apparently radius authentication server did not start
properly causing people to not be able to sign on for
a while after the boot, until 8:01pm to be exact.

01/31/99 Sunday  6:00pm EST
 
    Scheduled down.  Adore rebooted with new tape drive in place.

    Added 3 9.1 gig drives to adore and 1 to light.
 
01/29/99 Friday 5:26pm EST

     Bad routing at AOL has caused ICQ and Instant Messenger to
file across the Nysernet backbone.

01/23/99 Saturday 8:29pm EST

    ISDN2 modem bank rebooted to clear out bad modem, everyone
on that bank was bounced.
12/14/98 Monday 12:03pm EST

    Adore crashed this morning at 9:42am for unknown reasons.

12/13/98 Sunday 5:39pm EST

    We had momentary but major network outages today due to a failing
ethernet link and a hub.  Changes to the network settings made things
worse, and it took about 2 hours to undo the damage that we did
trying to fix the original problem.

    Momentary outages on adore and light would have been experienced.

12/06/98 Sunday 12:23am EST

     All modem banks have been changed to allow ascii users to
sign on, they are rlogined to adore at the password prompt.

     This means all ppp users *MUST* use non scripted PAP.

11/17/98 Tuesday 6:45pm EST

    We were spammed last night, and I set the mail limit
to 12 connections at a time, which was too low.  Today a number
of people were not able to send mail because of this, it has
been raised to 24.
 
11/15/98 Sunday 4:15pm EST

     gopher/help system on adore was left down after last nights
system down time.  It is working now.
 
11/14/98 Saturday 6:12pm EST

     Scheduled down lasted 6:00pm - 6:15pm.

     Light upgraded to kernel jesw which supports vifc.

     Adore also down during upgrade, no changes made.

11/13/98 Friday 11:07pm EST

    Web hit log files are being moved to majesty, there will
be some interruption in availability but we are trying to not
lose hits.  They should be functional again in a day or two.
 
11/08/98 Sunday 8:15pm EST

     We got isdn and isdn2 flashed for V90, took much longer than
anticipated.  All X2's are now X2/V90 capable.
 
10/27/98 Tuesday 3:32pm EST

     16 new modems in, hunt group seems ok, 2 modems were not set
properly, gave busy signals, all should be working properly now.
 
10/05/98 Monday 7:34pm EST

     Jammed modem on isdn3 cleared.  Netserver rebooted, bouncing
everyone on isdn3.

09/30/98 Wednesday 12:17pm EST

    Nysernet is having significant problems at various routing
points to the main backbone.  They are aware of the problem, and
are working on it.  Network slows have been happening for a few days,
and may continue to happen for a while.

09/29/98 Tuesday 1:04pm EST

     There appears to be a net outage of some kind, stopping
communications from going out or coming in.

09/20/98 Sunday 9:17pm EST

     System taken down for scheduled maintenance at 6pm until 7pm.

     Lights root drive was replaced by its mirror and a new mirror put
in place.

     Adore's mirror drive was replaced by a new mirror as big as the
root drive.

     Tape drive would not work with adore, tried multiple different
tape drives and cables and terminators.  It seems to be a problem with
the one drive that is on the scsi bus, it seems to screw up the
termination when the tape drive is on there.  Put the tape drive
before and after the disk drive, it made no difference.

     Adore is still without its own tape drive.

     One UPS powering top Multitech modem bank is definitely gone,
battery will be replaced next time around.

09/20/98 Sunday 1:15pm EST

     We are getting fatal disk errors on light's root drive.

     This is a very bad sign and may involve replacing the root
drive with its mirror.

09/19/98 Saturday 08:02am EST

     At 3:07am light started to get fatal errors on the swap partition
of its root drive.

     This apparently caused the failure of named and primary name
service.

     At the same time majesty seems to be unpingable from our elmira
routers, so secondary name service for those people failed also,
preventing Elmira customers from getting on line.
 
     For some reason lost to antiquity the code that monitors named
was set to NOT try and restart it automatically, I have turned this
back on.
 
     I was not pinged because the pinger modem was off due to system
work on majesty last night.

     Sheesh.

     The fatal disk errors on the root swap partition are not a good
sign, and may result in a crash in the near future.  We have root
mirrors in place and fully updated in case we need to do a swap.

09/09/98 Wednesday 6:51pm EST

     Looks like all system jobs ran twice last night, causing
double entries in webstats and other areas.  Not sure why.
 
08/31/98 Monday 11:03pm EST

    Upgraded our news reading server to dnews 46r.  The upgrade went
smoothly, but an incorrect directory entry resulted in the server using
default files for expiration.  Thus all articles were expired.

    The spool will rebuild as time goes by.

08/29/98 Saturday 4:36pm EST

     News has been suffering a massive planet wide denial of service
attack involving an overload of newgroup and sendsys messages.  This
has caused the load on news servers to skyrocket, including ours, and
caused the flow of normal news to come to a crawl.

     Patches to the news server have helped to alleviate the problem,
but the attack continues.

08/20/98 Thursday 7:04pm EST

    Downtime 1 hour.

    Scheduled down for light and adore did not go well.

    64 meg was put in adore.

    Two new root mirror drives would not boot properly, on either
machine, possibly due to a jumper error which I hadn't accounted for.

    New Tape drive on adore seemed to cause scsi time outs on
sd0 which is real strange.

    Presently both systems are as they were except adore has 64 meg
more memory.

08/07/98 Friday 8:12pm EST

    Scheduled down for light and adore starting at 6:00pm.

    Home drive on light was moved to adore, and mirror home drive on
adore was moved to light.

    This broke cgi's which reside in the home directory stupidly,
which can not execute setuid from adore on light.

    This also broke listproc for the same reason.

    These are now fixed.

    /home/www -> /homewww

    Listproc moved to adore.

08/07/98 Friday 11:39am EST

    Link to Elmira was down from 8:35am until now.  Problem
with the Frame Relay line was called into EMI and fixed at about
11:30am.

07/19/98 Sunday 2:39pm EST

     X2 modems rebooted, kicking everyone off

     Apparently routing started to fail for some users at about 2am
Sunday morning.  Tech support got about 20 calls on it.  It may not
have affected everyone.

     One user reported trying all 3 X2 modem banks, failing on all of
them, which is real weird as they are independent machines.
 
     No idea what happened.

07/19/98 Sunday 2:14pm EST

     News was effectively down from about 11pm last night until now
due to a routing error during a reboot that went unseen.  Not related
to the X2 problem above.  (I think).

     News was coming into news1, but not being transferred to news2,
readers were able to read from news2, but weren't getting any new news
during this time.  Expiry on news1 has probably removed some articles
which are now lost.

07/05/98 Sunday 01:14am EST

    gopher and the help system did not restart properly during
last reboot.  Should be working now.

07/05/98 Sunday 12:27am EST

     isdn3 was off line due to a network lock up, perhaps for a few
days.  Everyone was bounced off of isdn2 accidentally in an effort
clear this out.

07/04/98 Saturday 7:26pm EST

     Raw web access logs were lost last night due to a long standing
but subtle programming bug.

     Logs for 7/1 7/3 and 7/4 have been recovered, the rest are lost.
Only raw hit logs were affected.
 
07/04/98 Saturday 6:13pm EST

    Light taken down at 6pm to replace failing backup tape drive.

07/01/98 Wednesday 18:23pm EST

     adore crashed for unknown reasons, rebooted itself.

06/30/98 Tuesday 10:00pm EST

    world read perms turned back on ftp due to failure
of cd command to print directory name properly.  What a pain.

06/28/98 Sunday 4:24pm EST

    shell help and gopher were broken by last nights ftp move.  These
should be working now.
 
    It is not immediately clear that running files across NFS between
two different platforms (SunOS and Linux) is going to work properly.
At present the SunOS executables for ftp are on the Linux box, which
probably was not intended, but seems to work anyhow as they are
being read across the network before executed on the Sun.

     Probably what we want is for the /ftp/pub directory to be exported
and leave the rest as it is.

06/27/98 Saturday 5:56pm EST

    ftp on light disabled to move it to emerald

    Both light and adore needed to be rebooted to let go of
the original ftp directory.  7:30pm


06/26/98 Friday 4:50pm EST

    Harmony locked up, had to be cold booted, bouncing all users.
 
06/22/98 Monday 9:14pm EST

     Light was put through an emergency reboot to see if it would
clear out a persisting problem with excessive syslogs.
 
06/17/98 Wednesday 2:17pm EST

     Sendmail stopped allowing remote mail through our system to
remote users at about 10:11am this morning due to a system screwup.
It should be working fine now.
 
06/16/98 Tuesday 11:57am EST

     NetServer I isdn2 had a jammed modem.  Rebooted both isdn1 and isdn2,
bouncing everyone off.

06/13/98 Saturday 12:19pm EST

     Harmony bank 1 and usr and isdn modem banks were rebooted,
due to a jammed UPS which had to be rebooted.
 
06/11/98 Thursday 12:50pm EST

     2 modems on harmony were found to not be picking up.  They were
causing ring no answers for a few days.
 
06/06/98 Saturday 9:42pm EST

     We suffered a slow but catestrophic failure of majesty over the
past week.  It began crashing routinely, every few hours.  This has
caused news server to be spotty although news2 which is the reader
machine did not go down, it was not able to have all the articles
available that it would other wise have.

     Presently majesty is in sick bay, and news.lightlink.com has been
replaced with a pentium 333 machine.  Things may be rough for a while,
but it probably won't crash.

06/03/98 Wednesday 7:30pm EST

     Router to Elmira rebooted, interruption of service for 3 minutes.

06/02/98 Tuesday 10:37pm EST

    Sendmail was rebooted improperly and started to fail
to let remote users post through lightlink to remote spots.

06/02/98 Tuesday 9:26pm EST

     Upgraded news software today on news.lightlink.com to make things
faster.  In process reduced size of active file getting rid of empty
groups.  This cut the size from almost 2 meg to less than 1 meg which
should make news reading faster.
 
05/30/98 Saturday 07:47am EST

     Harmony seems to have rebooted itself for unknown reasons.

     This kicked everyone off of the multitechs.
 
05/27/98 Wednesday 2:47pm EST

     news1 crashed at 2:19pm.  Probably from too many news feeds,
or a failing drive.  Haven't figured out which yet.

     In any case a new machine is being built which can handle
the load.

     news2 was rebooted in the process.

05/25/98 Monday 12:58pm EST

    News1 crashed again at about 12:30pm.  No warnings went off
because it seems to have half crashed leaving its ethernet port operational.

    All drives have been reseated including power sockets.  If it crashes
again, we will reseat the memory, and if it crashes again, we will replace
it with a new super news machine that we are considering in the wings.

     News2 was not affected except some news may have been lost.

05/24/98 Sunday 4:35pm EST

     News1 suffered a major crash at 3am in the morning, wiping out
the history file.  The entire news system was rebuilt this afternoon.
News2 was not affected, except that incoming news will have been sparse.

     News is running properly again.

     It is possible that news1 is loosing a hard drive, so this
may happen again.

05/18/98 Monday 11:18pm EST

    News2 crashed this afternoon and needed to be rebuilt.  The
history file was lost and rebuilt, and all binary files were lost.
Most other news was not lost.

    Posting was not working properly until now due to an incorrect
permission on one of the news drives.

05/14/98 Thursday 02:30am EST

     Adore's ethernet locked up tonight for unknown reasons.

     This tied up light because of the nfs mounted drives.

     Adore was rebooted but still would not telnet to itself,
indicating hardware failure in the ether port.  Turned it off and on,
and it started working again.

     Worse comes to worse, adore can have another ethercard put in it,
or people can use light for shell if adore dies off completely.

05/13/98 Wednesday 6:57pm EST

     Shell users were locked out of mail today, possibly for 2 hours.
The permissions on the password file were set to non world readable
for reasons that we suspect but don't know for sure, nothing serious,
more like a bug in the account creation process which has to write to
the file to create the new account.

     The problem occured we believe starting at 2:41pm and lasted
until about 5:00pm

     Monitors have been put in place to beep us within one minute
should this happen again.

 
05/11/98 Monday 01:35am EST

     isdn1 277 0356 rebooted to clear out jammed lines.

04/25/98 Saturday 8:46pm EST

     news1 and news2 were taken down this afternoon to install second
ether cards in them.  This will allow news2 to take news from news1
over a private internal network releiving some of the load on our main
network.

     news1 is being moved to the T3 shortly
 
04/20/98 Monday 3:12pm EST

    Modem 3 locked up on isdn1 2770356, picks up but doesn't answer.
Reset bouncing modems 3 and 4.

04/16/98 Thursday 6:17pm EST

     Scheduled Down.  Harmony modems and power supplies were
reseated a number of times bouncing everyone at least twice.

     Reseating needs to be done periodically to assure that the
contacts are clean to avoid spurious drops or modem failures.

04/16/98 Thursday 02:53am EST

     Light crashed during nightly mirroring due to out of mbuf errors.
This will probably go away with scheduled changes in which machine has
which drives.  Presently the mirroring takes place over the net and
stresses SunOS's ability to handle the traffic.

04/15/98 Wednesday 08:54am EST

    Mail loop in the ccounsel mailing list caused a crash
of gem at about 8am which was not caught until about 2pm.
 
    Have installed monitoring to restart mail and beep
if it dies.

04/04/98 Saturday 3:07pm EST

    isdn1 lost its ethernet address, causing people to not be
able to go anywhere on the net.  Rebooted, bouncing everyone.

04/04/98 Saturday 12:21am EST

    eggbots and irc servers killed on adore to track down source of
DNS hits coming in on light.  DNS installed on adore, so it can
handle its own name service now.

04/01/98 Wednesday 4:06pm EST

     Web server taken off line for 5 minutes to track down what's
driving light out of control.

03/23/98 Monday 3:06pm EST

     Modems 20 and 26 on Harmony have been giving ring no answers for
a few days.  Have busied them out.  The entire box needs to have all
its cards reseated and rebooted.  This will happen next down time.

03/22/98 Sunday 6:22pm EST

    Light and adore taken down from 6:00pm to 6:25pm to add memory

    Memory increased from 128M to 320M
 
03/14/98 Saturday 5:56pm EST

     News2 was unscheduled down for about 15 minutes to allow rewiring
of the machine room.

03/11/98 Wednesday 12:36am EST

     cisco router rebooted, internet connection lost for 2 minutes.

     Router had a bad arp entry for 205.232.34.91 which is modem 9 on
the USR 277 1076 modem banks.  People signing onto that modem could
not get out into the internet.  Bad arp entry came from a typo I made
over a week ago, so this has been going on for a long time.
 
03/08/98 Sunday 1:03pm EST

     277 1076 modem bank rebooted, bouncing everyone

03/05/98 Thursday 12:29am EST

    News2 was down from 9pm to 12am due to a faulty ether connection.

    The ether connections were changed to add in a new hub.

    I have no idea why the system didn't beep me.

     It looks like the monitor program 'pinger' was locked up, not a
good sign.

     What shall monitor the monitor program?
 
02/23/98 Monday 5:00pm EST
 
     Nynex installed 4 new modems which are now on line.

     Part of modem bank 1 was knocked off line due to a loose cable
twice bouncing everyone on that section.

     T1 was physically moved to a better location causing a 4 minute
network outage.

02/22/98 Sunday 8:30pm EST

     Cisco router upgraded to 11.1(17).  Network down from about 6pm
to 6:45pm.
 
02/21/98 Saturday 2:28pm EST

    Momentary net outage for unknown reasons.

    Rebooted cisco router and kentrox CSU/DSU.
 
    That didn't clear it up, but then it cleared itself up shortly after.
 
02/20/98 Friday 09:00am EST

    X2 NetServer was rebooted by accident, every one bounced.

02/18/98 Wednesday 5:17pm EST

     Adore was under attack today from someone at 130.238.203.192.

     Rebooting the system did not end the attack.

     Used tcpdump to sniff his packets and then blocked him at the border
router.

02/17/98 Tuesday 9:27pm EST

    The X2 NetServer was producing busy signals even though there
were modems free.  I rebooted the first X2 NetServer and the
USR V34 Server bouncing everyone off.

    It seems to have cleared up.

02/14/98 Saturday 6:31pm EST

     277 0356 X2 lines locked up for unknown reasons this afternoon.
It may have had to do with a test Cisco 2501 we put on the network to
test RIP2, a routing protocol.  I presently do not know why that would
have cuased the problem, particularly since the new router was on line
since last night with no reported troubles.  Rebooting the X2 banks
did not fix the problem, but taking the Cisco off line did.
 
02/11/98 Wednesday 12:19am EST

    Light rebooted itself last night at around 3am in the morning.
This caused authentication services to switch over to majesty.  When
light came back up, I forgot to reset authentication back to light.
For unknown reasons, majesty is refusing to authenticate some people,
so for much of the morning many people were not able to sign on.

     Mystery solved.  A newer version of the authentication code was
running on majesty, that was never supposed to be put in service, and
it was dying on config files from the older version that are copied
over from light to majesty every night to keep the password files in
sync.


02/07/98 Saturday 10:52pm EST

    Adore crashed again.  Took out another memory chip.
Gonna keep doing this until we find out which one is bad.

02/05/98 Thursday 11:36pm EST

     Modems 49-72 rebooted, bouncing everyone off.  Bad modem 68
is working again.

02/04/98 Wednesday 12:46pm EST

    Adore locked up but didn't crash.

     This caused the modem authentication software to fail on light so
some people were not able to get on.

    Sun tells us its possibly a bad memory chip.  I have replaced
the chip in slot 0.

01/05/98 Monday 9:25pm EST

     Majesty was down from 2:41pm this afternoon.  News2 was not affected,
but new news coming in was stopped until now.  Various cross checking
pingers failed to go off, new code with bugs etc.  My fault.

01/01/98 Thursday 3:05pm EST

     A number of people have not been able to get signed on to 277
5026.  When light crashed this morning, authentication shifted to the
back up server, which for reasons still unknown are rejecting certain
people's passwords.

01/01/98 Thursday 04:10am EST

     Light crashed due to mbuf map full error.

12/08/97 Monday 10:51pm EST

     Light was crashed by a user playing with an exploit.  The
exploit has been closed on all three machines.

     The anti relaying code was not restarted when light rebooted
so some remote users were not able to send mail.

     My apologies.

12/08/97 Monday 3:25pm EST

    Bad modem was causing ring no answers, taken off line.
 
12/05/97 Friday 5:37pm EST

     X2 rebooted itself.

12/05/97 Friday 02:20am EST

    X2 modems rebooted,  bouncing everyone about 12 midnight.
 
11/28/97 Friday 2:41pm EST

    pm2-elmira is down.  Frame connection to elmira rebooted, no
change.

    pm-elmira is fine.

11/24/97 Monday 01:26am EST

     Aurora (new2) rebooted a number of times at around 11 to 12am,
in order to test various things.

     Apparently it hasn't been posting properly to usenet for about 3
days since the 20th.  This was a result of starting it from the
startup profile /etc/rc.d/rc.local.  For unknown reasons it runs but
won't post.

11/22/97 Saturday 7:39pm EST

     Aurora (news2) taken down for 3 minutes to replace ethernet
card.  If this doesn't stop the crashing, we will replace the mother board.

11/21/97 Friday 7:15pm EST

    Tested an attack against the Cisco border router, and brought
it down for about 2 minutes.

11/21/97 Friday 12:38pm EST

     Aurora locked up at 8am.  Because of the nature of lock up, no
warnings went off.  I believe it is caused by bad ethernet code for
the card we are using.  The card will be replaced shortly.
 
11/20/97 Thursday 10:38am EST

     Aurora (news2) locked up its ether port again.  Various beeping
mechanisms in place to warn me and reboot the system automatically
failed.  Will need to do more testing on this.  Don't know why the
etherport is locking up.  Have replaced the ethernet card, the mother
board may be going bad.

11/17/97 Monday 2:01pm EST

    Overview database on news2 has been erased.  It will rebuild slowly
as people download articles.  Downloads may be slower for a while.  Hopefully
this will fix the missing news article problem.

     The Overview database had headers in it for articles that were
expired so they show as available in netscape but return ARTICLE
EXPIRED when customers hit on them.
 
11/16/97 Sunday 10:14pm EST

     News2 rebooted.  Was running with 64meg instead of 128meg due to
improper boot last time around.  This is fixed.
 
11/16/97 Sunday 9:27pm EST

     News2 suffered a few crashes over the past few days from ethernet
lockup.  Apparently some of its history, index and overview files got
corrupted.  This may have resulted in headers being downloaded without
bodies.  We have rebuilt the data base and run an expire, this
hopefully will clean it up.  If not, it should clean up by itself as
time progresses and things get naturally expired.

11/14/97 Friday 6:25pm EST

     Modems and news stopped for 5 minutes to reset a warning light on
a UPS.  Everyone bounced.  The UPS seems to have cleared out its bad
battery indicator light, but it may come back.
 
11/14/97 Friday 4:16pm EST

     News2 down for about 15 minutes to replace a bad ethernet card.
 
11/06/97 Thursday 11:57pm EST

    X2 rebooted itself.

11/04/97 Tuesday 4:48pm EST

    I had to take light down to clear out a module that was
misbehaving badly.  
 
     I was trying to trace a hacker who was trying out passwords on
our system, and in loading the tracing software, it took over the
console and wouldn't give me control back.

10/31/97 Friday 12:24pm EST

     Majesty locked up from about 5:45am from scsi disk failure,
eventually dying from process table full.

10/22/97 Wednesday 1:27pm EST

    Aurora locked up at 5:30am this morning and did not beep me because
the modem was off.

    News restored at 1:30pm
 
10/20/97 Monday 11:29pm EST

    Aurora (news2) locked up for unknown reasons, rebooted.

10/19/97 Sunday 7:02pm EST

    Majesty (news) crashed from mbuf map full.

10/17/97 Friday 5:47pm EST

    X2 rebooted again to clear polluted arp cache.

10/17/97 Friday 3:41pm EST

     USR modems down, box not responding.

     X2 modems rebooted by accident.

10/12/97 Sunday 10:23am EST

    Looks like there was an external network outage at around 8:30am.

10/12/97 Sunday 10:19am EST

     Romance's network locked up for unknown reasons, for about 3
minutes. Rebooting cleared it out.

10/12/97 Sunday 05:23am EST

     Internal network was locked up by unknown causes from 3:15am
to now.  Incoming traffic went to 100 percent and outgoing went to 0.

     Shutting down all machines did not clear the condition although
during the final reboot of all of them it managed to clear itself.

     Nysernet and Sprint are looking into it.

10/08/97 Wednesday 10:39pm EST

    X2/ISDN server totally locked up, rebooted bouncing everyone

    Talked to USR today, they have escalated the service call.


10/07/97 Tuesday 9:48pm EST
 
    External network seems to be partially down.

10/07/97 Tuesday 7:41pm EST

    Modem 12 on NetServer locked up.

10/06/97 Monday 8:11pm EST

    X2/ISDN server locking up routinely.  Have asked USR to escalate
the complaint.

10/04/97 Saturday 12:52pm EST

    News crashed at 6am from disk full errors.

10/03/97 Friday 11:41am EST

    X2 modem server rebooted to clear out jammed lines.

10/02/97 Thursday 11:58pm EST

    Main router locked up a few times tonight for unknown reasons.

    All access to the net was blocked out.

10/01/97 Wednesday 4:30pm EST

     News2 was down for about 30 minutes at 2pm.  Loose cable on
a hard drive caused it to crash, cable is damaged but working.

09/29/97 Monday 9:13pm EST

    News2 rebooted by accident.

09/29/97 Monday 6:28pm EST

     Modem 12 on the X2 server is locked up.

     Evidence indicates that the NetServer rebooted itself at 16:05
this afternoon, bouncing everyone.

     Called USR, they said modem is picking up on digital calls but
not analogue calls, that this is a Nynex problem.  USR said the ISDN
line is presenting an analogue call as a digital call which is
incorrect.

     Called Nynex, reported the line that is not picking up.

     Gotta leave the modem locked up so Nynex can see it fail.


09/26/97 Friday 3:46pm EST

    Modem 2 on the *NEW* Netserver locked up.

09/21/97 Sunday 12:19am EST

    Yesterday when light crashed and was rebooted, an older
version of the apache web server was started which caused
some anomalies including failures of freebie domains to go
to their proper pages.

    This is fixed.

    X2's seem stable.

    X2's rebooted to install newest code.

    Everyone bounced 12:24am


09/20/97 Saturday 9:16pm EST

     News has been jammed most of the day.  It was running but
throttled due to a binary fill up this morning.  When I restarted
it this morning, it didn't restart properly, but kept running anyhow,
so I never got beeped.

09/19/97 Friday 9:47pm EST

     Called Nynex about the 8 ISDN lines going up and down all night.

09/19/97 Friday 8:48pm EST

    Light crashed.  I quit out of Xwindows and got a cpu panic.

    Been up way too long.

09/19/97 Friday 6:50pm EST

     Nynex ISDN seems to have gone down.  All X2 channels got dropped
and gave busy signals.

     NetServer won't resynch up.

     Nynex ISDN seems to be going down every 10 minutes for about 5
minutes.

09/18/97 Thursday 9:38pm EST

     X2/ISDN server replaced.  Down for about 3 hours.

     Took it back out because it wasn't working, but it was
my problem, so the new one is back in again.  Looks like its working.

09/16/97 Tuesday 2:33pm EST

    X2 server locked up.  Worked with USR on it for a number
of hours.  Doing one last test before we replace it.
 
    Please continue to report modem failures.

09/13/97 Saturday 4:17pm EST

    NetServer Modem 2 locked up, call into USR to study it.

09/05/97 Friday 7:39pm EST

    X2/ISDN NetServer locked up on modem 1, giving no answers.

    Rebooted.

09/04/97 Thursday 1:32pm EST

    External net is down.

08/31/97 Sunday 2:58pm EST

    Nynex Frame Relay line died at 11:50am  They are working on it.
 
08/25/97 Monday 5:02pm EST

    Harmony password file was lost at 3:50pm for unknown reasons.

    Signons failed until about 4:10pm when we noticed it.

08/22/97 Friday 12:14am EST

    X2 modems rebooted twiced for system work.
 
    Everyone was bounced

08/19/97 Tuesday 02:13am EST

    Password file for X2 modems was destroyed again at 12 midnight.  This
time I found out what did it.

08/18/97 Monday 11:23am EST

     Password file for X2 modem banks was corrupted last night,
Down since 12 midnight.

08/16/97 Saturday 4:57pm EST

     X2 port 4 is apparently hung, I did a hard reset on it, which
booted 3 too.

08/11/97 Monday 4:18pm EST

    Majesty crashed from out of memory errors.  This caused light
to stick during shell logins from NFS mounts.

08/09/97 Saturday 4:48pm EST

    Majesty crashed, out of mbufs.

08/07/97 Thursday 12:09am EST

     I tried to regather articles from the last day to make up for the
0 day expire of this afternoon.  news2 promptly went and filled all 4
disks all the way and bombed, something I don't quite understand
because there wasn't 12 gig of news to steal from news1.

     So I had to expire that too, but by this time all those articles
were in the history file, even though they were now expired, so I
couldn't go back and reget them again without starting from scratch
which is what annoyed everyone last time.

     So, right now everything looks stable and is working properly.
Most of this was my unfamiliarity with the new software, and its lousy
manual.

     The articles will fill in again as new ones come in.

08/06/97 Wednesday 8:06pm EST

    news2 was put through a 0 day expire this afternoon after
it filled up one drive leaving the rest empty.  This bug is
fixed but to restart it properly we had to empty the one drive.

    All four drives are filling up now.

08/04/97 Monday 6:29pm EST

    news2 rebuilt from scratch using latest, greatest 4.2g

    Articles will refill in.

08/02/97 Saturday 10:54pm EST

     news seems to be stablized.

     news2 however got hopelessly corrupted.

     It was been wiped clean, and groups will have to reload
themselves.

07/26/97 Saturday 8:08pm EST

    inn1.6b2 has destroyed the news history file.  news will be down
for a number of hours.  news2 is fine.

    news will probably be up again around 9am plus or minus.

07/25/97 Friday 10:43pm EST

    Getting ring no answer on modem 43.  Taken off line.

07/24/97 Thursday 4:42pm EST

     Apparently we suffered some kind of SYN attack causing the
web server to lock up and name service to freeze.

07/23/97 Wednesday 8:56pm EST

    News has been down for a few hours while we install inn 1.6b1.

    Not going smoothly, tin doesn't seem to want to read.  rn works fine.

07/23/97 Wednesday 5:32pm EST

    NetServer rebooted itself.
 
07/22/97 Tuesday 2:38pm EST

    Significant net outage.

07/22/97 Tuesday 12:44am EST

     ISDN/X2 rebooted multiple times.  Uploaded newest code,
Courier I works worse than before.

07/20/97 Sunday 4:51pm EST

     The system password file was corrupted at 5:10am this morning due
to a bug in our own security software.  Logins to light and e-mail
retrieval were prevented until about 6:00am when we finally figured
out what had happened.

07/19/97 Saturday 1:24pm EST

    NetServer ISDN cold booted to clear out two jammed channels, s11 s12

07/13/97 Sunday 2:25pm EST

    News2 will be slow today, we are rebuilding the history database.

     While the rebuild is taking place, many articles will show up as
there while in fact not being there, this may cause problems with your
newsreaders.

     Lots of spam in alt.sex is rendering the groups totally useless.

    Expiration on alt.sex* has been set to 0 days, expiration on all
other groups has been set to 7.  Binaries are still 4.

    0 days doesn't mean we aren't carrying them, it means they
get expired fully every night.

     Filtering of postings crossposted to more than 10 groups has been
turned off on news2.  News hasn't been filtering.

07/09/97 Wednesday 01:03am EST

     ISDN/X2 server rebooted at about 11pm or so.  Then again a while
later.  It first locked up with endless radius daemons sending majesty
to load 40, then after it was rebooted, it wouldn't do ISDN any more,
X2 worked fine, and *V2=0 worked fine, but *V2=5 wouldn't work at all..

     Turning it off and on cleared it out.

07/08/97 Tuesday 12:20am EST

     ISDN/X2 server rebooted.

     It was in some terrible loop with the radius server on majesty,
causing MILLIONS of radiusd's to be spawned driving the load on
majesty up to 50.

06/30/97 Monday 9:01pm EST

    Net Server IDSN rebooted to set speeds to 115200.

    Testing slow speeds.
 
06/28/97 Saturday 2:14pm EST

     Nynex stopped by to check errors on the T1, trouble report
originated by Nynex monitoring.  T1 was down for about 5 minutes causing
interruption in connection to the net.

06/27/97 Friday 12:40am EST

     An experimental version of sendmail died silently at around
11:17pm for unknown reasons.

     Outgoing mail was down for about 30 minutes until I was
informed of the down.

     I will place sendmail in the system monitor so this won't
happen in the future.

06/21/97 Saturday 10:27pm EST

     Light taken down due to network instability.  Getting le0 out of
mbuf errors, tremendous activity coming in on le0 or going out, not
sure.  ciscoin and ciscoout don't show anything special.

     Getting mbuf errors even while in single user mode, that
indicates tremendous hits on light coming from outside.

     Installed new kernel while we are at it, jesv.  Light booted
3 times without incident, pray hard.

     Looks like it was a ping attack, we will get the joker the
next time it happens.

06/15/97 Sunday 10:27pm EST

    Light crashed again at 18:57pm and would not reboot at all.

    After multiple failures to reboot, I turned it off, pulled the
memory and reseated it, and did the same with all cables and internal
disks.  Then it rebooted, but it may simply have changed its mind.

     It is possible the crashes are from a dying root drive, there is
some indication that there were problems with the swap space, and boot
sectors.  There is another mirror drive ready to go if it crashes
again.

     If that's not it, it's going to be rough finding out what it is.

     We have a second server being prepared which can take over
the functions of light if necessary, but it is far from ready to
do this on short notice at this time.

06/15/97 Sunday 3:41pm EST

     Light crashed for unknown reasons.

     Watchdog reset.

     Came back automatically and tried to reboot, said bad boot block
or something, I stupidly didn't write it down.

     Then I rebooted it by hand and it booted and ran a *LOT* of
fsck's on all drives.  When it went to reboot, it failed with MEMORY
OUT OF ALIGNMENT, no tracebacks this time.

     Then it booted by hand fine.

     We may have a bad boot block on the main root disk, but this
doesn't explain why light crashed in the first place.

06/12/97 Thursday 6:35pm EST

    Light and related systems taken down to replace UPS.

    Light did not boot smoothly, it gave MEMORY OUT OF ALIGNMENT followed
by a traceback.  This is new and not good.

06/11/97 Wednesday 7:55pm EST

     Network interference from romance being down caused aurora to
lock up due to NFS mounts.  News2 down for about 20 minutes.

06/06/97 Friday 11:31am EST

     Elmira link was down for about 45 minutes for unknown reasons.
Rebooting the frame relay router at our end seems to have
reestablished the link.
 
06/02/97 Monday 3:15pm EST

    Light crashed from running out of mbufs, a bug in the apache
web server.

    It crashed at 15367/15744 mbufs.

    finwait was 2474

05/24/97 Saturday 11:50pm EST

     Majesty (news.lightlink.com) was down for about 6 hours from
a subtle news crash.  I did not notice and no one called to inform me.

     Aurora (new2) was not affected.

     Some articles may have been lost.

05/20/97 Tuesday 6:14pm EST

     Light taken down to install new kernel and make security patches.

     Running Kernel jess

     6:00pm -> 6:15pm
 
05/18/97 Sunday 6:40pm EST

    Light taken down at 6:00pm to rewire part of machine room.
 
05/18/97 Sunday 3:00pm EST

     News (majesty) was down for about an hour to patch the OS.

     Majesty is theoretically fully patched at this point.

05/07/97 Wednesday 01:01am EST

    Sendmail and web taken off line at about 12:30am for a few minutes
to deal with an incoming relay spam.

05/05/97 Monday 02:09am EST

    Harmony modem bank one reset, bouncing everyone.
 
04/30/97 Wednesday 11:32am EST

    Light crashed.
 
04/29/97 Tuesday 6:20pm EST

    Nynex has taken down the huntgroup to fix an earlier problem.

    Everyone is getting busy signals on 5026.
 
04/28/97 Monday 12:54pm EST

     Harmony reset bouncing everyone
 
04/25/97 Friday 12:24am EST

    Nynex has destroyed the hunt group again.  Line is
giving busies.  Dial in on 277 4940.
 
04/22/97 Tuesday 6:10pm EST

    Load driven to 30 by errant cgi on the part of local user.

04/21/97 Monday 3:51pm EST

     Elmira POP was offline for about 3 minutes, for unknown reasons.
 
04/20/97 Sunday 12:29am EST

    I locked up light by accident by bringing up adore with same
IP number!

    Sorry.

04/19/97 Saturday 7:02pm EST

    Light taken down to replace failing boot drive.

    Harmony rebooted with new operational image 11.2.3

    6:00pm -> 6:30pm
 
04/19/97 Saturday 3:20pm EST

    Netserver 16-I rebooted at 3:10pm.

04/17/97 Thursday 01:47am EST

     Cisco router got locked up around 12:30am, process table full.
Rebooted at 01:45am

     Apparently we suffered a syn flood attack from a disgruntled irc
user.

04/16/97 Wednesday 6:02pm EST

     Majesty (news) is down for upgrades.

     6:00pm -> 7:00pm

04/15/97 Tuesday 3:03pm EST

     Spam filter is causing more damage than its worth.  Have removed
it.  sendmail was down for 10 minutes due to an error restarting the
non spam filter version.

04/08/97 Tuesday 2:36pm EST

    Light is loosing is main root drive.  There is a backup mirror
drive if the main drive fails totally.  A new drive is on order.

    Light rebooted at 2:30pm to determine which drive was failing.
 
04/08/97 Tuesday 11:03am EST

     Elmira off line for about 30 minutes due to network lockup
at elmira end.
 
04/05/97 Saturday 11:57pm EST

     Elmira Frame Relay to Nynex is down, lights are dead.  Called into
Nynex Repair.

     Nynex T1 Box was burnt out, replaced and line came back up.

     Downtime 24 hours.

04/05/97 Saturday 12:09pm EST

     Rebooted light by accident. Sorry.

04/05/97 Saturday 01:00am EST

     Binary partitions on aurora (news2) rearranged to provide more
space.  All articles lost in alt.binaries.pictures.* and a number of
warez and games groups.

04/04/97 Friday 8:45pm EST

     Light rebooted at 6:00pm to install new 9 Gig drive.
     Rebooted a second time to clear out jammed format session on drive.

     6:00pm -> 6:07pm
     8:45pm -> 8:50pm
 
04/01/97 Tuesday 5:40pm EST

     News (majesty) rebooted with new kernel with statd security patch.

04/01/97 Tuesday 5:30pm EST

     Light rebooted with new kernel with statd security patch.

04/01/97 Tuesday 3:29pm EST

     News2 taken down to install 128Meg of memory.

     15:22 -> 15:30
 
03/31/97 Monday 5:47pm EST

    Light went out of control, load went to 20.  I halted the
system and rebooted.

    Earlier had 'out of mbuf' errors.

    Probably caused by Apache 1.2b4, we are now back 1.1.1

03/28/97 Friday 1:40pm EST

     Network rewiring may have caused intermittent freezes between
11am and 1:00pm.

03/28/97 Friday 11:05am EST

     News died silently at 4:42am probably from an errant control
message with too long a header.

    11:44am news rebooted

03/27/97 Thursday 2:15pm EST

     News and news2 off line for 15 minutes at 2:00pm to rewire.

     Failed, have to do it again.

03/27/97 Thursday 09:15am EST

     Network locked up this morning at 8:12am necessitating a reboot.
It then locked up again at 8:26am needing anothe reboot.

     There was very strange activity on one of the network hubs (BNC
connector to downstairs lan) even though there should have been no
traffic at all going through it.

     Unfortunately when the network is locked up, network sniffers to
view traffic don't show much!

     We will be separating the 4 ethernet hubs into their own areas,
making it easier to tell which line may be in trouble.  Presently they
are all stacked in one central site making it impossible to quickly
trace the wires from blinking lights to machines they are connected
to.
 
03/26/97 Wednesday 9:32pm EST

    Majesty (news) locked up the ethernet at 9:11pm causing all machines
to stop responding.

    Light, majesty and aurora rebooted.

03/26/97 Wednesday 4:01pm EST

     Nynex had a major event last night, causing our roll over
to go off line.  All modems were working, but were not rolling over
on busy.  People calling in on 277 5026 were getting busy signals,
because that modem was busy, although many others were not.

     This fixed around 1am, but then broke again this afternoon around
2pm.  All modems work, and most roll over properly, but the roll over
on 5026 has gone off line again.  Dial in on 277 4940.

03/19/97 Wednesday 5:50pm EST

     Aurora down: 4:00 -> 5:50pm
 
     News2 (aurora) down for memory upgrades.  512K Cache installed,
128Meg of memory still not being recognized by Linux.

03/19/97 Wednesday 12:59am EST

    USR Couriers rebooted with new kernel image, bouncing everyone.

03/11/97 Tuesday 1:54pm EST

     Aurora (news2) rebooted.
 
03/11/97 Tuesday 12:14pm EST

     Aurora (new2) taken down to install 128Meg.  Rebooted fine but
wouldn't recognize the memory.  Still running on an effective 64Meg.

03/10/97 Monday 10:45am EST

     Light semi jammed at about 8:46am for unknown reasons, causing
various network malfunctions.

     Rebooted at 10:45am.

     News was down for the period of the reboot.

03/08/97 Saturday 12:17am EST

    News outsend queue was jammed, some articles were not being
posted to the 15 swap sites.  They were going out the main feeds.
The queue has been freed, if articles are still on our system,
they will be sent, if not please resend.

03/06/97 Thursday 12:20am EST

     News taken down because of full partitions.  Expire set to
1 day to clear out space so that we can clean up the mess.

     News2 running fine, expire is 14 days.

02/26/97 Wednesday 08:40am EST

    Light became crippled at 5:19am with a Process Table Full error.

    Beepers did not go off because it was still responding to pings.

    Rebooted at 8:35am.

02/24/97 Monday 8:21pm EST

    Majesty (news) rebooted at 8:06pm to install new kernel.

02/23/97 Sunday 10:15pm EST

     Light was taken down this afternoon for about 20 minutes due to a
jammed process that drove the load to 8.

     News was rebooted twice at 22:00pm, to install new kernels.

02/23/97 Sunday 2:47pm EST

    News crashed from filled up warez partition.
 
02/22/97 Saturday 9:53pm EST

     Amphenonal connector pulled out of the modem banks taking modems
33 through 48 off line for a while.  Don't know how long it was going
on.

02/20/97 Thursday 9:13pm EST

    Majesty (news) rebooted with new kernel.

02/20/97 Thursday 05:12am EST

    Majesty's etherport crashed for the second time.  Took
about an hour to get majesty and light disentangled, had to reboot
both.  I hate NFS.

    Have no idea what is wrong with Majesty.

02/17/97 Monday 9:29pm EST

    Light was taken down at 6:00pm to install new lib.c and
security patches.  It did not go smoothly, and we are still
getting bad signal stack errors every once in a while.

    Came back on line at 7:00pm and had to reboot a few times
after.

02/15/97 Saturday 4:31pm EST

    News2 was down for an hour to add new disk space.  Entire news spools
was wiped clean.  It will refresh as you hit on groups again.

02/14/97 Friday 12:03am EST

    Majesty (news) rebooted to install new security upgrades.

02/13/97 Thursday 1:53pm EST

     Majesty (news) taken down for an hour for maintenance of news
spool.

02/10/97 Monday 11:59pm EST

     Majesty (news) locked up and wouldn't communicate over the ether
port.  Routes were all right, but it wouldn't telnet in or out or even
to itself.  I had a hard time getting control back.  Finally able to
do a clean reboot.  It's working now.

02/08/97 Saturday 5:26pm EST

    News2 restarted with clean spool, replicate true is on.

    Expect it to bomb.


02/07/97 Friday 01:19am EST

     News2 taken down for 1 hour to add new drive.  Have 12 gig on
board.
 
02/06/97 Thursday 6:42pm EST

     News2 was taken down for about 15 minutes to install
another hard drive.  One more to go in later.

     No news was lost.

02/05/97 Wednesday 9:59pm EST

    news2 got completely munged with the replicate true setting.  I have
turned it off and had to rebuild the news spool.

02/04/97 Tuesday 4:28pm EST

    News2 taken down to move machine to machine room.  News spool
was wiped clean to install "replicate true" which will force news2
to follow article numbers on news.

02/02/97 Sunday 6:29pm EST

     News was down for one hour due to a mistake.

01/30/97 Thursday 2:14pm EST

     Cisco router locked up.

     Had to reboot it.

     Down for 10 minutes.

     Turns out Jane had set a wrong IP in her work machine!  Not a
fault of the Cisco, easy to replicate.


01/26/97 Sunday 10:37pm EST

    Light rebooted at 10:00pm with corrected version of new
kernel.  SOMAXCONN set to 127 in header.h files and binary edit
of uipc_socket.o
 
01/25/97 Saturday 5:33pm EST

     Light rebooted with new kernel.  jesi contains SOMAXCONN 127
which may help web server jams.

01/21/97 Tuesday 03:34am EST

     Couriers rebooted to clear out jammed modem.

01/14/97 Tuesday 2:03pm EST

     Light taken down to install new virtual domain code.  I screwed
up on taking light down (forgot to disengage majesty) and what should
have been a 5 minute cycle turned into a 20 minute cycle.

     Installation when smoothly and it seems to work at first glance.
Light may crash.

01/09/97 Thursday 8:32pm EST

     Modem rack 1 has a bad power supply.  It is presently running on
its spare.  I found this out by accident when I went to reseat both
power supplies alternately.  The modems died when set to power supply 1.

     Everyone on rack one got bounced.

     Will replace.
 
01/06/97 Monday 10:26pm EST

     Sprint has taken over management of our primary router to the
internet.  I don't know if this is a good thing or not.  I get to not
have the write password, but apparently they also changed the read
password, so the monitoring software is registering throughput as
zero.  Not.

01/03/97 Friday 2:51pm EST

     Courier modem number 4 was producing ring no answer.

     The rack has been reset.

12/30/96 Monday 7:22pm EST

    Light crashed.  I have reinstalled the original kernel using
the older virtual interface code, kernel jesf.  We started crashing
with jesg.

12/28/96 Saturday 4:51pm EST

    Light crashed.  Time to get rid of the new vif code.

    Don't know if the old code supports more than 128 virtual domains.

12/28/96 Saturday 11:12am EST

     Connection to external network was down from 4:00 to about 10:30.

     This is usually caused by problems external to lightlink, but
this time for some reason the cisco router had locked up.

     Sprint is "looking into it".
 
     I have placed a pinger on the cisco from majesty so that if it
happens again I will be beeped immediately.

12/27/96 Friday 6:11pm EST

     Light crashed.

12/26/96 Thursday 3:40pm EST

     News taken down for upgrades.
 
12/26/96 Thursday 1:41pm EST

     I crashed light installing new software.
 
12/23/96 Monday 11:40am EST

     Light crashed.

     Probably caused by new virtual domain code.

12/20/96 Friday 5:00pm EST

    Modems 49-72 upgraded to flash prom 28MR113D.HEX
    Modems 04-12 upgraded to flash prom 28MR113D.HEX
    All modems upgraded.

12/20/96 Friday 07:41am EST

     Light crashed, cpu panic on mget.

12/17/96 Tuesday 6:17pm EST

     Vif 1.1 installed, virtual domain code.  Running kernel jesg

     Light taken down at 6:00pm to install code.

     USR's taken off line to switch from coax to twisted pair, then
rebooted.

12/12/96 Thursday 7:16pm EST

     Virtual domains moved to 205.232.88.xx

     Web was down from about 6:00pm until now.
 
12/05/96 Thursday 5:49pm EST

     Majesty was accidentally halted by one of our staff.

     News was down from 4:15pm until about 5:00pm.  This caused
anomolies in the mail server and shell accounts due to NFS mounted
drives between light and majesty.

12/02/96 Monday 12:42pm EST

     Web server was down for 10 minutes due to a bad domain name.

12/01/96 Sunday 6:23pm EST

     Light was down for scheduled maintenance starting at about 6:10pm.

     /tmp was cleaned out, and various things checked to make sure
they work properly (rootenv etc.)

11/26/96 Tuesday 2:42pm EST

     I crashed light by accident with an incorrect kill command.

11/26/96 Tuesday 1:20pm EST

     One line fixed, one to go.

11/26/96 Tuesday 12:34pm EST

     2 lines are still bad, one is producing ring no answers at around
modem 45.
 
     Just keep trying, you will eventually hop over someone getting
the ring no answer and get in.
 
11/25/96 Monday 7:51pm EST

     Well there were 6 dead lines.  Now there are 8.  Making
progress.

     Nynex just showed up, we are down to 2 bad lines, one can
be busied out, the other is still ring no answer.

11/25/96 Monday 3:14pm EST

     Presently the rotary seems to be working properly except we have
found 6 dead lines in the middle of it that were working fine
a few days ago.  Nynex has been informed.

11/25/96 Monday 1:12pm EST

     Hunt group is in process of being repaired.  In mean while 18
modems are off line.

     ftp filled up over night due to failure of maintenance program
over the past few days.  This has been fixed to beep me.

     web server did not restart smoothly this morning after web stat
runs.  No reason determined.  Results are cgi failures, and other
anomalies.

11/22/96 Friday 6:40pm EST

     Nynex installed the remaining 9 lines for new modems and managed
to stick the new numbers in the middle of the hunt group.  Because the
numbers were not working and had no modems on them if they did, people
were unable to dialup after the first 30 modems were filled up.

     This produced long term ring no answers all afternoon.

11/18/96 Monday 9:06pm EST

    News crashed due to disk full errors, and then later again
due my own errors.

11/17/96 Sunday 11:47pm EST
 
     News taken down for security enhancements.

11/16/96 Saturday 7:01pm EST

     News was down for about 45 minutes.  A simple reboot to install
new security software turned into a nightmare when I couldn't sign on
at all.  A bit too much security I would say.

     Actually the cause was a bug in some commented lines in rc.local
that prevented it from running fully during boot up, and the password
authentication daemon was not getting started.

     Its a very erie feeling not being able to sign on to your own
machine.

11/03/96 Sunday 2:44pm EST

     All shell access has been closed off to accounts that have not
used shell in the past month.
 
10/31/96 Thursday 8:59pm EST

    I erased the password file accidentally while cleaning up for
security sweeps.

    e-mail and shell were unavailable for about 10 minutes.

10/25/96 Friday 5:54pm EST

     Light and majesty both rebooted to add security features.

     Running jesf and jojof

10/24/96 Thursday 12:24pm EST

    Light rebooted at 11:pm last night to clear out remains
of intruder.  Forgot to install new named, so at 12 midnight
when named was reset it bombed out.  Name service was defective
from about 12 midnight to 11am this morning.
 
10/24/96 Thursday 12:24pm EST

     A trogin login program was found on light recording shell
passwords.  Trojan seems to have been installed Aug 16th.

     Not doing too well.


10/18/96 Friday 11:15pm EST

     We are under the influence of a *VERY* bad mail loop between
Cornell listproc, the department of mathematics and ourselves, namely
my own mail.  A listserve I run at cornell is sending error messages
to my account at the math department which was just closed.  The
messages are bouncing back to the listproc which is then bouncing them
back again to math.  Each time the files get bigger.  They started off
about 20K, now they are 600K.  I am going to have to disable my mail
lest the mail spool fill up.  Cornell has been notified, but no
response yet.

10/17/96 Thursday 10:18pm EST

    Problems with named have been fixed.

10/17/96 Thursday 10:17pm EST

    Cornell has asked me to remove all cornell.* groups from
our newsfeeds.  They are meant for cornell only.

10/17/96 Thursday 7:04pm EST

     At 2:07pm the web server suffered an unclean restart which
resulted in it rejecting all cgi execution for a few hours, before I
was notified.

     Don't know what caused the problem, except that the server was
restarted by hand and then the error logs show cgi's failing left and
right.

10/16/96 Wednesday 12:48pm EST

     External net is down.

10/12/96 Saturday 11:46pm EST

     External net is down from about 11pm to 2am

     They are working on fixing the router problems
that have plagued Ithaca for a long time.

10/10/96 Thursday 11:19pm EST

    Light rebooted to install corrected named files.
 
10/09/96 Wednesday 11:59pm EST

    I screwed up and killed a whole mess of processes that should have
been left running.  Rebooted light to restabilize system.

10/09/96 Wednesday 2:07pm EST

     External net is going up and down while they work on a bad
primary Ithaca router.

10/07/96 Monday 11:08pm EST

    Light rebooted to clear out named.  We will be installing a new
version of named and the vif code shortly.  Please bear with us, presently
evey time I add a new virtual domain I have to reboot the system.

10/05/96 Saturday 9:08pm EST

    Light rebooted to clear out named.  Named is due for replacement
soon.

10/02/96 Wednesday 10:18pm EST

     The root partition filled up this morning causing the loss of web
hit stats from about 4:20 in the morning to about 11:00am.

10/02/96 Wednesday 1:03pm EST

     Light was rebooted to clear out a full root paritition.

     The root system filled up during the mirror run.  This occured
because the mirror drives are not mounted during normal reboots such
as the one last night, for reasons as yet undetermined.  Since the
mirroring is to directories attached to the root partition, when the
mirror partitions are not attached, it merely fills up the very small
root partition.

10/02/96 Wednesday 12:49am EST

     Light was down for two hours starting at about 11pm.

     We ran into a problem with named (NAME-D) and virtual domains.

     If named is loaded first and then /etc/rc.vif is run, it will
work properly.  vifs are virtual interfaces, we need one for each
virtual domain.  They assign the virtual IP number to the ethernet
port so it can respond to more than one IP number.

     But after the vifs are loaded, if named is killed outright and
restarted, it runs out of file descriptors and complains about bad
file numbers and refuses to load any name service records.

     There seems to be a limit of about 40 to 50 virtual domains,
before named starts to go bad.  We have over 100.

     Most of the down time was spent trying to corner the problem and
determine the max number of vifs.

     I was playing with sendmail on majesty and had killed off named
and restarted it a few times on both machines.  This is what started
the problem.  When I restarted it on light, it died and it took me a
good half hour to figure out what was going on.

     It's also really hard to think straight with fear running through
your stomach.

09/29/96 Sunday 6:53pm EST

     Light was down for scheduled maintenance from 6:00pm to 6:50pm.

     Disk drive bay filters were cleaned.

     A broken SCSI cable feeding the tape backup drive was replaced.

     Harmony was extensively tested for name service and signon
authentication while light was down.  The RX11.1 code works much
better than the old R9.0.

     Trumpet winsock 2.0B seems to ignore the second DNS number in ethernet
mode, but works fine during dialup.

     Timing differences however seem to trip up the script.  This will
need more study.

     Win95 dialup worked fine.

     Next scheduled down time will be next Sunday at 6:00pm to move
the web and log directories to their own partition.

09/27/96 Friday 7:21pm EST

    News is down to repair a disk drive bay.

09/24/96 Tuesday 1:59pm EST

     External net is not fully up.

     13:45pm external net seems back up.
 
09/23/96 Monday 2:26pm EST

     Web stats for 19960922 and 19960923 were munged and had
to be redone.  They should be correct now.

09/23/96 Monday 10:52am EST

     Light's /home directories were filled up today by a user who had
a 600 meg file in his home directory.  This was going on since 5:45am
until now.

09/21/96 Saturday 10:18pm EST

     News was restarted due to system weirdnesses with the w command.

09/20/96 Friday 7:54pm EST

    News rebooted to find which fan is going bad.  It is going to have
to be replaced soon.

09/19/96 Thursday 12:48am EST

     News was rebooted a few times around 11:00pm to clear out jammed
serial ports from uucp experiments.

09/15/96 Sunday 9:57pm EST

     News rebooted to clear out jammed processes on serial ports.

09/15/96 Sunday 11:38am EST

     External net has been down since

     06:43am -> 3:00pm

     Ithaca 1 Router was down at Nysernet.


09/12/96 Thursday 11:25am EST

     External net is down.

09/11/96 Wednesday 10:32pm EST

     The web server went out of control at 8:20pm.  We spent about 30
minutes trying to do an autopsy on the dying processes, only to find
that some could not be killed off.  Since one remaining process was
tying up port 80 in an (D) uninterruptible I/O wait, we had to reboot
light to clear it out.

09/09/96 Monday 12:59pm EST

     News was down for about 30 minutes to remove an air conditioner
with a blown compressor.

09/08/96 Sunday 8:48pm EST

    Ithaca-1 router is off line (external network).

09/06/96 Friday 11:05am EST

     Web server has been stabilized to a large degree.  Source of load
spikes has been found, and hung children and root servers are being
worked on.

     Web stats are being moved to 6:30am once a day so that the server
is not killed every hour interupting ftp downloads etc.
 
09/03/96 Tuesday 4:31pm EST

     Power outage at about 4:10pm -> 4:42pm
 
09/03/96 Tuesday 01:10am EST

    Web hit logs program is causing the server to die.  I will
be running web hits by hand periodically until I figure out why.

09/02/96 Monday 3:43pm EST

     Web server logs from 7:00pm or so last evening until now have
been lost due to a programming error on my part.

09/02/96 Monday 12:08am EST

     Modems were reset bouncing everyone.

08/31/96 Saturday 1:01pm EST

     Working on the web server, it will be up and down repeatedly.

     13:01 -> 15:00

08/30/96 Friday 9:47pm EST

    Light crashed due to a mistake I made.

08/23/96 Friday 12:29am EST

     Perl5 was down until now from earlier upgrades today.

     Links in /usr/local/bin were nuked, and permissions were set wrong.

08/22/96 Thursday 9:55pm EST

     We are presently running vanilla apache 1.1.1.  The web stats are
still be accumulated but they are not being distributed to users.
This will start happening again when things stabilize.

     The Secure server is working again (the documentation was wrong)
but is presently down.
 
08/22/96 Thursday 6:54pm EST

     ApacheSSL 1.3 replaced with vanila apache 1.1.1

     18:55 ->
 
     Keep Alive turned off to see if it effects spiking.

08/22/96 Thursday 10:08am EST

     The Apache web server is not working perfectly.  It is causing
load spikes on the system which make light stick momentarily every
once in a while.  It may be related to cgi's.  Apache is aware
of the problem and is 'working on it'.

     1.1.1. however so far seems to have fixed the main problem
with 1.05 which is chronic freezing.

     The Secure server is not responding yet, probably due to a
configuration error from the upgrade.

08/21/96 Wednesday 2:22pm EST

     Apache web server is down for upgrades.

     14:22 -> Running 1.1.1 -> 16:52
     16:52 -> Running 1.05  -> 18:00
     18:00 -> Running 1.1.1 -> *****
 
08/19/96 Monday 09:24am EST

     News crashed at 6;07am from a news posting collision between
two groups alt.binaires and alt.binaries.

08/15/96 Thursday 6:11pm EST

     The following newsgroups have been moved to a new partition.  All
articles in these groups have been temporarily lost.

     /alt/binaries/multimedia
     /alt/binaries/games
     /alt/binaries/mac
     /alt/binaries/sounds
     /alt/binaries/misc
     /alt/fan
     /alt/mac

 
08/13/96 Tuesday 5:34pm EST

    News is down for upgrades

     17:33 -> 19:07

     Big 7 news groups now belong on their own partition which will
allow longer expire times for them and the world wide groups that
remain on the original partition.

     soc sci comp rec talk misc news
 
08/13/96 Tuesday 11:23am EST

    Last night at about 1:00am, there was a strange anomaly in the
harmony password authentication process.  A number of people were
blocked from getting on.  Causes are as yet unknown and the anomaly
has not been repeated.

08/10/96 Saturday 12:37pm EST

     Majesty/News are going to be up and down all afternoon for
upgrades.

     Down:
 
     12:37 -> 12:43
     16:18 -> 20:16

     I tried to move the rec group from the root news partition to its
own partition pending moving all of the Big 7 to their own space.

     The rec partition has 400 meg in it, and it took hours and hours
and hours to copy it over.  I finally cut it short and proceeded to
erase the original files on the root directory, and THAT took hours
and hours and hours.

     So, this is not going to work.

     I took this course of action because I wanted to save the present
spool of news in rec (and the big 7 when I go to move them) but it
just takes too long.  So probably what I am going to do is just nuke
the news spool, reformat and rearrange it the way it ought to be, and
then let it rebuild itself with new articles over the ensuing days.

     We either suffer long down times and preserve articles (although
losing new articles coming in during the down times), or we nuke the
whole spool and get it running again in about 1 hour, losing
everything on it, but having little down time and losing few new
articles.

08/10/96 Saturday 01:33am EST

     news reset twice to clear out expiration confusion resulting from
earlier upgrade.
 
08/09/96 Friday 1:49pm EST

     news is down for about 2 hours for upgrades.

     13:49 -> 15:41

     The overview data base was moved from /sd4a to /sd5h to even out
the load on our news drives.

08/02/96 Friday 10:42pm EST

     The external net is down.

07/26/96 Friday 2:52pm EST

     The external net is down.
 
07/25/96 Thursday 4:44pm EST
 
     Jason,

     Light suffered an event a few moments ago, the load went to 50
and people started calling in thinking we were down.
 
     During such high load periods, cron kicks in and starts taking
snap shots of the system resulting in the following top display.
 
     You will notice mogrify is running at 99 percent CPU, using 781
Meg of swap space, 41meg of which was resident.
 
     You must put limits on this puppy as it is highly destructive in
the hands of your users.
 
    Homer

Thu Jul 25 16:26:39 EDT 1996
last pid: 29724;  load averages: 47.44, 36.88, 20.52    16:27:09
186 processes: 160 sleeping, 24 running, 1 zombie, 1 stopped

Memory: 111M available, 53M in use, 58M free, 8984K locked

  PID USERNAME PRI NICE  SIZE   RES STATE   TIME   WCPU    CPU COMMAND
29114 nobody   102   10  781M   41M run     1:09 99.64% 18.75% mogrify
29709 sysdiag  -12  -20 1704K 1428K run     0:00  0.00%  0.00% top
29168 sysdiag   25    0  652K  212K run     0:00  0.00%  0.00% 
28724 nobody    25    0  448K  328K run     0:00  0.00%  0.00% httpd
   63 sysdiag   25    0   88K  104K run    43:39  0.00%  0.00% portmap
17674 docdoo    -5    2 5780K 1412K run    17:07  0.00%  0.00% parse
12119 sysdiag   -5    0  352K  200K run     7:01  0.00%  0.00% erpcd
28828 nobody    -5    0  448K  176K run     0:00  0.00%  0.00% httpd
28778 gila      -5    0 2532K  164K run     0:00  0.00%  0.00% pine3.95
22690 maxho     -5    4 1296K  128K run     3:36  0.00%  0.00% bawt
  455 root      25    0   11M  136K run    63:01  0.00%  0.00% Xsun
29153 nobody    -5    0  436K  312K run     0:00  0.00%  0.00% httpd
29079 nobody    -5    0  436K  236K run     0:00  0.00%  0.00% httpd
29109 nobody    -5    0  440K  220K run     0:00  0.00%  0.00% httpd
 7979 slam      -5    0 1080K  216K run     3:30  0.00%  0.00% eggdrop

     Homer



07/24/96 Wednesday 4:41pm EST

     htaccess passwords for web pages is presently not working.

     We are looking into it.

     This is fixed.  There is a new module called unix_auth_module,
that is supposed to use /etc/passwd passwords.  It conflicts
with the normal dbm passwords.  The server was recompiled without
the unix_auth_module and dbm passwords started working again.

07/22/96 Monday 6:27pm EST

    Web stats were down for the past day.  No web logs were lost,
but they weren't being collated every hour as usual.  Things should
be back to normal and all past web logs should be in todays files.
 
07/22/96 Monday 6:25pm EST

     News was down for about an hour to do routine maintenance
on disk drives and to install a new mirror drive.

07/19/96 Friday 12:53pm EST

     News and Majesty were down from 11:12am to 12:52pm for
maintenance and upgrades.

07/15/96 Monday 1:44pm EST

    Web server was down for about 15 minutes due to DNS failure on
a virtual domain.  My fault.

07/04/96 Thursday 10:47pm EST

    News server crashed from full partition on alt.binaries.
 
06/27/96 Thursday 7:44pm EST

     Last night at 12:00am or so the web server crashed and was
down all night.

     Since there were no new hits, the web stat accumulation program
simply kept adding the last batch of stats to the log files every hour
on the half hour, producing multiple entries for the time between
11:30pm and 12:00am when the server crashed.

     I have cleaned up all the log files for 19960627, so there should
be no duplicates in them.

     It is unknown why the server crashed, but it may have been
related to a domain name change I made, failing to inform the virtual
domain server of the change.  The apache server is very sensitive to
dns failures, it will crash the whole web server if it fails to get
dns on any of its virtual domains.

     I have fixed the bug in the stats program that wrote duplicate
stats to people's log files, and the apache server is due for an
upgrade shortly.

06/27/96 Thursday 12:05pm EST

     Light was rebooted to install jese kernel, which includes patch
102264-02 ufs_lockf.o to fix the crash we suffered yesterday.

06/27/96 Thursday 11:20am EST

     Harmony was rebooted to install fix for failure to reprompt.

06/26/96 Wednesday 11:46am EST

    Light crashed and rebooted itself for unknown reasons, probably not
related to scsibus errors.

    Lot's of changes have been made to the kernel recently so some
instability may have been entered into the system.


06/25/96 Tuesday 4:18pm EST

    Harmony reset to set auto_detect_timeout to 10 from 30.

    This will prevent some ring no answers.

06/25/96 Tuesday 3:47pm EST

     Harmony was rebooted at around 12:30pm to install a new setting
to help fix some of the scripting failures the new code is causing.
This fixed a lot of scripts and promptly broke a few more.

06/24/96 Monday 10:08pm EST

    News/Majesty was down for 15 minutes at 10:00pm or so to
add 64 meg of memory bringing total to 160meg.

06/23/96 Sunday 4:52pm EST

     News/majesty was down for about 20 minutes at 4:10pm, for
upgrades.

     /usr/local software drive on light was copied over to majesty
/sd7.  Majesty is now running on identical software to light.

06/22/96 Saturday 11:32pm EST

     Alt.binaries filled up the news partition and bombed out the
news server.  I have nuked the alt.binaries group, starting with a clean
slate.  They will fill back up again over the next few days.

     Someone is flooding the net with binaries, possibly in a.b.misc,
and this is filling up partitions and using up extreme amounts of
T1 bandwidth.


06/22/96 Saturday 7:13pm EST

    Harmony was taken down at 6:00pm until 6:30pm to install
RX11.1.3.

    First indications are that this has NOT fixed the sign on problem.
Further testing will be necessary to see if it helps the upload problem.

    Homer

06/19/96 Wednesday 10:14pm EST

    News was down for 10 minutes.  Rebuilt kernel on majesty for
127 max users instead of 200.  Told that 200 wraps around at 128!

    Rebooted.

06/19/96 Wednesday 12:30pm EST

     A number of user perl scripts went out of control at 10:00pm last
night producing high loads and sticky response.  The situation
lasted for about an hour before I caught it and killed them.

     The alt.binaries groups have been flooded recently, tying up both
our incoming bandwidth and disk space.  I cleaned out the alt.binaries
groups to one day (although expire is still 3) and I have upped our T1
to 768K effective in about a week.

06/16/96 Sunday 11:47pm EST

     Light was quarantined for 2 hours from 6:00pm to 8:00pm or so to
move the ftp/web directories to a new 4 Gig drive.

     The web server was down for this time, and shell usage was locked
out.  E-mail, news, dialup and web surfing should all have been
working fine.

     The upgarde went with out incident.

06/11/96 Tuesday 12:59pm EST

     Incremental backups as of 8am are in /backup.

06/11/96 Tuesday 12:19pm EST

     /etc/motd was corrupted.  I replaced the kernel and motd and
rebooted.

06/11/96 Tuesday 12:06pm EST

     We lost the home drive sd6 at about 8:30am this morning.  Light
was down until now restoring the home directories to a new drive.

     sd6 is the drive that has been having so many errors over the
past year, I think it finally decided to die for good.

     Home directories are current as of 12 midnight or so.
Incrementals will be made available shortly if the system proves
stable.

06/10/96 Monday 5:52pm EST

     Web server pages were messed up for about 30 minutes due to
a typo I made.

06/09/96 Sunday 8:24pm EST

     ftp brought back on line.  Everything looks like it is working.

06/09/96 Sunday 7:13pm EST

     Light was taken offline for scheduled maintenance at 6:45pm.

     /home/ftp was moved to /ftp

     A link was placed in /home/ftp -> /ftp, so nothing should break
from this.

     During this time I was able to sign on to harmony using winsock,
and surf the net although www.lightlink.com responded with a socket
error.  I also was not able to read or send mail, Eudora responded
with "Connection refused by lightlink.com".

     This means that majesty was working as both nameserver and
security server for those using Harmony to sign on.

     By changing the definition of smtp.lightlink.com and
mail.lightlink.com from light to majesty, it might be possible to keep
outgoing mail functional during these down times.  But this is
possibly more dangerous than it is worth.

06/09/96 Sunday 11:17am EST

     News throttled itself from a full news partition over night, at
least that is what it says it did.  The expire run has apparently
cleaned out the over full articles but did not restart news.

     Recent spams and cross postings between news heirarchies have
caused strain on the various news partitions, I am cutting back all
news to 4 days expire time to clear things out, and will open them
back up again according to our disk space availability.

06/07/96 Friday 10:37am EST

     ftp daemon jammed from too many failed uploads.  I have increased
the permissible jobs to 100 and installed a monitor that will warn
me when it gets above 20.

06/05/96 Wednesday 10:40am EST

     Light crashed from cpu panic after /sd6a went out of control.

     First look shows no damage.

06/04/96 Tuesday 11:35pm EST

     Web server killed and restarted.  It was beginning to stick
again.

06/03/96 Monday 12:42pm EST

    Ciscoout shows that external net was down for a short while
at 12 noon.

05/29/96 Wednesday 12:55am EST

     web server is acting up.  keeps jamming for unknown reasons.

     This started tonight, but has been reported sporadically in the
past.

05/23/96 Thursday 01:13am EST

    Unscheduled reset of Harmony bouncing everyone.

     There is a setting called forwarding_timer that is supposed to be
set to 5 (25ms) but is off instead.  This is supposed to be the cause
of much upset, perhaps even ftp upload problems.  However I was unable
to turn the setting on.

05/16/96 Thursday 4:37pm EST

     beta6 ftpd jammed with failed sesssions.  Rebooted with beta11.

05/16/96 Thursday 12:00am EST

     Unscheduled downtime.

     All modem ports were reset at 12 midnight, to reset the subnet
mask of each port from 255.255.255.0 to 0.0.0.0.

     Xylogics says the prior setting would cause trouble.

05/14/96 Tuesday 11:52pm EST

     News server was shut off from Clarity tonight for a few hours
while we tried to determine where the bandwidth is going.

     (It's not Clarity.  6/4/96)
 
05/14/96 Tuesday 11:49pm EST

     The web server shut itself down tonight for a few minutes after I
deleted an account that had moved on.  The account had a virtual
domain with us, and when the directories were deleted, the httpd
server decided it didn't want to run any more.
 
     Not sure if that is a feature or a bug.

05/14/96 Tuesday 1:34pm EST

    News should be fully back on line.

05/14/96 Tuesday 12:08pm EST

    News is down for testing.  Back at 12:45pm.

    Remote domain reading is still down.

05/13/96 Monday 3:06pm EST

     News is down for testing.

     You will notice from the sharp fall off on the ciscoout numbers,
that news is somehow using up all our bandwidth.

     News is presently back up, but various feeds are locked out,
until we find which one is causing the problem.

     Problem seems to have been in part all the new micro feeds
we have added.  We were sending too much to all of them at once.
Things are more in control now.

05/10/96 Friday 11:44am EST

     Harmony rebooted itself at 8:36am under the new operational image.
So much for that.

05/10/96 Friday 01:19am EST

     Harmony rebooted itself.  It is now running the new operational
image which Xylogics says won't reboot itself any more.

05/07/96 Tuesday 3:42pm EST

     Harmony rebooted itself at 7:25am this morning bouncing everyone
off.

     Xylogics says that if we install the new R9.2.21 operational
image, it will stop doing this.  I have installed this in the proper
directory, but it involves taking harmony down in order to load it.
Probably I will simply let harmony fail again and reboot it self with
the new image.

05/06/96 Monday 7:08pm EST

     News was down for about 45 minutes due to a bug in an upgrade
that I did not spot.  I have reverted the code to the original until
a fix is forthcoming.

     Xylogics has acknowledged that the recent reboots of harmony are
the result of a bug in the R9.2.7 code we are running, and has offered
a fix in R9.2.21, which will not compile.  They are aware of that too.

05/04/96 Saturday 12:04am EST

     Harmony rebooted itself at 2:46pm on Friday, probably bouncing
everyone off.

05/03/96 Friday 8:23pm EST

     News will be up and down for short periods as we try to work the
bugs out of new news software.  We are trying to install software that
will allow us to feed remote sites at 10 times our present rate, thus
making a fair exchange for our 5 redundant newsfeeds (that are
presently not enabled because we couldn't feed them as fast as they
were feeding us by a factor of 10.)

04/29/96 Monday 4:37pm EST

     The Sprint network was apparently down from 4:30am to about 7:30am
this morning.

04/29/96 Monday 11:43am EST

    Light/web/mail/ftp was taken down for scheduled down time from
11:00am to 11:32am.

    Installations went without incident.

04/28/96 Sunday 11:23pm EST

     Majest/news taken off line for about 45 minutes at 10:23pm for
hardware upgrades.

     16 gig of mirror drives were installed on the fast wide bus, and
4 gig of news space was reinstalled.

     Installation went without incident.

04/25/96 Thursday 7:26pm EST

     Harmony rebooted itself at 7:02pm for unknown reasons.  booting
everyone off.
 
04/23/96 Tuesday 2:59pm EST

     Scheduled Down Time

     Light was taken down at noon for 15 minutes to install a new
mirror root drive.

     Majesty was taken down at 12:15pm for 45 minutes to install a new
mirror root drive, and a fast wide scsi card.

     Both installations went without incident.

04/22/96 Monday 12:36pm EST

     System taken down for scheduled down time at 11am.

     Could not get light to fail to boot no matter how I reset the
kernel to the way it was during the crash.  This is not a good sign.

     Harmony will boot properly from Majesty, but will not
authenticate.  Probably a simple configuration error.

     Root mirror drives were not installed, not enough time.

     System will be down on Tuesday during Scheduled down time from
11am to 1pm.

04/20/96 Saturday 8:58pm EST

     Presently playing with mirroring name service on majesty/light
for both forward and reverse lookups.  Name server may be sticky or
momentarily non existent during these tests.

04/20/96 Saturday 4:43pm EST

     Earlier today while Light was down, out of the corner of my eye I
saw Harmony reset the modem banks.  Apparently during this time it
also lost its default route, preventing dialup users from going out
onto the net.

     The mean time to failure around here is getting ridiculous.

     Update: Apparently for whatever as yet undetermined reason,
Harmony rebooted itself while light was down.  Since the new
experimental Harmony code WAS resident on Majesty, Harmony rebooted
itself from Majesty using a vanilla config.annex file which did not
have the default route to the cisco in it.  Thus we lost our default
route.

     Amazing.  These machines are talking to each other behind my back.

     Anyhow the upshot of all this is Harmony is now runnint X9.2.21 rather
than X9.2.7.


04/20/96 Saturday 2:47pm EST

     Crashed again.   This time out of the blue for no good reason.

     Panic on CPU 0, almost definitely on sd5.

     Some damage to web hit files.

     Perhaps the drive is bad.  I do have another one, but it would
take significant down time to swap them over.

     I see its going to be a hard life.

     However I also see that people were signing on while light was
down, so the harmony software is working on majesty.

     Also placed the kernel back into asynch mode which is
usually more stable.


04/20/96 Saturday 2:03pm EST

     At about 1:50pm I walked in on light to find it generating
repeated errors on sd5.  These were not sufficient to crash it, nor
for anyone to report it, but the tape backup had failed and it was
clearly going down.

     I halted light gracefully and rebooted.

     I took the opportunity to remove the second drive bay, which is now
destined for Majesty as mirror, and I installed a new $100 gold plated
radium cored forced perfect terminator on the fast wide bus.

     There was no damage and light rebooted flawlessly twice.

     We may have fixed the 'refusing to boot more than 3 drives' problem,
but we have not fixed the scsi buss errors problems at all.

     The procedure I used to bring light down gracefully was:

     1.) halt command which would not complete because so many errors were
happening on sd5 it couldn't sync the drives.

     2.) Stop-A which forcibly halts the processor in mid stride.

     3.) Turned the drive bay off and then on again and let them come up
to speed which apparently cleans out the jammed buss.

     4.) Go, which starts the processor again allowing it to complete
the syncing and halting process.

04/19/96 Friday 10:57pm EST

     Light crashed while formatting a recalcitrant new disk.  

     "CPU panic on 0"

     Not sure why.  Possibly the disk was not properly registered with
the kernel, it had been complaining about read errors prior to trying
to format it.  Normally I would have rebooted the system at this
point, but we have a mortorium on rebooting in place.
 
     The kernel complaint was possibly caused by having pulled the
disk to get its serial number and repluging it.  They are supposed to
be hot plugable though.

     Light rebooted just fine by the way, no damage.  Moment of terror
though.

     Most people probably can't conceive of being in fear when their
computer crashes.  But I go through it every time Light burps.

     Some times I just have to walk around until the adrenaline
ceases.

     I can't continue living this way.

     Panic on cpu 0 indeed.  
 
     Panic on operator 0 is more like it.

04/19/96 Friday 7:51pm EST

    Getting intermittent ring no answers.  On Monday I am going
to have call forwarding on ring no answer installed.  It's going to
cost $3/line per month or about $210/month for our 72 modems.

************************************************************************
******************  CRASH OF 4/17/96 ***********************************
 
     RECENT EVENTS of 4/17/96
 
     During the scheduled down time of 6pm to 7pm, after I had swapped
in the new scsi bus, the tape drive and sd8 failed to come on line
during the first boot attempt.

     There was some other impropriety in how I had configured the
system which I do not remember now, and I thought the booting failure
was possibly related to it (NOT).  I fixed the minor error, and
rebooted and it booted fine and the system came up at 7pm.

     Then I started a tape back up, which promptly jammed.

     So I rebooted Light, clearing out the jammed tape process, and
started another backup.  This one bombed out with a serious scsi
error.

     Then I reset the drives to asynch mode in the kernel and rebooted
again.  Again the tape backup got a scsi error.

     Then I figured well maybe the cable has gotten ruffled, so I took
the system down, and replaced the $100 gold plated radium cored cable
with a $10 cheapie that had been working fine on Majesty.  When I went
to reboot, the tape drive wouldn't come on line at all, and neither
would sd8d.

     Now this is exactly what happened two weekends ago during the
crash, only that time I was not smart enough to realize what was going
on, and whatever I did in my hurry, I managed to wipe out the FATS on
all of the drives.  This time I was very careful not to do anything
stupid, and the drives so far have remained whole through out all
this.

     OK, so at about 11pm the system would not boot at all, the tape
drive and sd8 wouldn't come on line no matter what I did.

     I then started thinking it might be various things and set about
to test them.

     First thing I did was put the old scsi card back in.  No change.
 
     The next thing I did was swap the drive bay which holds the 4
drives.  We just happened to have a new one lying around which is
destined to hold the web log stats and more drives.  That was
convenient.  No change.

     Then I took all drives out of their drawers (these are hot
plugable drives) and checked and reseated their scsi jumpers and
internal cables, all to no avail.

     Then I swapped the CPU board from Majesty to see if it was the
CPU card.  No change.

     Then I noticed that if I booted with one of the drives off, the
missing drive would come on line.  Thus we have sd5 sd6 sd7 sd8, and
sd8 was not coming on line, but if I turned off either 5 6 or 7, it
would come on line just fine.  (sd = scsi drive).

     Then I stuck both drive bays on the system and put in a fifth
drive that was destined for web stats.

     Again, only three drives would come on line, I could turn off any
two and the other three would come on line.

     Something has to be working over time to do this.

     I saw that probably we could bring the system back up, if I moved
the contents of sd8 to the single wide bus, and left only 5 6 and 7 on
the fast wide bus.

     At this time it was about 2am, I was getting very tired and
beginning to make mistakes (deadly mistakes), so I called it quits and
went to bed for 4 hours, leaving the system off.

     When I got up at 6:45am, I went in to start putting together the
new configuration, and it just booted all the way just fine.

     Somehow my getting some sleep, made the system work.

     Now this in fact has been the story of this ISP, and I am
beginning to think that we have something REALLY FLAKEY going on here
that has not been spotted.  Time and time again, over the past 9
months, I mean repeatedbly, when the system was rebooted it was hell
to bring back up, then it would suddenly start working again for no
reason, particularly after having been left off for a while.

     Then it would run fine for a while, even days, then suddenly
start getting scsi errors again, causing numerous crashes etc.

     OK, so in summary this is what has been swapped.

     1.) The Sparc 20 mother board (months ago). No change.
     2.) The scsi card.  No change.
     3.) The CPU card.  No change.
     4.) The drive bay. No change.
     5.) The drives themselves, one at a time.  No change.
     6.) ALL cables, many times.  No change.
     7.) All terminators.  No change.

     Presently, light is running, but I am afraid to take it down lest
it not come up again.  A tape job is fully jammed so backups can't be
done without rebooting.  Of course if I reboot, the tape drive won't
come on line at all.  Majesty is down because for some reason it won't
boot with Light's CPU card in it.

     So things are a mess.

     04/18/96 Thursday 10:43am EST

     I called the makers of the scsi card PTISP, and told them how the
bus was only finding 3 drives out of 4.  The guy told me to twiddle a
number in the driver and recompile the kernel that would allow the
driver to recognize more than 3 drives (which apparently it was set to
recognize.) I asked him, well how come we have been running on 4
drives all these months (albeit flakily) and he said he didn't know.

     I made the change and rebooted.  All 4 drives were already on
line from when I woke up, and they stayed on line during the reboot,
so not much is proven there yet.  However the full system tape backup
completed which is a major improvement over 5 failues in a row.

     I had asked the guy, how could your fast wide bus software affect
the single wide bus that the tape drive was on, and he said he didn't
know.  However I must point out that the tape drive is backing up
drives on the fast wide bus and thus there might be some interaction
between them.

     We did however get on minor scsi error during the backup, but not
enough to bomb it out.  So things are not working perfectly.

     04/18/96 Thursday 11:49am EST

     Light is back up with its own CPU card, and Majesty and News are
back up.

     Tape backups are working.

     Gonna skip tonight's down time.

     Been down enough.

     Homer
 
************************************************************************
************************************************************************
 
04/18/96 Thursday 08:46am EST

     Made change to scsi driver as suggested by tech support.

     scsi_ncmds_per_dev from 3 to 32.

     It is supposed to allow recognition of more than 3 drives.  This
doesn't explain why we have been running 4 drives for months.

04/18/96 Thursday 07:34am EST

     Light was down all night, and may be up and down all day.

     It is presently running in a crippled mode, tape backups are NOT
working.

     Majesty and news are also down as it won't boot with Light's CPU
board.

     I can't take Light down to swap the boards just yet because it
probably won't reboot.

     The SCSI errors have finally reared their ugly head and are no
longer playing patsy with me.

     More to come but I have to get on with debugging the system.

04/17/96 Wednesday 9:19pm EST

     Got a scsi bus error on single wide bus within 5 minutes of
starting tape job.  Taking light down to reset kernel to ansychronous
mode.

04/17/96 Wednesday 8:52pm EST

    Light rebooted to clear out a jammed tape process.
 
04/17/96 Wednesday 7:41pm EST

     There is some evidence that there was an 'event' of some sort
around 5:30pm this afternoon.  We were out and when we came back, the
number of people on the modems was real low, like everyone had been
kicked off or Harmony had rebooted itself for no reason.  The modem
stats chart shows when this happened.  I have no idea what it was.

     Some reports are coming in that lines are disconnecting and then
people are getting ring no answers, then it clears up.  Until we track
this down I can't say much about it except keep on trying and
reporting the problems.

04/17/96 Wednesday 7:10pm EST

    Lightlink was taken down at 6pm for scheduled hardware upgrades.

    There will be further down times this week at the same time period.

     1.) New fast wide scsi card was swapped in for the original card
to see if this effects the prevalence of scsi bus errors that have
plagued us from the beginning.  They are mostly occuring on sd5 now,
which is the first of the fast wide drives.

     When the single wide drives for news were on light, most of the
scsi errors were happening on the single wide bus, so we never tested
a new double wide bus card.  But as soon as news was moved to Majesty,
the errors moved over to the double wide bus.

     If swapping the card fixes the scsi bus errors, then the old card
will be sent back for repair or burial.  If scsi bus errors continue
to happen, adjustments will be made to the scsi programming interface,
to see if that changes anything.

     1a.) New kernel (jesb) was installed which allows for additions
of more fast wide drives, and increases number of supportable virtual
domains from 64 to 128.  Scsi drives are presently in synchronous
mode, which generally exacerbates scsi erros, and precipidates
crashes from those that occur.

     Full tape backups are being done at least once and sometimes
twice a day, and incremental backups are being done hourly.

     There are presently no hard drive backups which will start when
we get the next 4 Gig drive on line.

     2.) Drive sd4 was taken off the single wide scsi bus and returned
to Majesty for use as another 4 gig News drive.  Majesty now has its
full 16 gig news complement back in order.  sd4 was recruited from
Majesty during the crash of two weekends ago to hold crash remains.
The crash remains are now on tape and will remain so for a long time.

     3.) Harmony was reburned with new configuration settings, one of
which will hopefully allow Majesty to act as a password server for
Harmony when Light is down.  This will allow users to signon to
Harmony from whence they can surf the web and read news.  If Light is
down they will not have access to their home, ftp or web directories,
nor their e-mail.
 
     4.) A new $100 gold plated radium cored SCSI cable was installed
on Majesty for the news drives.  Majesty by the way has never had a
scsi error, but it doesn't have any fast wide's either.

     5.) Secondary name service was not working properly on Majesty
while Light was down, this will need to be fixed.

04/16/96 Tuesday 12:57pm EST

     Light rebooted to clear out jammed /dev/ttyp1's

04/11/96 Thursday 5:19pm EST

     Scsi bus went out of control at about 1:30pm.  Light was halted
and rebooted.  No damage.

     Scsi card is scheduled for replacement.

04/10/96 Wednesday 10:00pm EST

    ftpd daemon reset
 
04/10/96 Wednesday 2:47pm EST

     Modem banks reset themselves for unknown reasons bouncing
everyone.

     I was playing with modem 40 (14A) 277 9530 trying to see why it
was giving a ring no answer when suddenly the whole modem rack when
off line.  I was merely dialing in.

     When I first stuck the phone in 14A it was dead.  Then when I
tried it again, I had a dial tone.  I am going to repunch it just to
make sure the wires are in tight.
 
     There is also an odd occurance in the modems.stat home page where
every once in a while the stats page registers zero people on line for
one 15 minute cycle.  Is it possible that Harmony is down during these
times and rebooting itself?

04/10/96 Wednesday 12:40am EST

     Got a ring no answer today, and few reports of ring no answer or
pickup after 3 or 4 rings which is really strange.

04/09/96 Tuesday 8:04pm EST

     News server reset with rebuilt install.

04/09/96 Tuesday 12:40am EST

    ftp daemon reset
 
04/08/96 Monday 5:03pm EST

    Majesty and News were taken down for 30 minutes to oil a squealing
fan cooling our news disk drive bay.

04/07/96 Sunday 11:42pm EST

    ftpd daemon reset.
 
04/06/96 Saturday 12:03pm EST

     fptd daemon reset
     news server reset
 
04/05/96 Friday 3:31pm EST

     I crashed the modem bank (I think) looking for the bad modem.
Anyhow the whole bank went down, and harmony jammed.  Harmony was
rebooted, and modem banks 1 through 8 have been reburned with correct
settings.  To my chagrin the MultiTech management software does not
load the modems with config files dependably.  It leaves settings out
randomly as it does all the modems.  So each one has to be checked
carefully by hand and set again if it didn't take hold on the first
pass.

04/05/96 Friday 2:10pm EST

     Getting reports of a bad modem not letting people on.  Will
try to hunt down.
 
04/04/96 Thursday 6:03pm EST

     Light was rebooted to clear out a jammed tape process.

     I have been playing the tape backup machine extensively to get a
good feeling for it while we program a system to automate the handling
and recording of tape backups.

     I did a tape erase just to see what it would do.  It started to
erase the tape (very slowly) so I decided to kill it.  Turns out it is
an unkillable process.  Rather than wait 5 hours for the job to
finish, I rebooted the system.  As light was halting it said,

    "Some processes wouldn't die."

     No sh*t Sherlock.

04/03/96 Wednesday 11:09am EST

    News should be fully functional at this time.

    Going to swap out the fast wide scsi board with a new one.

    If that don't work, we are going to swap out the CPU card.

     If that don't work, we are going to get rid of the fast wide bus
and see if a Sun can work with its own hardware.

04/03/96 Wednesday 02:56am EST

     Crashed again.

04/03/96 Wednesday 02:34am EST

     Light taken down to clear out jammed scsi bus.

     /dev/sd5a home directories caught in endless error loop.

04/02/96 Tuesday 9:01pm EST

    ftp daemon reset.

04/02/96 Tuesday 1:02pm EST

    Deer Park router is still down, some parts of the world
may not be accessible.

04/02/96 Tuesday 12:38pm EST

    Nysernet Ithaca-1 router was down from 10:57 to 11:23am this morning.
 
04/01/96 Monday 12:28pm EST

     Sprint is accepting mail again.
 
04/01/96 Monday 12:15pm EST

     News is still flakey.  Sporadic reports of people not being able
to access server.  Also Sprint is refusing to accept postings, they
are looking into it.  None are being lost, but none are being sent.
Postings are being sent to uunet so we are not totally locked out.
News is receiving fine.

     Same for secondary news feed, receiving fine, refusing to accept
outgoing.  They have been informed.

     New backup drives for light are on order.

     Hourly tape backups are being done.


03/31/96 Sunday 5:20pm EST

    Light rebooted to bring drives back up in asynch mode.

03/31/96 Sunday 3:06pm EST

    cron on light was down, web stats weren't being compiled, although
none were lost (except when we were down).  They are now up to date
running every hour.

    uucp was also down, it should be working again.

03/31/96 Sunday 11:37am EST

     News is presently up and down as we try to fix various
things with the history file.

03/31/96 Sunday 11:37am EST

     News was down over night due to incorrect permissions on alt.sex

03/31/96 Sunday 11:36am EST

     Lightlink was down from 7:30am to about 10pm on Saturday due to
loss of File Allocation Tables across 4 main drives.

03/29/96 Friday 2:44pm EST

     The external network connection to nysernet was down due to their
hardware failure for a few minutes between 2:00pm 2:30pm.  They called
and said it was fixed and apologized for the outage.

03/29/96 Friday 12:10pm EST

     Mail was down for a few hours due to /var filling up from an
incoming spam of a user.

03/28/96 Thursday 12:25pm EST

     Light crashed during news expire.

     News will be moved off of light today some time.

03/28/96 Thursday 11:32am EST

    ftp jammed, filled up with interrupted processes.
preventing people from logging on to ftp.

     This has to be fixed!

03/22/96 Friday 1:49pm EST

     /var ran out of disk space.  news throttled, was down until now.

03/20/96 Wednesday 6:51pm EST

    Modem 3 producing ring no answer, now off line.
 
03/20/96 Wednesday 10:19am EST

    ftp server reset.

03/19/96 Tuesday 12:44pm EST

    Light crashed at 12:25pm during news expire.
 
03/19/96 Tuesday 09:00pm EST

     ftpd daemon jammed with broken jobs.  Reset

03/14/96 Thursday 09:39am EST

     Light crashed and rebooted itself at 1:10am last night.

     News was down for the night.

03/14/96 Thursday 12:48am EST

     ftp daemon reset.

03/12/96 Tuesday 2:25pm EST

    News was throttled from alt/binaries filling up from 12:00pm
this afternoon.
 
03/11/96 Monday 7:47pm EST

     /etc/nologin removed.
 
03/11/96 Monday 7:27pm EST

     Modem bank was down from 6:00pm til now.

     Preparing for installing of new modem bank.
 
03/09/96 Saturday 7:22pm EST

    Light reboot to clear out jammed tape job.

    Nynex busied out the bad modem, so the rack should be fine for the
moment.  He said there was a definite break at the Central Office somewhere.

03/09/96 Saturday 2:47pm EST

    It's a bad phone line, not a bad modem rack!
 
03/09/96 Saturday 11:14am EST

     The modem rack is divided into two sections by a bad slot at
modem 26.

     Modems 1-25 start at 277-5026

     Modems 28-48 start at 277-3567
 
     Modem 27 is dedicated and not available for public use.
 
03/09/96 Saturday 01:15am EST

     The modem rack was reset at about 12:30am bouncing everyone off.

     Slot 26 is bad, its not the modem.  The modem tries to answer,
but the slot doesn't convey the information.  This produces a ring no
answer when you run into that modem.

     This divides the rack into two sections of about 25 modems each,
one starting at 277-5026 and one at 273-3567.  If you get
ring-no-answer on the first number try the second.  If you still get
ring no answer, then all modems are busy.

     Modems 9 and 17 were bad a few days ago, and resetting the rack
fixed that.  This indicates that there was nothing wrong with the
modems which had been individually reset to no avail, and that the
rack mother board is going bad.

     Today's travails is just more of the same.

03/08/96 Friday 5:43pm EST

     Modem 26 was the cause of the ring no answer.

     It would light up and pretend to answer, but it wouldn't.

     It is presently off line.

03/08/96 Friday 4:28pm EST

     We are having definite problems with ring no answer.

     That means you call up the modem banks, there are modems free,
but the phone just rings.  If you call up on one phone it will just
ring, but a second phone will get a modem.

     This is often caused by a modem refusing to answer in the middle
of a rotary, but every time I try to narrow it down to a modem, it
picks up and someone gets on!

     The ring no answer also happens when all modems are busy, because
there are 12 new lines at the end of the rotary that (sniff sniff)
don't have any modems on them yet.

03/08/96 Friday 1:14pm EST
 
     Light crashed from news expire at 12:36pm.

03/05/96 Tuesday 5:11pm EST

    Light rebooted to clear out jammed tape backup process.

03/04/96 Monday 11:08am EST

    Light taken down to remove 64meg of memory to move over to Majesty
for news server.

     Modems cold booted to help clear our 9 and 17.  Seems to have
worked.

    Harmony rebooted.

03/03/96 Sunday 1:03pm EST

    We have two modems have the gone sour.  9 and 17.

    They are presently off line.

    Apparently modem 17 has been bad since 3/1/96


03/03/96 Sunday 11:53am EST

     Apache web server shifted over to new logging.

03/03/96 Sunday 11:36am EST

     We suffered a bum modem this morning that prevented people
from getting on beyond that modem.  The phone would just ring and not
pickup.  That modem is now off line.

02/29/96 Thursday 09:26am EST

     Light crashed.

     A number of monitoring procedures we had in place to warn us that
light was down failed.  This probably resulted in a number of hours of
down time without awareness that we were down.  Just before light
crashed, the load had gone very high for unknown reasons, possibly
massive spamming through our remailer, which is now off line until
further notice.

     Unfortunately two web hit log files were lost during the crash,
and we are not presently keeping backups of them because they are so
massive.  With the new web hit software in place this will change.  In
the meanwhile my apologies to those who lost web hit files.

02/27/96 Tuesday 3:50pm EST

    Light was rebooted to clear out misbehavior of the server
that wouldn't fix on its own.

02/27/96 Tuesday 3:08pm EST

     All referer and agent logs have been temporarily turned off as
the server is crashing from them apparently.

02/27/96 Tuesday 2:01pm EST

     The apache server has run out of file handles.

     This will take some serious revamping to fix.

     This caused the server to not respond.

02/27/96 Tuesday 1:11pm EST

     apache web server reset

     Added new mime.types for x-director
 
02/26/96 Monday 5:00pm EST

     ftp server reset
 
02/23/96 Friday 9:22pm EST

     Apache server reset.
 
02/23/96 Friday 12:31pm EST

     Light crashed at 12:07pm during news expire.
 
02/22/96 Thursday 1:29pm EST

     FTP server reset, everyone bounced.
 
02/22/96 Thursday 01:58am EST

     News was killed momentarily and rebooted.

02/20/96 Tuesday 12:58pm EST

     I had to kill news a few times to get the expire to start working.

     Something has to change around here.

02/20/96 Tuesday 12:21pm EST

     Man was that a mess.  The process which crashed light last night
managed to start itself up from cron again, even though I had disabled
the cron file.  I guess I failed to restart cron so it had it in
memory.

     This process took the load to 6.0 at around 2pm.  This slowed
everything down to a point where the various hard drive backups and
tape backups didn't complete on time.  Thus the tape backup started
while the hard drive back up was still going, causing things to slow
down even more.  Then the tape back up failed to complete by the time
other jobs started at 6am which failed to complete by the time news
expire started at 11am, and by 11:30 the load was 12.0 and EVERYTHING
was still running and nothing was getting done.
 
     I rebooted light to clear it all out, after trying to tear it
apart process by process.  There is only so much you can do with an
autopsy.

02/19/96 Monday 11:27pm EST

    Light went out of control again for reasons that are not clear.

    (This was later determined to be caused by a user program scanning
the news spool for data.)

02/19/96 Monday 10:57pm EST

     Light went out of control.  I tried to bring it down gracefully
and was barely able to halt it in time before it crashed for real.

    Reasons unknown.

02/15/96 Thursday 12:14pm EST

    Light was down for 10 minutes to oil a noisy fan in the news drive
bay.  It may still need to be replaced.

02/14/96 Wednesday 9:56pm EST

    Light crashed at 9:34pm probably caused by my moving a disk drive
bay around.  SCSI cables suck even when they cost $99 each.

02/13/96 Tuesday 01:24am EST

     Web server was down for 10 minutes for a recompile, to add
referer and agent logs.
 
02/10/96 Saturday 11:43pm EST

     News was taken down for 5 minutes to check load averages used
by news.
 
02/10/96 Saturday 01:58am EST

     The outbound mail queue was lost last night due to a spam.

     The mail queue holds outgoing mail that can not be delivered for
various reasons at the receiving end.  The sendmail program tries
every hour to send the mail for many days and then quits.

02/08/96 Thursday 11:51pm EST

     Apparently news throttled itself when /var filled up.  I did
not see this and so news was down all afternoon.

02/08/96 Thursday 4:03pm EST

     /var ran out of spool space today for a few minutes.  This
has been temporarily fixed.  I also moved the www log directories
to a new partition, so the web server was down for about 5 minutes.

02/06/96 Tuesday 12:09pm EST

    News was down for about an hour as we made preparations for moving
it to Majesty.

02/05/96 Monday 12:09pm EST

     Light crashed at 11:45am during news expire.  There was a huge
load spike just prior indicating that the apache server had gone our
of control, however all log records from 11:03am on were eradicated
when the disks were fsck'd

     The apache server is supposed to kill off excess servers and not
allow them to go over 50 in any case, usually such a crash indicates
300 or more.

     I am going to install a little script that will reset the apache
server any time the servers go over 50 as part of the monitor program.
Monitor data can be found in /var/log/monitor and is open to public
view.  through our http://www.lightlink.com/stats.html home page.
 
02/04/96 Sunday 1:08pm EST

     Kernel reinstalled yet again, light rebooted.
 
     Now we are running in Asynch mode again.

     OK hopefully this will handle the crashes.

02/04/96 Sunday 1:00pm EST

     jes kernel reinstalled scsi set to 0x58 and light rebooted.

02/04/96 Sunday 12:19pm EST

    Light crashed 11:46am.

    Tried to reset scsi bus to asynch mode, but it doesn't
seem to be taking.

    It got reset from asynch back to synch during the last install
of the new kernel.

02/03/96 Saturday 4:55pm EST

    Light crashed 4:31pm
 
02/03/96 Saturday 1:41pm EST

    Top 18 modems were disconnected and reseated in the correct
order.

02/01/96 Thursday 4:05pm EST

     Light crashed 3:51pm
 
02/01/96 Thursday 11:26am EST

     Harmony is failing to do name service properly for unknown reasons.

     OK, this is fixed.

02/01/96 Thursday 12:34am EST

     Web server was up and down a few times as we played with new
installation.

     No permanent changes yet.

01/30/96 Wednesday 10:29am EST
 
     Light was rebooted to install new kernel to fix bug in tape drive
software, and install NFS to bring Majesty on line.

01/30/96 Tuesday 12:32pm EST

     ftp jammed for a while due to too many zombie processes left
over from something or other.  I killed all ftp processes, perhaps
bumping a few real ones in present time.

01/26/96 Friday 11:26am EST

     11:00am

     Modems were rebooted and reburned with correct init strings.  All
modems visually check for correct init speed and speed.  Please report
garbage on the screen during login incidents immediately.

      Lower 39 modems have retraining off, top 9 modems have retrain on.

      Harmony rebooted to clear out defunct routes in the cache.

      Light rebooted.

01/25/96 Thursday 4:17pm EST

     Light rebooted.  We were getting load spikes, signifying
impending crash.  Have moved swap space back off of /sd3b to /sd0b.
If this clears up the spikes then that means sd3 is bad.

01/25/96 Thursday 11:38am EST

     Light crashed at 11:25am from news expire during expireover.

     We are getting a second Sparc 20 which will carry news, so
hopefully this will move the crashes off of light to that machine,
until we can find out what is going on.

01/24/96 Wednesday 7:04pm EST

     CSU/DSU was brought off line for 2 minutes by accident via a
loose power plug while moving it.
 
01/21/96 Sunday 10:50pm EST
 
     Light went unstable with spike loads of 101 or so.  I was
barely able to halt it gracefully, nothing would 'vfork', and we
were getting stack errors on the swap partition.

     This was AFTER the move to /sd4.

01/21/96 Sunday 7:36pm EST

     After the crash this morning during news overview expire, we
started to get repeated load spikes indicating near crashes.

     Since the only change that was made was to move the overview data
base to /sd3a, this indicates that /sd3a is causing troubles.

     So now we are moving it to /sd4a.

     At the same time the erotica groups will be moved /sd3d to /sd4h,
and /sd3 will take taken fully off line to see if the system
stabilizes.

     No news will be lost, but news will be fully down while the
transfer takes place.
 
01/21/96 Sunday 11:56am EST

    Light crashed during news expire at 11:56am.

     It crashed during expiry of the overview data base.  which is
presently on /sd3a.  When I first moved it to /sd3a a few days ago I
noticed a number of high load spikes every few minutes which is
usually a sign of instability and 'near' crashes.

     It is possible that /sd3a is causing some of our problems.  I may
consider moving the over view data base once again to /sd4a, taking
/sd3a out of the loop.
 
01/19/96 Friday 6:19pm EST

     News overview data base is being rebuilt, to allow threaded news
readers like tin to access old news.

01/19/96 Friday 3:39pm EST

     Light taken down for 10 minutes to move news overview data base
to /sd3a.  This caused the destruction of the old data base.

     No news was lost, however old news is not accessible unitl
overview data base is rebuilt.

01/18/96 Thursday 9:50pm EST

     News was down for posting for 7 hours while the history data base
was rebuilt.  Obnoxious.

01/13/96 Saturday 3:45pm EST

     Light went out of control at 3:33pm from SCSI Bus errors on sd5a.

     I tried to take it down gracefully, but to no avail.

01/12/96 Friday 5:30pm EST

     News has been unreadable from modems 37 through 48 for a while.

     We have been running an open nntp server until recently, which
means everyone in the world can read news from our server.

     In trying to locate where our outgoing bandwidth is going, I
turned this off temporarily.  By mistake modems 37 through 48 were
been enabled to read news, and were only enabled by default of
everyone being able to read.  When I closed off the open access, those
modems were locked out also.  They are now properly enabled as
themselves.

     I may reopen up the news port in the future, our bandwidth is NOT
being consumed by news reading by external sites.

01/12/96 Friday 4:02pm EST

     Light was taken down at 2:40pm to install new Forced Perfect
Terminator on SCSI bus.

     Rebooting failed a few times due to typos in rc.local.

     Later permissions were incorrectly changed on root directory
locking a few people out for a few minutes.
 
01/11/96 Thursday 6:16pm EST

     News expire is presently acting strange.  Expireover went for
7 hours and did not complete.  I killed it.

01/06/96 Saturday 12:29pm EST

     Light crashed at 12:06pm during news expire.

     Installed new kernel with vif=64.

     adb -w /vmunix
     scsi_options?W 5*
     $q

     Presumably that sets the scsi bus to asynch mode.

12/31/95 Sunday 11:12am EST

     Web server was down from 11:00pm last night due to a failure
to restart it properly.

12/27/95 Wednesday 07:35am EST

     Light taken down for 10 minutes to replace scsi cable.

     We are now running with two Gold Cables.

     Tape drive is first in the chain.

12/26/95 Tuesday 12:09pm EST

     Light crashed at 11:55am during news expire.
 
12/24/95 Sunday 12:07pm EST

     Light crashed at 11:58am during news expire presumably from scsi
bus error.

12/20/95 Wednesday 09:25am EST

     Light taken down at 8:00am for system work on scsi bus.
 
12/19/95 Tuesday 10:17am EST

     Light crashed from scsi bus errors during news expire at 10:00am.
 
12/14/95 Thursday 09:28am EST

     Light crashed from Scsi bus errors during news expire at 9:07am.

     I took the opportunity to install two new gold plated cables ($99
each), so that sun techies can stop asking us if we are using lousy
cables.

12/12/95 Tuesday 3:59pm EST

     Light taken down to bring news back on line.

     Before putting the disks on line, I put the main scsi cable in
light and terminated it with the active terminator.  Light would not
boot, but gave nasty errors resulting in needing to manually fsck
/dev/rsd0g.

     I replaced the cable with an earlier 6 foot cable, and did the
same thing and light booted fine.  Then I attached the news drives,
and booted again.  The tape drive is still off line.
 
12/12/95 Tuesday 12:17pm EST

     I took light down to take off the news drives which are suspected
of causing the scsi buss errors.  When I went to reboot, it wouldn't
reboot.  It wouldn't load vmunix, when it did load vmunix it said it
couldn't find /sd8d which is on another bus entirely.

     I finally had to take the tape machine off the external scsi bus,
and then it booted fine.

     Something is very strange in Denmark.

12/12/95 Tuesday 09:17am EST

     Harmony taken down to install 12 new lines.
 
12/10/95 Sunday 08:45am EST

     Light taken down to reinstall sd0.  Going to go back to original
factory configuration and take external drives off line to see if
errors still happen.  This will require that news be down.
 
12/09/95 Saturday 12:04pm EST

     After replacing the internal cable and terminating board, light
became more unstable than before, producing load peaks of 20 (near
crashes) every 5 minutes or so during news expire.  I took light down
and put back in the old terminator and cable, and also took the drive
tower apart to verify that all termination settings were correct, and
reseated all scsi cables.
 
12/09/95 Saturday 09:38am EST

     Light crashed, and then crashed again on reboot.

     Anyone want an ISP cheap?

12/09/95 Saturday 09:08am EST

    Light taken down to replace internal scsi terminator back plane.

    Got a scsi error 10 minutes after coming up again.

    Took light down again to replace internal scsi cable.

    That's the last of it, there is nothing more to replace except
for the entire external disk drive set.

    Homer

12/08/95 Friday 2:37pm EST

     Sendmail was down for about an hour from a uucp typo.

     Sorry.
 
12/07/95 Thursday 12:51am EST

     Light crashed at 11:36pm.
 
12/05/95 Tuesday 08:13am EST

     Light taken down to unplug and reseat internal scsi cable from
motherboard and pc hard drive termination board.  System running with
cable in place but no floppy nor CDROM (and no internal hard drive).

     Found permissions set wrong on web counter directory, causing web
counters to fail.  Web counters should be working correctly now.
 
12/05/95 Tuesday 02:29am EST

     Coffee break 1:53am
 
12/04/95 Monday 12:55am EST

    Light took coffee break at 12:14am
 
12/03/95 Sunday 1:55pm EST

     Light was rebooted with new kernel.

     At the same time, just after compiling new kernel, tape backup
became jammed, couldn't kill the process at all.

     Then shutdown jammed, until I turned off tape drive.

     Then boot -s jammed with "can't find swap on sd8b, no such etc."

     Turned everything off, and on, and boot worked fine.

     I changed the MAXUPROC from 25 to 256 in sys/param.h in kernel.
Apache .8 has an apparent limit of 26 virtual hosts, but the change to
256 did not fix the problem.

     I also changed the max process id from 30000 to 300000.

     I wonder what happens when the machine runs into a max process id?
 
12/03/95 Sunday 11:44am EST

     Tape backup failed again with sd3 out of the loop.

     Now we take sd1 off line, which is the main news drive, so
news will be down for a while.

     Homer
 
12/03/95 Sunday 09:27am EST

     Light took a coffee break at 5:37am, rebooting itself.  The tape
drive was off line, so it is not the tape drive electronics.

     I took light down at 7:00am to move /usr/local/ off of /sd3a and
over to /sd8d.  sd3 has been taken off line, as it is suspected as
being the culprit for the crashes.

     News is a mess because light keeps crashing during expires, so we
may have to clean the slate and and start over.  I may institue a
policy whereby newsgroups are kept for a shorter period of time,
except for those explicitly being read or requested.  This will
greatly aid in cleaning up the mess after succh crashes.

     cgi's are still not working.
 
12/02/95 Saturday 8:21pm EST

     Light was rebooted to try and solve cgi failures by increasing
kernel allocation tables (maxusers = 200) Cisco was brought off line
to stop incoming hits on web site.  Neither worked.

     At this time we have no idea why cgi's are not working.

     Homer
 
12/02/95 Saturday 7:57pm EST

    Cisco was rebooted again.
 
12/02/95 Saturday 3:26pm EST

     Cisco was rebooted and Harmony was rebooted, kicking everyone
off, in order to clear out the old arp tables and install the new
ethernet address of the motherboard.

     Prior to this the virtual domains were not working at all.
 
12/02/95 Saturday 2:22pm EST

     Light was taken down for 10 minutes to swap /sd3d and /sd7a.
sd3d was the backup partition and sd7a was the erotica partition.  Now
its the other way around.  The purpose of this is to decrease the
backup time as /sd3d is single wide and /sd7a is double wide.

     In the process all .pictures.erotica were lost.
 
     News is still down and will continue to be down for a while.

12/02/95 Saturday 1:36pm EST

     Light was down to replace the mother board for a few
hours this morning.  Unfortunately, this did not repair the problem.

11/28/95 Tuesday 07:59am EST

     System down for lab time from 7:30 till now.

     Main boot drive sd4 was moved from scsi id 5 to 3.  The system
boots by default from 3.  This increases the chances the system will
reboot itself after a severe crash.

     All swap space has been removed from the single wide scsi bus, to
take the pressure off the bus and to decrease the likelihood that a
scsi error will happen on the swap partition, an event most likely to
crash the system.

     Sun has offered to send us a new motherboard which will be
installed when it arrives.

11/27/95 Monday 2:07pm EST

     Due to a typo, sendmail was not functioning correctly for about
3 hours until now.
 
11/27/95 Monday 12:22pm EST

     Light crashed at 2:00am from scsi bus errors.

     We were down until 7:am.
 
     Home drive is out again, however it is unlikely it is the source
of the problem.

     Sun is sending a new mother board.
 
11/26/95 Sunday 08:37am EST

     System down for lab time from 8 until now.

     We put the home drive back in to see if it would fail to
boot because it was cold.  It booted fine.

     We then took light down for 30 minutes and turned it off to
see if it would fail to boot because it was cold.  It booted fine.

     The cold idea came from the fact that light would not boot after
being down the night of the blackout during the storm after being off
for about 30 minutes.

     At this time I have no idea what is going on, we are still
getting errors and are considering how to best approach the problem
with minimum down time.


11/25/95 Saturday 08:37am EST

     System was down for lab time from 7:30 until now.

     We successfully moved all of the home drive partitions to the
sd4, and got the machine to boot from sd4.  Not obvious.

     If the system proves stable in this configuration, the home drive
will be reinstalled, formated to erase all its data, and sent off for
replacement.
 
      News is still down for another 30 minutes or so.

      Harmony was stable over night.

11/24/95 Friday 08:53am EST

     The system was down for lab time from 7:30am until now.

     Harmony was stable over night.

     We moved the /usr and /var (mail) partitions to sd4 to get them
off the failing home drive sd0.  We are still booting from sd0 but all
important partitions have been moved in anticipation of its
replacement.

     Harmony was rebooted to set its configuration file correctly, for
the new subnets.  However apparently a bug in the harmony code
prevents it from reading in subnetted masks properly, so they all have
to be added by hand after the booting takes place.  This will require
more work and a few questions to Tech Support at Xylogics.

11/23/95 Friday 08:55am EST

     System was down for lab time.  Harmony again lost its default
route to the internet twice over night.  We think we know what is
causing it, but not why.

      The route loss seems to be connected with hand entering the route
after deleting it for test purposes.  When the route is added through
the normal configuration procedure during boot up it is stable, if it
is then deleted and reentered by hand it seems to have a time to live.

      We will query Xylogics tech support about this on Monday.

11/22/95 Wednesday 10:05am EST

     For reasons that are still unclear, Harmony lost its default
route out to the internet this morning at about 8:30am.  It should be
working again at the moment.  We may have to reboot harmony if things
continue to be unstable.

     Thanks to all who brought this to my attention.

11/20/95 Monday 08:55am EST

     System was down from 7am until now for lab time.

11/17/95 Friday 12:19pm EST

     CSU/DSU was brought off line resulting in a 20 second
disconnection from the internet.

     Since the storm it has been getting errors, Sprint will be
testing it this morning between 6am and 7am.  During that time there
will be interruption of service to the internet.

11/16/95 Thursday 10:22am EST

     System was down from 8am to 9am for system work to rebuild new
home drive.

     Expect continued down times during early morning hours.

     News was down until now.

11/15/95 Wednesday 04:58am EST

     System was down for 2 hours due to a power outage in upper
college town.  We then had a LOT of trouble bringing it back up again.
possibly due to the same problem causing intermittent scsi buss
errors, which are going to be dealt with terminatedly in short order.

11/12/95 Sunday 07:38am EST

    Cisco router was rebooted intentionally to clear out a problem
with config file transfers.  It didn't fix it.  Connection to the internet
was down for about 20 seconds.

11/08/95 Wednesday 07:18am EST

     System was down for an hour until now to add the new disk drive.
All four 4gigs are now in one tower, this got rid of a SCSI cable
bringing the total to two on scsi bus one.  The system has been stable
for a while, but it is not yet clear why, I believe the scsi bus
errors have merely gone dormant.  We have taken a load off the home
drive by moving the swap partition to another drive, this may have
contributed to the quieting of the errors.

     Please continue to expect down times during early morning hours
between 3am and 7am.
 
11/06/95 Monday 09:17am EST

    System was down from about 6:00am to 7:00am to begin move of
home drive over to new drive.  It is suspected that the home drive
is going bad which is causeing all the SCSI buss errors.

     The system will be down every morning for some time until this is
resolve, between 3am and about 8am.

11/05/95 Sunday 10:09am EST

     An error has been found in the Virtual Domain code, rebooting
to repair.

11/05/95 Sunday 07:32am EST

    Going down again.

     Swap partition moved to /dev/sd7b in preparation for moving root
directory to /dev/sd7.
 
     My guess presently is that sd0 main drive is bad.

11/05/95 Sunday 06:52am EST

     Going down to replace swap partition.

11/04/95 Saturday 03:06am EST

     System down for 10 minutes to change SCSI cables.  All cables
are brand new.  Next check will be to change the swap partition off the
home drive.

11/03/95 Friday 05:12am EST

     System was down for about 1 hour while we worked on scsi cables.
 
11/03/95 Thusday 012:02pm EST

     The system crashed for unknown reasons, and rebooted itself,
probably due to sporadic I/O errors we have been getting on the
primary SCSI bus.  The system will be down tonight to work on the
problem.
 
10/30/95 Monday 04:58am EST

     News was down for 2 hours due to disk full errors.  Sprint has
been down again, and maybe when they came back up, we got flooded with
back log that would have other wise expired.
 
10/25/95 Wednesday 02:14am EST

     System was down for 2 hours due to system crash.  Apparent cause
was badly seated cables, complicated by resulting corrupted super
blocks on 2 drives.

10/21/95 Saturday 04:51am EST

     Light rebooted to increase virtual domains to 32.

     Down for 20 minutes while I forgot to bring it back up again,
having too much fun with new color scanner.
 
10/04/95 Wednesday 12:45am EST

    I crashed light playing with security holes.
 
10/02/95 Monday 04:48am EST

    Harmony was rebooted to increase time out to 40 minutes.
 
09/27/95 Wednesday 01:52am EST

     Harmony was down for about 15 minutes while we tried a security
upgrade.  The upgrade failed.  We are running as normal.

     Light was rebooted during the process.

09/26/95 Tuesday 03:35am EST

     We had a run away process last night send the system load to 4.0
and then to 7.0.  This made things very slow.

09/16/95 Saturday 01:38am EST

     System was down for 2 hours for upgrades.  See news for details.
 
09/13/95 Wednesday 01:04am EST

     Routes out of light to the internet were down for about
30 minutes by accident.

09/08/95 Friday 9:15pm EST

    web server was down for about 2 hours by accident.
 
09/02/95 Saturday 4:59pm EST

     An error in programming caused all password files to be renamed
improperly resulting in momentary loss of an ability to signon to the
system, and/or execute certain programs.
 
08/31/95 Thursday 3:23pm EST

     System rebooted to physically rearrange drives back into place
after yesterdays fiasco.
 
08/30/95 Wednesday 8:46pm EST

     We suffered a total power outage this afternoon around 3pm.  A
transformer caught fire down the block towards college town.  The
system stayed up for 20 minutes into the power blackout, and then we
had to shut down as the backup batteries had run out.

     During this time I chose to do some back up maintenance on two
micro fans cooling disk drives that were beginning to whine and make
noise, a sign that they would soon jam risking overheating and damage
to the power supplies.

     When I went to hook the whole system back together about 30
minutes after the power came on, I found that it would not boot.  This
was apparently caused by SCSI connections gone bad.  It took me three
hours or so to get the system working again, after trying many
different cables.  The final answer was to cable the various
components in a different order.  A very dissatisfactory solution.

08/28/95 Monday 1:30pm EST

     System rebooted at 2am this morning to check out new virtual
domain kernel.  It didn't work, so system rebooted again with normal
kernel.

08/24/95 Thursday 1:56pm EST

     News was down until 8am this morning, to rebuild the
history data base.

08/23/95 Wednesday 8:18pm EST

     News has been down for 2 hours and will continue to be down
for another 2 hours or so.

08/23/95 Wednesday 02:34am EST

     We managed to get the modems converted to 115200 without bringing
down the system, however during the transitions some of the modems
were set wrong, and some people may have gotten rings without picking
up, or gotten garbage on the screen from mis set modems.  All should
be working now.

08/21/95 Monday 6:45pm EST

     System was down for 2 hours for intallation of 16 more modem lines.

     New modems have not yet been configured.

08/20/95 Sunday 5:43pm EST

     System was down for 3 hours for upgrades.

     Annex was rebooted at end of upgrade.

     See 'news' for details.


08/13/95 Sunday 7:50pm EST

     News was not responding to slip requests since last night
due to an error in a configuration file.

08/12/95 Saturday 2:26pm EST

     System rebooted to clear out some things.

08/11/95 Friday 4:35pm EST

     FTP was broken for about 2 hours today while we tried to install
the new version.  We are presently running the old version.
 
08/10/95 Thursday 1:47pm EST

     System was taken down for about 30 minutes for upgrades.  The
annex was also rebooted to install new settings.
 
08/09/95 Wednesday 2:42pm EST

     News was down from about 12;00am last night till 2:30am when I
spotted it.  Reading was not affected, but incoming articles were
stopped.  The cause was an incorrect permission on the news directory
junk which caused a failure to write to it.

08/06/95 Sunday 8:24pm EST

     System down for reboot to bring two new fast wide scsi drives on
line.

08/05/95 Saturday 8:04pm EST

     System down for 30 minutes to install fast wide scsi bus card.

08/05/95 Saturday 7:32pm EST

     I created my first mail loop today.  Took the system to load 31
or so, for a few minutes.

08/04/95 Friday 7:08pm EST

     T1 was down for 5 minutes while we rebooted the T1 CSU/DSU modem.
We have a defective T1 modem, nothing serious, but I can't log on to
it.

     We are also getting data errors higher than normal which is a
completely separate matter.

     The modem will be replaced shortly, and the data errors are being
looked into by sprint.

08/04/95 Friday 3:34pm EST

     System down for 30 minutes.
 
     I had to bring the system down to clean up an error I made.

08/03/95 Thursday 7:16pm EST

     System rebooted to clean things up.
 
08/03/95 Thursday 7:05pm EST

     I crashed the system.

07/30/95 Sunday 7:07pm EST

     News was down for about an hour.
 
07/29/95 Saturday 8:25pm EST

     System rebooted.