September 22, 2000: Brief AMAP outage in the wee hours of the morning
Shortly after 3:00am, our DS3 line to the Austin Metro Access Point (AMAP) lost its connection. This line provides local peering with other providers in the Austin area. While the line was down, access to the rest of the 'Net was not affected in any way. After contacting the provider for the line, and some troubleshooting, the connection was restored at approximately 5:45am.
September 17, 2000: Network connectivity outage
A DoS (Denial of Service) attack against one of our userhosts (fnord.io.com) caused IO's upstream connectivity to be flooded. After a bit of troubleshooting, the 'Net connections were returned to normal operations. This outage lasted from approximately 12:30am to 2am Austin time, Sunday morning.
September 16, 2000: Compuserve responds!
Routes to compuserve.com and associated hosts appear to have been mostly fixed. IO sent out requests to our upstream providers, and had them contact their upstream hosts, finally determining that someone, somewhere on the 'Net, was broadcasting bad routes as being 'good.' Looks like someone fixed one or more of their routers, as we are now able to connect (but we never heard anything back from whoever made this fix!)
September 15, 2000: Compuserve unreachable
This is apparently also affecting Exodus' Austin facility as well as Insync's Houston facility (because those are the two upstream providers our packets are going through.)
Looks like someone out there on the 'Net is broadcasting bad routes for uunet and/or compuserve. There's currently no way to reach compuserve.com via any of our upstream providers. We've tried routes both from Austin and Houston (two separate networks), and each ends up trying to cross the same broken bridge.
The nameservers for compuserve.com are located internally on their network, and because we io.com cannot reach their network, compuserve.com appears as an 'unknown host or domain.'
The stopping point between our networks is at a router owned and operated by UUNET/Alter.net/Worldcom. Notices have been sent to all known contacts for that network, as well as to IO's upstream providers.
September 13, 2000: Filesystem Reboot
An unscheduled reboot of bavaria, our main filesystem, occurred this morning at 10:18. Bavaria came back online at 10:23am. During the 5 minutes of downtime, all files were inaccessable. Customers logged into the userhosts may have had to drop their connection and log in again when their home directories became available. Mail has queued up on deliverator as a result, so there may be some latency in mail delivery over the next hour or so.
September 6, 2000: Webserver problems
One of the three machines that collectively act as www.io.com was malfuncionting this morning. After about a 5-minute reboot time, www-03.io.com has recovered. Before the reboot, problems were spotty because two of the machines had no problems, and the third one that was malfunctioning would show errors just by the chance that it was the particular machine being called. Uptime previous to this problem for that particular machine was 128 days.
August 30, 2000: Network Outage
12:56pm: All services are back online. Total approximate downtime was 15 minutes. Not yet certain exactly what happened, but it looks like a flood of traffic came into eris for some reason.
We are currently (12:32pm) experiencing network problems. This is affecting inbound as well as outbound traffic. This page will be updated when more information becomes available.
August 18, 2000: Dialup Outage -- Follow-up
After speaking with Time Warner about the brief outage we experienced, they have diagnosed the problem as having been caused by a signal sent at 6:30pm for the south switch that our PRIs are on to terminate. At that time, all connections to our Austin dialup servers were dropped and no one was able to log in for approximately 10 minutes. Connectivity returned to normal once Time Warner corrected their error. We have no information at present regarding what caused the termination signal to be sent, but have opened a ticket with Time Warner. They will investigate and we should hear something tonight or over the weekend.
August 18, 2000: Dialup Outage
We experienced a brief dialup outage in Austin that has thus far been attributed to a momentary glitch with our PRIs. The matter is being investigated, but all lines seem to be answering. Exact length of outage uncertain, approximated at 2-5 minutes. More information will be posted as it becomes available.
August 1, 2000: FTP changes
Due to high load issues on our userhosts, we have updated wu-ftp in order to no longer allow anonymous FTP access to our userhosts, ie. io.com. Anonymous FTP access is still available at ftp.io.com.
July 19, 2000: News continued
Due to the disk failure on hiram, we've switched the news server to use the Giganews news service. An announcment about this appears in IO Revealed! at http://www.io.com/revealed
July 19, 2000: News/hiram.io.com disk failure
Hiram, our main news server, is suffering from a disk problem in the array of disks that hold news articles. We are currently working to try and resolve this problem. If this problem continues to take up more time than it is worth, we will be escalating the move to our new news service (currently known by the servername betanews.io.com.)
July 18, 2000: Sshd daemon restart
We will be making some changes to our ssh configuration and in order for them to go into effect the ssh daemon will be restarted at approximately 1am on the 19th. All current ssh connections will be dropped and will need to reconnect. Total downtime should be less than a minute and a maximum of five minutes.
July 14, 2000: Mail delays in mid afternoon
Deliverator ran out of available processes earlier this afternoon and caused mail to queue on the mailservers. Once deliverator was brought back to operational status, mail began to flow through very quickly, but many messages had built up in the queues. As of 5pm, there are only about 300 messages left in the queues and they are depleting very quickly.
July 10, 2000: Houston access interruption
The facility where we house our Houston access servers suffered a power failure at about 1:30pm. Our UPSs served power to those systems until approximately 2:26pm. At that point, the batteries were drained, and all of those systems lost power and connectivity. Power to the building was restored at about 2:45pm, and our services are now back up.
July 9, 2000: Deliverator To Receive Hardware Upgrade As expected, downtime was minimal (approximately 26minutes). Will be monitoring the new hardware configuration over the next few days to see how well it will perform.
July 8, 2000: Deliverator To Receive Hardware Upgrade
At approximately 1:00am Sunday morning, Deliverator.io.com (mail.io.com) will be upgraded from it's current hardware configuration (PII 400mhz) to a Pentium III 450mhz processor. Expected downtime should be minimal.
June 29, 2000: Houston Downtime/Switchover III
Currently our Houston routers can see the rest of the world but not our Austin routers. This is being cleared up with Exodus currently in order for our routes to be allowed since the traffic is coming in from a new location and line from Houston.
June 29, 2000: Houston Downtime/Switchover II
All equipment has been moved, but traffic is not routing properly. Brent is on site in Houston working on the problem.
June 29, 2000: Houston Downtime/Switchover
We are moving our facilities in Houston to a new location down the hall from our current one. During this switch there will be some downtime. This downtime should be fairly short as they move the T1 over. This is expected to occur in the next hour.
June 28, 2000: Houston Routing Resolved
The routing problems affecting Houston have been resolved. All customers should be able to dial up, connect, and access the internet.
June 28, 2000: Houston Routing
We are experincing a problem with routing in Houston. Customers presently dialed up in Houston will experience problems getting out to the rest of the internet. This includes sending/receiving email. Customers attempting to dial up in Houston will not be able to connect, as the password server (in Austin) cannot be reached by the Houston dialup server.
June 27, 2000: Austin Dialup - Recap
Yesterday, we experienced some problems with our dialup lines that were not documented in the NOC page as well as we would have liked to document them. What follows is a timeline of the events as they occurred.
The customers affected most grievously were our Dedicated and LAN Dial on Demand ISDN customers. As of now, all outstanding issues on our end are confirmed resolved. The only remaining issues might be with customers who inadvertantly scrambled their ISDN configuration while trying to fix a problem they thought to be theirs and not ours.
1:00am - Dialup lines went down, as scheduled, to be moved from Time Warner's North DS3 location to Time Warner's South DS3 location.
5:30am - Move complete. Dialup lines back online.
8:00am - Started noticing problems with ISDN dialups. Customers dialing up to 495-7460 were coming in on PRIs that were supposed to be allocated to 485-7440 and vice versa. This caused ISDN routing problems, as the 7440 ASes are not configured for Dedicated and LAN ISDN. Contacted Time Warner.
1:00pm - Time Warner figured out which PRIs were routed to the two trunks; we switched PRI cables around on our end so that we would not have to wait for them to put the PRIs back the way they should be.
1:30pm - ISDN PRIs maxxed out at 29 connections--much lower than it should be. Contacted Time Warner again.
2:00pm - A configuration change was made that rolled all PRIs into the 7440 trunk group, dispite the insistance of our network administrators that such a change was neither desired nor acceptable to our needs.
2:30pm - We were told by Time Warner that they were waiting for information to update elsewhere and that they would call us back.
3:30pm - Received callback from a technician who had received a voice mail left earlier in the day. Brought him up to date on the situation. He told us to call him at 4:30pm if the PRIs were still not routed properly.
4:30pm - Called and left voice mail for technician.
5:00pm - Brent called and spoke with our Time Warner sales representative. She got us in touch with a technician at their NOC in Denver. Denver tech discovered 6 of our PRIs were set to "cadence disable" (i.e. turned off). He enabled the PRIs and fixed our PRI routing so that the PRIs rolled into the proper trunks.
5:30pm - All lines restored to full dialup capacity.
June 26, 2000: Austin Dialup
Our Austin analog and ISDN lines are continuing to experience problems from the telephone company service earlier this morning. Symptoms include hearing a recording on our dialup number that it is "no longer available" or that your ISDN service may authenticate, but not properly transit TCP/IP traffic.
These are call routing and trunking issues that the telephone company is working to resolve.
June 21, 2000: Mail Server Reboot
We had unscheduled downtime with the mail server at about 11:00am while checking some of the hardware. It was down for approximately 20 minutes due to needing some drive checks.
June 10, 2000: Power Outage
We experienced a power outage in our Austin offices today. More details have been posted to the IO Revealed! website (http://www.io.com/revealed/)
June 8, 2000: Brief News Outage
At approximately 7:00pm Central, hiram.io.com was rebooted because of a kernel error. The reboot itself took approximately 10 minutes and news service was restarted. Total downtime was approximately 15 minutes.
June 6, 2000: News update II
Hiram.io.com (AKA news.io.com) is experiencing severe memory and processor use from time to time as we are running processes to make news 'catch up.' We're expiring many of the articles in the rec.* heirarchy to get a better grasp of exactly what has happened to these groups and their specific spool. The spool holding all of the rec.* newsgroups is still showing numerous index problems. It's already doing a lot better than it was yesterday, but the problem is still at the point of making most rec.* newsgroups almost completely unusable.
June 5, 2000: News update
The problem with no new articles appearing last week appeared to be an article numbering problem. The immediate resolution seemed to be simply 'unreading' news (previous article numbers were apparently being used.) The final resolution to this problem was made on Wednesay night/Thursday morning, when we were able to run an expire process and fix the history files.
At some time yesterday a problem appeared with no new articles appearing in some groups. So far, today, we've found this problem in the rec.* heirarchy, and are trying to resolve that issue. The articles appear to be making it through our news servers...they just aren't being handled properly by the history file or are not being archived in the correct place in their spools.
May 31, 2000: News again
We think the problem with news stems from a change made to one of the cron jobs that runs nightly - we've reverted it back to the prior state, and it should clear up the current issue when it runs the expire process tonight.
If not, we'll keep looking - let us know whether or not you see an improvement, and we'll check it ourselves as well.
May 31, 2000: News Update
News is currently acting up for some people - typical symptoms are groups which don't show any new articles since the evening of the 29th - if this is happening to you, there are a couple of fixes - unsubscribing and resubscribing to your news groups has worked, as has resetting the .newsrc file, either by hand or by running /usr/local/sbin/unread-news - both of these should pull up both old and new articles.
This is not the solution - we're working on the problem itself, and have some suggestions from our consultants we're going to try, but these should be good workarounds for the moment.
May 30, 2000: Mutt Update
Mutt has been upgraded, and there is now only one copy of mutt on Eris and Fnord. ;) The old mutt was coming up for most people - now the only mutt is 1.2i-2. If you need a new .muttrc, there's a basic version in /home/f/flatline/tmp/muttrc (no period), which you can copy over to your home dir as .muttrc (backup your current one, of course, before doing so). It should work for you. The sample muttrc provided by mutt is in /usr/doc/mutt-1.2i/ if you want to check that one out, and mutt.netliberte.org has a web based muttrc generator, if you have the time to walk through all the sections. And, as always, www.mutt.org has more information than you could ever want.
May 26, 2000: Mutt Upgrade
Mutt has been upgraded on Eris and Fnord to version 1.2i-2, along with all the relevant files needed by mutt. If you're a mutt user, go check it out and be sure everything still works, and start using those spiffy new features. krb and glibc were also upgraded, so if you see any errors from that, or from mutt, let us know.
May 3, 2000: Mail Backlog
Due to the flood of email generated by the 'I LOVE YOU' virus running rampant today, the mail queues have been backlogged for most of the afternoon. At present, there are approximately 10,000 messages in the queue. While we did put filters in place as soon as we could in an effort to keep the virus from spreading to our customers, our mail server still has to reject every message that is tagged by the filter. The amount of traffic through the servers slows down mail delivery, but all mail will eventually be delivered throughout the evening.
April 10, 2000: Nameserver errors
A bad DNS record entered into our nameservers at approximately 3:55pm today. This bad record propogated to the rest of our nameservers and slowly killed all nameserver lookups. Nameservice was fixed by about 4:05pm with the exception of our secondary nameserver (ns2.io.com) which completely locked up due to the bad record and had to be reboot. The secondary machine was back up and answering queries about 10 minutes later.
April 5, 2000: Brief News Outage
This afternoon, around 1500 CST, we had a very brief news outage. The news server seemed to begin refusing access to the server for a reason as yet unknown to us. Outage lasted approximately five minutes.
Our news specialist has been working on the news server all day to increase performance and repair recent problems.
April 4, 2000: Hiram/news running slow
Hiram, the main news server, has been giving intermittant problems with very little troubleshooting info. Throughout the weekend and past couple of days, we've found that the machine will give various errors that point to problems that don't exist! Finally, we've been able to troubleshoot it quite intensely and have pinpointed some overview problems to tackle (we already knew taht hiram had overview problems, but we've found even bigger ones than we've had before.)
At this time, 7pm, we're running an expire process on hiram to eliminate these major overview problems. This is causing the machine to run very slowly at the moment, and may extend over a period of several hours. Once finished, we believe the server will be in much better shape and will make further evaluations of its performance.
April 4, 2000: Corrupt Mail Aftermath
The issues that caused mail delays and a deliverator reboot yesterday appear to have cleared up for the most part. We have seen some minor e-mail delays, which may be the leftovers caused by yesterday's outage, but they have not delayed e-mail more than a handful of hours, and the backlogged queues have been much smaller. We have not received any reports of deliverator.io.com refusing connections, as it did yesterday.
It appears that some people who sent e-mail to io.com addresses may have received warning messages back stating that their e-mail could not be delivered for 4 hours. However, no e-mail should have been lost as a result of this: the problem was cleared up well before the traditional 5-day limit when most mail servers give up attempting to send messages.
April 3, 2000: Corrupt Mail Adventures
In recent weeks, we have seen some intermittent problems where mail has been delayed for several hours in a 'queue' and not been delivered normally to our user's inboxes. We made a series of configurations changes that appeared to have alleviated the problem.
However, the problem returned today, when deliverator.io.com, our main mail server, experienced extremely high loads and a massive backlog of e-mail. After rebooting the machine to allow users to check their e-mail, Brent Oswald noticed a corrupt e-mail message that appeared to be holding things up. He removed that file, and we have already delivered several thousand e-mails that had been waiting to be delivered. We will continue to monitor the mail server to ensure top performance and prevent further holdups of this type.
It should be emphasized that these problems do not entail any lost e-mail, simply delays in delivery.
April 3, 2000: News Issues this Weekend
Our news server experienced difficulties this weekend. During some periods of time, users connecting to news.io.com received a 'No Spool Inodes' error message. We are still looking into what caused this, and will place further updates on this page.
April 3, 2000: Where'd the Internet go to?
At approximately 1pm Austin time, it looks like our routers received some very bad routing information from one of our upstream providers or peering points. Our core router held these bad routes in memory and wasn't able to send data out through the normal connections. We believe to have narrowed down where the bad routing information has come from and have opened up a trouble ticket with them to further investigate what may have happened (and what can be done to repair some of the faster routes that we appear to have lost for the moment.)
We're once again sending a normal load of bandwidth out to the rest of the Internet. A few of the routes are a bit more distant than what we'd like, but they will resolve themselves as new paths open up and maybe our upstream connections can find a problem or two to fix.
March 31: Router Memory Upgrade
At midnight (March 30th), we successfully installed the additional memory to our core router for Austin. Downtime was minimal, and the router is again running smoothly. Total downtime, as expected, was roughly 15 minutes.
30: Whitehall lock-up part 2
This afternoon at approx 2:10pm, whitehall began exhibiting the same high loads and problems as it did last week. Looking at the error messages being displayed, we took whitehall offline to quickly swap out what we believe to be a possibly bad network card. We will continue to closely monitor whitehall.
March 27: Whitehall lock-ups
Begining late Thursday evening, and continuing into Friday morning, one of our virtual hosts (whitehall.io.com) began to periodically lock up and on a couple of occasions, simply reboot by itself. Logging into the server, and checking on the running processes initially didn't reveal anything out of the ordinary. Upon closer inspection, it seemed one of the domains being hosted on this machine, was generating a larger amount of traffic than normal. This extra traffic, was enough to take up almost all of the server's memory and other system resources, causing it to lock up, and just freeze.
After shutting down the domain in question, we went ahead and juggled things around, and moved the domain to another server. This took much of the load, and strain off of whitehall, and as a result made whitehall run much smoother and happier. This morning however, at approximately 7:56am, whitehall was forced to be reset. Since then, no other problems or system lock-ups have occured. We will be looking at whitehall much more closely to hopefully better determine these recent system problems.
March 26: News server unresponsive
Hiram died around 11:30 this morning and was rebooted. When it came back online, it wasn't serving news to readers (but still taking its feeds from upstream.) After restarting news several times on the machine, the acutal problem that was hindering progress was discovered at about 4:20pm. We've changed the startup scripts so this problem should no longer inhibit the starting of news.
March 16: News server slowness/unresponsiveness...
I've restarted news after manually rotating the news logfile (~news/log/news). At first glance, the 'news' log file has grown exponentially over the past several days due to massive amounts of cancel messages. The bigger the logfile, the more syslogd has to work. The more syslogd has to work, the higher the load on hiram. The higher the load on hiram, the less responsive innd is to people reading news. I'm trying to block off the massive amounts of cancel messages, but they're not all coming from the same sources (thus...if I blocked them *all*, I'd also be blocking legitimate posts from certain servers.) :(
March 7: News server upgrade
The news server upgrade scheduled for 10pm March 7 has failed due to hardware problems. We've reverted solomon back to the original configuration it had before the upgrade. We'll plan for an announce another upgrade window as soon as it becomes possible.
February 26: Continuing issues...
We currently have two continuing issues affecting us.
February 18: Mail error on the 17th
We had a serious mail problem on the 17th, caused by a misconfiguration in one of our spam filtering programs. All customers who were affected by this were sent a copy of the following notice. Our customer support team is ready to take any questions you may have about this error.
It is a standard procedure to check on mail bounced to email@example.com and add appropriate spamming sites to our spam filter. A typo in the editing of this spam filter, today, caused *ALL* mail at that point to be treated as spam (by this filter).
This was an engineering (human) mistake made while editing our spam filters. Safety checks have already been put in place to prevent this type of human error from happening again.
The filter in question is our 'nospam' filter. This filter only applies to those customers who have subscribed to the filter, and to all POP (mail only) accounts on our system.
The e-mail outage lasted from approximately 2:54pm till 6:34pm this evening. All mail received on our system during this time was delivered to the computer equivalent of a shredding machine and can not be recovered.
If you were expecting mail that you did not receive, you will want to contact the person sending the mail and ask them to resend it.
We deeply apologize for this inconvenience.
February 16: Brief mail server downtime
Our main mail server, Deliverator, encountered some minor problems around 5:00 pm today, requiring us to reboot it. Upon coming back online, it resumed delivering mail as normal, but was displaying intermittent login problems for people trying to retrieve their mail. To resolve this, the server had to be taken offline for about fifteen minutes while we worked to solve the problem. The server came back online at approximately 6:00pm and appears to be operating normally again. We apologize for any inconvenience this may have caused.
February 15: News server progress report
Three of our most outstanding issues with news are:
We may have found an answer. It appears as if authentication programs for innd are, at best, beta quality. Digging through the source code and making some tweaks, it looks like we'll be able to solve most of this problem...at least it works under test environments.
Many thanks go to The Wharf Rat (firstname.lastname@example.org) for helping us find this problem.
2: propogation of articles (news.io.com appears to miss many articles).
This appears to be a problem upstream from our main news server. It is possibly a problem with our filtering news server (solomon.io.com.) Current belief is that adding more disk space to solomon will solve the problem.
Solomon holds and filters all articles before sending them to hiram. These articles are spooled locally on solomon, but it looks like too many are getting spooled, so the first articles on that spool "drop off the edge" of the disk before they get sent to hiram. Adding more disk space to solomon means extending that 'virtual edge.'
3: overview file problems (news articles are listed in the overview file, but don't exist on the server.)
This is still an outstanding issue that we believe will be partially fixed when the solomon<->hiram communication is fixed. We're still looking into the issue before the other fixes go into place, as we're not going to be satisfied until the overview problems are resolved (and we don't believe there's only one problem here.
February 7: Austin Dialup Problems
If you dialed our Austin modem number between 9 and 9:45pm, you may have encountered an operator messge as follows:
"You have reached time warner's blocking verificiation. Caller ID information is blocked by dialing a *67 before each call, or by subscribing to permanent line blocking. Your telephone number was blocked on this call. TWC37"
A few *very* unlucky callers would have received the message from SouthWestern Bell:
"We're sorry, all ciruits are busy now. Will you please try your call again later."
We have called TimeWarner (TWC, our telephone provider) about this a few times in the past week. We have again opened a trouble ticket with them to have them explain the situation.
They have called back once tonight to update the ticket stating: "We have tried calling your number and we only get some sort of fax machine answering." They are keeping this ticket open for future resolution as they are unable to duplicate the circumstances.
If you encounter either of these messages on a dialup attempt (or any other non-modem answering), please let us know either by voice (our Customer Support number is handled by SouthWestern Bell : 462-0999) or by e-mail to email@example.com. Please also give the phone number from which you are calling and getting these non-modems (we will only release the prefix of your number and the number of instances per prefix to the phone company.)
Again it seems, the infamous 'hipcrime' found a way to attack our main news server, and a result caused the load on the sever to rise to a high level. We've since added additional filters in hopes this will better deal with these attacks.
February 4: Offsite news reading re-enabled
We have taken care of the security problems that caused us to shut down offsite user authentication last evening. Offsite users should be able to read news remotely again with their news username and password.
February 3: News takes another hit from 'hipcrime'
Everyone (especially news admins) all across the 'Net is tired of the infamous 'hipcrime' attacks. Our news server has received yet another several hundred control messages to create bogus groups such as alt.h0pcr0me.nanotech . The safeguards in place on our news server automatically filter out 99% of these new (bad) new group control messages, but in doing so, the load avearge on the machine rises dramatically. As many of you know, our news server rejects connections at high processor load.
In addition, we've found a few spammers abusing some very new holes in the security of the new news server. One of these holes has forced us to reject offsite user authentication. This is a temporary measure and we will be working to resolve this issue around the clock until it is working again.
February 3: News Upgrade Performed
The news server upgrade was performed last night, and news is back in service again. However, we do have one ongoing problem with offsite user authentication that developed. We are working to get that resolved, and already have some workarounds in place that shouldn't block anyone's news access in the meantime. We hope that the overall upgrade of innd will fix things such as the 'blind cross post' issues, where articles appear in groups they were not actually posted to, as well as other overview problems. Articles that are currently 'blind cross posted' in the overview file will not be fixed without a rebuild of the entire file, but new occurrences of this phenomenon shouldn't happen. The backup plan, if overview problems continue to happen, is to either:
B) Wipe both the overview file and all news articles and let it all start from scratch with the new articles that come in.
We hope it doesn't come to either of those.
February 2: Delayed mail delivery
Incoming mail has seen many delays recently (especially in the past two weeks). Mail delays have been varied in many ways, the result of many causes, and we are taking many different actions to remedy this situation as quickly and efficiently as possible.
A bulk of the delays have occurred during weekdays around 10am. At that time, it appears as if "the rest of the 'Net" suddenly wakes up and starts sending mail to our servers. Our primary mailserver, deliverator, starts accepting as many messages as it can, until it begins to become overwhelmed by the daily floods. At this point, mail is redirected to our secondary and tertiary mailservers (mx2 and mx3.io.com.)
Deliverator has safegurads in place to keep it from crashing due to mail overload. These safeguards have been stretched to their limits, so we are refining the way mail will be delivered on our systems.
Another cause of recent mail delays has been due to a number of external domains that have 'no' mailservers responding (one in particular has been home.com: they have 8 mailservers setup for receiving mail, but not one of those responded for a period of at least 6 days!) Mail on our system that is waiting to go to those domains has been a clog in the queueing system.
The final problem of note is the one that we're currently in the midst of addressing: deliverator is just doing too much! As you will note in the rest of this post, deliverator is serving the following functions:
-POP server (most mail retrieval on our system is through POP..or Post
-IMAP server (this is the second most widely used method of retrieving
-Primary MX host (every mailserver on the 'Net that sends mail to us tries
to connect to deliverator first!)
-NFS server (this is the *nix method of sharing disk drives across a network.)
-Aliases and Virtual Aliases (aliases translate firstname.lastname@example.org into my real
address, email@example.com ... virtual aliases translate all mail
for virtual domain addresses (such as firstname.lastname@example.org) into
other delivery addresses (such as email@example.com .)
Many plans are 'on the table,' and the current work in progress is as follows:
Current mail hosts that are advertized to the rest of the 'Net are: deliverator (primary), mx2 (secondary), and mx3 (tertiary). In this scheme, any mail coming to our system first tries deliverator. If deliverator doesn't respond, mx2 takes the mail, and so forth.
We're currently refining a mailserver to act as 'mx.io.com.' This mailserver will be the primary mailserver that is advertized to the rest of the 'Net. It will be able to handle many more incoming connections than deliverator is capable of handling. Once mail comes to mx.io.com, the mail will then be forwarded directly to deliverator in a single outbound connection (instead of several incoming connections from many different hosts). This should greatly improve deliverator's performance as it will only have our internal mail connections to deal with.
At the same time, mx.io.com will be put in charge of handling all virtual aliases (mail aliases for the many domains that we host). At that time, we'll be re-reviewing mail delivery for further enhancements.
Other plans that are currently slated for future development:
all handled by deliverator.)
February 2: News upgrade tonight
The bug that was found in the version of the news server software that we were planning to upgrade to has been fixed, and we are now ready to proceed with the upgrade. We have scheduled it for midnight tonight, with a downtime window of 15-30 minutes.
We apologize for all the delays in getting this upgrade done, and we thank you for your patience.
January 29: News server upgrade delayed further
Because of the recent D.O.S. attack, Hipcrime flood, and a bug in the version of the news software that we were planning to upgrade to, the upgrade will be delayed a bit longer. A patch has come out that is supposed to fix the bug, and we are waiting on reports to see if the patch takes care of the problem or not. This should probably only be a few days, so look for another announcement near the beginning of the week.
January 29: News server slowness/unresponsiveness
You might be connecting intermittently today, and getting a lot of refused connects due to high load average. This is the effect of (so far) over two thousand 'new group' control messages from everyone's most unfavorite person/people: hipcrime. We can't stop the newsgroup control messages from getting to us, but at least we aren't adding 2000 new newsgroups today such as alt.h.i.p.c.r.i.m.e.congress.
(I only wish other news providers would put up filters on these control messages to prevent them from propagating...it could be that he's just using too many variations in his spelling and punctuation).
January 28: Routing problems, D.O.S. attack
Starting around 11pm, January 27, IO became the target of a Denial Of Service (D.O.S.) attack. During the beginning phase, most of the data was concentrated on a collocated server on our network. This machine (irc.io.com) was removed from the network, but the attack continued until all routes were flushed from our router and our upstream providers at about 2:30am.
As a result of this outage, irc.io.com will be out of service until we are able to consult with those system administrators and guarantee a method of preventing this from happening in the future.
January 27: News Upgrade Delayed
Due to some unforeseen events, last night's upgrade of the news daemon had to be postponed. We apologize for this delay, but it will be a short one. The upgrade has been rescheduled for midnight tonight with the same downtime window, a max of thirty minutes, but probably closer to fifteen.
January 24: Mail delivery slow
Update as of 10PM Austin time: The mail queues have quieted for local traffic (going to io.com addresses). There are still several mail messages waiting to be delivered to non-IO controlled systems, but these are normal effects of other mailservers not responding or mail that really isn't deliverable (typos in e-mail addresses/nonexistant domain names.) We feel that it is because of outgoing (undeliverable) mail that incoming mail may be queueing, and will be keeping our eye in the pyramid staring at the state of our mail queues and delivery times to see how these problems can be avoided in the future.
We have been experiencing some major slowdowns to mail delivery today. Mail is not being lost, but is being queued for future delivery. The mail is being delivered as we push the mail through our servers with the computer equivalent of a 'big stick.' More news about mail will be posted later Monday evening as we resolve these queue problems.
January 18: BGP Routing Issues Update
As of 8pm, we still have open tickets with two of our upstream Internet providers. At around 5:15pm, the problem appeared to have resolved itself, but we are still pressing our providers for answers as to what happened, why it happened, and what measures are being taken to prevent this situation from happening again.
The gory details: Our router communicates via BGP (Border Gateway Protocol) to our upstream providers to announce to the rest of the Internet "we are right here." As this issue is not yet resolved, we do not have complete details of what happened, but we believe that one of our upstream providers had a misconfiguration that either pushed our announcement too much, or they had an internal routing problem (they kept mentioning an unknown surge of Internet traffic right about the time our problems started).
January 18: BGP Routing Issues
Around 3:55pm today, an issue with our routing caused quite a few problems with all traffic entering and leaving IO. We have opened tickets with our two major upstream providers regarding the issue. As of this writing, some of the routes have been corrected and are working again, but we are still waiting on updates from our upstream providers on the rest. We will post more information here when it is available.
January 16: Houston power outage
The facility where our Houston access servers are located had a power failure around 10am. UPS power kept our equipment running until about 10:32am, but the power backup for the phone lines apparently didn't work. This caused a recording to replace our modems answering. We have requested a detailed report from our phone provider of this outage and will see if they can't provide somewhat better backups to their services. Power was restored at about 10:45am and our dialup lines are working again.
January 12-13: Router Upgrade Problems
The new Cisco 7206 VXR was successfully installed the night of January 11, over the course of several hours of troubleshooting miscellanous problems with one of our DS3 circuits. We were never completely off the net, though you might have experienced some abnormal routing.
Our goal was to have both Cisco 7206 routers working together in a redundancy configuration. Unfortunately, there have been many routing problems.
After much troubleshooting, many calls, and lots of headaches for everyone involved (including our customers), we've reverted our configuration, for now. In the new configuration (made active the evening of January 12), the new 7206 VXR works as the main router and the old 7206 (not a VXR) is handling only our T1 connections.
We are still troubleshooting routing problems from our Houton network to Austin, as well as working with our upstream providers to clear old route caches on their routers.
January 6: News server upgrades and performance issues
The news changes began a few months ago when we attempted to switch from using INN 2.2 with traditional storage volumes to INN 2.2 with CNFS volumes. Using traditional volumes was causing newsreading to be slow with the current load of news, as well as forcing us to manually expire the alt.binaries hierarchy every day, as normal expires couldn't keep up.
Upon upgrading, we encountered an overview formatting problem between INN 2.2 and CNFS storage volumes. This caused excruciatingly slow news reading times, and the only solution we could find was to upgrade INN to version 2.3, the development-path server. There were some minor problems with this upgrade, due to an incomplete overview rebuild, but we got that smoothed out.
Newsreading is speedy again, and aside from an overview problem on Christmas weekend, which caused quite a few articles to turn up 'missing' (They were there, but not in the overview, so they did not appear to users), we haven't encountered a large amount of missing articles.
Expires operate differently than before, so some articles that are reported 'missing' may just be expired. It is no longer nightly, or weekly, or hourly , rather whenever a volume fills up, the oldest posts in that volume (regardless of newsgroup) are expired. Low-traffic groups may appear empty frequently if they share a volume with high-traffic groups, as those will make the volume cycle through its available space more quickly.
Actual missing articles could either be caught by our filters, or more likely, never reached us through our upstream news feeds. This last one is the most likely cause of articles that are missing and not expired. Now that we have the bugs worked out of the system we're adding a non-binary news feed from Supernews, which should pretty much eliminate the "missing articles" problem.
We apologize for our recent problems, and we should have been more diligent in posting current information. We're spending a lot of time and money on our news service these days, and we want you to be satisfied. But a small caveat-we're NEVER going to have all of the articles that Deja News does-they're 1000 times larger than we are- but we should have all of the articles in all the popular newsgroups for a minimum of a week. We appreciate your feedback. We'll get it there.
January 2: Broadwing DS3 failures.
12:40pm: Broadwing, aka IXC, aka Eclipse, has been reporting sporadic problems with their DS3 to us. Apparently they lost power at their Stone Hollow location, which affected their Kramer location... talk about great architecture there. -zippo
12:55pm: Power is back at Stone Hollow, so Kramer works now. We will keep an eye on them for awhile.
December 29: Denial of Service attack
Around 4am and 3:30pm IO was under a denial of service attack. Each attack lasted about 15-30 minutes. We have located the source and contacted our uplink providers so they may add appropriate filters. To see how we were impacted, refer to http://www.io.com/traffic. -zippo
December 17: Various machine reboots
Some time after the fileserver reboot this monring, the distributed password database on our systems became corrupt. In order to rectify the situation, each affected machine (mainly userhosts and webservers) had to manually be reboot and have a new password database copied to them.
December 17: Fileserver reboot
At about 10AM, our fileserver, bavaria.io.com, decided to dump core and reboot. This caused many services to hang while they waited for bavaria to come back to life. The fileserver took about 10 minutes for the complete coredump and reboot process and is back alive again.
(And NetApp has been notified that their instructions saying "this is a harmless and unintrusive procedure" are incorrect for the procedure performed about 20 minutes before this happened.)
December 17: News update
Midnight - The issue we are seeing on about 30 groups, including incorrect posts and unable to retrieve articles has been tracked. The issue is with the beta version of innd running on the server. I will begin the upgrade of this shortly. This requires a rebuild of the overview, which might take awhile. Article numbers will be preserved. -zippo
5:28am - I have been in contact with the innd maintainer all night regarding some of the issues we encountered. He suggested upgrading and rebuilding the overview. The upgrade went fine, but when it came to rebuilding the database, "makehistory" did not finish properly, and core dumped. I was able to get innd running again and it is accepting new articles once more. Older articles still exist on the server but not in the overviews. I will be working on restoring the complete overview today. -zippo
December 16: News update... again
6:47am - Hiram complained of too many open files when it ran its nightly stats and expired the overview. This caused innd to die. I recompiled the kernel at that time to have more file handles but hiram was cranky. Hiram was finally brought up around 5:30am. This new kernel takes advantage of both CPUs in the machine, as well as more file handles. (still working on that weird austin.general type problem). Donations of coffee beans would be great :) -zippo
December 15: Eris drive failure
1:07pm - The main Seagate Baracuda harddrive on eris.io.com has completely failed. All attempts to revive it have been unsuccessful. All userhost traffic is being redirected to fnord.io.com. We will be attempting to mirror fnord throughout the day in an effort to bring eris.io.com back into service as soon as possible. Due to the sheer amount of users on this one remaining host, as well as the mirroring process, fnord will be very loaded and slow. We apologize for any inconvenience this has caused. -dupre
7:00pm - Eris' disk was unrecoverable. We took a mirror copy of fnord and changed the hostname and IP address to make eris. THis took a bit longer than we liked... but fnord could handle it! Eris is not yet back in rotation, we'd like to watch it for a day of two just to be sure it was mirrored succesfully. -zippo
December 15: News update
10:20am - The majority of the newgroups transferred over from the beta server intact. We have had reports of some corrupt groups. These groups include: austin.general, rec.arts.sf.written, alt.books.david-weber, and comp.lang.perl.misc. There was another report that stray articles from one group ended up in another group. We plan on rebuilding the overview database to correct this issue. We are currently looking for an efficient way of doing so with minimal downtime on the news server. -zippo
6:00pm - Rebuild of the history and the overviews did not fix the problems seen in about 20-30 groups which have been reported in io.admin or firstname.lastname@example.org. We are still looking for a solution to this issue, sorry for the trouble this is causing some of you. Enjoy the speed though! -zippo
December 14: Upgrade Status
6:00pm - Looks like we've gotten everything up and running. We had a few RAID problems, but those are resolved. We have slightly less space than we published yesterday, 100GB instead of 116GB, but other than that, the rest is correct. If you have any problems with news, please contact customer support by phone or email to email@example.com.
10:00am - We've run into a few snags that are making things take longer than we expected. We don't have a current ETA, but we'll keep you posted.
December 13: News Upgrade
This Tuesday morning, at 6:00 AM we will begin an upgrade on news.io.com. Downtime for news is expected to be about two hours.
We are upgrading innd, the news server daemon, to version 2.3. In addition to this, we are also increasing the amount of storage space allowed for the news spools. We will have 92GB for articles and 24GB for overview files, for a total of 116GB!
We have been testing the setup of innd that we are about to implement for about two months now, and the only problem that we're still having is that it runs out of overview space quickly. However, we are testing it on a *much* smaller file system, 1.5GB instead of the 24GB that it will have.
There is not a way to efficiently preserve the article numbers from the current news system. Article numbers are markers that make each posting unique and allows your news program to figure out what you have and have not already read. With the server swap, all of the old articles were copied over, but the article numbers have all been reset. The result of this, is that you will probably not see any new news posts in the groups.
If you are using a Unix news reader on one of our userhosts, you can run a script that we have written that will reset the article numbers in your news program as well, and allow you to see new postings again. You can run this script by typing the following at the Unix command prompt:
This script will make a backup copy of your news configuration file in your home directory named .newsrc.saved . Then it will reset the article numbers in your config file. When it is finished, you can re-open your news program and you should see all of the messages again, and have some news to catch up on.
For those using Netscape or Outlook to read news, you will need to unsubscribe and resubscribe to your newsgroups to reset them.
If you are using Agent or Free Agent, first select all of your newsgroups. Then go to the "Messages" menu and select "Mark All Unread".
If you are a Mac user who is reading news with Newswatcher, you can pull down the "Edit" menu and choose "Select All". Then pull down the "News" menu and select "Mark Unread".
We apologize if this causes any inconveniences to you. We should all see some notably increased performance over the old setup once this is done, so that should soften the blow a bit.
November 19: Dialup lines and DS3/T3
This morning Time Warner expanded capacity on theor SONET ring. This process included downtime on all our dialup lines and our DS3 connection to IXC. The downtime began at 5:09am and ended at 5:29am. Total downtime was 20 minutes. With this upgrade complete, we can fully utilize the OC-12 and OC-3 into our building to Time Warner.
November 18: IRC Server Drive Failure
Irc.io.com (aka: austin.tx.us.undernet.org) lost its /usr partition last night and is down for a reinstall. Estimates are that it should be back sometime tomorrow.
November 18: News Troubles
Hiram.io.com was having a slew of problems starting late last night. It has since been brought back up after the problem was located and fixed. We will be observing its status throughout the day.
November 17: Tuesday's dialup problems
Our Austin dialup lines were unavailable to many customers for about two hours on Tuesday evening. We have a trouble ticket open with Time Warner on the issue, which is currently being investigated. They have not come back with an explanation of the problem as of yet. The symptoms appear to be the same as last week's outage, described below.
November 8: Sunday's dialup problems
Our Austin dialup lines were unavailable to many customers for about four hours on Sunday. The explanation received from Time Warner on this issue follows:
The majority of our customers with any traffic from any end office and intermittently through Greenwood experienced call blocking yesterday from 11:00 am to approximately 2:30 pm. This was a result of a failed STP processor in the SWBT 5 ESS in the Tennyson Central Office. Because the STP link was good from TWTC to SWBT we did not stop sending the calls, however the SWBT switch was not able to process them. Our technicians immediately escalated this within SWBT to have the correct resources available to fix the problem on Sunday. We are asking SWBT for a complete Reason for Outage to document this problem.
October 12: More webserver modifications
The main counter script (http://www.io.com/cgi-bin/counter ) has been modified so it now uses a shared file between all three servers instead of three separate files that differend on each webserver. The result is that counts should only go 'up' and won't fluctuate depending on which webserver actually answers the call to the counter. Also, all pages using the shared counter script have been reset to '1' as it would be somewhat more difficult to merge the three previously used files and over 1100 individual page counts.
A few links have been made for backwards compatibility with many Perl scripts/CGI's. The officially supported path to perl is /usr/bin/perl, but we went ahead and added /usr/bin/perl5 and /usr/local/bin/perl after finding so many scripts that used those paths. More information on Perl scripts and CGI's can be found on the 'web helpdesk' pages at http://www.io.com/help/helpdesk/customcgi.html .
October 11: htaccess files under the new webserver
The Apache directive 'AuthName' has had a slight change between the versions that we've upgraded. The value for 'AuthName' must now be enclosed in quotes when it is more than one word. See http://www.apache.org/docs/mod/core.html#authname for specific reference to this directive.
An example 'bad line' looks like:
AuthName Secured Web Pages
An example corrected .htaccess files would look like:
AuthName "Secured Web Pages"
require user fnord
October 11: Webserver Upgrade/swapout
The webserver upgrade/swapout has been completed with little (under 10 minutes) downtime for any of the machines. Aside from the normal slew of 404 errors that we serve on a regular basis, I have not found any other problems/conflicts with the new configurations.
October 7: News Update
Thanks to a little help, we've tracked down the source of the news slowness. Unfortunately, there is no fix under the current news server software.
The slow performance is a problem with INN 2.2, caused by the unified overview database (uniover), which was optimized for creating OV records, not reading them.
In 2.3, uniover is replaced by two different overview db formats, both of which greatly improve reader performance and still do a good job keeping up with full newsfeeds when creating the OV records.
We're peering some of the news to another machine so we can do a test setup of INN 2.3. An upgrade to the news servers will follow after that. Dates and times for the upgrade will be posted in advance.
October 5, 1:45PM: Password burps..
A small problem caused the distributed password files to be cut at line 666. We're performing an exorcism on the program that distributes passwords and it should be ok by about 2:15PM.
September 28: News update - final?
The board was swapped, and aside from a slight SCSI id problem that was remedied, it went smoothly. News is back up and running, and appears to be loading the article lists for newsgroups much faster now, at least by our tests.
September 28: News update continued
We will begin the board swap in 10 minutes. This process should not take more than 20 minutes. Thank you for your patience once again.
September 28: News update
The new machine (a Pentium II 400mhz) is not handling the news services nearly as efficiently as the old machine (a dual-processor Pentium Pro 200mhz) did. After looking over the server and kernel configs, it appears that news was designed to work better on multiple processors rather than a single one. We will be switching back to the dual-processor machine later today or tomorrow. We know it is a pain, and we're sorry, but the end is in sight. The new file storage method is working well, so once we get the machine serving news back up to par we should be in the clear. Thanks for your patience.
September 27: News outage
The drives containing alt.binaries for our news server filled up around 10 AM Sunday morning. The volumes attempted to exceed their individual boundaries of 2 gigs, which throttled news. We are currently working hard to restore news to working condition as soon as possible and we will post updates to the NOC page. We apologize for any inconvenience this has caused you and we are working to ensure that this will not occur in the future.
September 21: News server upgrade/changes
The news server was (for the most part) successfully upgraded last night, but a few problems remained. First, offsite newsreading wasn't enabled. This was re-enabled about 10AM local Austin time. Next is the fact that all news articles now have different numbers.
Most articles will appear as 'already read' in your news program. A simple script has been made available on our userhosts (dillinger, atlantis, fnord, eris):
When run, this script will copy your .newsrc file to .newsrc.saved, and resets all article numbers in your .newsrc to begin with '1'. Then, when you re-open tin/trn/favorite-shell-newsreader-here, you should see all messages and have lots to catch up on.
As far as other clients (windows, mac, non-IO shell clients), you will need to find the feature in your program to 'unread' news or reset your newsgroups.
September 20: Eris reboot
Eris.io.com was having some strange problems this morning that was slowing the machien down to a crawl. The data we collected pointed to SCSI drive errors or kernel errors in reading teh drives. Simplest fix was a reboot of the machine. We'll be monitoring the machine for any further problems.
September 1: Webserver Beta Testing
webserver configuration is well past due for an upgrade. So..we've compiled
a newer version of the Apache webserver: version 1.3.9. This verion differs
from the current version 1.2.6 in many ways. Most importantly, this newer version
has PHP3 installed in place of PHP/FI (a.k.a. PHP2). The PHP3 module has internal
functions availble for interfacing with MySQL databases
(see http://www.io.com/help/helpdesk/mysql.html ).
Initial announcement of the 'beta-test' of our webserver was made to the newsgroup io.help.www and the io-beta mailing list (send mail to firstname.lastname@example.org with 'subscribe' in the body of your message to subscribe).
A summary of the changes and upgrades will be posted as soon as we've deemed the new server 'adequate'.
September 1: Quota problems
Slight problems with the disk quota settings today as we were enabling improved functions. Quotas were all reset to 0 for approximately 5 minutes. They have been fixed.
July 20: Ascend issue resolved, Mac trouble persists
We've resolved the Ascend compatibility problem we were having on Thursday. However, the Macintosh Open Transport PPP dialer is still having problems with the server.
There was a point this weekend where some of the changes made in an attempt to correct this problem caused login problems for many dialup users. We've corrected that problem, but we're still working on the Mac dialup issue.
After some testing this weekend between ourselves and Cisco, we have narrowed the problem down to a portion of the OT/PPP dialup sequence. Cisco has escalated our trouble ticket and they are currently reviewing dialup logs that we have generated ourselves by dialing up with OT/PPP.
In the meantime, we have come up with a workaround for our customers using Mac Open Transport PPP to connect. For the time being, a modem script should be used to connect to IO. The script and instructions for downloading and setting it up can be found at the following address:
We thank you for your patience with this situation.
July 14: Mac and Ascend Dialup Troubles
The upgrade to the dialup servers have caused two problems to crop up. The first is that customers connecting with Apple's Open Transport PPP (included with MacOS 8 and higher) is unable to establish a connection in several cases. We have a trouble ticket open with Cisco on this but have not yet been able to pinpoint the source of the problem.
The second problem is with dial-on-demand ISDN customers using Ascend Pipeline hardware. Some of these users have been able to connect, but not route any traffic. We have spoken with both Cisco and Ascend on this matter and have some suggestions. If you have Pipeline hardware and are having difficulty connecting, please contact our customer support technicians at (512) 485-7440. They have been given special instructions to work with you on this issue.
July 12: Dialup Server Reboots
We will be upgrading the operating system firmware on our Cisco 5300 dialup servers over the next couple days to bring them all up to the most current version. This will require a reboot of each unit after the new firmware is installed.
Because of time constraints, we will be spacing the upgrade out over a few days. The first server, as1.io.com, will be rebooted at 12:00AM CDT on Tuesday, July 13th. As2.io.com will be rebooted at 12:00AM CDT on Wednesday, July 14th, and as5.io.com will be rebooted at 12:00AM CDT on Thursday, July 15th.
When the reboot occurs, any customers dialed into that server will be dropped offline, and there is a possibility that you may receive busy signals while the system reboots, which can take up to 12 minutes.
July 7: Brief Downtimes
At approximately 14:45 the secondary mail server and primary DNS server will be taken offline briefly while we do an emergency replacement of a failed UPS. Downtime should be less than 10 minutes.
July 7: Catalyst 5000 Card Swap
Replaced a card on the Cisco Catalyst 5000 that was giving us some trouble around 11:00AM. There was a 2-3 minute network interruption for mail servers and the list server. All machines picked back up as soon as the network links were restored.
June 16: Mail Server Drive Failure
On Wed. June 16, approximately 3:15-3:30PM we began to have problems with our main mail server, Deliverator. Something was locking the machine up completely and forcing us to reboot the system. We first thought it was a load issue, but after further examination, realized that one of the hard drives was failing. The drive in question just happened to be the one where users' mail is stored.
We keep 3 days of backups of the mail spools, in case of emergencies such as this. The most recent one at the time of this problem (about 4:00PM) was from 6:00AM that morning, so we began a level 1 backup of the failing drive in an effort to save mail that had been received after that time. The drive failed several times during this process, and finally died completely before we could finish it.
During the time that the server was down, incoming mail was being spooled up on our secondary server and held until the mail server came back online. By the time the main server was back up, approximately 12,000 messages were waiting.
What does this mean for users? Plain and simple, some of your mail may have been lost. If it was received between 6:00AM and 5:00PM, there is a chance that the message was lost. We apologize for this. We did everything that we could to rescue the mail on the failing drive, but could not complete the backup before it failed completely.
If you were expecting a message, and did not receive it, we recommend contacting the person you were expecting it from and letting them know you did not get it.
June 15: Langley RAM Swap update
Langley was shutdown for approximately 10 minutes this morning while new RAM was placed in the machine. This procedure went very smoothly and will hopefully resolve some problems we've been experiencing with that machine.
June 14: Langley RAM Swap
Langley.io.com, one of our virtual hosting webservers, will be powered down for about 10 minutes while we switch out RAM in the machine. Estimated start time is 5:30AM, Tuesday. All virtual domains hosted on this machine should be back in service by 5:45AM. This downtime will only affect those virtual domains specifically running on Langley.
May 28: Schultz going offline
The new hosts are up and answering to io.com. One of the old FreeBSD hosts, schultz.io.com, will be taken out of service today at 1:00PM. If you need to use a FreeBSD host instead of a Linux host, please use dillinger.io.com.
May 27: Impending Userhost Changes
This evening or tomorrow we will be replacing the current userhosts, dillinger and schultz, with two new Linux userhosts since the old ones have become increasingly unstable. The new hosts will be Fnord.io.com and Eris.io.com and will be what answer when one telnets to io.com. We will be keeping Dillinger around as a FreeBSD machine for those people that wish to use it. We are planning a hardware swap on it to help the stability problems, that should happen shortly after the two new hosts come up.
May 25: Mail Server Changes
This afternoon we have moved POP and IMAP service to a different machine in our mail cluster, in an attempt to put a stop to the pop lock problems. Customers should still use mail.io.com for incoming mail and smtp.io.com for outgoing mail.
May 14: ISDN and Houston Routes
UPDATE 14:50pm CDT: We have gotten both problems resolved and back
running in stable condition. The ISDN problem turned out to
be an address (mis)assignment issue. The Houston problem
was an issue with the actual T1 itself, requiring the telco's
UPDATE 12:00pm CDT: We have some quick-and-dirty fixes in place for
both the ISDN problem and the Houston routes. We are still
working on pinning down and correcting the sources of these
The routing work yesterday went off with only a few minor disturbances. However, this morning one of the ISDN servers is having some problems (we've directed calls to the other server) and our T1 to Houston just went into alarm at about 11:45 CDT. We're on the phone with people trying to get both of these things corrected. We'll try to keep updates posted here as well.
May 13: Network Problems
We are having to do some work on our main router today. This may cause periodic interruptions in some services. Most issues that arise should be related to routing and getting out to the internet. We thank you in advance for your patience.
May 12: Atlantis replacement
Atlantis.io.com, our web development userhost, was replaced this morning. The new machine is much more powerful, has much more RAM, and is much more identical to our current webservers. If you find anything missing on this host that you normally use, please send e-mail to email@example.com . Priority will be given to applications used in web development.
Apr 23: Main Router Offline
Beginning at about 2:55pm CDT our main router stopped carrying network traffic. After not finding an obvious solution, we contacted our Cisco representative an began troubleshooting the situation with him. The router was brought back online at approximately 3:45pm CDT. We are still attempting to fully diagnose the cause of the outage.
Apr 23: Drop in Incoming News
There appears to have been a significant drop in the amount of articles transferred to news.io.com last night and this morning. The cause is currently unknown, but we are investigating it.
Apr 20: News Downtime
News was offline for approximately an hour while a corrupt active file was being fixed. This file being corrupt was causing recently added groups to not receive new posts.
Apr 19: Router Downtime and Filtering
The router for our Austin network was taken down early this morning to swap in a new network interface card. After it was brought back up, we enabled some filtering that we had set up last week to improve the security of our network and to block some common denial-of-service attacks. If you are attempting to run services that require the use of a port below 1000, please try to reconfigure your software to use a port of a higher number. If this is not possible, please contact us.
Mar 25: Unscheduled downtime of anonymous ftp and proxy/cache
The machine babbage.io.com handles all of our anonymous FTP access for ftp.io.com and proxy/cache service for proxy.io.com. Due to failed upgrades on the machine, these services will be down while the machine is quickly rebuilt. We apologize for any inconvenience this may cause.
Mar 25: Scheduled Maintenance Friday
This Friday morning beginning at 1:00AM CST we will be doing another round of maintenance and inventory on our servers. Several services may be unavailable for brief periods during this time, but none should be out for more than five to ten minutes. All work should be completed by 7:00AM CST at the latest. We thank you in advance for your patience.
Mar 16: Houston Outage
At approximately 4:00AM this morning Taylor Communications in Houston suffered a power outage, which in turn caused our dialup lines and our T1 between the dialups and our servers to go down. It looks like power was restored around 6:00AM, since the T1 came back up then. Dialup lines remained down until about 11:00AM, when a Taylor tech re-initialized the PRI lines to our servers. Everything is up and running now, and we have a few queries in at Taylor regarding the situation that we are waiting to hear back on.
Mar 14: Replaced Bavaria's NIC card
Last week, we noticed several errors being reported from one of our network file server's NIC cards. We contacted Network Appliances and had them ship us a replacement card, which arrived on Friday. The cards were swapped out early this morning and the new card appears to be running perfectly.
Feb 23: News 'active' file rebuilt
The 'active' file on the news servers, which tells the server which newsgroups to work with, had gotten corrupted and nothing after a certain point was being read. The servers were taken offline around 15:30 and the file was rebuilt, and the duplicate entries sorted out. News was back up by 16:30 and is now serving the recently added groups.
Feb 19: Password file truncated
On the afternoon of the 19th something went wrong when the accounting system made an update to the main password file and pretty much ate it instead of updating it. This happened sometime between 14:00 and 15:00. By 15:00 the corrupted password file had been parsed and distributed across our network. This, to use a technical phrase, completely hosed the password file on most of the systems, which could only see about 450 users instead of the normal 7700. It took about, two or three minutes before the calls started coming in and we were digging out the backup password file about five after. After finding out what changes happened between the backup and the present, we manually made the changes and then redistributed the file across the network (which takes about 30 mins to complete) and the new file was in place by roughly 15:45. This fixed a majority of the users, but some were still having problems. At this point, we found that the shadowed file had been corrupted as well, restored it, manually copied it to a few hosts to ease the problem more quickly (userhosts, dialup authentication) and let the updater take care of the rest, which was done around 16:30.
Feb 13: Outgoing mail problem / Mail loss
From the evening of the 12th to the afternoon of the 13th there were intermittent problems with mail looping between the two hosts answering to mail.io.com and mx2.io.com (the primary outgoing server). This is also the period during which several people reported that mail they had sent out was never reaching it destination, so its reasonable to assume that the two are related. We've looked over and tried to either find where this mail may have gotten stored on the system, or find out what happened to it. Neither has been very successful. The most likely scenario that we could come up with after looking at the situation is that the messages that were lost entered the mail loop and bounced back and forth between the three servers. After a certain number of hops, sendmail will declare a message undeliverable and bounce it back to the sender, which is probably what happened. However, the bounce originated from inside the previously mentioned loop and was bounced back to MAILER-DAEMON as undeliverable as well, for the same reason.
Feb 10: New mail servers online
The new mail cluster went online early this morning. It consist of a central mail server, with two hosts for reading and receiving mail. Deliverator acts as a central MX server which nfs mounts the mail spools for anarchy and discordia, which answer to mail.io.com. With this clusers online, mail reading speed should increase greatly.
Feb 5: Dillinger-2 mail fixed
Once it was put into service yesterday, it was found that Dillinger-2 was not relaying mail sent from users on that machine to the main mail server, Deliverator. We found a configuration error and corrected it, and mail is now sending normally from this host.
Feb 4: New Host in Rotation
One of the two new userhosts that have been in testing has been moved into the regular rotation of userhosts. Now Dillinger, Schultz, and Dillinger-2 are the hosts answering to connections made to 'io.com'. Please let us know if you notice any problems with this new host so that we can get any final kinks worked out and swap the userhosts out completely.
Jan 27: Web Server Upgrades - A Timeline
A week in the life of a web server:
16:00 - Pentagon.io.com (a virtual domain server) completely failed.
The server had been giving us many problems and we had a replacement
server in the process of being built. With the machine down, the
priority was given to completing the replacement machine.
16:00 - While the replacement pentagon.io.com was being completed, the
engineers tweaked with the old server and finally brought it back to
life. This life was very weak and the machine was destined to die
again under the load it had. We continued work on the replacement
server, but under the circumstances of having a machine that was
actually working, we decided to fully develop the replacement machine
instead of putting it in service ASAP and then tweaking the
06:00 - The old (near-death) pentagon.io.com (a PPRO 200) was powered
down and the replacement machine (a PII 350) was powered on in its
09:04 - Pentagon (the new machine) ran out of processes and was reset.
11:33 - Pentagon once again ran out of processes and this time had
a tough time coming back on the 'net. Seems to be something weird
with the network card (3c905B).
15:45 - The old PPRO 200 machine was brought into service as
hakiriya.io.com (hakiriya is the headquarters of the Mossad). This is
now our fifth virtual domain webserver (pentagon, kremlin, langley,
whitehall, and now hakiriya). Due to loads experienced on pentagon,
we moved one half of pentagon's virtual domains onto hakiriya (thinking
that it is possibly one or two virtual domains that are causing the
problems...if we move them around enough and watch what machines die,
we'll be able to narrow down which virtual domain may be the problem).
01:59 - langley.io.com fell off the 'net. It was reset and came back up
just fine. The admin on duty noted that the machine room wasn't quite
as cold as normal and propped open the doors to ventilate cooler air
into the room.
16:55 - www.hoboes.com was down. Apparently, the IP address dropped
off of langley. ifcfg'd the IP address and restarted that virtual domain
No expalnation found as to why this happened.
19:36 - Engineer on call was called about the warmth building up in the
machine room. We found the circuit breaker for our dedicated AC unit
had flipped. Reset the circuit breaker and the AC came on. Airflow was
cool (not cold) and weak (not much), so we shut the doors and the admins
on duty were instructed to check the room and airflow every five minutes
20:40 - langley completely died again and was reset. Upon boot of the
machine, it dumped core and was reboot again.
21:10 - Inspection of langley's logfiles showed many nfs errors. The nfs
clients were upgraded to latest release and the config files were
reset so langley would now mount nfs using version 3 versus version 2
(version 3 is faster and more reliable..version 2 is still default
in most systems, but we've had lots of luck with version 3). An early
morning reboot was scheduled to force the new nfs configs to take effect
(the motto here is "Don't reboot a machine if it appears to be working
21:55 - langley died on its own (thus the morning reboot was cancelled).
'Engineer on call' was back in the office about the AC problem again.
22:30 - AC repairman arrived on the scene to fix the unit. Engineer on
call was trying to debug further langley problems on top of
coordinating with the AC repairman and venting air into the machine room
(which was somewhere between 95 and 100 degrees F).
01:05 - named (a cacheing-only nameserver configuration on langley) was
freaking out and causing a great deal of network traffic. A cacheing
nameserver running locally normally helps a virtual webserver, but in
this case, it wasn't. Deactivated named on langley (as well as several
other services which were running...including mouse drivers, nfs server,
portmapper, and snmpd).
01:38 - AC unit repaired and cooling. The 2.5 ton unit dedicated to only
our machine room and telco closet cooled the rooms to about 70F within
about an hour.
02:40 - 03:20 - Several redundant and/or unused services turned off on
whitehall, kremlin, pentagon, and hakiriya. This seems to have cleared
up memory and processor power on langley, so it should help the rest of
the virtual servers.
17:00 - The replacement webserver that has been in testing and finishing
phase was setup on a rack in the machine room. The new PII 400 with
384MB of RAM was to replace the old www-01 (answering as www.io.com),
which is a PII 300 with 256MB of RAM. Replacement was being planned for
the early morning hours of 01/27/99.
20:50 - www-01.io.com (which answers all calls as www.io.com) died and
was reset. The machine would not boot. Since we already had another
machine handling all of the higher-traffic websites (www-03.io.com), we
redirected DNS configurations so www-03 would now answer all calls for
www.io.com. Due to the total amount of traffic on www-03 (now serving
duties for two machines), we stepped up priority on enabling the
replacement www-01 (was simpler to complete configurations on the new
machine rather than try and fix an old and broken one).
22:59 - The replacement for www-01 was deemed 'ready for traffic'. DNS
propogation had not made it that far out on the 'net, so we went ahead
with the next step: dueling webservers (round-robin via DNS). Both
machines are PII 400's with 384MB or RAM each. Each is running identical
versions of Apache (the world's most popular webserver), RealServer 5.0
(the RealAudio and RealVideo server by Progressive Networks), and our
Stronghold Secure Server (Licensed by C2.net) was replaced by the RedHat
Secure Webserver (www.redhat.com).
23:00 - Server modifications continue on both www-01 and www-03. Some
operations of the two machines seem to differ and they are being
reconfigured identically. The webmaster would really like to name
these (and future webservers), and is currently taking suggestions.
Jan 22: ISDN Recap
Here is a complete evaluation of the ISDN problems earlier this year.
Dec 30, 1998- Authentication on all IO dialups became 'flakey'. After
investigating, we found that FastEthernet0/0 on isdn1 was
mysteriously "shutting down" to new connections, but
keeping alive its current connections. This caused
radius to die, and not authenticate anyone. In order to
isolate the chaos of not being able to log in, we needed
to make the ISDN machines authenticate using a separate
radius server than the normal analog dialups. Soon
after doing this, we discovered radius would crash because
of the strange FE0/0 "shutdown". Thinking it was a
hardware issue, Cisco sent us a new FE card.
Dec 31, 1998- When the new card arrived, we installed it, and things
were stable for about 2 hours, then it started all over
again. We requested a new unit, chassis and PRI cards.
It took a bit of "talking" to find a way to get the
equipment to us on new Year's Day, but it was shipped.
Jan 1, 1999- The new chassis and cards came in, but they sent the
wrong kind of PRI cards. We took the newest FE card,
and the old PRI cards and placed them into the new
chassis... after swapping the RAM on the boards, because
they send us half the RAM we requested. It was stable
for about 30 minutes, then the same symptoms occurred.
We had them send me the correct cards the next day.
Jan 2, 1999- New cards arrived, installed, same problem. We took the
opportunity of this downtime to correct some routing
and arp issues we saw in our tables, which magically broke.
We also made it possible for many units to work in
Jan 3, 1999- The unit we ordered about a month ago arrived. We brought
it on the net, using the edited configurations (to conform
to the fixed routing/arps) and it seemed to be stable for
about 2 hours. We loaded an edited configuration to the
older unit and brought it online to see how long it'd
stay on the net. All the PRI's were in the newer unit.
We noticed old arps on the core router and switch, so we
cleared the arps on those, and rebooted both ISDN machines.
Both seemed to be stable with each other. We then took 2 of
the 6 PRI's and put them on the older box. After clearing
the arp again on the core router, switch, and both ISDN
boxes, things worked like they should and have been up since.
Jan 4, 1999- Some routes were still incorrect, fixed all routing to ISDN
Two good side effects of this 'malfunction' was the fixing of routing, the addition of the second server working in parallel, and new hardware. Bad side affects are that we and Cisco are both still baffled as to why it suddenly broke, why FE0/0 was acting the way it was, why radius was actually dying, and 4 1/2 day downtime for ISDN users.
Dec 31: Radius dialup problems
Beginning about 9am this morning the radius authentication server for our Austin dialup lines started crashing whenever an ISDN user attempted to log in. We finally found that one of the password files were corrupted, which was causing the ISDN server to send bad signals to the radius server and crashing it. We've got the password file fixed now and are looking to see how it was corrupted in the first place.
Dec 27: ISDN PRI still giving us trouble
We've opened up a trouble ticket with Time Warner about these PRI acting strange. We show the lines to be perfect on our interfaces. Time Warner reports that the D channels are up fine, but they are having problems with the B channels. If you are getting connected, but can't go anywhere, this is why. Please refer back here for updates on this issue.
Dec 11: ISDN and dialup sluggishness
We've had many reports of 'sluggishness' with ISDN connections. They range across all of our Cisco access servers. After troubleshooting every possible local problem with our equipment, we have contacted both Cisco and out telephone circuit providers. The problem is with "Line Code Violations" on the incoming phone circuits. As of 1PM, we have about 24 open trouble tickets (one for each circuit) with the providers. They are actively troubleshooting each circuit until they all check out ok.
Nov 19: RAM Problems
The RAM added to password.io.com apparently has some compatibility problems with the system, causing it to segfault. We are removing said RAM.
The drives and PCI cards from password were moved into another case with a motherboard and RAM that we knew were compatible and the server came back up fine. Password is now running at 233Mhz with 256MB of RAM.
Nov 19: RAM Added to Password
Password.io.com was given an extra 64MB of RAM today, bringing it up to 128MB total. The upgrade was made to accommodate some additional processing we plan to do on that host.
Nov 18: IXC Maintenance
IXC Internet Services, our main upstream provider, will be doing some maintenance on their primary routers early on Thursday morning. They will be beginning the work around 4:30am, and expect to be down no more than 15 minutes. During this time, io.com customers may be unable to reach certain internet sites over the web or ftp, through email, or by telnet. The following explanation of the maintenance was given by IXC.
SMARTNAP Technical Contacts: ;
Subject: IXCIS Maintenance (Thurs, 19-Nov-1998 @ 04:30) Date: Tue, 17 Nov 1998 20:18:48 -0600
From: "David P. Maynard" <firstname.lastname@example.org>
IXC Internet Services will be performing scheduled maintenance on the primary Cisco 7513 routers at the Stonehollow (former SMARTNAP) site on Thursday morning, November 19, starting around 4:30am. Overall connectivity should not be disrupted for more than 15 minutes. (Any outage should be closer to the 5 minutes needed for the aus1 router to reboot.)
The Cable & Wireless (formerly MCI) connection may be down for around 30 minutes while it is moved to another router. Traffic will automatically reroute across the other connections during the C&W work.
We will be performing the first in a series of upgrades to increase the overall capacity and reliability of the Austin site. Once the changes have been completed, traffic will be split across two upgraded Cisco 7513 routers that have the latest interface adapters and maximal memory configurations. There will probably be 1-2 additional router reboots scheduled over the next few weeks to complete the upgrades.
In the next few weeks, we will also begin passing traffic across the IXC connection to the public exchange points at MAE-East, MAE-West, SprintNAP, and Ameritech NAP. The exchange point connections will be used to carry traffic that already passes through the public exchanges and should increase performance to those sites. The private connections to UUNET, Sprint, C&W, ant AT&T will still be carrying most of the traffic that can bypass the public exchanges.
Additional bandwidth and equipment upgrades will be announced and brought online over the next few months as we begin to realize more benefits from the IXC acquisition.
David P. Maynard, Senior Manager, Internet Services IXC Communications, Inc.
16: Sprint outage
Part of the Sprint network connection will be down tonight. Sprint has provided the following explanation. IO traffic will not be affected as we have multiple connections to Sprint. The only traffic that will be affected will be to other service providers connected on the other side of this specific Sprint router (and presumably only those that do not have multiple backbone connections).
Nov 16 19:42:10 1998 Date: Mon, 16 Nov 1998 17:58:51 -0500 (EST) From: Outage
EMERGENCY MAINTENANCE ANNOUNCEMENT
Purpose for maintenance: Network fix
Maintenance date: 11/17/1998
Maintenance start time: 00:01 EST
Estimated end time: 01:00 EST
Details of maintenance: On 11/17/1998 at 00:01 EST, hardware on sl-gw3-rly will be replaced. The router will be powered down for the duration of the maintainance.
Impact to customers: Customers directly connected to this router will lose connectivity for the duration of this maintainance.
Please bear with us as we work to resolve this issue. We thank you for your patience and patronage.
Nov 10: Houston Back In Service
We have all of the phone company problems resolved and all of our Houston dialup lines should be back in service. If you are still experiencing any difficulties connecting to us from Houston, please contact us at (800) 294-6266 at any time and we will be happy to help you.
Nov 10: Further Update
The cable that connected the dialup server to our T1 server was found to have a short in it. That has been replaced, and the backup server is currently connected, but only one PRI is active. We have contacted Taylor Communications, who is working on the clocking problem that the PRI they provide us is having. We expect a resolution by late this afternoon or early this evening.
Nov 10: Houston Update
Our engineer in Houston suspects that the problem may not be the dialup server itself, but rather the cable that connects it to the rest of our hardware. At last report, he was on his way in this morning to verify this. If this is the case, repair time will be approximately as long as it takes to find a local retailer who sells Cisco supplies.
Nov 9: Dialup Outage in Houston.
The dial up access server in Houston has had a system failure. We have tried to recover the system remotely with no results. We are configuring a backup server currently and it will be driven to the Houston office for immediate instalation. Configuration of the server should be complete in approximately 45 minutes. Transportation time will be about two and a half hours, plus whatever Houston rush hour adds to that.
Nov 5: Deliverator motherboard swap
Deliverator got hung up again and had to be rebooted. Upon reboot there were problems with the motherboard that prevented the system from rebooting properly. We are transferring the filesystems onto the old server, which is slower, but more reliable than this one. With the old server in place, we will be rebuilding the new one, after which we will put it through another round of testing before it goes into service. An announcement will be made when we plan to put the new machine back into service.
Nov 5: Schultz down for new power supply
Schultz, one of our two userhosts, was down briefly this morning so we could replace a failing power supply. Downtime was about 15 to 20 minutes.
Nov 4: Deliverator kernel rebuild
Deliverator was rebooted last evening to correct a minor problem. After this reboot, it came up under its default kernel, which is not what it had been running on, since we had been using special kernel builds for the RAM testing and problems last week. The default kernel had some problems with the new setup, so it was rebuilt this morning and a more appropriate one was put in place. The server was again rebooted and is now running on the new kernel.
Nov 2: A few MH kinks to work out
We were noticing a problem with some of the MH commands on the new mail server, 'inc' in particular. We have gotten that one taken care of, but if you encounter any other MH problems, please let us know in the io.admin newsgroup or at email@example.com.
Oct 30: Mail Problem Found
After attempting several software fixes for the mail server problem, none of which solved it, we looked into the possibility that we might have a bad stick of RAM in the server. We tested swapping the RAM into the server in different configurations to determine which one, if any were the cause of the problem. Sure enough, one of the 128M sticks of RAM was found to be the source of the problem. The server has been up and running fairly well since as a Pentium II 400mhz.
Oct 29: Mail Server Difficulties
Over the past week or so, we have been noticing an increasing number of errors and corrupted files on the mail filesystem. In an effort to remedy this before it caused more serious problems, the mail spool was rebuilt onto another computer, which was in turn brought up as the new mail server early this morning. There were some configuration problems earlier in the morning which have been sorted out now, but the filesystem is still giving us some serious difficulties.
As a result of this, the server may be down intermittently while we go through the process of trying to correct this. If the server is down, customers will be unable to download their email, but should still be able to send email, as this is handled by a different server. Mail that comes in during this time will be received and spooled on a secondary server, which will deliver it once the main server is back online.
We apologize for this inconvenience. Our engineering staff will be working on this problem until it is resolved. Once we have this system in working order, we have made plans to build a separate server that will handle all POP services. This will mean that even if the main mail server goes down, customers would still be able to read their mail that was currently on the server. This will add an extra level of redundancy to the system, much like what we currently have for sending mail.
Oct 26: Scheduled Downtime for Services
With 1998 drawing to a close, everyone is picking up the pace on their year 2000 conversions, including us. We will be performing some compliance tests on all of our servers this week to ensure that IO will still be online 14 months down the road. Each of these services will be affected for brief periods during the time they are scheduled for this maintenance.
Userhosts - Telnet logins to our userhosts will be unavailable.
News - News service will be interrupted briefly.
Web - Our main web server will be offline briefly.
FTP - Our anonymous FTP server will be offline for a short time.
List Server - Mailing lists will be temporarily unavailable.
Password - Password changes will be unavailable for a time.
Proxy Server - Web proxies will be off during this time.
Mail - Customers will not be able to send or retrieve mail for a
Name Servers - Some lookup requests may fail, but this should be mostly
Virtual Domains - These servers will be down briefly, which will cause
short periods where the domains will not be available.
We apologize for any inconvenience that this may cause you, and we thank you for your patience as we go through this process.
Oct 20: Read Our Lips, "No New (V.90) Tests"
There will not be an upgrade or line switch early tomorrow morning as announced. A hardware issue was encountered when we were going over what we would have to do for the upgrade. This will be postponed for a later date, which will be announced in advance here in IO Revealed! and in the io.admin.announce newsgroup.
Oct 20: New V.90 Test
We finally have a test version of the V.90 software for our newer model of dialup server, the Cisco AS5300. We will be moving the current V.90 test line to this server so that we can begin testing the new server for compatibility issues and other problems.
This change will be taking place early Wednesday morning, and there will be brief dialup downtime while servers are swapped. If you are currently dialing the test line (493-9999) or if you wish to test your modem on the line, please report any problems you encounter to firstname.lastname@example.org or to the io.admin newsgroup.
Oct 19: Austin to Houston Connection Disruption (Part II)
The direct connection from Austin to Houston is still down as of 01:15 Monday. IXC communications has determined the cause to be a fiber cut about 11 miles South of Bastrop, Texas. Due to the heavy rains and dangerous flooding conditions, they cannot give an estimate to time of repair. Routing of data between Illuminati Online's Austin and Houston connections has been redirected through other service providers, but these routes send data through many other networks and not all of those routers are reliably transmitting packets.
Illuminati Online will continue to monitor the progress of repairs to the Austin-Houston circuits and will be able to these direct routes as soon as our circuit comes back to life.
Oct 18: Austin to Houston Connection Disruption
The T1 Circuit that connects our Austand Houston offices failed shortly after 2PM, Sunday. Hardware tests at both ends indicate that the failure is in the line. Traffic trying to connect between Houston and Austin will be slow as the data tries to find alternate routes. Our T1 proivder (Southwestern Bell) has been notified of the trouble and is in the process of fixing the line.
Oct 15: Password Update
We have restored service to most accounts now. Users who started their account in the last 5 days, or those who changed their password in the last five days may have problems getting connected, as we had to use an older backup. These accounts are being manually added over the next couple hours.
Oct 15: Password Problems
Shortly after 10:30am the system stopped recognizing the passwords of the majority of our users. We first thought this was due to a failure of the system that updates passwords across the network, but found that the main password file had been corrupted. Our engineers attempted to reconstruct the password file, which appeared to work initially, but only partially fixed the situation. Currently we are attempting to retrieve a backup copy of the file from the tape backup that ran overnight last night. We currently do not have an estimate on how long this procedure will take. We apologize for the inconveniences this causes you, and assure you that all of our efforts are currently directed towards resolving this problem.
Oct 11: FTP server and userhosts
At about 2:50PM, the power supply on babbage.io.com failed. Babbage hosts both our anonymous FTP server (ftp.io.com) and our cacheing proxy server (proxy.io.com). The power supply was replaced and the machien was operational after about an hour of downtime.
At about 4:15PM, both userhosts (dillinger.io.com and schultz.io.com) became unuseable when both machines received 0-length password files. After manually recreating the password files and rebooting each machine, they were once again available around 6:10PM.
Sept 18: Taylor adds more connections
Taylor Communications, our digital line provider for our Houston dial-up servers, is scheduled to increase the number of connections between their switches and Southwestern Bell's switches in the Houston calling area. This increase should decrease the frequency of busy signals when trying to dial our servers from a SWB phone line, or remove the problem entirely.
If you do continue to receive busy signals, please call or email us and let us know at what time, and provide us with your area code and 3-digit phone prefix. We give this information to Taylor so that they know which switches are having problems.
Sept 3: v.90 Comes to Houston
Houston's Cisco AS5200 server was upgraded to the V.90 56k protocol yesterday afternoon. We have been testing this upgrade on the Austin lines, and have not found any problems with it. If your modem is not currently capable of V.90 performance, you might want to visit http://www.56k.com to find out how you can upgrade your modem to take advantage of the faster speeds offered.
Sept 3: Houston Busy Signals
On another Houston note, we have been getting reports that some users are receiving busy signals when dialing into our lines in the evenings. The odd part is, our servers have not even gotten filled to half capacity during the reported times. We have checked our hardware, and all of the modems are answering calls properly. This leads us to suspect a telco problem might be the source of our woes. If you do get a busy signal, please note the time and either call us at 800-294-6266 or send us email (when you get through) at email@example.com, and let us know the time the problem occurred and your area code and phone number. This will help us to try and determine if the problem is with the phone companies.
Sept 2: News Rebuild
We attempted to change our news setup to make use of the new CNFS file storage method in combination with the old storage methods early this morning. Combining the two storage methods did not work. News groups would list the correct number of articles, but no actual articles would appear. Our previous setup was restored, but all the news in the alt.binaries.* hierarchy was removed. New articles should be coming into those groups now. Please report any other problems you encounter with news to firstname.lastname@example.org.
Sept 1: Pentagon Retired
Pentagon was retired from service at 10:00am this morning. Our current userhosts are Dillinger and Schultz, both of which run FreeBSD. Please let us know of any problems or conflicts that you encounter on these hosts at email@example.com.
Aug 27: Removing Pentagon from Service
Pentagon.io.com will be removed from service on Sept. 1, 1998, which is next Tuesday. If you have crontabs, please migrate them to Dillinger or Schultz. If you have need of a specific program which exists only on Pentagon, please notify firstname.lastname@example.org and we will install it on Schultz and Dillinger.
Aug 18: V.90 Testing Available
The v.90 dialup is now available for testing. You can reach this server in Austin by dialing 493-9999. Please do not idle for long peroids on this line. You are welcome to test it and use it as much as you like, but it is primarily a testing line, and a 10 minute dialup timeout will be strictly enforced on it. Please report any problems you experience with this server to email@example.com, the io.admin newsgroup, or to our customer support department at 462-0999.
Aug 18: News Downtime
We took down the incoming news server, solomon, for an OS downgrade from FreeBSD 3.0 or 2.2.7 to resolve some ongoing problems we have been having with it. During the downtime, which was about two hours, news did not come into, or leave, io.com, but was spooled on either ends. The OS downgrade did not resolve the problem. The server is back up and running at previous conditions, and our next step is to consider an OS change. We will likely install Linux on solomon in the near future.
Aug 14: V.90 Testing
We've placed an order for a new PRI to be turned on in the office and will connect it to one of our AS5200's for testing of V.90 connections. We expect the new PRI to be activated on Tuesday the 18th. We'll make further announcements when we have the equipment ready for testing.
Jul 28: Postal Script
We apologize for the odd mail happenings. Last night, there was a script error with the script that deletes the mailboxes of old accounts and several people who still had current accounts lost their mailboxes. To correct this, we restored the mail spool from yesterday's backups. This has probably caused you to receive the multiple messages this morning. There is also a possibility that mail that was received after the time of the backup (about 12am Monday) and before the time of the restore (about 10pm Monday) could have been lost in the restore, if you had not downloaded your mail during that time.
Jun 03: News Service
Today starting at roughtly 2:00pm CDT, hiram ceased to answer news queries. The active file which contains the active newsgroups and their corresponding article numbers had somehow aquired a bogus entry. Innd 2.0 subsequently decided to hang rather than crash or report an error. I corrected the error and restarted the server around 2:13pm. Everything seems fine. I also performed a quick upgrade to the latest release (6/2/98 vs. 5/11/98) to correct the error which caused the problem in the active file. -firstname.lastname@example.org
May 31 (12:30PM CDT): Dialup Outage for Austin
As of approximately 12:15PM, all of our access servers were powered back on and connections once again appear to be stable.
May 31 (11:30AM CDT): Dialup Outage for Austin
Waller Creek Communications (1801 N. Lamar) suffered a power outage this morning sometime around 9am. They lost electrical power at their pole (a squirrel fried itself across two phases). City of Austin Electrical company will have to perform the repair to restore power.
Dialup access for our Austin customers as well as our DS-3 connection and 100Mb fiber link go through equipment at Waller Creek. IO.COM's equipment lasted as long as the UPS we were plugged into - probably 30 to 40 minutes. At the present time IO.COM's equipment is powered off of Waller Creeks backup generator; but power is less than stable. The Waller Creek equipmentr room is very hot and dark - they are working on getting air conditioning powered off of their generator. To reduce heat load in that room we have switched off as4.wc-aus. io.com and as5.wc-aus.io.com (two of the dialup access servers). Best estimate for restoration of power is sometime this afternoon.
May 18: News
Incoming news came to a halt for a brief period of time over the weekend. As soon as the trouble was reported, our news server was restarted and appears to be catching up on articles. Everything should be fine by 7PM.
May 13: New News Server
Our new news server has been put into full service. More details on this server can be found in the newsgroup io.admin, io.admin.announce, and the Revealed homepage.
Apr 27: Dillinger
Dillinger ran out of swap space at aproximately 6:00PM. It was rebooted. We will be adding more memory and swap in the near future.
Apr 21: Deliverator
We rebooted Deliverator this afternoon. Mail was unavailable for aproximately 15 minutes.
Apr 16: Deliverator
Deliverator suffered a kernel panic for no discernable reason, and it took about twenty minutes to get it a drink of water, calm it down, and convince it to return to its duties of delivering mail. It seems quite relaxed now. We're not sure what spooked it. A psychiatrist has been consulted. A *very* few e-mails may have been lost or corrupted. As of this time, we have received no reports of such an occurance.
Apr 09: Dillinger
Today at roughly 5:00pm Dillinger ran out of ram and swap space and promptly froze. It has 104 users online at the time and didn't seem to be slow from a local standpoint.
The crash was unfortunate and to remedy it for the time being, I have added pentagon back into the io.com rotation. Monday, I will either replace the MB with one that can support more ram (this one only goes to 128M SDRAM) and quite possibly swap the K6 233 with the PII 233 that I have slated for the new news server. The PII's motherboard can handle 384M of ram and it currently has 256M of SDRAM.
Apr 03: Elm: we no longer support elm, as it is no longer being maintained. However, elm has evolved into mutt, which we do support. Elm is still available on some userhosts, but will soon go away entirely.
Mar 26: NNTP-2 is back up.
Mar 26: NNTP-2 is being taken down to clear its cache. The newsservers somehow got out of sync, and there were reports of duplicate articles. News will be sluggish while NNTP-2 is down.
Mar 25: An obscure bug is causing certain attached files to very occasionaly crash the mail server. We are monitoring the situation. However, as of this afternoon, there was considerable mail spooled (while the mail server was down) and it will take about a day for it to be cleared out.