GIUZ Status
May 16th, 2012
17:28 – Matlab licence shortage resolved
Currently there are again ~30 Matlab licences available.
11:16 – No more Matlab licences
All Matlab Licences of the UZH are in use (we are talking Matlab core licences here, not Toolbox licences, that are known to be scarce).
We informed the Informatikdienste about it (around a quarter of the licences seem to be used by a course) and they promised to have a look at it.
May 5th, 2012
01:00 – All services up and running again
Finally … postmortem follows.
May 4th, 2012
14:51 – LDAP Problems causing server outages
The problem could be identified: our LDAP server. Remaining time to fix: 30 to 60 minutes.
13:52 – Massive Problems with the whole infrastructure
Currently there is a problem affecting almost all services. We are investigating the problem.
March 20th, 2012
11:11 – User webspaces restored
All the user webspaces have been restored. We are still investigating the root cause of the problem.
09:44 – Userweb migration problem
Last night, the shadow migration of user web spaces to the new storage infrastructure failed partially. Currently we are working on restoring the datasets on the old storage servers.
November 14th, 2011
09:43 – Service Hickups due to Power loss
Following an explosion in an EWZ transformer station the floor switches in Y25 rebooted themselves, leading to a service interruption of about 5 minutes. The server rooms were unaffected due to UPS systems.
November 9th, 2011
09:43 – Services are back
All services should be back.
09:05 – Virtual machine host down
It appears that the virtual machine host ‘cyllene’ is down again, and with it some of our linux-server infrastructure and incoming/outbound mail. We’re investigating.
November 5th, 2011
23:45 – All hosts up and running
Some filesystems had to be fixed, due to the system halt. But otherwise the host rebooted cleanly. Applause for the work horse ‘cyllene’ with an uptime of over 450 days!
22:48 – KVM node crash
Our host cyllene crashed, affecting these virtual machines: geoscience, ls2, ls3, ls7, ls10, ls11, ls17, ls18, geo441, permosdata and one of our mail hosts. The host is currently rebooted.
November 1st, 2011
08:31 – Disk failures, reduced service
Two disks in two independent servers failed yesterday/this morning. We’re recovering, but file services might be slower as usual. Email might be affected too.
Your data is secure, since we have proper redundancy at all levels.
October 25th, 2011
11:10 – Host rose sucessfully updated
The climap access server is up and running again.
10:14 – Climap Server rose down due to upgrade
The Windows Terminal Server “rose” is currently down due to a disk upgrade
August 30th, 2011
15:40 – Server back again
The problem mysteriously disappeared. We hope to replace the server with new hardware soon.
15:27 – Server crash, many services affected
We’ve just had one of our main servers crash. Many services such as mail, file sharing and login are affected. We’re looking for the cause right now.
August 18th, 2011
12:09 – All services working fine
We assume a network related glitch that caused a great deal of our IT infrastructure to stall.
11:54 – Serious problem affecting all the infrastructure
We are investigating the issue.
July 8th, 2011
16:11 – Helpdesk Phone Number working again
The issue with the helpdesk phone number could be resolved. You can happily call 55888 again.
July 1st, 2011
16:33 – Helpdesk Phone Number Out of Order
The IT helpdesk phone number (044 63 55888) is currently out of order due to electrical work conducted in the server room. Please send an email to computer-help@ or use the phone number 044 635 52 43 instead.
June 14th, 2011
17:38 – Source Control Server up again
We’ve restored operation of the source control server.
09:30 – Source Control Server inoperative
Today, our source control system at ‘scm.geo.uzh.ch’ will be unoperative. This is caused by a system failure. No data is lost and we expect to have this service up again by the end of the day.
April 1st, 2011
17:22 – Matlab License Server up and running
The UZH computing services got the Matlab license server up and running again.
14:56 – UZH Matlab licence server is down
The Informatikdienste UZH are working on it.
December 14th, 2010
08:56 – Reboot has solved most of the problems
We’ve rebooted the file server that was down; most of the problems have been solved this way. We’re looking for a cause.
08:08 – Fileservers down
Some of our file servers have developed an acute condition overnight. We’re investigating the issue and will try to get them up and running as soon as possible.
Thank you for your patience.
November 24th, 2010
12:44 – Sunray servers should be working again
We have rebooted the Sunray Servers. All logins should be fine now. If any problems persist, please stop by.
11:56 – Sunray trouble
Sunray servers are having trouble when new users log in. We’re working on the problem. If you’re experiencing login issues, you’re not alone.
November 22th, 2010
08:47 – Downtime for scm.geo.uzh.ch
Our repository server at scm.geo.uzh.ch was down since Sunday 12:00. The server had a readonly root disk because of a SCSI link failure. Everything should be good now.
November 16th, 2010
10:46 – New Mailadmin Webinterface working
The new mailadmin webinterface is running stable now.
November 12th, 2010
23:27 – New Mailadmin Webinterface not working
The new mail administration interface, where you can set your spam settings and the vacation message, is not running stable. We are still investigating the cause of the problem. Meanwhile, if you want any of your email settings to be changed, please send an email to GIUZ IT, so we can make the changes for you. Sorry for the inconvenience.
09:09 – All problems have been attended to
Both issues of the past few days have been attended to. The power outage hasn’t caused any permanent damage because of the USV machinery that is now keeping our servers happy in such a case. We could not connect, but haven’t lost any data, unlike the last time about a year ago… As soon as ID had sorted out trouble related to devices not booting properly, the whole infrastructure was working again.
Course Disks (K:\ on wsc.geo.uzh.ch) have been extended with a new server and new disk space. There is now enough space for even more courses than we currently have. We’ll try to improve the situation further by integrating that storage architecture with all our other servers.
As usual, if any problems remain, please don’t stop by, write us a mail to computer-help@geo.uzh.ch. This is really the quickest way to get our attention.
November 11th, 2010
15:04 – Technical Problems in Building Y25
Technical problems, may be due to a short power outage, are causing massive troubles. All the network switches rebooted, the fire doors closed and many other systems are behaving strange. The Service Center Irchel is still investigating the issue. GIUZ IT is slowly recovering, but some services are still not back to normal.
November 10th, 2010
10:28 – Course disks are full
The K: disk on the course cluster (wsc.geo.uzh.ch) is full. We’re working on extending the disk. Until then, you should save your files elsewhere; in your home or on your desktop. (File > Save as…)
October 15th, 2010
12:40 – WSC - all nodes up and running
We’ve replaced the broken server (one of the wsc nodes). The full capacity of the Windows Course Cluster (wsc) is available again.
11:00 – Service interruption for a few machines
We’ve just had a short unexpected downtime for the following machines:
geo441, ls0, ls1, ls2, ls3, ls7, ls10, ls11, permosdata
Everything should be up and running again.
October 13th, 2010
14:50 – Reduced service
We’re experiencing high load on the Windows Course Cluster (wsc). This is due to one of the machines that broke down this morning. We’re setting up new machines to take over. In the meantime, we apologize for the delays and login problems you might experience.
September 28th, 2010
14:15 – DHCP server crash
Our DHCP server was away for about ten minutes just now. You probably had reduced network connectivity. The problem is resolved. Try renewing your DHCP lease or contact us should your machine not reconnect by itself.
September 27th, 2010
17:52 – Repository server up
The web interface of our repository service was down the whole day. It should now be working again.
September 1st, 2010
10:40 – Server problems solved
The reboot of a server room switch this morning caused the network bridges on two of our KVM hosts to fail. The hosts had to be rebooted and are up and running again.
09:55 – Half of the servers down due to network problems
August 4th, 2010
15:46 – Fiber Optic Network problems solved
A faulty GBIC Fiber Optic Transceiver was causing link errors. The defective part has been replaced and the network seems to be stable now.
14:14 – Downtime of several hours expected
Contrary to the last post, we are unable to migrate the affected servers to the other server room. The transfer would have to take place over the NUZ network, which would take up too much time. Instead we opted for fixing the link by the UZH Informatikdienste. This work is currently underway.
Affected services and servers:
- www.geo.uzh.ch
- dsgz.geo.uzh.ch
- geo441.geo.uzh.ch
- permosdata.geo.uzh.ch
- tripodsrv
- Comsol licence server
- ls2
- ls3
- ls4
- ls7
- ls8
- ls9
- ls11
14:03 – Several systems down due to network problems
A broken glassfiber link between our two server rooms is causing massive troubles with many of our services. We are transfering the affected services away from the affected infrastructure.
July 13th, 2010
20:24 – Services Up
The migration to the new switches has completed. The linux KVM infrastructure should be up again. All servers are online.
18:54 – Migration to new switches running
The linux KVM infrastructure is currently migrated to the new switches. Some servers are currently offline.
14:54 – Services UP
The file server problem has been fixed.
14:23 – Fileserver Problems
Effecting the Windows Terminal Servers (wsc.geo.uzh.ch and wsr.geo.uzh.ch)
July 3rd, 2010
08:38 – Reboot of several servers
Due to storage space issues, we’ve had to reboot several servers this morning:
- dsgz
- geo441
- keskonrix
- ls1
- ls2
- ls3
- ls5
- ls6
- ls7
- permosdata
We apologize for this. We have now taken more drastic measures to prevent this kind of thing from happening again.
June 16th, 2010
15:39 – Services up
All services should be up again. We’re still looking for the cause.
15:21 – High NET load, possible attack
Server ‘callisto’ has very high incoming traffic and might be under attack. We’ve had to reboot it twice as a consequence. WWW service and Subversion/git service is affected. Other servers (ls5, ls8, ls9) are also affected.
We’re investigating the issue and will post an update later on.
09:40 – WLAN and VPN maintenance
ID (Informatikdienste UZH) report intermittent failures in WLAN and VPN service at the institute. This is due to a software bug. They are working on a solution.
Update 11:25: ID reports that the problems have been solved.
June 15th, 2010
10:18 – VM infrastructure up and running
The VM infrastructure is up and running again. We had to replace a server with a faulty CPU. As far as we can tell, no data should have been lost; If you experience any issues at all, please write to computer-help@geo.uzh.ch.
The downtime/ server issues started at 4:40 am and have been resolved just now, making for a downtime of about 5 hours.
08:13 – VM infrastructure unresponsive
Our virtual machine infrastructure is currently unresponsive. We’re investigating the issue.
Update (8:50): One of our storage servers is faulty. We’re switching all VMs over to backup storage. This will take a few hours.
Update (9:55): Most of the servers are up again. The remaining machines will be restored in a few minutes.
June 1st, 2010
18:46 – Failure reason: Server was full
The storage servers were just plain full and our storage surveillance has proved ineffective. As a consequence they refused writes and the corresponding number cruncher / linux servers remounted their root disks read-only.
We have now made some room for your writes and all servers should be up again. Once we have a new storage server set up, we will be transferring some disks. As a second measure, we are also improving our monitoring to show these errors early.
Linux servers (number crunchers and web servers) had intermittent service all day long. As of now, they are up and running again. We are sorry for the inconvenience we have caused.
14:57 – Write timeouts on storage (again)
We’re moving the storage to the failover server.
12:14 – Issues resolved
We’ve resolved the issues with our storage servers. We suspect that we’ve had a short network outage around 9:30 this morning which led to write time-outs on the number cruncher setup.
10:25 – Timeout issues with Numbercruncher storage
A few number cruncher machines are currently having issues with their attached storage. We’re working on the problem. Some of the machines might have to be rebooted.
May 21th, 2010
11:52 – Windows TS Course Cluster - all nodes up and running
The reboot of the failing cluster node has completed, the Windows TS course cluster (wsc.geo.uzh.ch) is back with its full capacity.
11:49 – Windows TS Course Cluster Server Node crashed
One of our Windows Terminal Server Course Cluster nodes crashed with a blue screen. Reboot is in process.
May 17th, 2010
09:28 – Mailservers are back up
We had to reboot the mailhost. Investigation still running.
09:00 – Outgoing/Incoming Mailservers down
We are working on it.
May 10th, 2010
12:20 – KVM Storage is back up
We’ve recovered all data on the storage. The affected machines should be back up and running. We’re currently looking for the cause of this downtime and taking actions to prevent it.
10:46 – KVM Storage down
We’re having trouble with one of the backend servers of our virtual linux server infrastructure. This affects the host ‘ls4’.
May 5th, 2010
10:02 – Migration of User Homes Done
We’re done migrating user homes. The new homes have been integrated into our new storage infrastructure. You may have noticed, that you receive status reports every month. If you need more space in your home, write a mail to computer-help@geo.uzh.ch.
April 6th, 2010
13:46 – Storage server replaced
We were unable to boot the storage server and have replaced it with our hot standby server. All services are restored for now, but will undergo maintenance downtime this afternoon / tomorrow morning. This should be quick and affected users will be notified beforehand.
We’re sorry for the downtime. If anything still doesn’t work, please contact the helpdesk.
11:13 – Storage server trouble
One of our storage servers doesn’t currently answer to requests. This affects the web site of the institute, source code repositories and the following server machines: ls1, ls2, ls4, ls5, ls6, geo441 and dsgz. We’re rebooting the machine.
March 30th, 2010
15:27 – Migration of User Homepages DONE
We’re done migrating user web spaces. You can access your web space (if you
had one) as before by accessing
http://www.geo.uzh.ch/~GIUZ_LOGIN. If you have any questions
regarding this, please read the wiki page and
then write a mail to computer-help@geo.uzh.ch.
15:09 – Scheduled downtime for fs
Host fs will be down 18:00 till 22:00 today. Affected are
Windows & Sunray Login, File Serving. Please see also the mail you have
received.
12:12 – KVM Host rebooted
One of our KVM hosts just lost all networking. We’ve rebooted the host. As a
consequence, the hosts ls6, ls7,
permosdata and tiva have also been restarted.
March 29th, 2010
12:30 – tethys up again
Host tethys has booted cleanly. We’re still trying to find the cause of this interruption, but everything should be back to normal in the meantime.
12:18 – Reduced Service
One of our file servers has just spontanously booted (tethys). This impacts many services including windows login and unix login. We’re working on the issue.
11:25 – Migration of user homepages
We will migrate all web spaces that are below ‘www’ in your unix
home to a new infrastructure. This means that starting about 14:00 today you
wont be able to make changes to your personal website.
March 19th, 2010
09:04 – Printing issues this morning
One of our virtualisation hosts went down this morning at 2 am. All services came up after reboot. During the downtime, printing didn’t work. The issue has been resolved now.
March 9th, 2010
09:47 – All good
March 3rd, 2010
14:58 – test
14:20 – New Status messages
This is the new GIUZ status page where we will be publishing service interruptions and service announcements.
What's new?
May 7th, 2012
It's all about security, stupid!
Last Friday afternoon, May 4 2012, we had the most severe outage of the GIUZ IT infrastructure since the desastrous LDAP crash dating back to August 10 2009. Once again, the unavailability of both our LDAP servers led to the grounding of almost every IT service at the department. But this time the root cause was not a software bug (as it was back in 2009), but human error. Desaster started to unravel 10 years ago, precicesly at May 7 2002, 11:13:32 GMT. Back then, GIUZ IT established their own so called “Certificate Authority” (short: GIUZ CA), to issue their own SSL certificates in order to secure the retrieval of emails via the POP or IMAP protocol (and more importantly, to encrypt the transferred passwords!). Over the years, the GIUZ CA was used to issue many more self-signed SSL encryption certificates to protect other parts of the infrastructure. Part of it our two LDAP directory servers we spawned back in 2009. Every SSL certificate has an expiry date, so did our two LDAP-SSL certs: some day in the year 2019. While we were monitoring the expiry of our SSL certificates, we missed to monitor the expiry of the underlying CA-certificate. The SSL system is hierarchical in nature, with a root certificate at the top – the GIUZ CA certificate in our case. Back in 2002 this certificate was established with a validity of 10 years, resulting in an expiry date of … Friday May 4 2012, 13 minutes past 1pm (and 32 seconds, local time).
The expiry of the certificate resulted in a full stop of our infrastructure. The LDAP servers stopped to answer any authentication requests, so no one could login anywhere. The NFS servers serving files couldn’t fetch the mount tables (stored in LDAP) and stopped serving files. And so on and so forth.
After determining the root cause of the problem we were positive to fix the problem within 30 to 60 minutes: all we had to do, was to create new self-signed certificates for our LDAP servers (by now we automated this task, spawning ad-hoc CAs as needed) and install them. Unfortunately this assumption couldn’t be any wronger. Looking back, it is pretty naive to think, one could easily replace a 10 year old technology (our root CA) with the now state of the art technique. For the technically interested: don’t expect your CA created with gnutls to behave like one created with a ten year old OpenSSL. Some hours later we finally could produce valid SSL CA certificates, that our Sun DSEE server would happily accept. LDAP service could be restored around 6pm last Friday. Unfortunately this didn’t help much. Only the Linux servers could query the LDAP servers by then. All the Solaris hosts (meaning the core of our infrastructure) needed to have the new CA certificate locally on disk, before beeing able to talk to the LDAP servers! But since the fileservers were not working properly, our configuration management system didn’t either. Meaning we had to manually distribute the certificates to all our Solaris hosts. Now Murphy’s law striked, too: about three weeks ago we ordered a replacement for the defect power supply of our KVM-switch. It was out of stock and did not yet arrive, though. We didn’t use the KVM-switch for years now, but still had some elderly Sun-Hardware spinning that did not provide a good remote console. Shouldn’t be a problem, until it became one last Friday …
Sparing you the details, suffice to say: it took us literally hours to deploy the certificates on all these servers, resulting in various reboots and single user boots of large parts of the infrastructure. Finally, at 1am early Saturday morning all the services were up and running again.
BTW, in case you wonder about the expiry date of our new LDAP CA certificate: May 2 17:06:13 2022 GMT. It’s a Monday.
March 14th, 2012
New storage servers online
Today we’re switching to production for new storage servers adding about 40TiB to the network. The new servers have a custom built architecture and management tools, streamlining operations and management. Failover should be quicker as well, as the new setup has a high redundancy.
March 21th, 2011
Licensing changes
We no longer offer work@home Microsoft licenses for private use. Such licenses must be bought directly from Microsoft using the “Select Plus Student Option”. Please see the wiki page.
December 20th, 2010
New Box System
We have re-implemented the box system to work smoothly with Mac OS X, Windows and Linux/Unix. The box system can be used to send files to a co-worker and replaces the old scratch space. See https://it.wiki.geo.uzh.ch/Boxspace for further instruction on how to setup and use.
December 8th, 2010
Atom Feed for announcements
Starting today, we also provide an Atom feed of these news. Point your feed reader at http://it.geo.uzh.ch/atom.xml.
December 7th, 2010
DNS update: geo.unizh.ch will not work anymore
We are in the progress of replacing our DNS servers. As a consequence of this, all references to ‘geo.unizh.ch’ will not work anymore. You should replace them with the more modern ‘geo.uzh.ch’.
If you need help with reconfiguring your mail client, please refer to the IT wiki.
November 6th, 2010
New Mailserver Infrastructure
A great deal of the mailserver infrastructure has been replaced last night. In theory, you shouldn’t have noticed anything about this move. Nevertheless, if you experience email problems, please contact the IT helpdesk or send an email to computer-help (if you still can ;-), so we can sort it out.
September 15th, 2010
POP/POP3 only with SSL
We’ve upgraded our old POP/POP3 server. To enforce a higher security level, the new server only allows encrypted connection (SSL) over port 995.
April 6th, 2010
Post Mortem on the incident this morning
We’ve conducted a small investigation on why the server crashed this morning and found that it had a faulty CPU. We will be replacing the defect parts and put it back into operation as soon as possible.
March 2nd, 2010
Virtual Server Infrastructure operational
As of today, our virtual server infrastructure is operational. Behind the scenes, we’re using KVM to provide dedicated Linux hosts to the GIUZ.
February 1st, 2010
Web server migration & Microsites
We’re in the progress of migrating all old web servers to new (and more sustainable) infrastructure. This affects the hosts cheese and cyclon. As soon as we finish the migration, we will turn those off.
Microsites
As a first benefit of this migration, we’re launching a new service called microsites. Those are simply a web space that can be managed by a group of people. You can get more information here. (needs login)
Web service hiccups
We’ve had a few hiccups during the transition to the new system. This should be over now and the infrastructure should be stable again.
![Validate my Atom 1.0 feed [Valid Atom 1.0]](imgs/valid-atom.png)