Wednesday, 9 June 2010

Managing RAID with VMWare ESXi 4 and Fujitsu Servers

Managing RAID is an integral part of any server installation. Fujitsu servers, with either a typical Windows or Linux installation have no hardware abstraction layer and so the RAID subsystem can be managed with ServerView RAID manager which is provided by Fujitsu.

If, however, you install the server with VMWare ESXi the guest operating systems do not have direct access to the hardware and so using ServerView RAID manager after a default install will not correctly display the RAID subsystem. This was seen as a big negative for using Vmware ESXi 4 as ServerView RAID manager can be very critical when troubleshooting, planning or fixing anything to do with RAID.

After trying to find a way around this situation I stumbled across a brief mention of a way to connect to the CIM API that is provided by VMWare. This allows developers to create software that can talk to the hardware of the server via a CIM broker. ServerView RAID can take advantage of this and a VMWare server can be added via the amCLI command, as shown below (Updated: This works on Windows and Linux):

amCLI -e 21/0 add_server name=1.1.1.1 port=5989 username=root password=*****


Change the server name to an IP address or DNS name of your server, the username and password to the one matching your VMWare installation.

Confirm the addition by running
amCLI -e 21/0 show_server_list


Delete the server by running, changing the name as appropriate
amCLI -e 21/0 delete_server name=1.1.1.1


Log into the ServerView RAID web interface as normal (https://IP Address:3173) using the superuser name and password for the OS. You should now see the RAID adapter from the VMWare ESXi 4 server. This has been tested on a TX150 S6 and TX200 S5 with a LSI1078 RAID card.

If the adapter does not appear then make sure there is a host file entry for the IP address of the guest OS that is running ServerView RAID

Monday, 10 May 2010

Windows Server 2003 R2 Terminal Services and TWAIN Drivers.

I had an interesting problem the other day which is definitely worth a post. We have a customer who run entirely on Thin Clients but needed a document scanner to improve their business. Unfortunately due to budget requirements (as usual) buying a PC was not an option so I was left with the task of configuring the Fujitsu 5120C scanner so the customer could run it at the console.

I downloaded the latest Fujitsu Twain Driver which lists Server 2003 as a supported operating system. The installation went smoothly and I could successfully use the scanner via the Scanners and Cameras option in the Control Panel. However, we needed to use more functionality than this and so I installed the ScandAll21 software that comes with the scanner. After running ScandAll21 I was not able to select the scanner as the correct source. I was running as an Administrator so I didn't suspect permissions.

Some Googling later I came across the following which then lead me to a Microsoft Article KB186499

As explained in the Microsoft article and I created a new Registry Key with the name of the ScandAll21 executable (FIMAGE) and then created a new DWORD value called Flags and gave it a hexadecimal value of 40c. Once I had completed these steps I was able to successfully scan which made both me and the customer very happy.

Saturday, 17 April 2010

Windows 2003 R2 crashing every two days - Event ID 2019 The server was unable to allocate from the system NonPaged pool because the pool was empty.

This was an interesting problem I had recently and is well worthy of a blog post! I was dealing with a server that would stop functioning on the network roughly every two days. There was nothing extraordinary about this server and we have quite a few with very similar configurations. The customer would reboot the server to start functioning again and we would log on remotely to try and determine the cause. After a few crashes we noticed it was always preceded by Event ID 2019 The server was unable to allocate from the system NonPaged pool because the pool was empty.

I started watching the server's Non Paged pool usage with Task Manager and Poolmon but was not able to determine what was causing the problem. At this stage I still wasn't sure whether it was a hardware or software issue so decided to restore the server onto one of ours in the office and let it run for two days. This was over the bank holiday weekend and low and behold the server experienced the same issue. This was great news because now I had the opportunity to do further analysis. I ran Process Explorer, Task Manager and Poolmon but still could not determine the cause (not sure if I was using Poolmon correctly). I have had experience with analysing Minidumps and so thought it would be a good idea to get a full memory dump but needed a way to create a BSOD. In the back of my head I was thinking sysinternals and found reference to NotMyFault.exe which has a /crash switch. I was able to use this to create a BSOD and get a much needed memory dump. You can also use Ctrl+ScrlLck+ScrlLck but must be first enabled in the registry.

Opening this memory dump (C:\WINDOWS\MEMORY.DMP) in Windows Debugging Tools for windows allowed me to do some further analysis. Running the !vm command gave me the following information:-

1: kd> !vm

*** Virtual Memory Usage ***
Physical Memory: 524002 ( 2096008 Kb)
Page File: \??\C:\pagefile.sys
Current: 2095104 Kb Free Space: 1766344 Kb
Minimum: 2095104 Kb Maximum: 4190208 Kb
Available Pages: 178832 ( 715328 Kb)
ResAvail Pages: 439715 ( 1758860 Kb)
Locked IO Pages: 3528 ( 14112 Kb)
Free System PTEs: 234209 ( 936836 Kb)
Free NP PTEs: 319 ( 1276 Kb)
Free Special NP: 0 ( 0 Kb)
Modified Pages: 229 ( 916 Kb)
Modified PF Pages: 229 ( 916 Kb)
NonPagedPool Usage: 64932 ( 259728 Kb)
NonPagedPool Max: 65536 ( 262144 Kb)

This shows my NonPagedPool Usage is very close to NonPagedPool Max value. I then ran !poolused 2 which gave me the following:-

kd> !poolused 2
Sorting by NonPaged Pool Consumed

Pool Used:
NonPaged Paged
Tag Allocs Used Allocs Used
AvgU 401672 86761152 0 0 UNKNOWN pooltag 'AvgU', please update pooltag.txt

Although AvgU is a unknown pooltag it was logical to guess that this was related to the Anti virus product AVG 9 and this reference cements these findings. Uninstalling AVG from our test server lead to the problem disappearing.

The customer purchased and installed AVG9 by themselves and so we told them to log a support call with AVG to get a resolution.

Getting to the root cause of the problem in this was very rewarding and highlighted the importance of being able to restore the machine to rule out hardware and to be able to do further diagnosis.

Wednesday, 7 April 2010

Format an RDX cartridge from the command line / scheduled task

Due to the bugginess of Acronis 10.0.11345 I had a situation where I needed to format a removable storage device before a backup plan was scheduled to run. A quick play with the format command and I couldn't get it to run without user interaction. After a bit of research I came across the diskpart command which can be scripted using the /s switch. Create a file with the commands you would like to run eg and save it as format.txt:

Select Volume E:
format FS=NTFS QUICK NOERR OVERRIDE


It goes without saying to change the volume letter to one that matches your configuration. All you have to do is run diskpart /s format.txt and the specified volume letter will be formatted.

This is not limited to RDX devices and so may come in handy for formatting other devices.

Tuesday, 16 March 2010

Is being helpful more trouble than its worth?

After a particular heavy day at work it got me thinking about workloads and the time it was taking to do some tasks in comparison to my colleagues. I work in a team of 7 where there are varying degrees of knowledge. A working day is pretty flexible and there are no specific tasks, apart from one day a week where an individual aids the support department with operating system and hardware calls. However, different individuals treat some of these problems with different attitudes and whilst I may spend up to 30 minutes(or alot more) trying to find the cause of a problem, others may just reboot the server and close the call. Obviously, this can lead to a problem(for me), especially if the problem reoccurs on a day where I'm assisting support. I've lost count of the number of memory leaks or configuration changes I've made after spending the time to understand and diagnose a problem that otherwise seems to have been bouncing around support for days or maybe months.

It then got me thinking about how often my phone rings during the day. Because I actually spend time diagnosing problems it gives me a greater understanding of how things work and therefore better placed to answer specific configuration or scalability questions. I also have a good memory which means throughout the day I am asked what is the IP address of this or how is this set up etc. Answering these questions actually further cements this into memory and the circle continues!

I have lots of ideas of how to improve things but a lot of these need time to be researched and implemented properly. However, I feel as though I spend the majority of my time helping others with their problems or answering questions to things that have been said and documented hundreds of times before!

I love working in IT but some days I feel like I haven't done anything because I have spent more time helping others than doing any tangible work myself. It does sometimes make you think, what is the incentive to be helpful and do a good job? The people at the top are blind to this because its not quantifiable, ie spending 60 minutes now to understand something can save you a lot of time in the future.

It's always easier to ask the guy with the good memory, than to spend some time finding out something for yourself. I'm sure this is true for almost every profession.

Monday, 22 February 2010

Complete server restore to different hardware using BackupExec 11d SP5 and Windows 2003 R2 SP2

I haven't mentioned it earlier, but I work for a relatively small IT company and so alot of things are done on a finite budget with a finite amount of time. This means that things are not always documented and servers are not always specced as well as they should be. It also means things like disaster recovery are just well, overlooked. The powers that be don't see £'s from disaster recovery planning, therefore it just never really happens.

When a Terminal Server, Mail Server, Database Server and Print Server failed to boot recently it was up to me to try and pick up the pieces. It was pretty clear from the off that things were not in a good state. Allthough the server was RAID 5 there was no MBR and after booting into the Windows Recovery and running fixmbr there was still no joy. I knew the previous nights backup was good and decided to take the plunge and perform a complete server restore.

The server was a Fujitsu Econel 200 and we had a spare Fujitsu TX150 S5, clearly not the same hardware. The restore was pretty tedious so I decided to blog about it here.

Here is what I did. Please note, I do not take responsibility for anything that goes wrong after following these steps.

  • Partition the HDD the same and install the same version of Windows (R2 etc) and update to the same service pack. Also give the server the same name because when you restore with BackupExec later it looks at the name of the server. Also if possible write protect the backup tape/cartridge, just in case!
  • Install BackupExec (11d in this case) but make sure you CHANGE the install path. Choose something like C:\Program Files\SymantecTemp
  • I updated BackupExec to the latest version (SP5 at time of writing), just in case there were any restore bugs that may have bitten me.
  • The backup device was a Tandberg RDX and in BackupExec you have to create a Backup to Disk device. I recreated this and pointed it at my B2D folder.
  • Once you have added the B2D folder, perform an inventory so BackupExec queries the backups.
  • By default BackupExec splits the "Media" into 1GB files and so after the inventory I was left with lots of "Media" in my B2D media set.
  • This part was very tedious. I couldn't find a way to associate these individual "Media" with a specific backup so I had to select them all, right click and select Catalog Media. I had to wait quite a while for BackupExec to go through each one and work its magic. Don't be too alarmed if alot of them fail, they did for me.
  • I then selected New restore job using wizard, click Next and looked through each Media Label until I found the backup that I wanted to restore. Be aware that different drive letters and the System State may appear in Media with different labels.
  • Click Next and if you are not sure of the logon credentials for restoring the data you can test them on this page. If you're confident click Next, give the restore job a name, select the relevant device, select Overwrite the file on disk and click finish to run the job now.
  • Allow the restore job to run (restore time varies greatly depending on the amount of data and type of backup device). When the job is complete BackupExec will prompt to restart the machine.
  • If you're lucky the server will boot. For me it didn't. I was greeted with :-
Windows could not start because the following file is missing or corrupt:
<Windows root>\system32\ntoskrnl.exe
Please reinstall a copy of the above file
  • My first thoughts were bugger, the hardware must be too different. Then I thought, no, a too bigger hardware difference is likely to manifest as a BSOD. A little bit of research led me to believe the boot.ini file must be different between the servers. To get around this, you will need your Windows 2003 installation media. Boot from the CD and after the drivers have loaded press R to enter the recovery console.
  • If the server is relatively recent it is likely that Windows has not got the correct drivers for your RAID/SATA controller. If this is the case download and install nLite. This app is superb and amongst other things allows you to slipstream service packs and drivers into a windows install. Copy your Windows Installation CD to a directory on your machine, open nLite, point it to your Windows installation. Follow the wizard and point nLite to the drivers for your RAID controller etc and then either create a new ISO or burn the modified OS directly. It is so straight forward that I'm not going to bother describing the process here.
  • Logon to your Windows installation using the admin password from the original server. Then run bootcfg /rebuild. This command took a few minutes to finish but when complete it should find your Windows installation (probably C:\WINDOWS) and ask if you want to add this installation to the boot list. Press Y and enter. For the load identifier enter something like "Windows 2003 Standard Edition R2" and for OS Load Options enter the default "/fastdetect"
  • Type exit, the server should reboot and hopefully load into Windows. How windows acts now is highly dependent on how different the hardware is from the original installation. I was lucky because the server booted up albeit very slowly. I disabled some hardware specific services, installed a Chipset driver, checked the Event Viewer and everything seemed OK. After another reboot the server was as good as the original!
  • If you want to free up some hard disk space you are free to delete C:\Program Files\SymantecTemp that we created earlier. The restored OS knows nothing about this install because we have restored the System State.
Hopefully, if you are reading this I have helped you restore a stricken server. If not maybe you have learnt something new.

I am aware this is not an exhaustive step by step guide to restoring a server and I'm sure this procedure could fall down in lots of other places. If and when I experience these different scenarios it is likely that I will update this post. Sometime in the future this may become a very useful resource.

Regards

Monday, 15 February 2010

Enable local relay on a Microsoft Exchange 2007 Server

We have an application that sends email by relaying through an SMTP server and unfortunately its quite basic and so you cannot specify any logon credentials. Therefore I needed to allow the application to relay through a locally running Microsoft Exchange Server. This was the first time I've used Microsoft Exchange 2007, but I thought this should be easy as I knew how to do it on Microsoft Exchange 2003. How wrong was I! This is when working in IT becomes really frustrating, when things appear to be changed just for the sake of it with no apparent improvement in functionality. An hour or so later I had the solution, which was alot more long winded than I was expecting.

First open the Exchange Management Console, expand server configuration and click on hub transport. On the right hand side click New Receive Connector and a New SMTP Receive Connector wizard will open. Give the connector a name and leave Select the intended use for this Receive Connector set to Custom. If the server is multi-homed set the next page so the connector is only listening on the LAN adapter. The next part is important because you want to restrict relaying as much as possible. In this case it is a single IP address so the Start and End IP address will be the same. 127.0.0.1 didn't appear to work for me, so I used the LAN IP address of the server. Click Next and then New to create the new connector.

We now need to configure authentication parameters for this connector. Highlight the newly created connector and click on properties. Leave the Authentication Tab at defaults (Transport Layer Security Ticked) and the click on the permission group tab and ensure only Anonymous users is ticked.

Anonymous users are not granted the relay permission by default. Run the following command in the Exchange Shell but replace *NAME* with the name of the Receive Connector created earlier.

Get-ReceiveConnector "*NAME*" | Add-ADPermission -User "NT AUTHORITY\ANONYMOUS LOGON" -ExtendedRights "ms-Exch-SMTP-Accept-Any-Recipient"

Thats it, you should now be able to relay locally, which you can test using telnet. When Server applications are supposed to be moving forward I find it absolutely incomprehensible that an Admin needs to go through this process to configure relaying.

Regards