Overview#Normally, the DirXML modules, or vrdim modules, will load whenever they are installed and NDSD is started.
We have Observed issues#Where there was an issue either in a DirXML Driver or the installation that prevented NIDM Product modules, or vrdim modules, from loading properly.
You may need to manually unload and load vrdim.
Server ndstest1 provides LDAP connectivity for Siteminder in QA, and it also runs two DirXML eDirectory to eDirectory drivers. Over the past few months, the DirXML drivers on this server have occasionally stopped functioning; the only way to start them back up is to restart ndsd, or to unload and load vrdim.
Because of this, I've been thinking we are running on the edge of a problem similar to what we've seen in production -- and why I'm such a stickler for keeping SiteMinder and DirXML separated in production.
Last night (4/13), the ndstest1 server had the following changes installed:
eDirectory 126.96.36.199 TID-2970239 Applying eDir Authen Server Modules 2.3.6 (NMAS) - TID-2970663 NICI 2.6.5 upgrade The patch Novell provided to turn off the 3 second delay for failed logins was also installed.
-rwxr-xr-x 1 root other 5293440 Mar 29 10:54 /usr/lib/nds-modules/libnds.so.1.0.0Implemented a script that summarizes and logs ndstrace '-c threads' and ndstrace '-c connections' every 10 minutes (ndsthrds.pl).
Since we made all of those changes, the DirXML driver problems increased to the point it is nearly non-functional.
Check out the /var/nds/ndsd.log file entries from ndstest1 this week:
Note the large number of "Ignoring unexpected signal 13" messages. I've asked about these before, but nobody ever came up with any explanation for them, and the Great Oracle (Google) points me in the direction that the message is related to a missing file or directory. (Hmm.... file descriptors.)
There are groupings of errors at regular intervals... 05,15,25,35,45,55 minutes of each hour. Those correlate to when the ndsthrds.pl script is being run. But that isn't the only time the error appears... and if you look early in the log, quite of few of the messages happened long before the patches... just not as frequent.
These messages appear on the DirXML driverset status log:
Error 1: Thu Apr 14 07:04:22 EDT 2005 Fatal No description provided. Code(-9057) Unable to read persistent writeback queue: java.io.FileNotFoundException: dx32882 (No such file or directory) Error 2: Thu Apr 14 07:05:44 EDT 2005 Fatal No description provided. Code(-9057) Unable to read persistent writeback queue: java.io.FileNotFoundException: dx32882 (No such file or directory) Error 3: Thu Apr 14 09:25:42 EDT 2005 Fatal No description provided. Code(-9057) Unable to read persistent writeback queue: java.io.FileNotFoundException: dx32882 (No such file or directory)
Found this TID... but there is only one *.wbq file, 0 bytes, and hasn't been touched for quite a long time:
-rw-rw-rw- 1 root other 0 Aug 12 2004 /var/nds/dib/32882.wbq
The really unexpected thing is, by turning off the ndsthrds.pl script, DirXML has run for several hours now without further problems. From that I can only conclude that the ndstrace commands it calls is somehow exacerbating an underlying problem... maybe the file descriptor thing. From ndstest1:
pfiles `cat /var/nds/ndsd.pid` | grep ": S_IF" | wc -l 339
From ndstest2 (provides LDAP for SiteMinder only, no DirXML):
pfiles `cat /var/nds/ndsd.pid` | grep ": S_IF" | wc -l 140
UPDATE: The file descriptors, if related, may not have been due to a Solaris bug. It seems that some version of our eDirectory build placed a limit of 1024 file descriptors on the shell that launches eDirectory. This limit has been changed to 8192 in build versions >= 20050802.