Friday, October 13, 2017

Opensuse 42.2 to 42.3 upgrade booting into emergency mode

A few days ago I ran the normal "zypper up" on my 42.2 system and received a lovely update from nvidia for their G03 driver on the 4.4.27-2 kernel.  After this, I noticed vlc stopped working due to some plugin.  I rebooted my machine and no longer had X windows starting up.  The error was something about the nvidia driver not being able to load.  Many hours of trying to get that fixed, and try alternatives like Nouveau only got me to a graphic interface that couldn't seem to do more than a 800x600 resolution.  So I noticed there was a new distro update with different nvidia driver and kernel, so I gave that a try. 

Upgrade went ok and as usual (for the past 10 distro updates this machine has gone through), I expected some problems.  Usually its my bootloader pointing to the wrong drive and not being able to start up, but not this time.  For the first time, I had my machine booting into emergency mode with no real obvious errors from the output on the screen.  All file systems (root, /tmp, swap, windows partitions) were mounted as RW.  I had no networking, but otherwise could do pretty much anything form the command line except for starting a graphic interface.  Checking the output of the "journalctl -xb" command showed 2 errors for systemd targets, one for local-fs.target and one that I can't remember at the moment, but I think it was usb related or something else that looked like file systems. After googling around, I couldn't find anything specific to this problem, though a few mentioned looking at fstab for partition issues. I found one line in there which seemed suspicious given the errors in systemd

usbfs               /proc/bus/usb       usbfs       auto,devmode=0666     0 0

I commented this out, rebooted and all was good for boot.  Later installing the nvidia driver and downgrading the kernel to the version that the nvidia driver was created for resolved my original problem.  Spent some more time recreating my desktop environment for the new KDE version.  All together it was about a 12 hour recovery process.  So thanks Nvidia, you guys are awesome.

Thursday, October 5, 2017

Master list of Domain Join errors

This article is a collection of error messages from the domain join process, windows event viewer and general observations.  All of these were tested on a windows 2012R2 server joining to a single domain controller 2012R2 over a simulated router.  The domain is testforest.local and domain controller IP 10.1.1.50.  Various ports were blocked for each test and the results are recorded below.



Main Error Message on client: "An Active Directory Domain Controller (AD DC) for the domain 'test.local' could not be contacted.  Ensure that the domain name is typed correctly"



Situation: No functional dns.  That means, the client has no dns IP's configured, they are not valid dns server IP's, they are not accessible to this client, etc.

Sub Error Message when Details are expanded:

Note: This information is intended for a network administrator.  If you are not your network's administrator, notify the administrator that you received this information, which has been recorded in the file C:\Windows\debug\dcdiag.txt.

The following error occurred when DNS was queried for the service location (SRV) resource record used to locate an Active Directory Domain Controller (AD DC) for domain "testforest.local":

The error was: "This operation returned because the timeout period expired."
(error code 0x000005B4 ERROR_TIMEOUT)

The query was for the SRV record for _ldap._tcp.dc._msdcs.testforest.local

The DNS servers used by this computer for name resolution are not responding. This computer is configured to use DNS servers with the following IP addresses:

10.1.1.50

Verify that this computer is connected to the network, that these are the correct DNS server IP addresses, and that at least one of the DNS servers is running.

Steps to perform: Ensure the client is pointing to a valid dns server that can resolve this active directory domain.  Use of nslookup as a troubleshooting tool, or nltest /dnsgetdc: will help test connectivity.



Situation:  a RODC is accessible, however a RW domain controller is not accessible.  Your machine may be at a branch office with a local RODC that is handling dns queries, however the link connecting back to a writable domain controller is down.  Additionally this error could come up if the client has a functioning dns server to query that does provide answers, but due to some connectivity problem, the machine can't connect to a domain controller.

Sub Error Message when Details are expanded:

DNS was successfully querie for the service location (SRV) resource record used to locate a domain controller for domain "testforest.local":

The query was for the SRV record _ldap._tcp.dc._msdcs.testforest.local

The following domain controllers were identified by the query:
forest1dc1.testforest.local

However no domain controllers could be contacted.



Situation: Functional dns server, however the server doesn't cover this zone.  This means, the DNS server is accessible and is providing answers, however it cannot resolve anything in this Active Directory zone.  It does not host the zone, it does not forward to another server than can answer, nor does it do any recursion to find the answer.


Sub Error Message when Details are expanded:
Note: This information is intended for a network administrator.  If you are not your network's administrator, notify the administrator that you received this information, which has been recorded in the file C:\Windows\debug\dcdiag.txt.

The following error occurred when DNS was queried for the service location (SRV) resource record used to locate an Active Directory Domain Controller (AD DC) for domain "testforest2.local":

The error was: "DNS server failure."
(error code 0x0000232A RCODE_SERVER_FAILURE)

The query was for the SRV record for _ldap._tcp.dc._msdcs.testforest2.local

Common causes of this error include the following:

- The DNS servers used by this computer contain incorrect root hints. This computer is configured to use DNS servers with the following IP addresses:

10.1.1.50

- One or more of the following zones contains incorrect delegation:

testforest2.local
local
. (the root zone)

 Steps to Perform: 1) Ensure that the name typed in for the domain name on the client is the correct name, 2) check DNS infrastructure to find a server that is capable of resolving the active directory domain's dns zone.



Situation: Port 389 blocked (LDAP udp/tcp) 

Sub Error Message when Details are expanded:

Note: This information is intended for a network administrator.  If you are not your network's administrator, notify the administrator that you received this information, which has been recorded in the file C:\Windows\debug\dcdiag.txt.

DNS was successfully queried for the service location (SRV) resource record used to locate a domain controller for domain "testforest.local":

The query was for the SRV record for _ldap._tcp.dc._msdcs.testforest.local

The following domain controllers were identified by the query:
forest1dc1.testforest.local




## This ends the above section where the primary error message is domain controller could not be contacted.  In all three of these cases, there will be no prompt for credentials.


Error:  the RPC Server is unavailable

Situation: Block of port 135.  

What is seen:  User is prompted for credentials.  Domain join is slow but works eventually with a welcome to the domain error.  After the success, it may pop up "Changing the primary domain dns name of this computer to "" failed.  The name will remain "testforest.local".




Error:  Extremely slow domain join and everything else (boot up, logon, etc)


Situation: kerberos blocked (port 88 with DROP by firewall)

What is seen: Domain join still works but it is much slower, boot up is very slow, logons are very slow, GP update is very slow

Causes errors in system log
-lsasrv 6038  Microsoft Windows Server has detected NTLM authentication is presently being used between clients and this server....

-GroupPolicy 1055  Windows could not resolve the computer name

-TerminalServices-RemoteConnectionManager  1067   The RD Session Host server cannot register 'TERMSRV' Service Principal Name to be use for server authentication.  The following error occured: The system cannot contact a domain controller to service the authentication request.

-DNS CLient Events 8019.  The system failed to register host (A or AAAA) resource recortapter with settings:...

In the application log
-Winlogon 6006 GPClient errors


Situation: Kerberos blocked with icmp reject (port unreachable), same slowness


Error:  none

Situation: port 137 is blocked

What is seen:  prompts for cred, no problem in domain join, works quickly, no issues.



Situation: port 445 blocked

What is seen: Domain join works quickly, Boot speed is fine, and logon speed is fine. Gpupdate seems to work over port 137/139 (further blocking these ports breaks group policy with eventID 1096 in system log).  TCP 139 is the primary backup to 445 though the other ports may be required to get the connection started


Situation: port 3268  (AD global catalog) blocked

What is seen: No problem, fast join, no obvious problems after join



Situation: All ICMP traffic is blocked

What is seen: Join is fast, boot is fine, logon is fine.  Nothing significant seen here.  Firewall didn't catch any pkt drop.



Situation: Clock time of machine doesn't match domain controller (large skew >5min)

What is seen:  No problem in domain join.  System reboot, logon are all fine.  Clock time sync's after domain join reboot.

Error: "An Active Directory Domain Controller (AD DC) for the domain 'test.local' could not be contacted.  Ensure that the domain name is typed correctly"


 Sub error message in Details:

Note: This information is intended for a network administrator.  If you are not your network's administrator, notify the administrator that you received this information, which has been recorded in the file C:\Windows\debug\dcdiag.txt.

The following error occurred when DNS was queried for the service location (SRV) resource record used to locate an Active Directory Domain Controller (AD DC) for domain "testforest.local":

The error was: "This operation returned because the timeout period expired."
(error code 0x000005B4 ERROR_TIMEOUT)

The query was for the SRV record for _ldap._tcp.dc._msdcs.testforest.local

The DNS servers used by this computer for name resolution are not responding. This computer is configured to use DNS servers with the following IP addresses:

10.1.1.50

Verify that this computer is connected to the network, that these are the correct DNS server IP addresses, and that at least one of the DNS servers is running.

Situation:  all dynamic ports above 1023 dropped in both directions.

Causes: dropped dns traffic on return.  If return traffic/dns is working.... domain join is fine, boot is slow, logon is slow

System log:

Group policy 1053.  The processing of Group Policy failed.  Windows could not resolve the user name.  This could be caused by ...

Group policy 1055.  The processing of Group policy failed.  Windows could not resolve the computer name.  This could be caused by ...

TerminalServices-RemoteConnection Manager 1067   The RD Session Host server cannot register 'TERMSRV' Service Principal Name to be used for server authentication. The following error occured: The RPC server is unavailable.
.

Service control manager 7022  The Network Location Awareness service hung on starting.

Windows Remote Management 10154

The WinRM service failed to create the following SPNs: WSMAN/Slave1.testforest.local; WSMAN/Slave1.

Additional Data
 The error received was 1722: %%1722.

User Action
 The SPNs can be created by an administrator using setspn.exe utility.

Application Log - winlogon 6006  GPClient taking a long time