Chapter 15: WLAN Troubleshooting

We haven’t really gotten into how COVID has affected my studies to this point.  To be honest, it should have made things better, but it didn’t.  Having a lot of time at home, means the kids are here, the wife is here and the pets are here.  They all need different things at different times.  It makes getting things done during the day a challenge.  That is why I do my studies and blog posts at night…..until I give myself an excuse to pass on working tonight as I have spent my whole day in my office already.  Rinse and repeat.  Well, things are starting to pick up from a project perspective which is getting me out of the house more.  So staring at these walls all day is less of an issue.  Only a couple more chapters to go.  Lastly I have some exciting news that I found out today……I don’t think I want to share it right this minute but I will in time.  It is another avenue for continued learning of WiFi.

That doesn’t have a darn thing to do with WLAN Troubleshooting, so let’s move on.  We have all been privy to WiFi issues, whether it is an simple as unplugging and plugging something back in, bouncing a port, or fixing an incorrectly typed PSK.  We will cover items as simple as that in this chapter as well as some more complex issues that may occur.  We will dive into how to narrow down where the issue is before we start blindly troubleshooting by asking the right questions.  We will also determine what to do if it isn’t WiFi since we know that can only be the case at layers 1 and 2.

Five Tenets of WLAN Troubleshooting

Troubleshooting Best Practices
  • When is the problem happening?
    • What time did the problem occur?
    • Check log files of APs, WLCs, and applicable servers.
    • Make sure NTP and time zones are set properly.
  • Where is the problem happening?
    • Widespread or just one area?
    • Single floor or entire building?
  • Does the problem reoccur or is it a single event?
    • Single time problems are difficult to examine.  You may need to enable debugs to catch it in a log file.
  • Did you make changes recently?
    • Check logs for recent configuration changes.
    • Best practice is to plan and document changes.

Once you have gathered the info, you can decide on your plan of attack.

  1. Identify the issue:  Determine if a problem truly does exist, as WiFi often takes the blame.
  2. Re-Create the problem:  Try to duplicate the issue either on-site or in the lab.
  3. Locate and isolate the cause:  Troubleshoot up the OSI layer to help identify the issue.  What layer of the network does the issue reside?
  4. Solve the problem:  Create and implement a resolution, network changes, firmware updates, etc.
  5. Test the solution:  Spread out testing over different times and locations.
  6. Document the problem and solution:  It is best practice to document all problems, diagnostics, and resolutions.  Keeping a database will help in the future.
  7. Provide feedback:  Follow up with the person who reported the initial issue.

Troubleshooting the OSI Model

Keep in mind that 802.11 operates at the first two layers of the OSI Model.  If the issue exists outside those layers, it is not a WiFi problem.  Often layer one issues are the issue, such as non-powered APs or client radio driver problems.  Additionally, WLAN coverage, capacity, and performance are often indicative of a poor WLAN design.  If you have exhausted layer 1 issues, then you should be moving into the Data-Link layer.  Here you will look at discovery, authentication, association and roaming.  We also have WLAN security mechanisms here.  When we take a look at encryption methods, they must match on the AP and the client.

Most WiFi Problems are Client Issues

WiFi Client Issue

The first step with a client would be to disable and re-enable the WLAN network adapter.  This should reset the driver that resides between the radio and the OS.  We also want to make sure the client is running the latest drivers.  It is also a good idea to delete and re-create a new profile.  A lot of times your issue could be as simple as a mistyped PSK as we alluded to earlier.

Proper WLAN Design Reduces Problems

Users will typically blame WiFi first.  Doing your due diligence will be important to isolate where the issues are.  Working your way up the OSI layers is very important.  This will be a recurring theme.  Drill it into your head.

OSI Model WiFi

Layer 1 Troubleshooting

WLAN Design

The most common issues are coverage holes and co-channel interference.  Keep in mind the APs receive sensitivity is usually much stronger than the client devices receive sensitivity.  Improper AP placement and antenna orientation can also contribute to poor WLAN coverage.  We also have CCI, which is the top cause of needless airtime consumption.  The radios are doing exactly as they should, using CSMA/CA, but poor design makes it worse.  Proper channel reuse design solves this issue.

Transmit Power

A common issue is to have APs transmitting at full power, resulting in oversized coverage and poor capacity.  This also increases the odds of CCI and sticky clients.  Typical designs would call for APs set to one fourth to one third power.

RF Interference

Non-802.11 transmitters are the most common external cause of WLAN problems at Layer 1, which causes a denial of service.  If the clear channel assessment always shows RF energy, the transmission will be deferred entirely.  Corrupted frames due to RF interference will cause high re-transmissions and kill throughput.

Narrow-band Interference

A narrow-band signal is usually very high amplitude and will disrupt communications in the frequency space in which it is being transmitted.  It can disrupt one or several 802.11 channels.  Additionally it can cause corrupted frames and Layer 2 re-transmissions.  A resolution can be had by using a spectrum analyzer to find the source and either removing the offending device or eliminating the channels being disrupted.

Wide-band Interference

If the transmitting signal has the ability to disrupt the communications of an entire band, it is considered to be wide-band.  Wide-band interference can create a DoS for the entire 2.4GHz band.  The only resolution is to use a spectrum analyzer and locate and remove the offending device.

All Band Interference

All band interference is typically associated with frequency hopping spread spectrum communications that usually disrupt the 802.11 communications at 2.4GHz.  While hopping and dwelling, a FHSS device will transmit in sections of the frequency space occupied by an 802.11b/g/n channel.  This will not typically result in a DoS, but the frame transmissions from the 802.11b/g/n device can be corrupted.  Bluetooth is a short-distance RF technology used in WPANs.  It uses FHSS and hops across  the 2.4GHz ISM band at 1600 hops per second.  Older Bluetooth devices can cause problems, while newer ones utilize adaptive mechanisms to avoid interfering with 802.11 WLANs.  If you have a high number of FHSS transmitters in a space, the results will be devastating.  A single device will typically not cause the same level of issues.

Drivers

It is vitally important to ensure your client devices are using the latest drivers for their radios.  Also a consideration is to ensure that legacy clients are compatible with newer technology APs.  802.11k/r/v mechanisms can create roaming and connectivity problems if the client doesn’t support them.  The simple solution to legacy devices is to upgrade them.

PoE

Ensure you have the proper P0E power budget on your switches or you will encounter AP reboots over and over.  I have personally encountered this with Cisco 4800 series APs.  They seem like they are booting but the continually blink red, green, blue over and over and never come online.  In newer environments this has been due to a bad pair on the Ethernet side of the house.  As we learned back in Chapter 12, PoE can be supplied over multiple pairs.  Keep in mind that as 802.11ax becomes commonplace, 802.3at power will be necessary.  A good troubleshooting step with hung APs that are using PoE is to bounce their port, effectively powering them off and back on.

Firmware Bugs

When upgrading AP firmware, there is always a possibility that a firmware bug exists.  After upgrading, a recommended approach would be to upgrade a small subset of your APs for testing before a full rollout.  If bugs are thought to be found, contact the vendor for assistance.  Most vendors have recommended versions that have been vetted extensively.  If the bug is service impacting, rolling back to a previous version is likely recommended.

Layer 2 Troubleshooting

Layer 2 Re-transmissions

Excessive Layer 2 re-transmissions adversely affect the WLAN in two ways.  First, Layer 2 re-transmissions increase airtime consumption overhead and therefore decrease throughput.  Second, if application data has to be re-transmitted at Layer 2, the delivery of application traffic becomes delayed or inconsistent.  We already know how disruptive it is to VOIP.

  • Latency:  Latency is the time it takes to deliver a packet from the source device to the destination device.  It should not exceed 50ms for a VOIP packet.  Increased latency can cause echo issues.
  • Jitter:  Jitter is a variation of latency.  Jitter is how much latency of each packet varies from the average.  A high variance in latency can be a sign of 802.11 Layer 2 re-transmissions.  Results will be choppy audio, and all the re-transmissions will draw down battery life of VoWiFi phones.  For VoWiFi, jitter of less than 5ms variance is ideal.

A good 802.11 protocol analyzer can track layer 2 retry statistics for the entire WLAN.  They can also track retry stats for each AP and client station.  This is normally done from a WLC or NMS.  Your goal is to be under 10% for data and if VoWiFi is in play then under 5%.

RF Interference

RF Interference from a non-802.11 transmitter is the number 1 cause of Layer 2 re-transmissions.

Low SNR

Second on the list of Layer 2 re-transmissions is a low SNR.  When the received signal is too close to the background noise, data can become corrupted. 

Adjacent Channel Interference

Adjacent channel interference refers to the degradation of performance resulting from overlapping frequency space, due to an improper channel reuse design.  An adjacent channel would be a channel one above or below the current channel, so if you are on channel 6, channels 5 and 7 would be adjacent.  Adjacent channel interference can cause corrupted data and Layer 2 re-transmissions.

Hidden Node

Hidden node issues occur when the AP can hear both clients, but the clients cannot hear each other.  This causes the clients to not see issues when doing a clear channel assessment, which leads to both clients transmitting at the same time.  This leads to data corruption.  Hidden node issues drive re-transmission rates up to 15-20 percent or higher.  It is typically a result of poor WLAN design or obstructions like a new wall or piece of furniture that alters RF propagation.  The same issue can happen with clients that can’t hear each other.  An example is if a client is moved to a remote area or put inside a desk.  It can also happen if the AP is transmitting way too high, creating huge cells.  Distributed antenna systems can also become a problem if only a single AP is connected.  Multiple APs will still be required.  To troubleshoot a hidden node issue, using a protocol analyzer is recommended.  If you see a higher re-transmission rate for the MAC of one station when compared to others, it is likely you have found a hidden node.  You can additionally use RTS/CTS to diagnose the problem.  You can try to lower the RTS/CTS threshold on devices that support it by 500 bytes on the suspected hidden node.  You will have to play with this setting depending on the environment.  Doing so will cause the hidden node to reserve the medium and force all other stations to pause, effectively decreasing collisions and re-transmissions.  This typically is only available on legacy devices.

Fixes for Hidden Node Issues:

  1. Use RTS/CTS: Discussed above
  2. Increase power to all stations:  Most clients have fixed power, but if it is adjustable, increasing it may help.
  3. Remove the obstacles:  Removing walls will prove difficult, but something like a metal cabinet could be removed.
  4. Move the hidden node station:  Move the station to an area more likely to be heard.
  5. Add an additional AP:  Typically the best fix.

Mismatched Power

If an APs power exceeds that of a clients, the client will be able to hear the AP, but won’t have the power to reply.  This is more likely to be an outdoor problem with the AP density normally found in indoor environments.  This results in Layer 2 re-transmissions because the AP never hears a reply.  Due to antenna reciprocity, antennas will amplify received signals just as they do for transmitted signals.  A high gain antenna can help in this situation as it will amplify the weak clients transmission.  A protocol analyzer will show you if an issue exists by showing that frame transmissions are not corrupted by the AP but are at the client.  Matching the client and AP transmit power settings is the best way to alleviate this issue.  In dense environments, clients can cause CCI issues due to their power being high enough to be seen by multiple APs.  APs with 802.11k capabilities enabled can inform clients to use Transmit Power Control to change their transmit power dynamically to match the APs power.

Multipath

Multipath can cause intersymbol interference which causes data corruption.  Using directional antennas can help alleviate multipath by focusing the RF to where it is needed.  802.11a/b/g radios are the most susceptible to multipath.  802.11n and 802.11ax both support MIMO which uses multipath to increase signal.

Security Troubleshooting

802.11 security defines L2 authentication methods and Layer 2 dynamic encryption.  It can normally be diagnosed using an AP, WLC, or NMS.  Security and AAA log files from the WLAN hardware and RADIUS server are a great place to start when troubleshooting either PSK or 802.1X/EAP authentication problems.

PSK Troubleshooting

PSK can be troubleshot using a WLAN vendor diagnostic tool, log files, or a protocol analyzer.  PSK requires the 4-way handshake to be successful.  The PSK credentials must match on both the AP and the client.  If the 4-way handshake was successful, the unicast PTK (Pairwise Transient Key) is installed on the AP and the client.  Next the client gets its IP from DHCP.  We know at this point it is now a networking issue if problems still exist.  Most common PSK issues are mismatched credentials.  If they don’t match, a Pairwise Master Key is not properly created, thus the 4-way handshake fails, and the client will never attempt to get an IP.  Additionally there can be encryption mismatches.  The AP is setup for WPA2, which a legacy WPA client does not support.

802.1X/EAP Troubleshooting

802.1X is a port-based access control standard that defines the mechanisms necessary to authenticate and authorize devices to network resources.  It consists of three main components, the supplicant, authenticator, and authentication server.

  • Supplicant:  User of device that is requesting access.
  • Authenticator:  Gateway device that sits in the middle between the supplicant and authentication server, controlling or regulating the supplicants access to the network.
  • Authentication Server:  Its job is to validate the supplicants credentials.

802.1X/EAP Troubleshooting Zones

Zone 1 Backend Communication Problems

Zone 1 should always be investigated first.  If the AP and RADIUS server cannot communicate with each other, the entire authentication process will fail.  Same goes for the RADIUS server to the LDAP database.

4 Points of Failure in Zone 1

  1. Shared Secret Mismatch:  The authenticator and RADIUS server must validate each other with a shared secret.  Check for mistyped secrets.
  2. Incorrect IP Settings on the AP or the RADIUS Server:  The AP must know the proper IP of the RADIUS server, and the RADIUS server must be configured with the IPs of any IPs of any APs or WLCs functioning as authenticators.
  3. Authentication Port Mismatch:  The last point of failure is a failure of the LDAP query between the RADIUS server and the LDAP database.

Zone 2 Supplicant Certificate Process

If all is well with Zone 1, then we need to look at the supplicant.  Issues here are typically related to certificate issues or client credentials.  To verify if there is a certificate issue, you can edit the supplicant client software settings and temporarily disable the validation of the server certificate.  If authentication is now successful, you have confirmed an implementation issue with the certs with the 802.1X/EAP framework.

Most Common Certificate Issues

  1. The root CA cert is installed in the incorrect certificate store
  2. The incorrect root cert is chosen
  3. The server cert has expired
  4. The root cert has expired
  5. The supplicant clock settings are incorrect

The root CA cert needs to be installed in the Trusted Root Certificate Authorities store of the supplicant device.  It is also possible that back in Zone 1, the server cert configuration may be incorrect.  Authentications will also fail if you fail to match the Layer 2 EAP protocols on both client and authentication server.  If one uses PEAPv0 and the other PEAPv1, the authentication will fail.

Zone 2 Supplicant Credential Problems

If you have verified that no cert issues exist, then you are left with supplicant credential failures.

Possible Credential Problems

  1. Expired password or user account
  2. Wrong password
  3. User account does not exist in LDAP
  4. Machine account has not been joined to the Windows domain

One last consideration to troubleshoot 802.1X/EAP is RADIUS attributes.  These provide role-based access control, providing custom settings for different groups of users or devices.  If the RADIUS attribute configuration does not match on the authenticator and the RADIUS server, users might be assigned to default role or VLAN assignments.  Worse case might result in an authentication failure.

VPN Troubleshooting

VPNs are rarely used as the primary method of security for WLANs at this point.  IPSec VPNs are still commonly used to connect remote branch offices with corporate offices across WAN links.

IPSec Tunnel Creation

  1. IKE Phase 1: two VPN endpoints authenticate one another and negotiate keying material.  This results in an encrypted tunnel used by Phase 2 for negotiating the ESP (Encapsulating Security Payload) Security associations.
  2. IKE Phase 2:  two VPN endpoints use the secure tunnel created in phase 1 to negotiate ESP security associations (SAs).  The ESP SAs are used to encrypt user traffic that traverses between the endpoints.

Common Issues if IKE Phase 1 Fails

  1. Certificate problems
  2. Incorrect network settings
  3. Incorrect NAT settings on external firewall

Common ports  that need to be opened on any firewall a VPN tunnel may traverse are ports UDP 500 (IPSec) and UDP 4500 (NAT Transversal).

Common IKE Phase 2 Fails

  1. Mismatched transform sets between the client and server (encryption algorithm, hash algorithm, and so forth).
  2. Mixing different vendor solutions.

Roaming Troubleshooting

The most common roaming problems are the result of bad client drivers or bad WLAN design.  Sticky clients are a result of APs in close physical vicinity with transmit power levels that are too high.  This causes clients to stay connected to their original AP and not roam.  Proper roaming design is a result of proper primary and secondary coverage from the client perspective.  Clients decide when to roam, not the APs.  No secondary coverage creates a dead zone, where connectivity might be lost.  If too many APs are heard, the client may be constantly roaming, which for voice is an issue if the client needs to reauthenticate each time.  Changes in the environment can also create issues.  If new construction occurs, new walls can attenuate the RF.  This can create dead zones.  Most security-related roaming problems are based on the fact that many clients do not support either OKC or fast BSS transition (FT).  If the AP supports FT and the client doesn’t, the client will still need to reauthenticate for each roam.  As mentioned previously, enabling Voice-Enterprise mechanisms on an AP may create connectivity issues for legacy clients that can’t process the new information elements in the management frames.  Always test legacy clients before deploying FT.  Adding a SSID for just fast BSS devices is an option if you don’t mind the extra overhead.

Channel Utilization

Good channel utilization thresholds to live by

  • 80 percent channel utilization impacts all 802.11 data transmissions
  • 50 percent channel utilization impacts video traffic
  • 20 percent channel utilization impacts voice traffic

High CCI is often a result of improper channel planning.  Oversaturation of clients and high-bandwidth applications can consume too much airtime on a channel as well.  Proper capacity planning is important.  Having too many SSIDs, low basic rates, and a lot of legacy devices can also contribute to channel utilization.  QBSS information element can often help you to see what channel utilization looks like from the APs perspective.

Layers 3-7 Troubleshooting

The first step here is to check Layer 3 connectivity.  Is the client getting an IP address in the expected subnet?  A VLAN probe tool can be helpful by probing all the designated VLANs with multiple DHCP requests.  If the lease is offered to the AP, it will deny it but you will know that DHCP was successful.  If the DHCP request fails you will need to look at the upstream router to ensure it has the proper ip helper address configured.  If it doesn’t, requests will never make it to the DHCP server.  Another set of issues that can occur on the DHCP server are that it could be down, out of leases, or not configured properly.  Next up are issues at the switch level.  You may have VLANs not configured on the access switch, VLANs not tagged on the 802.1Q port, or the switch port is an access port.  Beyond the switch, we could also have firewall issues.  Firewalls can block specific applications, or groups of applications.  Receiving log files may be necessary.

WLAN Troubleshooting Tools

WLAN Discovery Applications

To troubleshoot WLANs, you will need an 802.11 client NIC and a WLAN discovery application like WiFi Explorer.  These will give you a broad overview of an existing WLAN.  They send out null probe request frames, and listen for the 802.11 probe response frames and beacon frames sent by the AP.

Spectrum Analyzers

Spectrum analyzers are frequency domain measurement devices that can measure the amplitude and frequency of electromagnetic signals.  Ekahaus Sidekick in an exmple of this as is MetaGeeks Wi-Spy.

Protocol Analyzers

Protocol Analyzers provide network visibility into exactly what traffic is traversing a network.  They capture and store network packets, providing you with a protocol decode for each packet captured.  You get a display with all the fields broken down.  WireShark is one of the most popular applications for packet analysis.  Most often these are used to look at Layer 2 802.11 frame exchanges between the APs and client devices.  Radiotap headers provide additional link-layer information that is added to each 802.11 frame when they are captured.  Using filters to see the information you are after is key to streamlining the troubleshooting process.

Throughput Test Tools

Throughput test tools are used to evaluate bandwidth and performance throughout a network.  They normally work in a client/server model to measure data streams between two ends or in both directions.  When testing downlink WLAN throughput, the 802.11 client should be configured as the server.  When testing uplink WLAN throughput, the 802.11 client should be configured as the client communicating with the server behind the AP.  iPerf is an open-sourced command line utility that is used to generate TCP or UDP data streams to test throughput.

Standard IP Network Test Commands

  • Ping:  tests basic connectivity between source host and destination host
  • Arp:  used to display the Address Resolution Protocol cache, which is the mapping of IP addresses to MAC addresses.  Viewing the Arp cache on the AP is often helpful.
  • Tracert/Traceroute:  determines detailed information about the path to a destination host, including the route an IP packet takes, number of hops, and response time.
  • Nslookup:  used to troubleshoot problems with Dynamic Name System (DNS) address resolution.  Many WLAN captive portals used for WLAN guest access rely on DNS redirection.
  • Netstat:  displays network statistics for active TCP sessions for both incoming and outgoing ports, Ethernet stats, IPV4 and IPV6 stats and more.  Helpful when troubleshooting firewall issues.

Secure Shell

SSH and Telnet are methods to connect to network equipment.  Secure Shell is used as the secure alternative to Telnet.  SSH 2 is the current version which should be used.  Putty and SecureCRT are popular programs for terminal emulation.

Chapter Review

This chapter review of WLAN Troubleshooting was a monster filled with a ton of information.  I am starting to wonder if my longer blog posts are the content I need the most help with.  There has to be a correlation there.  In this chapter we went over how we start the troubleshooting process by asking questions to help isolate a starting point.  We then talked about the OSI model and how we start from Layer 1 and work our way up.  Then we blamed the client!  We know that most issues lie in the lap of the client.  Something that should be obvious is that if your WLAN design is poor your troubleshooting tickets will be high.  An area I am a bit weak on is security and the areas in which we need to look to ensure it is configured properly so I will personally be digging into that deeper.  Lastly we learned about the tools at our disposal that can be used to assist in our troubleshooting process.  Tools will only go so far if you don’t know what you are looking for.  Next up we will be going over Wireless Attacks, Intrusion Monitoring and Policy.  I sense another long review coming.

Interesting Link of the Day

It seems every conference is being cancelled or postponed and WTF20 put on by CWNP was almost an exception to the rule.  But alas today they announced that the conference is going virtual, which is for us a blessing if you weren’t able to attend in person.  Head over to the WTF20 site and get yourself registered for the event on September 27th to October 2nd.