NXP SJA1105 switch driver

Overview

The NXP SJA1105 is a family of 6 devices:

  • SJA1105E: First generation, no TTEthernet
  • SJA1105T: First generation, TTEthernet
  • SJA1105P: Second generation, no TTEthernet, no SGMII
  • SJA1105Q: Second generation, TTEthernet, no SGMII
  • SJA1105R: Second generation, no TTEthernet, SGMII
  • SJA1105S: Second generation, TTEthernet, SGMII

These are SPI-managed automotive switches, with all ports being gigabit capable, and supporting MII/RMII/RGMII and optionally SGMII on one port.

Being automotive parts, their configuration interface is geared towards set-and-forget use, with minimal dynamic interaction at runtime. They require a static configuration to be composed by software and packed with CRC and table headers, and sent over SPI.

The static configuration is composed of several configuration tables. Each table takes a number of entries. Some configuration tables can be (partially) reconfigured at runtime, some not. Some tables are mandatory, some not:

Table Mandatory Reconfigurable
Schedule no no
Schedule entry points if Scheduling no
VL Lookup no no
VL Policing if VL Lookup no
VL Forwarding if VL Lookup no
L2 Lookup no no
L2 Policing yes no
VLAN Lookup yes yes
L2 Forwarding yes partially (fully on P/Q/R/S)
MAC Config yes partially (fully on P/Q/R/S)
Schedule Params if Scheduling no
Schedule Entry Points Params if Scheduling no
VL Forwarding Params if VL Forwarding no
L2 Lookup Params no partially (fully on P/Q/R/S)
L2 Forwarding Params yes no
Clock Sync Params no no
AVB Params no no
General Params yes partially
Retagging no yes
xMII Params yes no
SGMII no yes

Also the configuration is write-only (software cannot read it back from the switch except for very few exceptions).

The driver creates a static configuration at probe time, and keeps it at all times in memory, as a shadow for the hardware state. When required to change a hardware setting, the static configuration is also updated. If that changed setting can be transmitted to the switch through the dynamic reconfiguration interface, it is; otherwise the switch is reset and reprogrammed with the updated static configuration.

Traffic support

The switches do not support switch tagging in hardware. But they do support customizing the TPID by which VLAN traffic is identified as such. The switch driver is leveraging CONFIG_NET_DSA_TAG_8021Q by requesting that special VLANs (with a custom TPID of ETH_P_EDSA instead of ETH_P_8021Q) are installed on its ports when not in vlan_filtering mode. This does not interfere with the reception and transmission of real 802.1Q-tagged traffic, because the switch does no longer parse those packets as VLAN after the TPID change. The TPID is restored when vlan_filtering is requested by the user through the bridge layer, and general IP termination becomes no longer possible through the switch netdevices in this mode.

The switches have two programmable filters for link-local destination MACs. These are used to trap BPDUs and PTP traffic to the master netdevice, and are further used to support STP and 1588 ordinary clock/boundary clock functionality.

The following traffic modes are supported over the switch netdevices:

  Standalone ports Bridged with vlan_filtering 0 Bridged with vlan_filtering 1
Regular traffic Yes Yes No (use master)
Management traffic (BPDU, PTP) Yes Yes Yes

Switching features

The driver supports the configuration of L2 forwarding rules in hardware for port bridging. The forwarding, broadcast and flooding domain between ports can be restricted through two methods: either at the L2 forwarding level (isolate one bridge's ports from another's) or at the VLAN port membership level (isolate ports within the same bridge). The final forwarding decision taken by the hardware is a logical AND of these two sets of rules.

The hardware tags all traffic internally with a port-based VLAN (pvid), or it decodes the VLAN information from the 802.1Q tag. Advanced VLAN classification is not possible. Once attributed a VLAN tag, frames are checked against the port's membership rules and dropped at ingress if they don't match any VLAN. This behavior is available when switch ports are enslaved to a bridge with vlan_filtering 1.

Normally the hardware is not configurable with respect to VLAN awareness, but by changing what TPID the switch searches 802.1Q tags for, the semantics of a bridge with vlan_filtering 0 can be kept (accept all traffic, tagged or untagged), and therefore this mode is also supported.

Segregating the switch ports in multiple bridges is supported (e.g. 2 + 2), but all bridges should have the same level of VLAN awareness (either both have vlan_filtering 0, or both 1). Also an inevitable limitation of the fact that VLAN awareness is global at the switch level is that once a bridge with vlan_filtering enslaves at least one switch port, the other un-bridged ports are no longer available for standalone traffic termination.

Topology and loop detection through STP is supported.

L2 FDB manipulation (add/delete/dump) is currently possible for the first generation devices. Aging time of FDB entries, as well as enabling fully static management (no address learning and no flooding of unknown traffic) is not yet configurable in the driver.

A special comment about bridging with other netdevices (illustrated with an example):

A board has eth0, eth1, swp0@eth1, swp1@eth1, swp2@eth1, swp3@eth1. The switch ports (swp0-3) are under br0. It is desired that eth0 is turned into another switched port that communicates with swp0-3.

If br0 has vlan_filtering 0, then eth0 can simply be added to br0 with the intended results. If br0 has vlan_filtering 1, then a new br1 interface needs to be created that enslaves eth0 and eth1 (the DSA master of the switch ports). This is because in this mode, the switch ports beneath br0 are not capable of regular traffic, and are only used as a conduit for switchdev operations.

Offloads

Time-aware scheduling

The switch supports a variation of the enhancements for scheduled traffic specified in IEEE 802.1Q-2018 (formerly 802.1Qbv). This means it can be used to ensure deterministic latency for priority traffic that is sent in-band with its gate-open event in the network schedule.

This capability can be managed through the tc-taprio offload ('flags 2'). The difference compared to the software implementation of taprio is that the latter would only be able to shape traffic originated from the CPU, but not autonomously forwarded flows.

The device has 8 traffic classes, and maps incoming frames to one of them based on the VLAN PCP bits (if no VLAN is present, the port-based default is used). As described in the previous sections, depending on the value of vlan_filtering, the EtherType recognized by the switch as being VLAN can either be the typical 0x8100 or a custom value used internally by the driver for tagging. Therefore, the switch ignores the VLAN PCP if used in standalone or bridge mode with vlan_filtering=0, as it will not recognize the 0x8100 EtherType. In these modes, injecting into a particular TX queue can only be done by the DSA net devices, which populate the PCP field of the tagging header on egress. Using vlan_filtering=1, the behavior is the other way around: offloaded flows can be steered to TX queues based on the VLAN PCP, but the DSA net devices are no longer able to do that. To inject frames into a hardware TX queue with VLAN awareness active, it is necessary to create a VLAN sub-interface on the DSA master port, and send normal (0x8100) VLAN-tagged towards the switch, with the VLAN PCP bits set appropriately.

Management traffic (having DMAC 01-80-C2-xx-xx-xx or 01-19-1B-xx-xx-xx) is the notable exception: the switch always treats it with a fixed priority and disregards any VLAN PCP bits even if present. The traffic class for management traffic has a value of 7 (highest priority) at the moment, which is not configurable in the driver.

Below is an example of configuring a 500 us cyclic schedule on egress port swp5. The traffic class gate for management traffic (7) is open for 100 us, and the gates for all other traffic classes are open for 400 us:

#!/bin/bash

set -e -u -o pipefail

NSEC_PER_SEC="1000000000"

gatemask() {
        local tc_list="$1"
        local mask=0

        for tc in ${tc_list}; do
                mask=$((${mask} | (1 << ${tc})))
        done

        printf "%02x" ${mask}
}

if ! systemctl is-active --quiet ptp4l; then
        echo "Please start the ptp4l service"
        exit
fi

now=$(phc_ctl /dev/ptp1 get | gawk '/clock time is/ { print $5; }')
# Phase-align the base time to the start of the next second.
sec=$(echo "${now}" | gawk -F. '{ print $1; }')
base_time="$(((${sec} + 1) * ${NSEC_PER_SEC}))"

tc qdisc add dev swp5 parent root handle 100 taprio \
        num_tc 8 \
        map 0 1 2 3 5 6 7 \
        queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
        base-time ${base_time} \
        sched-entry S $(gatemask 7) 100000 \
        sched-entry S $(gatemask "0 1 2 3 4 5 6") 400000 \
        flags 2

It is possible to apply the tc-taprio offload on multiple egress ports. There are hardware restrictions related to the fact that no gate event may trigger simultaneously on two ports. The driver checks the consistency of the schedules against this restriction and errors out when appropriate. Schedule analysis is needed to avoid this, which is outside the scope of the document.

At the moment, the time-aware scheduler can only be triggered based on a standalone clock and not based on PTP time. This means the base-time argument from tc-taprio is ignored and the schedule starts right away. It also means it is more difficult to phase-align the scheduler with the other devices in the network.

Device Tree bindings and board design

This section references Documentation/devicetree/bindings/net/dsa/sja1105.txt and aims to showcase some potential switch caveats.

RMII PHY role and out-of-band signaling

In the RMII spec, the 50 MHz clock signals are either driven by the MAC or by an external oscillator (but not by the PHY). But the spec is rather loose and devices go outside it in several ways. Some PHYs go against the spec and may provide an output pin where they source the 50 MHz clock themselves, in an attempt to be helpful. On the other hand, the SJA1105 is only binary configurable - when in the RMII MAC role it will also attempt to drive the clock signal. To prevent this from happening it must be put in RMII PHY role. But doing so has some unintended consequences. In the RMII spec, the PHY can transmit extra out-of-band signals via RXD[1:0]. These are practically some extra code words (/J/ and /K/) sent prior to the preamble of each frame. The MAC does not have this out-of-band signaling mechanism defined by the RMII spec. So when the SJA1105 port is put in PHY role to avoid having 2 drivers on the clock signal, inevitably an RMII PHY-to-PHY connection is created. The SJA1105 emulates a PHY interface fully and generates the /J/ and /K/ symbols prior to frame preambles, which the real PHY is not expected to understand. So the PHY simply encodes the extra symbols received from the SJA1105-as-PHY onto the 100Base-Tx wire. On the other side of the wire, some link partners might discard these extra symbols, while others might choke on them and discard the entire Ethernet frames that follow along. This looks like packet loss with some link partners but not with others. The take-away is that in RMII mode, the SJA1105 must be let to drive the reference clock if connected to a PHY.

MDIO bus and PHY management

The SJA1105 does not have an MDIO bus and does not perform in-band AN either. Therefore there is no link state notification coming from the switch device. A board would need to hook up the PHYs connected to the switch to any other MDIO bus available to Linux within the system (e.g. to the DSA master's MDIO bus). Link state management then works by the driver manually keeping in sync (over SPI commands) the MAC link speed with the settings negotiated by the PHY.