Can SPI chip select timing be ensured without delay_ns? #9

Closed
opened 2021-04-23 12:45:50 +08:00 by harry · 7 comments
Contributor

ENC424J600 requires that the chip select CS_n (known as NSS on STM32 datasheets) must go low or high in-between each SPI transactions, as described on p.41 of its datasheet:

Commands and data are sent to the device on the SI pin. Data is driven out on the SO line on the falling edge of SCK. The CS pin must be held low while any operation is performed, and returned to logic high when finished.

The timing diagram on p.148 might also suggest a 60 ns delay is needed after data reception and before CS_n goes from low to high.

Currently (since 1add94c12e77924747f6bc5e9a3858f85fe3115d), we added this delay by calling a user-provided delay function right after embedded_hal::blocking::spi::transfer() returns. It's been discussed on embedded_hal's repo that transfer() doens't guarantee the whole SCK period is passed during the final byte read operation (NB the definition of "final" is determined by the user, since the SPI slave usually keeps shifting out more data on next addresses), especially on STM32F4xx, so without a delay the chip might got "deselected" too early and the SPI transaction might be incomplete.

Although we have implemented the delay mechanism, it might or might not be the only solution, especially when the user application doesn't want to provide a hard-defined delay function to the driver.

ENC424J600 requires that the chip select CS_n (known as `NSS` on STM32 datasheets) must go low or high in-between each SPI transactions, as described on p.41 of its datasheet: > Commands and data are sent to the device on the SI pin. Data is driven out on the SO line on the falling edge of SCK. The CS pin must be held low while any operation is performed, and returned to logic high when finished. The timing diagram on p.148 might also suggest a 60 ns delay is needed after data reception and before CS_n goes from low to high. Currently (since `1add94c12e77924747f6bc5e9a3858f85fe3115d`), we added this delay by calling a user-provided delay function right after `embedded_hal::blocking::spi::transfer()` returns. It's been discussed on [embedded_hal's repo](https://github.com/rust-embedded/embedded-hal/issues/264) that `transfer()` doens't guarantee the whole SCK period is passed during the final byte read operation (NB the definition of "final" is determined by the user, since the SPI slave usually keeps shifting out more data on next addresses), especially on STM32F4xx, so without a delay the chip might got "deselected" too early and the SPI transaction might be incomplete. Although we have implemented the delay mechanism, it might or might not be the only solution, especially when the user application doesn't want to provide a hard-defined delay function to the driver.
Author
Contributor

@occheung When you're free again, would you please show us how to test that this behaviour has been completely eliminated?

@occheung When you're free again, would you please show us how to test that this behaviour has been completely eliminated?

Take a step back and get a feel for the timescales involved! How long does one CPU clock cycle (instruction) take? How long is the SPI clock period? How many instructions do you think are there between asserting CS and the SPI transaction starting or between it ending and CS deassertion? Or just take a scope and look at the timing.

Take a step back and get a feel for the timescales involved! How long does one CPU clock cycle (instruction) take? How long is the SPI clock period? How many instructions do you think are there between asserting CS and the SPI transaction starting or between it ending and CS deassertion? Or just take a scope and look at the timing.
Author
Contributor

Hi Robert, it was stupid of me to phrase my question like this. Clearly, using a scope or logic analyser is suitable enough to check how CS changes w.r.t. data transactions (indicated by SCK pulses).

Actually, what I intend to ask is: under what situations would SPI transactions fail if there is not enough delay before CS gets deasserted and before CS gets asserted for the next contiguous transaction? Obviously, timing issues don't happen always (empirically speaking) - using my PmodNIC100 module on an STM32F407 board, the Ethernet controller can always be initialised reliably, and incoming/outgoing packets seem to be fine. It should be possible to test the packets thoroughly by having the board repeatedly transmit a packet at certain regular intervals, and use tshark to check if there is any lost or malformed packet.

Still, I used a logic analyser to capture the SPI signals under two types of situation: during initialisation, and during smoltcp's polling. The captures reveal the following: (edit: these are empirical findings)

  1. The current code adds a 60ns delay in each of the following:

    • between final byte (i.e. final clock pulse) and CS deassertion, and
    • between CS deassertion and function returning the RX byte.

    Without changing these delays, CS always got deasserted at least 400ns after the final SCK pulse (min. req. 50ns), and CS always stayed asserted for at least 400ns in-between contiguous SPI transactions (min. req. 20ns).

  2. When these additional delays were removed, the CS assertion between transactions would often be skipped (i.e. NSS would stay low between two back-to-back SPI commands). However, the controller would still send the correct data (aligned to SCK) for every command -- which is only because an SPI transaction must always be initiated by the master, which marks the beginning of each transaction.

  3. In the occasion where CS assertions wasn't skipped without the additional delays, CS always got deasserted ~250ns after the final SCK pulse, and stayed up for ~250ns, which would still meet the timing requirements.

  4. By skim-reading, there was no difference in terms of data received from the controller, if the delays were removed.

Hi Robert, it was stupid of me to phrase my question like this. Clearly, using a scope or logic analyser is suitable enough to check how CS changes w.r.t. data transactions (indicated by SCK pulses). Actually, what I intend to ask is: under what situations would SPI transactions fail if there is not enough delay before CS gets deasserted and before CS gets asserted for the next contiguous transaction? Obviously, timing issues don't happen always **(empirically speaking)** - using my PmodNIC100 module on an STM32F407 board, the Ethernet controller can always be initialised reliably, and incoming/outgoing packets seem to be fine. It should be possible to test the packets thoroughly by having the board repeatedly transmit a packet at certain regular intervals, and use `tshark` to check if there is any lost or malformed packet. Still, I used a logic analyser to capture the SPI signals under two types of situation: during initialisation, and during smoltcp's polling. The captures reveal the following: **(edit: these are empirical findings)** 1. The current code adds a 60ns delay in each of the following: * between final byte (i.e. final clock pulse) and CS deassertion, and * between CS deassertion and function returning the RX byte. Without changing these delays, CS always got deasserted at least 400ns after the final SCK pulse (min. req. 50ns), and CS always stayed asserted for at least 400ns in-between contiguous SPI transactions (min. req. 20ns). 2. When these additional delays were removed, the CS assertion between transactions would often be skipped (i.e. NSS would stay low between two back-to-back SPI commands). However, the controller would still send the correct data (aligned to SCK) for every command -- which is only because an SPI transaction must always be initiated by the master, which marks the beginning of each transaction. 3. In the occasion where CS assertions wasn't skipped without the additional delays, CS always got deasserted ~250ns after the final SCK pulse, and stayed up for ~250ns, which would still meet the timing requirements. 4. By skim-reading, there was no difference in terms of data received from the controller, if the delays were removed.
Author
Contributor

To clarify my thoughts and doubts:

  1. The issue on embedded_hal might still hold. On both Booster and my own testing firmware, CPU is clocked at 168MHz. In my firmware, SPI SCK is set to 5MHz (capped at 14MHz by ENC424J600). Since the HAL blocks to poll the RXNE flag, which is asserted at the final sampling edge on SCK (RM0090 Rev 18, p.883), and there's only a volatile read operation on the RX buffer (via SPI_DR) right before the transfer function returns the read bytes to our driver, followed by deasserting CS, theoretically CS could be deasserted before the complete SPI SCK cycle during the slave's final output stage. This issue could very well still happen if we kept the delay that holds CS high, though I haven't tested.

  2. Because I saw no difference in terms of data emitted by the slave, I am confused by ENC424J600 datasheet's statement:

    The CS pin must be held low while any operation is performed, and returned to logic high when finished.

    I really doubt that we would ever need to assert/deassert CS in-between SPI transactions that are back-to-back. Eliminating such operations would greatly simplify our SPI implementation and reduce spinning time.

    (edit: observing empirical results and disregarding actual specs is often dangerous)

  3. I made a mistake writing the title for this open issue. It's not intended to say that @occheung's delay logic is going to fail - I'm rather confident it works to guarantee the SPI timing. I simpply meant to say that this might not be the only way to meet the timing requirements. And folks over on Booster's repo seem to want us to use embedded_hal's DelayUs, which is more coarse than both its CPU and ENC424J600's SPI clocks.

  4. Alternatively, we could use cortex_m's asm::delay to dedicate a fixed number of CPU spinning cycles for CS assertion/deassertion. But I don't think it's good to assume that all user applications using embedded_hal must operate at several hundred MHz CPU clocks. DelayUs would be better.

To clarify my thoughts and doubts: 1. The issue on [embedded_hal](https://github.com/rust-embedded/embedded-hal/issues/264) might still hold. On both Booster and my own testing firmware, CPU is clocked at 168MHz. In my firmware, SPI SCK is set to 5MHz (capped at 14MHz by ENC424J600). Since the HAL blocks to poll the RXNE flag, which is asserted at the final sampling edge on SCK (RM0090 Rev 18, p.883), and there's only a volatile read operation on the RX buffer (via SPI_DR) right before the transfer function returns the read bytes to our driver, followed by deasserting CS, theoretically CS could be deasserted before the complete SPI SCK cycle during the slave's final output stage. This issue could very well still happen if we kept the delay that holds CS high, though I haven't tested. 2. Because I saw no difference in terms of data emitted by the slave, I am confused by ENC424J600 datasheet's statement: > The CS pin must be held low while any operation is performed, and returned to logic high when finished. I really doubt that we would ever need to assert/deassert CS in-between SPI transactions that are back-to-back. Eliminating such operations would greatly simplify our SPI implementation and reduce spinning time. **(edit: observing empirical results and disregarding actual specs is often dangerous)** 3. I made a mistake writing the title for this open issue. It's not intended to say that @occheung's delay logic is going to fail - I'm rather confident it works to guarantee the SPI timing. I simpply meant to say that this might not be the only way to meet the timing requirements. And folks over on Booster's repo seem to want us to use embedded_hal's `DelayUs`, which is more coarse than both its CPU and ENC424J600's SPI clocks. 4. Alternatively, we could use cortex_m's `asm::delay` to dedicate a fixed number of CPU spinning cycles for CS assertion/deassertion. But I don't think it's good to assume that all user applications using `embedded_hal` must operate at several hundred MHz CPU clocks. `DelayUs` would be better.
harry changed title from SPI NSS (chip select) might go high too early after data received to Can SPI chip select timing be ensured without delay_ns? 2021-04-26 16:34:36 +08:00
Author
Contributor

To all reading this: most of my comments are for my own record and future reference. I am not requesting for any comments or answers from others, but if someone do I would greatly appreciate their help!

To all reading this: most of my comments are for my own record and future reference. I am not requesting for any comments or answers from others, but if someone do I would greatly appreciate their help!
Author
Contributor

Update:

  1. I read the datasheet more carefully and found two different types of valid CS behaviour:

    1. If there are contiguous transactions that are not fixed-length, CS must be deasserted when the transaction ends.

      (p.45) ... no other opcode may follow any N-byte instruction until the CS[_n] line is driven high. Driving CS[_n] high terminates the instruction and then places the SO pin in a high-impedance state.

      Currently, all SPI transactions are done this way. Reads/writes of register data are done in the typical "Command - Address - Data" sequence. Reads/writes of SRAM data (i.e. Tx/Rx buffers) are done by sending a specific 8-byte opcode, followed by the data. Thus, in our current code, as long as CS is asserted, data will keep getting out/in while the address keeps incrementing.

    2. If there are contiguous transactions (aka "instructions" on the datasheet) that are fixed-length, CS can be kept asserted.

      (p.43) The host controller may issue another SPI instruction or multiple fixed length instructions without deasserting chip select.

      This can actually apply to several SPI transactions we need during init as well as packet Tx/Rx. This means CS can indeed keep asserted for certain sequences of SPI transfers. The total duration of such SPI commands can be shortened by the simpler, fixed-length transactions as well as reduced CS deassertion.

    Considering these two types of behaviour, I am going to keep the delays for (i) and remove CS deassertion logic for consecutive type-(ii) transactions. I'll repeatedly test my implementation, but that would only show empirical results.

  2. To answer my original question, enforcing embedded_hal's DelayUs is still too coarse, so I am for keeping the delays in nanosecond scale. The problem is cortex-m's asm::delay(), which is based on instruction cycles, is effectively the only API for achieving nanosecond delays, and it depends on the CPU speed set by the user.

    (Side-note: asm::delay() takes 3-4 cycles while asm::nop() takes 1, but NOP might not be necessary time-consuming on certain platforms, including STM32F4xx's Cortex-M4 (see ARM's manual ARM DUI 0553B, section 3.12.8). So, we can't go more precise than several CPU cycles, irrespectively of the CPU speed.)

    Because this driver has been implemented as "universal", I am against hardcoding CS delays this way, which is specific to STM32F4xx's max SYSCLK frequency and the Cortex-M cores. I will approach the rust-embedded community to try to bring the lack of nanosecond delay support to their attention.

Update: 1. I read the datasheet more carefully and found two different types of valid CS behaviour: 1. If there are contiguous transactions that are **not fixed-length**, CS must be **deasserted when the transaction ends**. > (p.45) *... no other opcode may follow any N-byte instruction until the CS[_n] line is driven high. Driving CS[_n] high terminates the instruction and then places the SO pin in a high-impedance state.* Currently, all SPI transactions are done this way. Reads/writes of register data are done in the typical "Command - Address - Data" sequence. Reads/writes of SRAM data (i.e. Tx/Rx buffers) are done by sending a specific 8-byte opcode, followed by the data. Thus, in our current code, as long as CS is asserted, data will keep getting out/in while the address keeps incrementing. 2. If there are contiguous transactions (aka "instructions" on the datasheet) that are **fixed-length**, CS **can be kept asserted**. > (p.43) *The host controller may issue another SPI instruction or multiple fixed length instructions without deasserting chip select.* This can actually apply to several SPI transactions we need during init as well as packet Tx/Rx. This means CS can indeed keep asserted for certain sequences of SPI transfers. The total duration of such SPI commands can be shortened by the simpler, fixed-length transactions as well as reduced CS deassertion. Considering these two types of behaviour, I am going to keep the delays for (i) and remove CS deassertion logic for consecutive type-(ii) transactions. I'll repeatedly test my implementation, but that would only show **empirical results**. 2. To answer my original question, enforcing embedded_hal's `DelayUs` is still too coarse, so I am for keeping the delays in nanosecond scale. The problem is cortex-m's `asm::delay()`, which is based on instruction cycles, is effectively the only API for achieving nanosecond delays, and it depends on the CPU speed set by the user. (Side-note: `asm::delay()` takes 3-4 cycles while `asm::nop()` takes 1, but `NOP` might not be necessary time-consuming on certain platforms, including STM32F4xx's Cortex-M4 (see [ARM's manual ARM DUI 0553B, section 3.12.8](https://developer.arm.com/documentation/dui0553/latest/)). So, we can't go more precise than several CPU cycles, irrespectively of the CPU speed.) Because this driver has been implemented as "universal", I am against hardcoding CS delays this way, which is specific to STM32F4xx's max SYSCLK frequency and the Cortex-M cores. I will approach the [rust-embedded community](https://github.com/rust-embedded/wg) to try to bring the lack of nanosecond delay support to their attention.
Author
Contributor

Hi, for all who are interested, I've tried my best to properly address this issue in #11. Indeed, despite empirical findings, this driver should observe the timing requirements as stated by the datasheets. Thus, I kept the nanosecond delays for CS_n, but switched from a user-provided delay_ns closure to Cortex-M instruction-based delay (provided by the cortex-m crate). I also reduced the overall SPI transfer durations by using the fixed-byte instructions, which don't require CS_n deassertion at the end of the transfer.

Although #11 only adds the delay for Cortex-M platforms, it can easily be extended for other CPUs with feature gates. Most platforms use hundreds of MHz clocks so the instruction-based delays should be precise enough.

I have already tested this new delay implementation with logic analyser, so the empirical proof should also be good enough. Therefore, we can close this discussion for now.

Hi, for all who are interested, I've tried my best to properly address this issue in #11. Indeed, despite empirical findings, this driver should observe the timing requirements as stated by the datasheets. Thus, I kept the nanosecond delays for CS_n, but switched from a user-provided `delay_ns` closure to Cortex-M instruction-based delay (provided by the cortex-m crate). I also reduced the overall SPI transfer durations by using the fixed-byte instructions, which don't require CS_n deassertion at the end of the transfer. Although #11 only adds the delay for Cortex-M platforms, it can easily be extended for other CPUs with feature gates. Most platforms use hundreds of MHz clocks so the instruction-based delays should be precise enough. I have already tested this new delay implementation with logic analyser, so the empirical proof should also be good enough. Therefore, we can close this discussion for now.
harry closed this issue 2021-06-25 11:56:59 +08:00
Sign in to join this conversation.
No Label
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: renet/ENC424J600#9
No description provided.