nal: Fix loops & socket handle management #7

Merged
harry merged 1 commits from fix-nal into master 2024-08-17 17:37:23 +08:00
Contributor

Summary

  1. Fix connection attempt blocking forever until it succeeds
    • NetworkStack::connect():
      • Add timeout for connection attempt
      • Now returns the socket at TCP ESTABLISHED or CLOSED states, or after connection timeout
    • Split NetworkStack::update() into update() (for controlling the clock) and poll() (for polling the smoltcp's EthernetInterface)
      • Additional clock handling in update(): When embedded_time::clock::Clock::try_now() returns a timestamp earlier than the current timestamp time_ms, it takes into consideration where try_now() were non-sensical (e.g. the main application would reset the hardware cycle counter once during runtime as in cortex-m-rtic), and so advance time by the minimal amount (1ms for the smoltcp::EthernetInterface::poll()) instead of returning an error
      • update() also now returns the amount of time advanced
    • Also remove option auto_time_update; the main application is responsible for what values embedded_time::clock::Clock::try_now() should return
      • For example, the main app can return a value from a temporary counter as Clock::try_now() until the hardware cycle counter is ready
  2. Fix buffer reading/writing blocking leading to unhandled behaviour in minimq (issue was only relevant on 83e946544c) Fix read/write not pushing unused sockets (due to network errors) back to the stack
    • Based on quartiq's minimq as of 933687c2e4
    • In minimq applications, a socket is expected to be returned when nal::TcpStack::open() is called
      • MqttClient::read()/write() takes away the TCP socket handle (wrapped as an Option) from its RefCell, and then calls nal::TcpStack::read()/write(); if NAL returns nb::Error, then the MQTT client will propagate and return the error, leaving None behind
        // minimq/src/mqtt_client.rs
        fn read(&self, mut buf: &mut [u8]) -> Result<usize, Error<N::Error>> {
            let mut socket_ref = self.socket.borrow_mut();                          // Borrow self.socket as mutable refcell,
            let mut socket = socket_ref.take().unwrap();                            // and then take it; socket unwraps the Some(...),
                                                                                    // while self.socket now contains None().
            let read = match self.network_stack.read(&mut socket, &mut buf) {       // If NAL's read() returns nb::Error,
                Ok(count) => count,
                Err(nb::Error::WouldBlock) => 0,
                Err(nb::Error::Other(error)) => return Err(Error::Network(error)),  // then MqttClient::read() returns Err.
            };
        
            // Put the socket back into the option.
            socket_ref.replace(socket);                                             // Then this can't be reached;
                                                                                    // self.socket is kept None().
        
            Ok(read)
        }
        
      • Afterwards, when MqttClient::socket_is_connected() gets called (e.g. while polling the interface), it will detect that the socket handle is None, and attempt to call nal::TcpStack::open()
        // minimq/src/mqtt_client.rs
        fn socket_is_connected(&self) -> Result<bool, N::Error> {
            let mut socket_ref = self.socket.borrow_mut();                          // Borrow self.socket as mutable refcell.
        
            if socket_ref.is_none() {                                               // Given self.socket is None(),
                // Attempt to open a socket from the network stack.
                socket_ref.replace(self.network_stack.open(Mode::NonBlocking)?);    // NetworkStack::open() is called
                                                                                    // to return a smoltcp socket handle.
                                                                                    // If open() returns Err, then this
                                                                                    // function also returns the Err;
                                                                                    // Otherwise, self.socket is replaced
                                                                                    // with the socket opened.
            }
        
            let socket = socket_ref.take().unwrap();                                // If self.socket is replaced and isn't
                                                                                    // None(), then the rest can run as expected.
            let result = self.network_stack.is_connected(&socket);
            socket_ref.replace(socket);
        
            result
        }
        
      • Since open() pops a socket handle from the array (unused_handles), we should ensure that any future calls of open() will not pop from an empty array and return NetworkError::NoSocket
      • For example, assuming that a socket handle has been obtained by the client with open():
        • If the main application loops to call MqttClient::poll(), in which socket_is_connected() is called, then as soon as MqttClient::read()/write() returns Err (e.g. due to network disconnection), the next iteration of poll() must call open(), trying to pop from the array
        • As long as MqttClient remains unable to replace the socket reference (e.g. by MqttClient::connect(), MqttClient::poll() would continue to loop and invoke open() until all socket handles have been popped from the array
      • Thus, we should push the socket handle back to the array inside all NAL methods that involves the socket handle, which are read() and write().
  3. Prevent close() from pushing duplicate handles for the same smoltcp::socket::TcpSocket
    • smoltcp::socket::SocketHandle implements Copy, so pushing the handle to usused_handles does not actually take ownership of the handle but copies it
    • If open() and close() calls are imbalanced, then the program might start pushing the same socket multiple times; while Vec::pop() automatically returns Err when the array is empty, Vec::push() does not return Err when there are duplicates and so this should be handled

Explanation

Currently, we primarily wish to use this driver to port the NGFW Booster to ENC424J600. Since quartiq's minimq is used, we want to provide an adapter in the ENC424J600 library for the network stack control required by minimq's MQTT client, such as connection to a broker at the client's instantiation, reconnection during polling, reading and writing to the data buffers. Therefore, we should aim to ensure that any call of the minimq functions would not lead to unexpected behaviour such as potential hanging and runtime errors.

As a proposed fix for hanging due to unsuccessful connection to the broker, the idea of adding a connection timeout is to let the function return at appropriate timing while still allowing some number of repeated attempts before relinquishing control over the main program. I am also allowing connect() to return the socket even if it is at TCP CLOSED state or if it still cannot connect to the broker before the timeout period. This was taking W5500's NAL as reference such that the main program or the MQTT client can try to re-connect in the future, instead of getting hanged by the network stack.

In practice, the user might want to set the connection timeout to 0 such that connect() would not even try to take control over other important routines in the main application. For example, in the RTIC model both the MQTT client and a USB terminal are polling in parallel, and each of these tasks is monitored by the STM32 independent watchdog. This means each task is only allocated with a fixed time interval to return, and risking to loop would be undesirable.

As a proposed fix for panicking due to blocking errors, the idea of replacing the blocking nb::Error with nb::WouldBlock can be seen as a temporary solution customised for minimq applications. WouldBlock simply cannot give the most useful information about "what's wrong?" like Error do.

The proposed fix that enforces pushing the socket handle back to the array is mostly targeted on minimq applications, but I believe this also fits the general case because our read/write functionalities are already reporting network errors, which should signal the main application to restart connection on the socket.

Testing

This as been tested on NGFW Booster firmware with no occurence of any Ethernet-related errors or issues.

## Summary 1. Fix connection attempt blocking forever until it succeeds * `NetworkStack::connect()`: * Add timeout for connection attempt * Now returns the socket at TCP ESTABLISHED or CLOSED states, or after connection timeout * Split `NetworkStack::update()` into `update()` (for controlling the clock) and `poll()` (for polling the smoltcp's `EthernetInterface`) * Additional clock handling in `update()`: When `embedded_time::clock::Clock::try_now()` returns a timestamp **earlier** than the current timestamp `time_ms`, it takes into consideration where `try_now()` were non-sensical (e.g. the main application would reset the hardware cycle counter once during runtime [as in `cortex-m-rtic`](https://docs.rs/cortex-m-rtic/0.5.6/rtic/trait.Monotonic.html#correctness)), and so advance time by the minimal amount (1ms for the `smoltcp::EthernetInterface::poll()`) instead of returning an error * `update()` also now returns the amount of time advanced * Also remove option `auto_time_update`; the main application is responsible for what values `embedded_time::clock::Clock::try_now()` should return * For example, the main app can return a value from a temporary counter as `Clock::try_now()` until the hardware cycle counter is ready 2. ~~Fix buffer reading/writing blocking leading to unhandled behaviour in minimq **(issue was only relevant on https://github.com/quartiq/minimq/commit/83e946544cebd09c9dd07ff1271be639138ec1ce)**~~ Fix read/write not pushing unused sockets (due to network errors) back to the stack * Based on quartiq's minimq as of https://github.com/quartiq/minimq/commit/933687c2e4bc8a4d972de9a4d1508b0b554a8b38 * In minimq applications, a socket is expected to be returned when `nal::TcpStack::open()` is called * `MqttClient::`[`read()`](https://github.com/quartiq/minimq/blob/933687c2e4bc8a4d972de9a4d1508b0b554a8b38/src/mqtt_client.rs#L114-L127)/[`write()`](https://github.com/quartiq/minimq/blob/933687c2e4bc8a4d972de9a4d1508b0b554a8b38/src/mqtt_client.rs#L129-L142) takes away the TCP socket handle (wrapped as an `Option`) from its `RefCell`, and then calls `nal::TcpStack::read()`/`write()`; if NAL returns `nb::Error`, then the MQTT client will propagate and return the error, leaving `None` behind ```rs // minimq/src/mqtt_client.rs fn read(&self, mut buf: &mut [u8]) -> Result<usize, Error<N::Error>> { let mut socket_ref = self.socket.borrow_mut(); // Borrow self.socket as mutable refcell, let mut socket = socket_ref.take().unwrap(); // and then take it; socket unwraps the Some(...), // while self.socket now contains None(). let read = match self.network_stack.read(&mut socket, &mut buf) { // If NAL's read() returns nb::Error, Ok(count) => count, Err(nb::Error::WouldBlock) => 0, Err(nb::Error::Other(error)) => return Err(Error::Network(error)), // then MqttClient::read() returns Err. }; // Put the socket back into the option. socket_ref.replace(socket); // Then this can't be reached; // self.socket is kept None(). Ok(read) } ``` * Afterwards, when `MqttClient::`[`socket_is_connected()`](https://github.com/quartiq/minimq/blob/933687c2e4bc8a4d972de9a4d1508b0b554a8b38/src/mqtt_client.rs#L239-L252) gets called (e.g. while polling the interface), it will detect that the socket handle is `None`, and attempt to call `nal::TcpStack::open()` ```rs // minimq/src/mqtt_client.rs fn socket_is_connected(&self) -> Result<bool, N::Error> { let mut socket_ref = self.socket.borrow_mut(); // Borrow self.socket as mutable refcell. if socket_ref.is_none() { // Given self.socket is None(), // Attempt to open a socket from the network stack. socket_ref.replace(self.network_stack.open(Mode::NonBlocking)?); // NetworkStack::open() is called // to return a smoltcp socket handle. // If open() returns Err, then this // function also returns the Err; // Otherwise, self.socket is replaced // with the socket opened. } let socket = socket_ref.take().unwrap(); // If self.socket is replaced and isn't // None(), then the rest can run as expected. let result = self.network_stack.is_connected(&socket); socket_ref.replace(socket); result } ``` * Since `open()` pops a socket handle from the array (`unused_handles`), we should ensure that any future calls of `open()` will not pop from an empty array and return `NetworkError::NoSocket` * For example, assuming that a socket handle has been obtained by the client with `open()`: * If the main application loops to call `MqttClient::poll()`, in which `socket_is_connected()` is called, then as soon as `MqttClient::read()`/`write()` returns Err (e.g. due to network disconnection), the next iteration of `poll()` must call `open()`, trying to pop from the array * As long as `MqttClient` remains unable to replace the socket reference (e.g. by `MqttClient::connect()`, `MqttClient::poll()` would continue to loop and invoke `open()` until all socket handles have been popped from the array * Thus, we should push the socket handle back to the array inside all NAL methods that involves the socket handle, which are `read()` and `write()`. 3. Prevent `close()` from pushing duplicate handles for the same `smoltcp::socket::TcpSocket` * `smoltcp::socket::SocketHandle` implements `Copy`, so pushing the handle to `usused_handles` does not actually take ownership of the handle but copies it * If `open()` and `close()` calls are imbalanced, then the program might start pushing the same socket multiple times; while `Vec::pop()` automatically returns Err when the array is empty, `Vec::push()` does not return Err when there are duplicates and so this should be handled ## Explanation Currently, we primarily wish to use this driver to port the NGFW Booster to ENC424J600. Since [quartiq's minimq](https://github.com/quartiq/minimq) is used, we want to provide an adapter in the ENC424J600 library for the network stack control required by minimq's MQTT client, such as connection to a broker at the client's instantiation, reconnection during polling, reading and writing to the data buffers. Therefore, we should aim to ensure that any call of the minimq functions would not lead to unexpected behaviour such as potential hanging and runtime errors. As a proposed fix for hanging due to unsuccessful connection to the broker, the idea of adding a connection timeout is to let the function return at appropriate timing while still allowing some number of repeated attempts before relinquishing control over the main program. I am also allowing `connect()` to return the socket even if it is at TCP CLOSED state or if it still cannot connect to the broker before the timeout period. This was taking W5500's NAL as reference such that the main program or the MQTT client can try to re-connect in the future, instead of getting hanged by the network stack. In practice, the user might want to set the connection timeout to 0 such that `connect()` would not even try to take control over other important routines in the main application. For example, in the RTIC model both the MQTT client and a USB terminal are polling in parallel, and each of these tasks is monitored by the STM32 independent watchdog. This means each task is only allocated with a fixed time interval to return, and risking to loop would be undesirable. ~~As a proposed fix for panicking due to blocking errors, the idea of replacing the blocking `nb::Error` with `nb::WouldBlock` can be seen as a temporary solution customised for minimq applications. `WouldBlock` simply cannot give the most useful information about **"what's wrong?"** like `Error` do.~~ The proposed fix that enforces pushing the socket handle back to the array is mostly targeted on minimq applications, but I believe this also fits the general case because our read/write functionalities are already reporting network errors, which should signal the main application to restart connection on the socket. ## Testing This as been tested on NGFW Booster firmware with no occurence of any Ethernet-related errors or issues.
harry added 1 commit 2021-03-05 17:34:32 +08:00
d9e50bbcb6 nal: Prevent looping until the stack successfully connects to remote
* `NetworkStack::connect()`:
  * Add timeout for connection attempt
  * Now returns the socket at TCP ESTABLISHED or CLOSED states, or after connection timeout
* Split `NetworkStack::update()` into `update()` (for controlling the clock) and `poll()` (for polling the smoltcp EthernetInterface)
* Also remove option `auto_time_update`; the main application is responsible for what values `embedded_time::clock::Clock::try_now()` should return
sb10q requested review from astro 2021-03-05 17:44:40 +08:00
Author
Contributor

More on nb::Error::WouldBlock vs nb::Error::Other: I just found there's an active PR on minimq which would eliminate the issue of panicking trying to unwrap() a None, such that we can keep @occheung's original code to return nb::Error::Other(NetworkError::ReadFailure) and nb::Error::Other(NetworkError::WriteFailure). The portion of minimq's code concerned is through this link. I'm making a follow-up discussion with QUARTIQ's people over the comment threads of the said PR.

More on `nb::Error::WouldBlock` vs `nb::Error::Other`: I just found there's an [active PR on minimq](https://github.com/quartiq/minimq/pull/35) which would eliminate the issue of panicking trying to `unwrap()` a `None`, such that we can keep @occheung's original code to return `nb::Error::Other(NetworkError::ReadFailure)` and `nb::Error::Other(NetworkError::WriteFailure)`. The portion of minimq's code concerned is through this [link](https://github.com/quartiq/minimq/blob/e686ccd89090f32990661fe49416d570e2e80835/src/mqtt_client.rs#L242-L245). ~~I'm making a follow-up discussion with QUARTIQ's people over the comment threads of the said PR.~~
Author
Contributor

In the minimq's PR I just mentioned, it also seems to prevent MqttClient's constructor from calling NAL's connect(), which would keep looping until the connection is established. However, the PR doesn't change the way to handle failed re-connection attempts during the client's polling period, which means that my proposal of adding a time-out and/or returning a TCP CLOSED socket would still be helpful and might still be needed.

In the minimq's PR I just mentioned, it also seems to prevent `MqttClient's` constructor from calling NAL's `connect()`, which would keep looping until the connection is established. However, the PR doesn't change the way to handle failed re-connection attempts during the client's polling period, which means that my proposal of adding a time-out and/or returning a TCP CLOSED socket would still be helpful and might still be needed.
harry force-pushed fix-nal from 29c8078ef9 to b372c37ae8 2021-03-11 17:15:44 +08:00 Compare
harry force-pushed fix-nal from b372c37ae8 to 2d3d4004e8 2021-03-11 17:19:18 +08:00 Compare
harry force-pushed fix-nal from 2d3d4004e8 to e7e08948d1 2021-03-11 17:22:46 +08:00 Compare
harry force-pushed fix-nal from e7e08948d1 to 99899e6657 2021-03-11 17:32:48 +08:00 Compare
astro approved these changes 2021-03-12 02:05:24 +08:00
astro left a comment
First-time contributor

I found a few small nitpicks. I do not have experience with minimq/rtic.

I found a few small nitpicks. I do not have experience with minimq/rtic.
@ -53,0 +54,4 @@
interface: NetworkInterface<SPI, NSS>,
sockets: net::socket::SocketSet<'a>,
clock: IntClock,
connection_timeout_ms: u32,
First-time contributor

Can you try to use an embedded_time::duration type wherever possible?

Can you try to use an `embedded_time::duration` type wherever possible?
Author
Contributor

I defined the connection timeout parameter as a u32 because it is used to compare with the network stack struct's time_ms "time now", which is also stored as u32. If I'm following @occheung 's logic correctly, we want time_ms to stay as u32 rather than a embedded_time::duration::Milliseconds because every time the Ethernet interface is busy polling, it has to borrow the "time now" value and conversion of the timestamp among u32, the embedded_time duration type and the smoltcp duration type might consume more CPU cycles. Plus, I made update() return a u32 representing the number of microseconds advanced, and so counting the total duration would be simpler.

I defined the connection timeout parameter as a `u32` because it is used to compare with the network stack struct's `time_ms` "time now", which is also stored as `u32`. If I'm following @occheung 's logic correctly, we want `time_ms` to stay as `u32` rather than a `embedded_time::duration::Milliseconds` because every time the Ethernet interface is busy polling, it has to borrow the "time now" value and conversion of the timestamp among `u32`, the `embedded_time` duration type and the `smoltcp` duration type might consume more CPU cycles. Plus, I made `update()` return a `u32` representing the number of microseconds advanced, and so counting the total duration would be simpler.
astro marked this conversation as resolved
src/nal.rs Outdated
@ -149,0 +171,4 @@
{
let mut sockets = self.sockets.borrow_mut();
let internal_socket: &mut net::socket::TcpSocket = &mut *sockets.get(socket);
First-time contributor

This isn't "internal" as in "some inner struct". I think it is more obvious to name this to what is behind this SocketRef::T, a tcp_socket.

This isn't "internal" as in "some inner struct". I think it is more obvious to name this to what is behind this `SocketRef::T`, a `tcp_socket`.
Author
Contributor

Perhaps @occheung meant that the internal_socket is the socket stored "internally" in the smoltcp::socket::SocketSet within the NetworkStack, which returns the socket needed using the socket parameter passed to the method, which is actually a smoltcp::socket::SocketHandle that was previously popped from the array of unused handles, done "externally" by calling open()?

Anyway, since Rust doesn't do matching of names of method parameters between the trait and the implementation, in 66c3aa534f I renamed the socket to handle and internal_socket to socket for clarity.

Perhaps @occheung meant that the `internal_socket` is the socket stored "internally" in the `smoltcp::socket::SocketSet` within the `NetworkStack`, which returns the socket needed using the `socket` parameter passed to the method, which is actually a `smoltcp::socket::SocketHandle` that was previously popped from the array of unused handles, done "externally" by calling `open()`? Anyway, since Rust doesn't do matching of names of method parameters between the trait and the implementation, in 66c3aa534f4b4a7628dae70700017164a9c5fc1d I renamed the `socket` to `handle` and `internal_socket` to `socket` for clarity.
@ -187,0 +220,4 @@
// "transient", so this function should keep waiting and let smoltcp poll
// (e.g. for handling echo reqeust/reply) at the same time.
timeout_ms += self.update()?;
self.poll()?;
First-time contributor

While all the other steps are very nicely documented, this fairly important self.poll() disappears to my eyes. How about a pair of newlines around it?

While all the other steps are very nicely documented, this fairly important `self.poll()` disappears to my eyes. How about a pair of newlines around it?
Author
Contributor

Right. Before, poll() was simply part of update(), and I decided to separate them simply for the sake of clarity.

What do you think: is it good to keep them separated?

Right. Before, `poll()` was simply part of `update()`, and I decided to separate them simply for the sake of clarity. What do you think: is it good to keep them separated?
First-time contributor

Smaller chunks with separated logic are always better.

Smaller chunks with separated logic are always better.
Author
Contributor

In 232a08f110, I added a pair of newlines, one before the comment line (that explains polling), and one after self.poll(). Would that be good?

In 232a08f11012fd749b68882140a8b9dc1a653363, I added a pair of newlines, one before the comment line (that explains polling), and one after `self.poll()`. Would that be good?
harry added 1 commit 2021-03-12 11:44:35 +08:00
harry added 1 commit 2021-03-12 12:36:38 +08:00
Author
Contributor

@astro Thanks a lot for reviewing this. I didn't update the PR description early enough before you pushed your review, and I just updated it to show some corrections.

The main thing I've changed is related to the minimq PR being merged to master recently, and thus has avoided the issue of not returning proper Err() because of the WouldBlock issue.

And with respect to this minimq commit, I've proposed closing the socket when read()/write() returns error, since the MQTT client would now simply try to open() a new socket, without knowing that such a socket needs to be taken from a list of unusued socket handles in the smoltcp implementation of the TCP stack. And because of this, I've also proposed a way to avoid adding duplicate socket handles when calling close(). These changed are reflected on points 2 and 3 of the summary.

@astro Thanks a lot for reviewing this. I didn't update the PR description early enough before you pushed your review, and I just updated it to show some corrections. The main thing I've changed is related to the [minimq PR]( https://github.com/quartiq/minimq/pull/35) being merged to master recently, and thus has avoided the issue of not returning proper `Err()` because of the `WouldBlock` issue. And with respect to this minimq commit, I've proposed closing the socket when `read()`/`write()` returns error, since the MQTT client would now simply try to `open()` a new socket, without knowing that such a socket needs to be taken from a list of unusued socket handles in the `smoltcp` implementation of the TCP stack. And because of this, I've also proposed a way to avoid adding duplicate socket handles when calling `close()`. These changed are reflected on points 2 and 3 of the summary.
harry added 1 commit 2021-03-16 10:25:27 +08:00
2eadb652ff nal: Fix infinite loop when TX buffer is full
* For example, if the PHY linkup is down, instead of looping until resumption of the linkup, a write operation now closes the socket for re-connection in the future
harry changed title from nal: Fix blocking behaviours to nal: Fix loops & socket handle management 2021-03-16 10:32:37 +08:00
Author
Contributor

Just added 2eadb652ff to deal with full TX buffer. We could check smoltcp::socket::TcpSocket::can_send() before doing send_slice(), but by just checking the number of bytes enqueued looks more straightforward than can_send(), which checks the state of the TX end on top of checking the buffer fullness.

Testing with the Booster NGFW shows that it returns the WriteFailure error whenever the TCP connection has been closed, or the PHY linkup on the controller is down.

Just added 2eadb652ff1d8e043736006aa74170dbb6d67f15 to deal with full TX buffer. We could check `smoltcp::socket::TcpSocket::can_send()` before doing `send_slice()`, but by just checking the number of bytes enqueued looks more straightforward than `can_send()`, which checks the state of the TX end on top of checking the buffer fullness. Testing with the Booster NGFW shows that it returns the `WriteFailure` error whenever the TCP connection has been closed, or the PHY linkup on the controller is down.
astro reviewed 2021-03-17 05:56:12 +08:00
src/nal.rs Outdated
@ -163,4 +194,3 @@
}
}
};
// Blocking connect
First-time contributor

Please retain the wording blocking as it is exactly what this is.

Please retain the wording **blocking** as it is exactly what this is.
Author
Contributor

Resolved in 232a08f110 by copying this original line of comment verbatim.

Resolved in 232a08f11012fd749b68882140a8b9dc1a653363 by copying this original line of comment verbatim.
harry force-pushed fix-nal from ca2fea4ed4 to 232a08f110 2021-03-17 10:21:26 +08:00 Compare
harry merged commit 232a08f110 into master 2021-04-12 09:28:56 +08:00
Sign in to join this conversation.
No reviewers
No Label
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: renet/ENC424J600#7
No description provided.