Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resync TLS sends the complete file to the destination while few characters are added at the end of the file #1028

Closed
dheerajjoshim opened this issue Dec 15, 2023 · 9 comments
Labels
bug Something isn't working

Comments

@dheerajjoshim
Copy link

dheerajjoshim commented Dec 15, 2023

We use Volsync to send the files from the primary cluster to the secondary cluster. One of our volumes has a large single file around 50G+ in size

We were testing how such a huge file would be handled using VolSync. We initially generated the file during the testing, and the entire file was replicated to the remote site. Then we appended a few characters to the file. The VolSync then replicated the entire file to the remote site rather than sending the modified chunks.

This could be reproduced in the local cluster itself. Use two namespace nsa and nsb to simulate source and destination where nsa will be source and nsb will be destination

Create PVC on two namespaces with 100G

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: pvc-mount-nsa
  namespace: nsa
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: thin
  volumeMode: Filesystem
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: pvc-mount-nsb
  namespace: nsb
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: thin
  volumeMode: Filesystem

Create a pod in source namespace to attach to the PVC

apiVersion: v1
kind: Pod
metadata:
  name: pvc-inspector
  namespace: nsa
spec:
  containers:
  - image: busybox
    name: pvc-inspector
    command: ["tail"]
    args: ["-f", "/dev/null"]
    volumeMounts:
    - mountPath: /pvc
      name: pvc-mount-nsa
  volumes:
  - name: pvc-mount-nsa
    persistentVolumeClaim:
      claimName: pvc-mount-nsa

Login to the pod in the source namespace and generate a large file

cd /pvc
dd if=/dev/urandom of=sample.txt bs=64M count=512 iflag=fullblock

Create TLS passkey for rsyncTLS in both the source and destination namespaces

apiVersion: v1
kind: Secret
metadata:
  name: tls-key-secret
  namespace: nsa
data:
  psk.txt: MToyM2I3Mzk1ZmFmYzNlODQyYmQ4YWMwZmUxNDJlNmFkMQ==
type: Opaque
---
apiVersion: v1
kind: Secret
metadata:
  name: tls-key-secret
  namespace: nsb
data:
  psk.txt: MToyM2I3Mzk1ZmFmYzNlODQyYmQ4YWMwZmUxNDJlNmFkMQ==
type: Opaque

Create ReplicationSource and ReplicationDestination on source and destination namespaces

apiVersion: volsync.backube/v1alpha1
kind: ReplicationDestination
metadata:
  name: pvc-destination
  namespace: nsb
spec:
  rsyncTLS:
    accessModes:
      - ReadWriteOnce
    capacity: 100Gi
    copyMethod: Direct
    destinationPVC: pvc-mount-nsb
    serviceType: ClusterIP
    storageClassName: thin
    keySecret: tls-key-secret
---
kind: ReplicationSource
apiVersion: volsync.backube/v1alpha1
metadata:
  name: pvc-source
  namespace: nsa
spec:
  rsyncTLS:
    address: volsync-rsync-tls-dst-pvc-destination.nsb.svc.cluster.local
    copyMethod: Direct
    sshKeys: secretRef
    keySecret: tls-key-secret
  sourcePVC: pvc-mount-nsa
  trigger: {}

Monitor the source mover pod and check if the file is completely transferred to the destination PVC
source mover 1st run.log

Afterwards login to the BusyBox pod and append a few lines at the end of the generated file

cd /pvc
echo 'text' >> sample.txt

Monitor the source and destination mover pods and see the entire file is again moved
source mover 2nd run.log
destination mover 2nd run.log

Destination copy
================
drwxrwsr-x. 3 root 1000770000        4096 Dec 15 10:26 .
dr-xr-xr-x. 1 root root                41 Dec 15 10:20 ..
-rw-------. 1 root 1000770000  2881911808 Dec 15 10:26 .sample.txt.rijUOG
drwxrws---. 2 root 1000770000       16384 Dec 15 06:48 lost+found
-rw-rw-rw-. 1 root 1000740000 27850178560 Dec 15 10:10 sample.txt
sh-5.1# 
sh-5.1# 
sh-5.1# ls -la
total 32495028
drwxrwsr-x. 3 root 1000770000        4096 Dec 15 10:26 .
dr-xr-xr-x. 1 root root                41 Dec 15 10:20 ..
-rw-------. 1 root 1000770000  5424701440 Dec 15 10:27 .sample.txt.rijUOG
drwxrws---. 2 root 1000770000       16384 Dec 15 06:48 lost+found
-rw-rw-rw-. 1 root 1000740000 27850178560 Dec 15 10:10 sample.txt
sh-5.1# 
sh-5.1# ls -la
total 32851700
drwxrwsr-x. 3 root 1000770000        4096 Dec 15 10:26 .
dr-xr-xr-x. 1 root root                41 Dec 15 10:20 ..
-rw-------. 1 root 1000770000  5789931520 Dec 15 10:27 .sample.txt.rijUOG
drwxrws---. 2 root 1000770000       16384 Dec 15 06:48 lost+found
-rw-rw-rw-. 1 root 1000740000 27850178560 Dec 15 10:10 sample.txt

The expectation was that only the last chunk get copied to the destination PVC. Is there a way to enable more verbose logging to the rsync command itself that might indicate why the entire file gets copied to the destination despite adding a few lines at the end of file?

@dheerajjoshim dheerajjoshim added the bug Something isn't working label Dec 15, 2023
@tesshuflower
Copy link
Contributor

@dheerajjoshim I'm wondering if something else is actually going wrong here.

I can see from the 2nd source mover log some errors like this:

2023.12.15 10:20:56 LOG7[1]: Service [rsync] finished (0 left)
rsync: [sender] safe_read failed to read 1 bytes: Connection reset by peer (104)
rsync error: error in rsync protocol data stream (code 12) at io.c(276) [sender=3.2.3]
Syncronization failed. Retrying in 2 seconds. Retry 1/5.

Eventually it does complete:

<f.st...... sample.txt

Number of files: 1 (reg: 1)
Number of created files: 0
Number of deleted files: 0
Number of regular files transferred: 1
Total file size: 27.85G bytes
Total transferred file size: 27.85G bytes
Literal data: 5 bytes
Matched data: 27.85G bytes
File list size: 0
File list generation time: 0.003 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 131
Total bytes received: 1.70M

sent 131 bytes  received 1.70M bytes  2.18K bytes/sec
total size is 27.85G  speedup is 16,381.44

I would interpret this as it's actually only needed to send 131 bytes, but did need to receive 1.7M from the destination in order to do the compare - however, as you've noticed it does take a long time. So perhaps something is happening when rsync is doing its checksums on the destination file in order to figure out what to send.

Do you see anything happening to the destination mover pods while this is happening? I'm wondering if they were getting killed or throttled somehow - We only seem to have logs from this time on the destination: 2023.12.15 10:20:57, which makes me wonder if a previous pod got killed.

@dheerajjoshim
Copy link
Author

Hi. Yes. After the first successful sync (Complete file transfer) the destination mover pod got exited successfully. The pod was marked completed.
Then a new pod was created.

@dheerajjoshim
Copy link
Author

dheerajjoshim commented Dec 15, 2023

And I observed 1/5, 2/5 error messages in the source pod when the destination mover pods are getting terminated and restarting.

If the destination mover pods are up within the source pod exhausts retry counts then the logs look like below

rsync: [sender] safe_read failed to read 1 bytes: Connection reset by peer (104)
rsync error: error in rsync protocol data stream (code 12) at io.c(276) [sender=3.2.3]
Syncronization failed. Retrying in 8 seconds. Retry 3/5.
2023.12.15 16:53:38 LOG7[main]: Found 1 ready file descriptor(s)
2023.12.15 16:53:38 LOG7[main]: FD=4 events=0x2001 revents=0x0
2023.12.15 16:53:38 LOG7[main]: FD=8 events=0x2001 revents=0x1
2023.12.15 16:53:38 LOG7[main]: Service [rsync] accepted (FD=3) from 127.0.0.1:38138
2023.12.15 16:53:38 LOG7[6]: Service [rsync] started
2023.12.15 16:53:38 LOG7[6]: Setting local socket options (FD=3)
2023.12.15 16:53:38 LOG7[6]: Option TCP_NODELAY set on local socket
2023.12.15 16:53:38 LOG5[6]: Service [rsync] accepted connection from 127.0.0.1:38138
2023.12.15 16:53:38 LOG6[6]: s_connect: connecting 172.30.61.209:8000
2023.12.15 16:53:38 LOG7[6]: s_connect: s_poll_wait 172.30.61.209:8000: waiting 10 seconds
2023.12.15 16:53:38 LOG7[6]: FD=6 events=0x2001 revents=0x0
2023.12.15 16:53:38 LOG7[6]: FD=10 events=0x2005 revents=0x0
2023.12.15 16:53:38 LOG5[6]: s_connect: connected 172.30.61.209:8000
2023.12.15 16:53:38 LOG5[6]: Service [rsync] connected remote server from 10.131.0.13:46196
2023.12.15 16:53:38 LOG7[6]: Setting remote socket options (FD=10)
2023.12.15 16:53:38 LOG7[6]: Option TCP_NODELAY set on remote socket
2023.12.15 16:53:38 LOG7[6]: Remote descriptor (FD=10) initialized
2023.12.15 16:53:38 LOG6[6]: SNI: sending servername: volsync-rsync-tls-dst-pvc-destination.nsb.svc.cluster.local
2023.12.15 16:53:38 LOG6[6]: Peer certificate not required
2023.12.15 16:53:38 LOG7[6]: TLS state (connect): before SSL initialization
2023.12.15 16:53:38 LOG7[6]: Initializing application specific data for session authenticated
2023.12.15 16:53:38 LOG6[6]: PSK client configured for identity "1"
2023.12.15 16:53:38 LOG7[6]: Initializing application specific data for session authenticated
2023.12.15 16:53:38 LOG7[6]: TLS state (connect): SSLv3/TLS write client hello
2023.12.15 16:53:38 LOG7[6]: TLS state (connect): SSLv3/TLS write client hello
2023.12.15 16:53:38 LOG7[6]: Deallocating application specific data for session connect address
2023.12.15 16:53:38 LOG7[6]: Initializing application specific data for session authenticated
2023.12.15 16:53:38 LOG7[6]: Deallocating application specific data for session connect address
2023.12.15 16:53:38 LOG7[6]: TLS state (connect): SSLv3/TLS read server hello
2023.12.15 16:53:38 LOG7[6]: TLS state (connect): TLSv1.3 read encrypted extensions
2023.12.15 16:53:38 LOG7[6]: TLS state (connect): SSLv3/TLS read finished
2023.12.15 16:53:38 LOG7[6]: TLS state (connect): SSLv3/TLS write change cipher spec
2023.12.15 16:53:38 LOG7[6]: TLS state (connect): SSLv3/TLS write finished
2023.12.15 16:53:38 LOG7[6]:      1 client connect(s) requested
2023.12.15 16:53:38 LOG7[6]:      1 client connect(s) succeeded
2023.12.15 16:53:38 LOG7[6]:      0 client renegotiation(s) requested
2023.12.15 16:53:38 LOG7[6]:      1 session reuse(s)
2023.12.15 16:53:38 LOG6[6]: TLS connected: previous session reused
2023.12.15 16:53:38 LOG6[6]: TLSv1.3 ciphersuite: TLS_AES_128_GCM_SHA256 (128-bit encryption)
2023.12.15 16:53:38 LOG6[6]: Peer temporary key: X25519, 253 bits
2023.12.15 16:53:38 LOG7[6]: Compression: null, expansion: null
2023.12.15 16:53:38 LOG6[6]: Session id: 
2023.12.15 16:53:38 LOG7[6]: TLS state (connect): SSL negotiation finished successfully
2023.12.15 16:53:38 LOG7[6]: TLS state (connect): SSL negotiation finished successfully
2023.12.15 16:53:38 LOG7[6]: Initializing application specific data for session authenticated
2023.12.15 16:53:38 LOG7[6]: Deallocating application specific data for session connect address
2023.12.15 16:53:38 LOG7[6]: New session callback
2023.12.15 16:53:38 LOG6[6]: No peer certificate received
2023.12.15 16:53:38 LOG6[6]: Session id: 8401831BFAF4030AAA3BF2A578DCAD104376695FE082DC13ADDDBE5684A08668
2023.12.15 16:53:38 LOG7[6]: TLS state (connect): SSLv3/TLS read server session ticket

Number of files: 1 (reg: 1)
Number of created files: 0
Number of deleted files: 0
Number of regular files transferred: 0
Total file size: 27.85G bytes
Total transferred file size: 0 bytes
Literal data: 0 bytes
Matched data: 0 bytes
File list size: 0
File list generation time: 0.004 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 63
Total bytes received: 12

sent 63 bytes  received 12 bytes  150.00 bytes/sec
total size is 27.85G  speedup is 371,335,714.27

Mover pod idle state success.log

Otherwise after 5 retries the mover pod looks like

rsync: [sender] safe_read failed to read 1 bytes: Connection reset by peer (104)
rsync error: error in rsync protocol data stream (code 12) at io.c(276) [sender=3.2.3]
Syncronization failed. Retrying in 32 seconds. Retry 5/5.
rsync completed in 62s
Synchronization failed. rsync returned: 1212
2023.12.15 16:54:52 LOG7[main]: Found 1 ready file descriptor(s)
2023.12.15 16:54:52 LOG7[main]: FD=4 events=0x2001 revents=0x1
2023.12.15 16:54:52 LOG7[main]: FD=8 events=0x2001 revents=0x0
2023.12.15 16:54:52 LOG7[main]: Dispatching a signal from the signal pipe
2023.12.15 16:54:52 LOG7[main]: Processing SIGNAL_TERMINATE
2023.12.15 16:54:52 LOG5[main]: Terminated
2023.12.15 16:54:52 LOG7[main]: Leak detection table utilization: 13/997, 1.30%
2023.12.15 16:54:52 LOG7[main]: Removed pid file /tmp/stunnel-client.pid
2023.12.15 16:54:52 LOG7[main]: Terminating the cron thread
2023.12.15 16:54:52 LOG6[main]: Terminating 1 service thread(s)
2023.12.15 16:54:52 LOG6[main]: Service threads terminated

Mover pod idle state failure.log

@tesshuflower
Copy link
Contributor

I'm not sure I understand - If you're starting another replication while the previous destination is still shutting down, then I think we'd expect to see errors connecting.

Is there a reason you're starting up the source replications so fast? Normally I'd expect that you use a trigger to either schedule at specific time/dates/intervals (for example every 5 minutes), or manual triggers to sync at a specific time chosen by you. Either way I think we do not expect a full file to be re-sent. The latest logs don't really show your original issue anymore as it doesn't appear to have sent or updated anything.

Maybe to separate the issues, you can re-create by scheduling a replication source to run when the destination is already running and capture the logs.

@dheerajjoshim
Copy link
Author

Yes. The last error logs were just to show the errors sometimes we see in the system.
Anyway, I can create a manual trigger and test and provide the logs.

@dheerajjoshim
Copy link
Author

It looks like rsync is not copying the entire file. But rather just the delta as expected.

I tried to use rsync on my local machine, between two directories (With --no-w to prevent full file copy while using rsync locally)
/usr/local/bin/rsync -a -h -H -x -S -z -vvvvvvvvvvvvvvvvvvvvvvvvvvv --no-W --exclude=lost+found --itemize-changes /path/to/src /path/to/dst
I saw the temporary file as I saw in the VolSync case. Upon closer look at the man pages and the other guides, I understood that without the --inplace option rsync will create a temporary copy and merge changes to it.

Armed with this information, I redid the whole test and during the first replication, the entire file was copied as expected. This was also evident from the metrics collected from the pods

Source receive and transmit bandwidth
SourcePodTransmitReceiveBW

Destination receive and transmit bandwidth
DestinationPodTransmitReceiveBW

Then appended a few characters to the source file and the network metric indicated there was no huge data transfer between the source and the destination pods. Few bytes were transferred where the file list and other information are shared between the source and the destination

Source receive and transmit bandwidth
SourcePodReceiveBW
SourcePodTransmitBW

Destination receive and transmit bandwidth
DestinationPodReceiveBW
DestinationPodTransmitBW

@tesshuflower
Copy link
Contributor

Sounds good, @dheerajjoshim do you have any outstanding concerns or should I close this issue?

@dheerajjoshim
Copy link
Author

dheerajjoshim commented Dec 18, 2023

Another question related to this topic is an improvement idea to allow in-place updates to the file using rsync using the --inplace CLI option.

Is it possible to add in-place update support or accept a PR for the same?

@tesshuflower
Copy link
Contributor

Thanks @dheerajjoshim , I've created an enhancement issue here: #1044

Closing this bug as I think it's working as designed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

2 participants