-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add lima provider for macOS #1536
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
nirs
commented
Sep 2, 2024
nirs
force-pushed
the
lima
branch
12 times, most recently
from
September 8, 2024 22:13
2f27e17
to
dd490b8
Compare
nirs
force-pushed
the
lima
branch
5 times, most recently
from
September 10, 2024 01:00
d3c076a
to
c14094f
Compare
nirs
commented
Sep 10, 2024
nirs
force-pushed
the
lima
branch
5 times, most recently
from
September 10, 2024 20:27
9e2d2c2
to
9b57c91
Compare
Similar to minikube, starting a stopped cluster is more flaky. Even when k8s reports that everything is ready, some components are not ready and running the start hooks can fail randomly. Example failure: Error from server (InternalError): Internal error occurred: failed calling webhook "managedclustermutators.admission.cluster.open-cluster-management.io": failed to call webhook: Post "https://cluster-manager-registration-webhook.open-cluster-management-hub.svc:9443/ mutate-cluster-open-cluster-management-io-v1-managedcluster?timeout=10s": dial tcp 10.110.203.24:9443: connect: no route to host Try to avoid this by adding a short delay after starting a stopped cluster, before we start to run the hooks. This change affects only developers that stop the environment and start it again. In minikube we added delay in configure(), but for lima is better done in start(), since there we can tell if this is a start of a stopped cluster. Signed-off-by: Nir Soffer <[email protected]>
limactl logs everything to stderr, and we watch stderr to consume the logs. Since we drop limactl logs when running in normal log level, when limactl fail we don't have any info on the error, and the only way to debug limactl error is run with verbose mode. With this change we extract limactl log level and log errors as drenv errors, so the last log before the error provide some info on the error. Tested by uninstalling socket_vmnet: sudo make uninstall.launchd With this limaclt fails to connect to the vmnet socket: % drenv start envs/vm.yaml 2024-09-09 22:30:07,354 INFO [vm] Starting environment 2024-09-09 22:30:07,376 INFO [cluster] Starting lima cluster 2024-09-09 22:30:26,490 ERROR [cluster] exiting, status={Running:false Degraded:false Exiting:true Errors:[] SSHLocalPort:0} (hint: see "/Users/nsoffer/.lima/cluster/ha.stderr.log") 2024-09-09 22:30:26,492 ERROR Command failed Traceback (most recent call last): ... drenv.commands.Error: Command failed: command: ('limactl', '--log-format=json', 'start', 'cluster') exitcode: 1 error: The lima error message is not very useful but this is what we have. This should be improved in lima. If we inspect the log file mentioned we can see the actual error: % tail -3 ~/.lima/cluster/ha.stderr.log {"level":"debug","msg":"Start tcp DNS listening on: 127.0.0.1:51618","time":"2024-09-09T22:30:26+03:00"} {"level":"info","msg":"new connection from to ","time":"2024-09-09T22:30:26+03:00"} {"level":"fatal","msg":"dial unix /var/run/socket_vmnet: connect: connection refused","time":"2024-09-09T22:30:26+03:00"} Signed-off-by: Nir Soffer <[email protected]>
Usually this is an early usage error, written before the logger is configured to json format, so we get a text log instead o message. Log this line as is in debug level to allow debugging the issue. Example error when using older lima version not supporting --log-format: 2024-09-12 22:25:55,637 DEBUG [drenv-test-cluster] time="2024-09-12T22:25:55Z" level=fatal msg="unknown flag: --log-format" Without this change, this error is dropped and we don't have a clue what went wrong. Signed-off-by: Nir Soffer <[email protected]>
We access the cluster via the IP address on the shared network. Port forwarding cannot work for multiple clusters since same port from multiple clusters is mapped to the same host port. Signed-off-by: Nir Soffer <[email protected]>
Without this the API server will listen on the user network which is not accessible from the host. Lima try to mitigate this by changing the address to 127.0.0.1, but this does not work for multiple clusters. With this change we can access all clusters from the host. Signed-off-by: Nir Soffer <[email protected]>
Without this configuration the rook-ceph pods are listening on the user network (192.168.5.0/24) instead of the shared network (192.168.105.0/24), and rbd-mirror is broken. With this change we can run the rook environment. Thanks: Raghavendra Talur <[email protected]> Signed-off-by: Nir Soffer <[email protected]>
Previously configured as minikube --extra-config. Signed-off-by: Nir Soffer <[email protected]>
With minikube this is set in the profile, and configured via --feature-gates flag. With lima we can configure this directly in KubeletConfiguration. Currently the feature gates are hard coded in the configuration for all cluster. We can configure based on the profile later if needed. Signed-off-by: Nir Soffer <[email protected]>
Currently we have: $ sysctl fs.inotify fs.inotify.max_queued_events = 16384 fs.inotify.max_user_instances = 128 fs.inotify.max_user_watches = 45827 And we see errors like this on managed clusters even with trivial busybox workloads: failed to create fsnotify watcher: too many open files We use OpenShift worker defaults, already used for minikube[1]. [1] kubernetes/minikube#18832 Signed-off-by: Nir Soffer <[email protected]>
We used 6 month old release, time to upgrade. Signed-off-by: Nir Soffer <[email protected]>
This makes it work in lima cluster without deploying a csi-hostpath driver. We can add such driver later if there is a real need. With this change we can run the minio environment. Signed-off-by: Nir Soffer <[email protected]>
This allows using commands.run() and commands.watch() with an open file connected to the child process stdin. We will use this to load images into lima cluster. Signed-off-by: Nir Soffer <[email protected]>
Some commands like drenv must run in specific location. Add cwd argument allowing this when using commands.run() and commands.watch(). We will use this to run `drenv load` in `ramenctl deploy`, which may run in any directory. Signed-off-by: Nir Soffer <[email protected]>
This command loads an image in tar format into all clusters. This will be used in ramenctl to load images into the clusters, and can also be used manually. The environment may use one or more provider, and each one will use the right command to load the image. External provider do not support loading images. Pushing the ramen image to a registry will work. Usage: % drenv load -h usage: drenv load [-h] [-v] [--name-prefix PREFIX] --image IMAGE filename positional arguments: filename path to environment file options: -h, --help show this help message and exit -v, --verbose be more verbose --name-prefix PREFIX prefix profile names --image IMAGE image to load into the cluster in tar format Example run: % drenv load --image /tmp/image.tar envs/regional-dr.yaml 2024-09-10 22:33:30,896 INFO [rdr] Loading image '/tmp/image.tar' 2024-09-10 22:33:30,902 INFO [dr1] Loading image 2024-09-10 22:33:30,902 INFO [dr2] Loading image 2024-09-10 22:33:30,902 INFO [hub] Loading image 2024-09-10 22:33:33,314 INFO [dr1] Image loaded in 2.41 seconds 2024-09-10 22:33:33,407 INFO [dr2] Image loaded in 2.50 seconds 2024-09-10 22:33:33,628 INFO [hub] Image loaded in 2.73 seconds 2024-09-10 22:33:33,628 INFO [rdr] Image loaded in 2.73 seconds Signed-off-by: Nir Soffer <[email protected]>
With this ramenctl can deploy ramen on any cluster type without knowing anything about the cluster provider. Example run: % ramenctl deploy --source-dir .. envs/regional-dr-lima.yaml 2024-09-09 00:52:14,231 INFO [ramenctl] Starting deploy 2024-09-09 00:52:14,234 INFO [ramenctl] Preparing resources 2024-09-09 00:52:18,192 INFO [ramenctl] Loading image 'quay.io/ramendr/ramen-operator:latest' 2024-09-09 00:52:22,023 INFO [ramenctl] Deploying ramen operator in cluster 'hub' 2024-09-09 00:52:22,023 INFO [ramenctl] Deploying ramen operator in cluster 'dr1' 2024-09-09 00:52:22,025 INFO [ramenctl] Deploying ramen operator in cluster 'dr2' 2024-09-09 00:52:22,600 INFO [ramenctl] Waiting until 'ramen-hub-operator' is rolled out in cluster 'hub' 2024-09-09 00:52:22,687 INFO [ramenctl] Waiting until 'ramen-dr-cluster-operator' is rolled out in cluster 'dr1' 2024-09-09 00:52:22,697 INFO [ramenctl] Waiting until 'ramen-dr-cluster-operator' is rolled out in cluster 'dr2' 2024-09-09 00:52:29,893 INFO [ramenctl] Finished deploy in 15.65 seconds Signed-off-by: Nir Soffer <[email protected]>
There may be a better way, but for testing setup we could not care less about certificates checks. We can try to improve this later if we think that drenv will be used on real clusters. Thanks: Raghavendra Talur <[email protected]> Signed-off-by: Nir Soffer <[email protected]>
On lima cluster submariner use the public IP of the host (the address assigned by your ISP) as the public IP of the clusters, and all clusters get the same IP: % subctl show connections --context dr1 ✓ Showing Connections GATEWAY CLUSTER REMOTE IP NAT CABLE DRIVER SUBNETS STATUS RTT avg. lima-dr2 dr2 93.172.220.134 yes vxlan 242.1.0.0/16 connected % subctl show endpoints --context dr1 ✓ Showing Endpoints CLUSTER ENDPOINT IP PUBLIC IP CABLE DRIVER TYPE dr1 192.168.5.15 93.172.220.134 vxlan local dr2 192.168.5.15 93.172.220.134 vxlan remote With this change it uses the actual IP address of the cluster in the vmnet network: % subctl show connections --context dr1 ✓ Showing Connections GATEWAY CLUSTER REMOTE IP NAT CABLE DRIVER SUBNETS STATUS RTT avg. lima-dr2 dr2 192.168.105.10 yes vxlan 242.1.0.0/16 connected % subctl show endpoints --context dr1 ✓ Showing Endpoints CLUSTER ENDPOINT IP PUBLIC IP CABLE DRIVER TYPE dr1 192.168.5.15 192.168.105.11 vxlan local dr2 192.168.5.15 192.168.105.10 vxlan remote Thanks: Raghavendra Talur <[email protected]> Signed-off-by: Nir Soffer <[email protected]>
After provisioning a lima vm we have 2 default routes: % limactl shell dr1 ip route show default default via 192.168.5.2 dev eth0 proto dhcp src 192.168.5.15 metric 100 default via 192.168.105.1 dev lima0 proto dhcp src 192.168.105.11 metric 100 192.168.5.0/24 is the special user network used by lima to bootstrap the VM. All vms use have the same IP address (192.168.5.15) so this network cannot be used to access the vm from the host. 192.168.105.0/24 is the vmnet shared network, providing access from host to vm and from vm to vm. We wan to use only this network. Without this change submariner uses the special user network (192.168.5.0/24) for the endpoints, which cannot work for accessing the other clusters: % subctl show connections --context dr1 ✓ Showing Connections GATEWAY CLUSTER REMOTE IP NAT CABLE DRIVER SUBNETS STATUS RTT avg. lima-dr2 dr2 192.168.105.10 yes vxlan 242.1.0.0/16 connected % subctl show endpoints --context dr1 ✓ Showing Endpoints CLUSTER ENDPOINT IP PUBLIC IP CABLE DRIVER TYPE dr1 192.168.5.15 192.168.105.11 vxlan local dr2 192.168.5.15 192.168.105.10 vxlan remote I tried to fix this issue by deleting the default route via 192.168.5.2. This works for deploying submariner, but this route is recreated later, and this breaks submariner gateway and connectivity between the clusters. Changing the order of the default routes seem to work, both for deploying submariner, and for running tests on the running clusters. We do this by modifying the metric of the preferred route so it becomes first: % limactl shell dr1 ip route show default default via 192.168.105.1 dev lima0 proto dhcp src 192.168.105.11 metric 1 default via 192.168.5.2 dev eth0 proto dhcp src 192.168.5.15 metric 100 With this change the endpoint listen on the public ip (in the vmnet network), allowing access to other clusters: % subctl show connections --context dr1 ✓ Showing Connections GATEWAY CLUSTER REMOTE IP NAT CABLE DRIVER SUBNETS STATUS RTT avg. lima-dr2 dr2 192.168.105.10 no vxlan 242.1.0.0/16 connected % subctl show endpoints --context dr1 ✓ Showing Endpoints CLUSTER ENDPOINT IP PUBLIC IP CABLE DRIVER TYPE dr1 192.168.105.11 192.168.105.11 vxlan local dr2 192.168.105.10 192.168.105.10 vxlan remote Signed-off-by: Nir Soffer <[email protected]>
With 0.17.0 and 0.17.2 the globalnet pod fails with: 2024-09-08T15:42:25.498Z FTL ../gateway_monitor.go:286 Globalnet Error starting the controllers error="error creating the Node controller: error retrieving local Node \"lima-dr1\": nodes \"lima-dr1\" is forbidden: User \"system:serviceaccount:submariner-operator:submariner-globalnet\" cannot get resource \"nodes\" in API group \"\" at the cluster scope" This worked with minikube clusters, so maybe this related to the some difference in the way the cluster is deployed, but we want to upgrade to latest submariner anyway to detect regressions early. Signed-off-by: Nir Soffer <[email protected]>
When testing the small submariner environment, we may start deploying one cluster before the other cluster is ready. This fail randomly with lima clusters when submariner use the wrong interface. This may happen if we install submariner before flannel is ready. Signed-off-by: Nir Soffer <[email protected]>
Remove the nslookup step, since is problematic: - nslookuop and curl use different DNS resolvers, so when nslookup succeeds it does not mean that curl will succeed. - nslookup sometimes return zero exit code with a message that the lookup failed! Then we try to access the DNS name with curl with a short timeout (60 seconds) and fail. Simply to check only with curl, increasing the timeout to 300 seconds. Signed-off-by: Nir Soffer <[email protected]>
It was configured to exit after 300 seconds, which makes it hard to test when it takes lot of time to wait for connectivity. Signed-off-by: Nir Soffer <[email protected]>
Submariner works now so we can enable in the regional-dr-lima.yaml. Signed-off-by: Nir Soffer <[email protected]>
This addons replaces the minikube volumesnapshot addon, and is needed for cephfs, volsync, and for testing volume replication of snapshot and pvcs created from snapshots. Signed-off-by: Nir Soffer <[email protected]>
This addon works on both minikube and lima clusters. It is used by the cephfs and volsync and will be used for testing DR for workloads using rbd pvc restored from snapshot. To use snapshot with rbd storage a snapshot class was added based on rook 1.15 example. With this change we can enable cephfs and volsync in regional-dr-lima.yaml. Signed-off-by: Nir Soffer <[email protected]>
We did not flatten the config since it is not needed in minikube, using path to the certificate. But in lima we get the actual certificate from the guest, and without flattening we get: clusters: - name: drenv-test-cluster cluster: server: https://192.168.105.45:6443 certificate-authority-data: DATA+OMITTED users: - name: drenv-test-cluster user: client-certificate-data: DATA+OMITTED client-key-data: DATA+OMITTED ... `DATA-OMITTED` is not a valid certificate, so argocd fail to parse it. With this change argocd works, and we can use regional-dr.yaml on macOS. Signed-off-by: Nir Soffer <[email protected]>
Limactl is racy, trying to access files in other clusters directories and failing when files were deleted. Until this issue is fixed in lima, ensure that only single vm can be deleted at the same time. Example failure: % drenv delete envs/regional-dr.yaml 2024-09-13 05:59:57,159 INFO [rdr] Deleting environment 2024-09-13 05:59:57,169 INFO [dr1] Deleting lima cluster 2024-09-13 05:59:57,169 INFO [dr2] Deleting lima cluster 2024-09-13 05:59:57,169 INFO [hub] Deleting lima cluster 2024-09-13 05:59:57,255 WARNING [dr2] no such process 2024-09-13 05:59:57,265 WARNING [dr2] remove /Users/nsoffer/.lima/dr2/ssh.sock: no such file or directory 2024-09-13 05:59:57,265 WARNING [hub] remove /Users/nsoffer/.lima/hub/ssh.sock: no such file or directory 2024-09-13 05:59:57,297 ERROR [dr1] open /Users/nsoffer/.lima/dr2/lima.yaml: no such file or directory 2024-09-13 05:59:57,297 ERROR [hub] open /Users/nsoffer/.lima/dr2/lima.yaml: no such file or directory 2024-09-13 05:59:57,298 ERROR Command failed Traceback (most recent call last): ... drenv.commands.Error: Command failed: command: ('limactl', '--log-format=json', 'delete', '--force', 'dr1') exitcode: 1 error: Note how delete command for "dr1" and "hub" are failing to read lima.yaml of cluster "dr2": 2024-09-13 05:59:57,297 ERROR [dr1] open /Users/nsoffer/.lima/dr2/lima.yaml: no such file or directory 2024-09-13 05:59:57,297 ERROR [hub] open /Users/nsoffer/.lima/dr2/lima.yaml: no such file or directory With the lock, we run single limactl process at a time, so it cannot race with other clusters. Signed-off-by: Nir Soffer <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently can run basic-test with regional-dr.yaml on macOS.
Needs more work:
Temporarily based on #1534
Fixes #1513