Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lima provider for macOS #1536

Merged
merged 33 commits into from
Sep 18, 2024
Merged

Add lima provider for macOS #1536

merged 33 commits into from
Sep 18, 2024

Conversation

nirs
Copy link
Member

@nirs nirs commented Sep 2, 2024

Currently can run basic-test with regional-dr.yaml on macOS.

Needs more work:

  • move image loading to drenv load command
  • enable submariner (requires labelling the nodes)
  • enable cephfs (requires external snapshotter)
  • enable volsync (requires cephfs and submariner)
  • enable argocd (fails to parse certificate in PEM format, need investigation)

Temporarily based on #1534

Fixes #1513

test/envs/regional-dr.yaml Outdated Show resolved Hide resolved
@nirs nirs force-pushed the lima branch 12 times, most recently from 2f27e17 to dd490b8 Compare September 8, 2024 22:13
@nirs nirs force-pushed the lima branch 5 times, most recently from d3c076a to c14094f Compare September 10, 2024 01:00
test/README.md Outdated Show resolved Hide resolved
@nirs nirs force-pushed the lima branch 5 times, most recently from 9e2d2c2 to 9b57c91 Compare September 10, 2024 20:27
@nirs nirs marked this pull request as ready for review September 10, 2024 20:27
nirs added 27 commits September 18, 2024 19:37
Similar to minikube, starting a stopped cluster is more flaky. Even when
k8s reports that everything is ready, some components are not ready and
running the start hooks can fail randomly.

Example failure:

    Error from server (InternalError): Internal error occurred: failed
    calling webhook "managedclustermutators.admission.cluster.open-cluster-management.io":
    failed to call webhook:
    Post "https://cluster-manager-registration-webhook.open-cluster-management-hub.svc:9443/
    mutate-cluster-open-cluster-management-io-v1-managedcluster?timeout=10s":
    dial tcp 10.110.203.24:9443: connect: no route to host

Try to avoid this by adding a short delay after starting a stopped
cluster, before we start to run the hooks. This change affects only
developers that stop the environment and start it again.

In minikube we added delay in configure(), but for lima is better done
in start(), since there we can tell if this is a start of a stopped
cluster.

Signed-off-by: Nir Soffer <[email protected]>
limactl logs everything to stderr, and we watch stderr to consume the
logs. Since we drop limactl logs when running in normal log level, when
limactl fail we don't have any info on the error, and the only way to
debug limactl error is run with verbose mode.

With this change we extract limactl log level and log errors as drenv
errors, so the last log before the error provide some info on the error.

Tested by uninstalling socket_vmnet:

    sudo make uninstall.launchd

With this limaclt fails to connect to the vmnet socket:

    % drenv start envs/vm.yaml
    2024-09-09 22:30:07,354 INFO    [vm] Starting environment
    2024-09-09 22:30:07,376 INFO    [cluster] Starting lima cluster
    2024-09-09 22:30:26,490 ERROR   [cluster] exiting, status={Running:false Degraded:false
        Exiting:true Errors:[] SSHLocalPort:0} (hint: see "/Users/nsoffer/.lima/cluster/ha.stderr.log")
    2024-09-09 22:30:26,492 ERROR   Command failed
    Traceback (most recent call last):
       ...
    drenv.commands.Error: Command failed:
       command: ('limactl', '--log-format=json', 'start', 'cluster')
       exitcode: 1
       error:

The lima error message is not very useful but this is what we have. This
should be improved in lima.

If we inspect the log file mentioned we can see the actual error:

    % tail -3 ~/.lima/cluster/ha.stderr.log
    {"level":"debug","msg":"Start tcp DNS listening on: 127.0.0.1:51618","time":"2024-09-09T22:30:26+03:00"}
    {"level":"info","msg":"new connection from  to ","time":"2024-09-09T22:30:26+03:00"}
    {"level":"fatal","msg":"dial unix /var/run/socket_vmnet: connect: connection refused","time":"2024-09-09T22:30:26+03:00"}

Signed-off-by: Nir Soffer <[email protected]>
Usually this is an early usage error, written before the logger is
configured to json format, so we get a text log instead o message. Log
this line as is in debug level to allow debugging the issue.

Example error when using older lima version not supporting --log-format:

    2024-09-12 22:25:55,637 DEBUG   [drenv-test-cluster]
    time="2024-09-12T22:25:55Z" level=fatal msg="unknown flag:
    --log-format"

Without this change, this error is dropped and we don't have a clue what
went wrong.

Signed-off-by: Nir Soffer <[email protected]>
We access the cluster via the IP address on the shared network.  Port
forwarding cannot work for multiple clusters since same port from
multiple clusters is mapped to the same host port.

Signed-off-by: Nir Soffer <[email protected]>
Without this the API server will listen on the user network which is not
accessible from the host. Lima try to mitigate this by changing the
address to 127.0.0.1, but this does not work for multiple clusters.

With this change we can access all clusters from the host.

Signed-off-by: Nir Soffer <[email protected]>
Without this configuration the rook-ceph pods are listening on the user
network (192.168.5.0/24) instead of the shared network (192.168.105.0/24),
and rbd-mirror is broken.

With this change we can run the rook environment.

Thanks: Raghavendra Talur <[email protected]>
Signed-off-by: Nir Soffer <[email protected]>
Previously configured as minikube --extra-config.

Signed-off-by: Nir Soffer <[email protected]>
With minikube this is set in the profile, and configured via
--feature-gates flag. With lima we can configure this directly in
KubeletConfiguration. Currently the feature gates are hard coded in the
configuration for all cluster. We can configure based on the profile
later if needed.

Signed-off-by: Nir Soffer <[email protected]>
Currently we have:

    $ sysctl fs.inotify
    fs.inotify.max_queued_events = 16384
    fs.inotify.max_user_instances = 128
    fs.inotify.max_user_watches = 45827

And we see errors like this on managed clusters even with trivial
busybox workloads:

    failed to create fsnotify watcher: too many open files

We use OpenShift worker defaults, already used for minikube[1].

[1] kubernetes/minikube#18832

Signed-off-by: Nir Soffer <[email protected]>
We used 6 month old release, time to upgrade.

Signed-off-by: Nir Soffer <[email protected]>
This makes it work in lima cluster without deploying a csi-hostpath
driver. We can add such driver later if there is a real need.

With this change we can run the minio environment.

Signed-off-by: Nir Soffer <[email protected]>
This allows using commands.run() and commands.watch() with an open file
connected to the child process stdin. We will use this to load images
into lima cluster.

Signed-off-by: Nir Soffer <[email protected]>
Some commands like drenv must run in specific location. Add cwd argument
allowing this when using commands.run() and commands.watch(). We will
use this to run `drenv load` in `ramenctl deploy`, which may run in any
directory.

Signed-off-by: Nir Soffer <[email protected]>
This command loads an image in tar format into all clusters. This will
be used in ramenctl to load images into the clusters, and can also be
used manually. The environment may use one or more provider, and each
one will use the right command to load the image.

External provider do not support loading images. Pushing the ramen image
to a registry will work.

Usage:

    % drenv load -h
    usage: drenv load [-h] [-v] [--name-prefix PREFIX] --image IMAGE filename

    positional arguments:
      filename              path to environment file

    options:
      -h, --help            show this help message and exit
      -v, --verbose         be more verbose
      --name-prefix PREFIX  prefix profile names
      --image IMAGE         image to load into the cluster in tar format

Example run:

    % drenv load --image /tmp/image.tar envs/regional-dr.yaml
    2024-09-10 22:33:30,896 INFO    [rdr] Loading image '/tmp/image.tar'
    2024-09-10 22:33:30,902 INFO    [dr1] Loading image
    2024-09-10 22:33:30,902 INFO    [dr2] Loading image
    2024-09-10 22:33:30,902 INFO    [hub] Loading image
    2024-09-10 22:33:33,314 INFO    [dr1] Image loaded in 2.41 seconds
    2024-09-10 22:33:33,407 INFO    [dr2] Image loaded in 2.50 seconds
    2024-09-10 22:33:33,628 INFO    [hub] Image loaded in 2.73 seconds
    2024-09-10 22:33:33,628 INFO    [rdr] Image loaded in 2.73 seconds

Signed-off-by: Nir Soffer <[email protected]>
With this ramenctl can deploy ramen on any cluster type without knowing
anything about the cluster provider.

Example run:

    % ramenctl deploy --source-dir .. envs/regional-dr-lima.yaml
    2024-09-09 00:52:14,231 INFO    [ramenctl] Starting deploy
    2024-09-09 00:52:14,234 INFO    [ramenctl] Preparing resources
    2024-09-09 00:52:18,192 INFO    [ramenctl] Loading image 'quay.io/ramendr/ramen-operator:latest'
    2024-09-09 00:52:22,023 INFO    [ramenctl] Deploying ramen operator in cluster 'hub'
    2024-09-09 00:52:22,023 INFO    [ramenctl] Deploying ramen operator in cluster 'dr1'
    2024-09-09 00:52:22,025 INFO    [ramenctl] Deploying ramen operator in cluster 'dr2'
    2024-09-09 00:52:22,600 INFO    [ramenctl] Waiting until 'ramen-hub-operator' is rolled out in cluster 'hub'
    2024-09-09 00:52:22,687 INFO    [ramenctl] Waiting until 'ramen-dr-cluster-operator' is rolled out in cluster 'dr1'
    2024-09-09 00:52:22,697 INFO    [ramenctl] Waiting until 'ramen-dr-cluster-operator' is rolled out in cluster 'dr2'
    2024-09-09 00:52:29,893 INFO    [ramenctl] Finished deploy in 15.65 seconds

Signed-off-by: Nir Soffer <[email protected]>
There may be a better way, but for testing setup we could not care less
about certificates checks. We can try to improve this later if we think
that drenv will be used on real clusters.

Thanks: Raghavendra Talur <[email protected]>
Signed-off-by: Nir Soffer <[email protected]>
On lima cluster submariner use the public IP of the host (the address
assigned by your ISP) as the public IP of the clusters, and all
clusters get the same IP:

    % subctl show connections --context dr1
     ✓ Showing Connections
    GATEWAY    CLUSTER   REMOTE IP        NAT   CABLE DRIVER   SUBNETS        STATUS      RTT avg.
    lima-dr2   dr2       93.172.220.134   yes   vxlan          242.1.0.0/16   connected

    % subctl show endpoints --context dr1
     ✓ Showing Endpoints
    CLUSTER   ENDPOINT IP    PUBLIC IP        CABLE DRIVER   TYPE
    dr1       192.168.5.15   93.172.220.134   vxlan          local
    dr2       192.168.5.15   93.172.220.134   vxlan          remote

With this change it uses the actual IP address of the cluster in the
vmnet network:

    % subctl show connections --context dr1
     ✓ Showing Connections
    GATEWAY    CLUSTER   REMOTE IP        NAT   CABLE DRIVER   SUBNETS        STATUS      RTT avg.
    lima-dr2   dr2       192.168.105.10   yes   vxlan          242.1.0.0/16   connected

    % subctl show endpoints --context dr1
     ✓ Showing Endpoints
    CLUSTER   ENDPOINT IP    PUBLIC IP        CABLE DRIVER   TYPE
    dr1       192.168.5.15   192.168.105.11   vxlan          local
    dr2       192.168.5.15   192.168.105.10   vxlan          remote

Thanks: Raghavendra Talur <[email protected]>
Signed-off-by: Nir Soffer <[email protected]>
After provisioning a lima vm we have 2 default routes:

    % limactl shell dr1 ip route show default
    default via 192.168.5.2 dev eth0 proto dhcp src 192.168.5.15 metric 100
    default via 192.168.105.1 dev lima0 proto dhcp src 192.168.105.11 metric 100

192.168.5.0/24 is the special user network used by lima to bootstrap the
VM. All vms use have the same IP address (192.168.5.15) so this network
cannot be used to access the vm from the host.

192.168.105.0/24 is the vmnet shared network, providing access from host
to vm and from vm to vm. We wan to use only this network.

Without this change submariner uses the special user network
(192.168.5.0/24) for the endpoints, which cannot work for accessing the
other clusters:

    % subctl show connections --context dr1
     ✓ Showing Connections
    GATEWAY    CLUSTER   REMOTE IP        NAT   CABLE DRIVER   SUBNETS        STATUS      RTT avg.
    lima-dr2   dr2       192.168.105.10   yes   vxlan          242.1.0.0/16   connected

    % subctl show endpoints --context dr1
     ✓ Showing Endpoints
    CLUSTER   ENDPOINT IP    PUBLIC IP        CABLE DRIVER   TYPE
    dr1       192.168.5.15   192.168.105.11   vxlan          local
    dr2       192.168.5.15   192.168.105.10   vxlan          remote

I tried to fix this issue by deleting the default route via 192.168.5.2.
This works for deploying submariner, but this route is recreated later,
and this breaks submariner gateway and connectivity between the
clusters.

Changing the order of the default routes seem to work, both for
deploying submariner, and for running tests on the running clusters. We
do this by modifying the metric of the preferred route so it becomes
first:

    % limactl shell dr1 ip route show default
    default via 192.168.105.1 dev lima0 proto dhcp src 192.168.105.11 metric 1
    default via 192.168.5.2 dev eth0 proto dhcp src 192.168.5.15 metric 100

With this change the endpoint listen on the public ip (in the vmnet
network), allowing access to other clusters:

    % subctl show connections --context dr1
     ✓ Showing Connections
    GATEWAY    CLUSTER   REMOTE IP        NAT   CABLE DRIVER   SUBNETS        STATUS      RTT avg.
    lima-dr2   dr2       192.168.105.10   no    vxlan          242.1.0.0/16   connected

    % subctl show endpoints --context dr1
     ✓ Showing Endpoints
    CLUSTER   ENDPOINT IP      PUBLIC IP        CABLE DRIVER   TYPE
    dr1       192.168.105.11   192.168.105.11   vxlan          local
    dr2       192.168.105.10   192.168.105.10   vxlan          remote

Signed-off-by: Nir Soffer <[email protected]>
With 0.17.0 and 0.17.2 the globalnet pod fails with:

    2024-09-08T15:42:25.498Z FTL ../gateway_monitor.go:286 Globalnet
    Error starting the controllers error="error creating the Node
    controller: error retrieving local Node \"lima-dr1\": nodes \"lima-dr1\"
    is forbidden: User
    \"system:serviceaccount:submariner-operator:submariner-globalnet\"
    cannot get resource \"nodes\" in API group \"\" at the cluster scope"

This worked with minikube clusters, so maybe this related to the some
difference in the way the cluster is deployed, but we want to upgrade to
latest submariner anyway to detect regressions early.

Signed-off-by: Nir Soffer <[email protected]>
When testing the small submariner environment, we may start deploying
one cluster before the other cluster is ready. This fail randomly with
lima clusters when submariner use the wrong interface.

This may happen if we install submariner before flannel is ready.

Signed-off-by: Nir Soffer <[email protected]>
Remove the nslookup step, since is problematic:
- nslookuop and curl use different DNS resolvers, so when nslookup
  succeeds it does not mean that curl will succeed.
- nslookup sometimes return zero exit code with a message that the
  lookup failed! Then we try to access the DNS name with curl with a
  short timeout (60 seconds) and fail.

Simply to check only with curl, increasing the timeout to 300 seconds.

Signed-off-by: Nir Soffer <[email protected]>
It was configured to exit after 300 seconds, which makes it hard to test
when it takes lot of time to wait for connectivity.

Signed-off-by: Nir Soffer <[email protected]>
Submariner works now so we can enable in the regional-dr-lima.yaml.

Signed-off-by: Nir Soffer <[email protected]>
This addons replaces the minikube volumesnapshot addon, and is needed
for cephfs, volsync, and for testing volume replication of snapshot and
pvcs created from snapshots.

Signed-off-by: Nir Soffer <[email protected]>
This addon works on both minikube and lima clusters. It is used by the
cephfs and volsync and will be used for testing DR for workloads using
rbd pvc restored from snapshot.

To use snapshot with rbd storage a snapshot class was added based on
rook 1.15 example.

With this change we can enable cephfs and volsync in
regional-dr-lima.yaml.

Signed-off-by: Nir Soffer <[email protected]>
We did not flatten the config since it is not needed in minikube, using
path to the certificate. But in lima we get the actual certificate from
the guest, and without flattening we get:

    clusters:
    - name: drenv-test-cluster
      cluster:
        server: https://192.168.105.45:6443
        certificate-authority-data: DATA+OMITTED
    users:
    - name: drenv-test-cluster
      user:
        client-certificate-data: DATA+OMITTED
        client-key-data: DATA+OMITTED
    ...

`DATA-OMITTED` is not a valid certificate, so argocd fail to parse it.

With this change argocd works, and we can use regional-dr.yaml on macOS.

Signed-off-by: Nir Soffer <[email protected]>
Limactl is racy, trying to access files in other clusters directories
and failing when files were deleted. Until this issue is fixed in lima,
ensure that only single vm can be deleted at the same time.

Example failure:

    % drenv delete envs/regional-dr.yaml
    2024-09-13 05:59:57,159 INFO    [rdr] Deleting environment
    2024-09-13 05:59:57,169 INFO    [dr1] Deleting lima cluster
    2024-09-13 05:59:57,169 INFO    [dr2] Deleting lima cluster
    2024-09-13 05:59:57,169 INFO    [hub] Deleting lima cluster
    2024-09-13 05:59:57,255 WARNING [dr2] no such process
    2024-09-13 05:59:57,265 WARNING [dr2] remove /Users/nsoffer/.lima/dr2/ssh.sock: no such file or directory
    2024-09-13 05:59:57,265 WARNING [hub] remove /Users/nsoffer/.lima/hub/ssh.sock: no such file or directory
    2024-09-13 05:59:57,297 ERROR   [dr1] open /Users/nsoffer/.lima/dr2/lima.yaml: no such file or directory
    2024-09-13 05:59:57,297 ERROR   [hub] open /Users/nsoffer/.lima/dr2/lima.yaml: no such file or directory
    2024-09-13 05:59:57,298 ERROR   Command failed
    Traceback (most recent call last):
      ...
    drenv.commands.Error: Command failed:
       command: ('limactl', '--log-format=json', 'delete', '--force', 'dr1')
       exitcode: 1
       error:

Note how delete command for "dr1" and "hub" are failing to read lima.yaml
of cluster "dr2":

    2024-09-13 05:59:57,297 ERROR   [dr1] open /Users/nsoffer/.lima/dr2/lima.yaml: no such file or directory
    2024-09-13 05:59:57,297 ERROR   [hub] open /Users/nsoffer/.lima/dr2/lima.yaml: no such file or directory

With the lock, we run single limactl process at a time, so it cannot
race with other clusters.

Signed-off-by: Nir Soffer <[email protected]>
@nirs nirs merged commit c021197 into RamenDR:main Sep 18, 2024
13 of 18 checks passed
@nirs nirs deleted the lima branch September 18, 2024 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

drenv on Apple silicon via lima
2 participants