Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SONiC Container Hardening #1364

Merged
merged 9 commits into from
Feb 29, 2024
Merged
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
349 changes: 349 additions & 0 deletions doc/Container Hardening/SONiC_container_hardening_HLD.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,349 @@
# SONiC Container Hardening #

## Table of Content
- [SONiC Container Hardening](#sonic-container-hardening)
- [Table of Content](#table-of-content)
- [List of Tables](#list-of-tables)
- [Revision](#revision)
- [Scope](#scope)
- [Definitions/Abbreviations](#definitionsabbreviations)
- [1. Overview](#1-overview)
- [2. Requirements](#2-requirements)
- [3. Architecture Design](#3-architecture-design)
- [3.1 Root privileges](#31-root-privileges)
- [3.2 net=host](#32-nethost)
- [4. High-Level Design](#4-high-level-design)
- [4.1 Root privileges removal](#41-root-privileges-removal)
- [Docker privileges](#docker-privileges)
- [4.2 net=host optimization](#42-nethost-optimization)
- [How to check?](#how-to-check)
- [5. SAI API](#5-sai-api)
- [6. Configuration and management](#6-configuration-and-management)
- [6.1. Manifest (if the feature is an Application Extension)](#61-manifest-if-the-feature-is-an-application-extension)
- [6.2. CLI/YANG model Enhancements](#62-cliyang-model-enhancements)
- [6.3. Config DB Enhancements](#63-config-db-enhancements)
- [7. Warmboot and Fastboot Design Impact](#7-warmboot-and-fastboot-design-impact)
- [8. Restrictions/Limitations](#8-restrictionslimitations)
- [9. Testing Requirements/Design](#9-testing-requirementsdesign)
- [9.1 Unit Test cases](#91-unit-test-cases)
- [9.2 System Test cases](#92-system-test-cases)
- [10. Open/Action items - if any](#10-openaction-items---if-any)
- [Appendix A: Further reading](#appendix-a-further-reading)
- [Appendix B: Linux Capabilities](#appendix-b-linux-capabilities)

## List of Tables
* [Table 1: Revision](#table-1-revision)
* [Table 2: Abbreviations](#table-2-abbreviations)
* [Table 3: Default Linux capabilities](#table-3-default-linux-capabilities)
* [Table 4: Extended Linux capabilities](#table-4-extended-linux-capabilities)

## Revision
###### Table 1: Revision
| Rev | Date | Author | Change Description |
|:---:|:-----------:|:------------------:|-----------------------------------|
| 0.1 | | | Initial version |

## Scope

This section describes the requirements, goals, and recommendations of the container hardening item for SONiC.

## Definitions/Abbreviations
###### Table 2: Abbreviations
| Definitions/Abbreviation | Description |
|--------------------------|--------------------------------------------|
| OS | Operating System |
| API | Application Programmable Interface |
| SAI | Swich Abstraction Interface |

## 1. Overview

Containers is a method of creating virtualization and abstraction of an OS for a subset of processes/service on top of a single host with the purpose of giving it an environment to run and execute its tasks without effect of nearby containers/processes.

In SONiC, we are deploying containers with full visibility and capabilities as the host Linux.

This poses a security risk and vulnerability as a single breached container means that the whole system is breached.

Addressing this issue – we have composed this doc for container hardening, describing the security hardening requirements and definitions for all containers on top of SONiC.

## 2. Requirements

What are we trying to achieve here?

We would like to increase the security in SONiC so that an attack on a specific container will not compromise the whole system.

To do so, we'll tackle the following areas:
1. Privileges
2. Network
3. Capabilities
4. Mount namespace
5. Cgroups
6. Etc

For now, we will focus on #1 & #2

Further guidelines and requirements will be brought upon in the future on-demand.

## 3. Architecture Design

### 3.1 Root privileges
maipbui marked this conversation as resolved.
Show resolved Hide resolved

When removing the root privileges from a specific container - we are required to remove the `--privileged` flag and add the required missing Linux capabilities to the docker, or alternatively adjust the container so that it does not require root privileges to perform any action.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is similar work done with Multi ASIC design, it is better to converge before making any decision changing net=host -> bridge etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not familiar with it, what is the work being done there?


### 3.2 net=host

Removing the `net=host` is required to prevent the container from accessing the full network scope of the host and system.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe, BGP container needs full network access in order to function properly, do we have the list of containers that can be migrated without net=host? Some of the container (DHCP relay, SNMP..etc) directly work on the Linux networking, removing net=host, might require additional changes at the application level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the SNMP container we have managed to remove the net=host.
For BGP - we also have a PR up with @maipbui changes that we can review.

For DHCP - we'll need to check this.
I have a composed list of containers and the suggestions on whether we can remove the flags or not, I'll add it to the HLD as an appendix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please help review code PRs which we included in this HLD PR's description. @venkatmahalingam Yes, BGP container needs host network mode.

When doing this removal - we will start getting failures from devices that require external access and packet transfers between the container and the host to the interfaces.
In order to overcome this obstacle - we have a few options here:
- using `--net=bridge` and port forwarding
Yarden-Z marked this conversation as resolved.
Show resolved Hide resolved

## 4. High-Level Design

### 4.1 Root privileges removal
Removing the `--privileged` flag is done by editing the docker_image_ctl.j2 file:

docker_image_ctl.j2 file

docker create {{docker_image_run_opt}} \ # *Need to modify this parameter "docker_image_run_opt" to not contain the --privileged flag*
{%- if docker_container_name != "database" %}
--net=$NET \
--uts=host \{# W/A: this should be set per-docker, for those dockers which really need host's UTS namespace #}
{%- endif %}
{%- if docker_container_name == "database" %}
-p 6379:6379 \
{%- endif %}
-e RUNTIME_OWNER=local \
{%- if install_debug_image == "y" %}
-v /src:/src:ro -v /debug:/debug:rw \
{%- endif %}
{%- if '--log-driver=json-file' in docker_image_run_opt or '--log-driver' not in docker_image_run_opt %}
--log-opt max-size=2M --log-opt max-file=5 \
{%- endif %}

This will cause the docker file to be altered in the following manner:

**database.sh file**

docker create --privileged -t -v /etc/sonic:/etc/sonic:ro \ # *Need to remove the --privileged flag*
-p 6379:6379 \
-e RUNTIME_OWNER=local \
--log-opt max-size=2M --log-opt max-file=5 \
--tmpfs /tmp \
$DB_OPT \
$REDIS_MNT \
-v /usr/share/sonic/device/$PLATFORM:/usr/share/sonic/platform:ro \
--tmpfs /var/tmp \
--env "NAMESPACE_ID"="$DEV" \
--env "NAMESPACE_PREFIX"="$NAMESPACE_PREFIX" \
--env "NAMESPACE_COUNT"=$NUM_ASIC \
--name=$DOCKERNAME \
docker-database:latest \
|| {
echo "Failed to docker run" >&1
exit 4
}

#### Docker privileges
Removing the root privileges from the docker container - will remove some Linux capabilities that are inherited from the root level permissions.

Running the capabilities list command on a privileged container, this includes all capabilities captured in both [Table 3: Default Linux capabilities](#table-3-default-linux-capabilities) and [Table 4: Extended Linux capabilities](#table-4-extended-linux-capabilities)
Yarden-Z marked this conversation as resolved.
Show resolved Hide resolved

root@ce2c36a0b20c:/# capsh --print
Current: = cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read+eip

Running the capabilities list command on an un-privileged container, this includes all capabilities captured in [Table 3: Default Linux capabilities](#table-3-default-linux-capabilities):

root@ce2c36a0b20c:/# capsh --print
Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap=eip

If, for some reason, a docker must retain a specific capablity functionality on top of the container (which is removed after removing the `--privileged` flag), we can do that with the following:

In the docker-database.mk file adjust this line:

$(DOCKER_DATABASE)_RUN_OPT += -t –-cap-add NET_ADMIN # Changed by removing the --privileged flag and adding --cap-add flag

### 4.2 net=host optimization

Here we will provide a detailed example of how to switch from the `--net=host` configuration (host network) to the `--net=bridge` configuration paired with port forwarding in a specific container. We are using the database container as an example for this item.
Copy link
Contributor

@Blueve Blueve Jul 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this design cover SDK change? SONiC containers are massively interactive with Redis database, it is required to let our SDK adapt bridge mode container

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this require SDK changes?
Regarding the SDK - since it resides in a single container, we have decided against reducing the privileges and capabilities of this container, as this will require massive SDK re-design (and this is not our purpose).
Also - the SDK container is not at a high risk of compromise since it does not interact with the outside network.

Copy link
Contributor

@qiluo-msft qiluo-msft Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Blueve Which SDK? Is it switch ASIC vendor SDK or swss-common library?
Could you give specific sample function in the SDK you are worrying about?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiluo-msft swss-common, do we still have a unified way to interact with Redis database with this design?

Current we can use two approaches to interact Redis in sdk:

  • local IP
  • Unix sock file

With this design, both of them are not available I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are still applicable with this design (tested).
UDS should not be affected, since reducing the privileges of the redis container does not affect the Unix socket (on a different level of permissions - file access permissions).
Local IP might require a slight change, but you can check it in the relevant PR and it is currently working.


The original docker creation should be like in the example below:

docker with host sharing:
Copy link
Contributor

@qiluo-msft qiluo-msft Oct 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docker with host sharing

As feedback in sonic community review meeting, please considering first unifying single-asic to multi-asic namespace design, and then leverage the current bridge net mode already implemented on multi-asic system.

ref: sonic-net/sonic-buildimage#3856 #Pending

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Will review this and apply the relevant fixes


docker create --privileged -t -v /etc/sonic:/etc/sonic:ro \
--net=$NET \
-e RUNTIME_OWNER=local \
--uts=host \
--log-opt max-size=2M --log-opt max-file=5 \
--tmpfs /tmp \
$DB_OPT \
$REDIS_MNT \
-v /usr/share/sonic/device/$PLATFORM:/usr/share/sonic/platform:ro \
--tmpfs /var/tmp \
--env "NAMESPACE_ID"="$DEV" \
--env "NAMESPACE_PREFIX"="$NAMESPACE_PREFIX" \
--env "NAMESPACE_COUNT"=$NUM_ASIC \
--name=database_no_net \
--cap-drop=NET_ADMIN \
docker-database:latest

To disable the sharing of the networking stack between the host and a container we need to remove the flag: `--net=host`. Because we have not specified any `--network` flag, the containers connect to the default bridge network `--net=bridge`.
To support port forwarding we are required to add the flag:  `-p <port>:<port>`

The "new" docker creation file database.sh can be seen in the code block below:

Docker with port forwarding and default bridge network

docker create --privileged -t -v /etc/sonic:/etc/sonic:ro \
**-p 6379:6379** \
-e RUNTIME_OWNER=local \
--uts=host \
--log-opt max-size=2M --log-opt max-file=5 \
--tmpfs /tmp \
$DB_OPT \
$REDIS_MNT \
-v /usr/share/sonic/device/$PLATFORM:/usr/share/sonic/platform:ro \
--tmpfs /var/tmp \
--env "NAMESPACE_ID"="$DEV" \
--env "NAMESPACE_PREFIX"="$NAMESPACE_PREFIX" \
--env "NAMESPACE_COUNT"=$NUM_ASIC \
--name=$DOCKERNAME \
docker-database:latest \

**How we did it?**

To create a docker with the flags above it is required to set the "new" flag in the file docker_image_ctl.js. Follow the call `docker create {{docker_image_run_opt}} \`: 
and replace the `–--net=$NET`.
docker flag generation

{%- if docker_container_name != "database" %}
--net=$NET \
{%- endif %}
{%- if docker_container_name == "database" %}
-p 6379:6379 \
{%- endif %}

#### How to check?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the user experience with this change with the current container capabilities? What is the plan to mitigate this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User should not experience any changes.
If there are, then one of the following should occur:

  1. Adjust the user action/container action so that it does not require the elevated privileges or specific net host capabilities.
  2. Add to the container the required capabilities it needs to perform.

We should go with option 1 first, and only after we have decided option 1 is not feasible - option 2 is on the table.


Go into the docker - `docker exec -it docker bash`
Run `ifconfig`.

On a docker with host network - you'll be able to view all physical interfaces.
On a docker without host network - we'll see only eth0 and lo.

Copy link
Contributor

@yaqiangz yaqiangz Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether syslog(especially send syslog message to remote rsyslog server) can work well with this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Syslog can function with this.
Also, the syslog is currently running on top of the host - where nothing is changed there.
So I'm going to split my answer:

  1. There should be no impact on rsyslog and none of these items affect it as it runs on top of the host
  2. In the same manner that we changed SNMP - we can change rsyslog (if there is a decision to move it to a container based solution)

Copy link
Contributor

@yaqiangz yaqiangz Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be clarified the syslog I mean is generated inside container. I just tried image built by this snmp network mode related PR: sonic-net/sonic-buildimage#15176 (SONiC.master-15176.360116-6bb24c135). And cannot find any syslog generated by snmp container (contains 'snmp#') in local host syslog file or our syslog server, and process to dynamically generate rsyslog inside container containercfgd doesn't startup. Not sure whether you have verified this, could you confirm that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please refer this item to the specific PR?
If there is some implementation concern that you have noticed - I'd appreciate it if you could direct it to the relevant PR.
This is a good item and I do not recall if this was reviewed thoroughly (might have missed something)

## 5. SAI API

N/A

## 6. Configuration and management

N/A - no configuration management/changes are required.

### 6.1. Manifest (if the feature is an Application Extension)

N/A

### 6.2. CLI/YANG model Enhancements

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a tool that knows what containers are running privileged mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be, requires a bit of research.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can check by docker inspect container_name | grep Privileged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be applicable for the --privileged flag, but this HLD is also discussing host network.
So while this might work for one of the HLD item (removing containers from privilege mode) it does not scope the whole feature list.
So this command might be a good start, but we should look at a more encompassing approach to check this (and not stop here).

N/A
We are not adding CLI commands or management capabilities to the system with this item.

### 6.3. Config DB Enhancements

N/A - DB should remain the same

## 7. Warmboot and Fastboot Design Impact

No impact on all boot sequences, as this item should be seemlessly integrated into the system and achieve the same functionality level as before.

## 8. Restrictions/Limitations

## 9. Testing Requirements/Design

To define this item completed - we are required to run the full CI and check that nothing has been broken from the changes proposed in this HLD.
In addition - we should test that the mitigations are applicable for the relevant containers.

### 9.1 Unit Test cases

N/A, this feature will be checked on a system level.

### 9.2 System Test cases

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any impact w.r.t memory and cpu as it is running in different network name spaces?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Negligible, as it is part of the infra.
I'll be more precise - the impact is as much of how much the Linux ACL takes, but this should not be visible to any user.

For general fucntionality flows- running the same test cases that we currently have on top of our system and verifying that nothing broke.

For adidtional security test cases, we should check that priviliges and network capabilities have been removed.
Net=$HOST removal test:
1. Login to container with removed network capabilities
2. Run ls /dev/
3. Check that we do not have visibility to all network devices (no tty9/8 no sda, etc')

Privilege removal test:
1. Login to container without --privileged flag
2. Check that you cannot access /etc/shadow
3. Check that you cannot perform vim for /boot folder or any file in it

## 10. Open/Action items - if any

Currently, Nvidia and MSFT have scoped commitment for specific containers.
Redis and SNMP already have these adjustments.
Copy link
Collaborator

@venkatmahalingam venkatmahalingam Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if SNMP container wants to listen on some particular interface on the host? container hardening will be disabled by default for now? can this be driven based on FEATURE table config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the SNMP container would like to listen to a specific interface - you can check this:
https://github.com/sonic-net/sonic-buildimage/pull/15176/files

I think that this PR will showcase your specific use-case.

What remains is to perform this container hardening for all other containers in the system so that the whole echo-system will comply to these security hardening requirements.

## Appendix A: Further reading

[Linux Capabilities 101](https://linux-audit.com/linux-capabilities-101/)

[Understanding Linux Capabilities](https://tbhaxor.com/understanding-linux-capabilities/)

[Linux Namespaces Wiki](https://en.wikipedia.org/wiki/Linux_namespaces)

## Appendix B: Linux Capabilities

The following table lists the Linux capability options which are allowed by default and can be dropped.
###### Table 3: Default Linux capabilities
| Capability Key | Capability Description |
| ----------- | ----------- |
| AUDIT_WRITE | Write records to kernel auditing log |
| CHOWN | Make arbitrary changes to file UIDs and GIDs (see chown(2)). |
| DAC_OVERRIDE | Bypass file read, write, and execute permission checks. |
| FOWNER | Bypass permission checks on operations that normally require the file system UID of the process to match the UID of the file. |
| FSETID | Don’t clear set-user-ID and set-group-ID permission bits when a file is modified. |
| KILL | Bypass permission checks for sending signals |
| MKNOD | Create special files using mknod(2). |
| NET_BIND_SERVICE | Bind a socket to internet domain privileged ports (port numbers less than 1024). |
| NET_RAW | Use RAW and PACKET sockets |
| SETFCAP | Set file capabilities |
| SETGID | Make arbitrary manipulations of process GIDs and supplementary GID list. |
| SETPCAP | Modify process capabilities |
| SETUID | Make arbitrary manipulations of process UIDs. |
| SYS_CHROOT | Use chroot(2), change root directory. |

The next table shows the capabilities which are not granted by default and may be added.
###### Table 4: Extended Linux capabilities
| Capability Key | Capability Description |
| ----------- | ----------- |
| AUDIT_CONTROL | Enable and disable kernel auditing; change auditing filter rules; retrieve auditing status and filtering rules. |
| AUDIT_READ | Allow reading the audit log via multicast netlink socket |
| BLOCK_SUSPEND | Allow preventing system suspends. |
| BPF | Allow creating BPF maps, loading BPF Type Format (BTF) data, retrieve JITed code of BPF programs, and more. |
| CHECKPOINT_RESTORE | Allow checkpoint/restore related operations. Introduced in kernel 5.9. |
| DAC_READ_SEARCH | Bypass file read permission checks and directory read and execute permission checks. |
| IPC_LOCK | Lock memory (mlock(2), mlockall(2), mmap(2), shmctl(2)). |
| IPC_OWNER | Bypass permission checks for operations on System V IPC objects. |
| LEASE | Establish leases on arbitrary files (see fcntl(2)). |
| LINUX_IMMUTABLE | Set the FS_APPEND_FL and FS_IMMUTABLE_FL i-node flags. |
| MAC_ADMIN | Allow MAC configuration or state changes. Implemented for the Smack LSM. |
| MAC_OVERRIDE | Override Mandatory Access Control (MAC). Implemented for the Smack Linux Security Module (LSM). |
| NET_ADMIN | Perform various network-related operations. |
| NET_BROADCAST | Make socket broadcasts, and listen to multicasts. |
| PERFMON | Allow system performance and observability privileged operations using perf_events, i915_perf and other kernel subsystems |
| SYS_ADMIN | Perform a range of system administration operations. |
| SYS_BOOT | Use reboot(2) and kexec_load(2), reboot and load a new kernel for later execution. |
| SYS_MODULE | Load and unload kernel modules. |
| SYS_NICE | Raise process nice value (nice(2), setpriority(2)) and change the nice value for arbitrary processes. |
| SYS_PACCT | Use acct(2), switch process accounting on or off. |
| SYS_PTRACE | Trace arbitrary processes using ptrace(2). |
| SYS_RAWIO | Perform I/O port operations (iopl(2) and ioperm(2)). |
| SYS_RESOURCE | Override resource Limits |
| SYS_TIME | Set system clock (settimeofday(2), stime(2), adjtimex(2)); set real-time (hardware) clock. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we know if the appliation does not use SYS_TIME?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't, but if a specific container requires this - it can add it as a specific flag only for this container.

| SYS_TTY_CONFIG | Use vhangup(2); employ various privileged ioctl(2) operations on virtual terminals. |
| SYSLOG | Perform privileged syslog(2) operations. |
| WAKE_ALARM | Trigger something that will wake up the system |