While we wait for Al's answer, it seems to me that filtering CDBs with cgroups is a
pretty natural extension of filtering devices with cgroups. So here is a possible
specification for such a cgroup.
[CCing the libvirt mailing list since they could be one of the first clients]
Paolo
SG_IO Filter Controller ("cdb")
1. Description
The cdb cgroup implement a way to filter allowed SCSI commands according
to one or more Berkeley Packet Filter programs associated to the cgroup
and its parents. BPF programs have access to the CDB and various
ancillary data about the device.
To be allowed, a command must be allowed by at least one program for each
cgroup from the current task's to the root. In addition, as a general
rule it must pass the regular check on privileged commands that is done
even without cgroups. Groups with no programs are handled specially so
that the default configuration is the same as without cgroups.
Privileged tasks may install programs that bypass the usual check on
"dangerous" SCSI commands. Non-privileged tasks in the same cgroup will
also be able to bypass the check, but they may not widen their privileged
abilities beyond what the cgroup already has.
Administrators can replace the current entries, or add new ones. Replacing
the entries in a cgroup will never affect those that are inherited from
the parent. However when a parent cgroup is changed, the new filters
will also apply to the children.
2. Operation
The BPF program can return one of the following values:
* 0: the CDB is denied. Another program in the cgroup will be tried,
or the SG_IO ioctl will return with EPERM if there are none.
* 1: the CDB is allowed; it should be subject to the bitmap that is
used in the absence of cgroups.
* 2: the CDB is allowed, and the generic filter may be bypassed.
Programs that return 2 or the value of the accumulator are called
privileged in the remainder of this document.
BPF programs used with the cdb cgroup have access to the following
ancillary values:
* ANC_MAJOR (45): the major number of the device
* ANC_MINOR (46): the minor number of the device
* ANC_BLOCK (47): 1 if the device is a block device, 0 if it is a
character device
* ANC_PART (48): the partition number of the device; 0 if it is a
character device
* ANC_MODE (49): one of O_RDONLY/O_WRONLY/O_RDWR depending on how
the file was opened.
* ANC_RAWIO (50): 1 if the current process has CAP_SYS_RAWIO, 0
otherwise.
Evaluation goes through all filters in each cgroup and picks the most
permissive (largest) value. It also goes through all cgroups from the
current task's up to the root, and executes filters in there; but here
it picks the most restrictive value. In other words the result from
multiple filters is "ORed", while the result from multiple cgroups
is "ANDed".
Cgroups with no filters are skipped, with one exception: if the current
task is in a cgroup with no filters, it will behave as if it had this
special filter:
pseudocode: | BPF:
if capable(CAP_SYS_RAWIO) | ANC RAWIO
return 2 | ADD #1
else | RET A
return 1 |
This has two effects:
1) when a non-privileged task is moved from a privileged cgroup to a
new cgroup, it will be subject to the generic filter;
2) when a task is in the root cgroup, and the root cgroup has no
filters, it behaves as if the cdb cgroup did not exist at all.
This maps to the following algorithm:
privileged = YES
allowed = YES
for each cgroup C from the current task cdb cgroup to the root
if no filters in C
if C is the current task cdb cgroup
privileged &= capable(CAP_SYS_RAWIO)
continue
privileged_this_cgroup = NO
allowed_this_cgroup = NO
for each filter F in C
ret = run_filter(F, cdb)
if ret != 0 then
allowed_this_cgroup = YES
if ret == 2 then
privileged_this_cgroup = YES
privileged &= privileged_this_cgroup
allowed &= allowed_this_cgroup
if !allowed then
return EPERM
if !privileged then
test CDB against bitmap
execute CDB
(Of course some short-circuiting is possible).
3. User Interface
The cgroup provides three files:
* cdb.filter: entries are modified using this file. Entries are
added if the file was opened with O_APPEND, otherwise they are replaced.
Opening the file with O_TRUNC immediately removes all filters. These
rules are chosen so that shell redirections (including ":>cdb.filter")
will do the right thing.
Adding or replacing programs requires CAP_SYS_ADMIN. Adding
privileged programs *in addition* requires CAP_SYS_RAWIO.
An entry is represented by multiple occurrences of the following
structure, which must all be written with a single system call:
struct bpf_insn {
u16 code;
u8 jt;
u8 jf;
u32 k;
};
in the native endianness of the running architecture. A zero-length
write will do nothing if the file was opened with O_APPEND, and
remove all entries if it wasn't.
* cdb.list: entries are retrieved using this file. All filters
are preceded by a 32-bit value counting the number of bpf_insn
structs in the program, and concatenated.
* cdb.priv: returns 1 if the cgroup is privileged (has at least one
privileged filter). This is true if at least one filter includes
a "RET #2" or "RET A" instruction.
4. Security
Filters that include the "RET A" or "RET #2" instructions can only
be added by a task that has CAP_SYS_RAWIO; thus only tasks with
CAP_SYS_RAWIO, who could bypass the bitmap themselves, can also
let other processes do so. Such cgroups are marked as privileged;
CAP_SYS_RAWIO is required to attach a process to a privileged cgroup.
The privileged status is visible in the "cdb.priv" file.
While such filters let non-privileged processes and their children
bypass the bitmap, this only holds as long as the non-privileged
process does none of the following operations (which by themselves
require CAP_SYS_ADMIN):
* replace all filters from the cgroup
* create a new sub-cgroup and move itself to it
Because in either case, the empty cgroup will behave as if it
had "RET #1".
In addition, new filters added to the cgroup will never widen the
privileged abilities of the process, because filters with "RET #2"
or "RET A" will not be allowed.
5. Examples of filters
5.1. Persistent reservations
This filter lets a program use persistent reservations, plus any
other command that is allowed without CAP_SYS_RAWIO:
LD_B 0 ; A = cdb[0]
JGT #0x5f, Lpass, 1f ; pass if > PR OUT
1: JGE #0x5e, Lpr, Lpass ; pass if < PR IN
Lpass: RET #1 ; go to bitmap check
Lpr: RET #2 ; bypass bitmap check
A program could put itself in a new cgroup, add this filter and then
drop CAP_SYS_RAWIO/CAP_SYS_ADMIN.
5.2. Arbitrary bitmap
This filter could be used as a template to convert a 256-bit bitmap
to a BPF program.
LD_B 0
AND #31
TAX ; X = cdb[0] & 31
LD #1
LSH X
TAX ; X = 1 << (cdb[0] & 31)
LD_B 0 ; A = cdb[0]
JSET #128, L1xx, L0xx ; Decode bit 7 of the opcode
L0xx: JSET #64, L01x, L00x ; Decode bit 6
L1xx: JSET #64, L11x, L10x
L00x: JSET #32, L001, L000 ; Decode bit 5
L01x: JSET #32, L011, L010
L10x: JSET #32, L101, L100
L11x: JSET #32, L111, L110
L000: TXA; JSET #..., Lpass, Lfail ; fill in bitmap values here
L001: TXA; JSET #..., Lpass, Lfail
L010: TXA; JSET #..., Lpass, Lfail
L011: TXA; JSET #..., Lpass, Lfail
L100: TXA; JSET #..., Lpass, Lfail
L101: TXA; JSET #..., Lpass, Lfail
L110: TXA; JSET #..., Lpass, Lfail
L111: TXA; JSET #..., Lpass, Lfail
Lpass: RET #1 ; could also be RET #2
Lfail: RET #0