On 01/29/2018 08:03 AM, Michal Privoznik wrote:
On 01/26/2018 04:07 AM, John Ferlan wrote:
>
>
> On 01/18/2018 11:04 AM, Michal Privoznik wrote:
>> This is a definition that holds information on SCSI persistent
>> reservation settings. The XML part looks like this:
>>
>> <reservations enabled='yes' managed='no'>
>> <source type='unix' path='/path/to/qemu-pr-helper.sock'
mode='client'/>
>> </reservations>
>>
>> If @managed is set to 'yes' then the <source/> is not parsed.
>> This design was agreed on here:
>>
>>
https://www.redhat.com/archives/libvir-list/2017-November/msg01005.html
>>
>> Signed-off-by: Michal Privoznik <mprivozn(a)redhat.com>
>> ---
>> docs/formatdomain.html.in | 25 +++-
>> docs/schemas/domaincommon.rng | 19 +--
>> docs/schemas/storagecommon.rng | 34 +++++
>> src/conf/domain_conf.c | 36 +++++
>> src/libvirt_private.syms | 3 +
>> src/util/virstoragefile.c | 148 +++++++++++++++++++++
>> src/util/virstoragefile.h | 15 +++
>> .../disk-virtio-scsi-reservations-not-managed.xml | 40 ++++++
>> .../disk-virtio-scsi-reservations.xml | 38 ++++++
>> .../disk-virtio-scsi-reservations-not-managed.xml | 1 +
>> .../disk-virtio-scsi-reservations.xml | 1 +
>> tests/qemuxml2xmltest.c | 4 +
>> 12 files changed, 348 insertions(+), 16 deletions(-)
>> create mode 100644
tests/qemuxml2argvdata/disk-virtio-scsi-reservations-not-managed.xml
>> create mode 100644 tests/qemuxml2argvdata/disk-virtio-scsi-reservations.xml
>> create mode 120000
tests/qemuxml2xmloutdata/disk-virtio-scsi-reservations-not-managed.xml
>> create mode 120000 tests/qemuxml2xmloutdata/disk-virtio-scsi-reservations.xml
>>
>
> Before digging too deep into this...
>
> - I assume we're avoiding <disk> iSCSI mainly because those
> reservations would take place elsewhere, safe assumption?
I believe so, but I'll let Paolo answer that. The way I understand
reservations is that qemu needs to issue 'privileged' SCSI commands and
thus for regular SCSI (which for purpose of this argument involves iSCSI
emulated by kernel) either qemu needs CAP_SYS_RAWIO or a helper process
to which it'll pass the FD and which will issue the 'privileged' SCSI
commands on qemu's behalf.
Just trying to cover all the bases... Disks, Hostdevs, Storage Pools,
SCSI, iSCSI, NPIV, etc. A few different ways to figure LUNs.
I think you're right regarding the privileged command - it's what I have
a recollection of when dealing with PR before with the rawio and sgio
properties.
>
> - What about using lun's from a storage pool (and what could become
> your favorite, NPIV devices ;-))
>
> <disk type='volume' device='lun'>
> <driver name='qemu' type='raw'/>
> <source pool='sourcepool' volume='unit:0:4:0'/>
> <target dev='sda' bus='scsi'/>
> </disk>
These should work too with my patches (not tested though - I don't have
any real SCSI machine).
I have access to the NPIV environment I use and can let you use it as
well... So that part is not a problem. IIRC, this was a two pronged
question, but I neglected to go back and be more explicit.. The second
half of the question is when using a storage pool for your LUN, then how
is the PR added? The <reservations> are only being processed from one
very specific disk source type. For reference, see how many times
encryption is added to various RNG definitions - including
diskSourceVolume which is what I think you'd want to process a 'srcpool'
definition.
The virStorageTranslateDiskSourcePool handles the magic of generating
the disk like structure used when building the command line. There's
also a 'mode' attribute where if set to 'host' for an iSCSI LUN's
will
result in using the "local" filesystem path to the LUN rather than going
through the iSCSI server. I would think for this case, then one would
have to have the PR set up (if configured of course).
>
> - What about <hostdev>'s?
>
> <hostdev mode='subsystem' type='scsi'>
>
> but not iSCSI or vHost hostdev's. I think that creates the SCSI
> generic LUN, but it's been a while since I've thought about the
> terminology used for hostdev's...
I think these don't need the feature since qemu can access the device
directly.
Well that'll make life a bit easier. IIRC, this creates a /dev/sgN on
the guest.
>
> I also have this faint recollection of PR related to sgio filtering
> and perhaps even rawio, but dredging that back up could send me down the
> path of some "downstream only" type bz's. Although searching on just
> VIR_DOMAIN_DISK_DEVICE_LUN does bring up qemuSetUnprivSGIO.
>
> And finally... I assume there is one qemu-pr-manager (qemu.conf changes
> eventually)... Eventually there's magic that allows/adds per domain
> *and* per LUN some socket path. If libvirt provided it's generated via
> the domain temporary directory; however, what's not really clear is how
> that unmanaged path really works. Need a virtual whiteboard...
So, in case of unmanaged path, here are the assumptions that my patches
are built on:
1) unmanaged helper process (UHP) is spawned by somebody else's than
libvirtd (hence unmanaged) - it doesn't have to be user, it can be
systemd for instance.
2) path to UHP's socket has to be labeled correctly - libvirt doesn't
touch that, because it knows nothing about usage scenario, whether sys
admin intended one UHP per whole host and thus configured label that
way, or it is spawned by mgmt app (or systemd, or whomever) per one
domain, or even disk. Therefore, we can do nothing more than shrug
shoulders and require users to label the socket correctly. Or use
managed helper.
3) in future, when UHP dies, libvirt will NOT spawn it again. It's
unmanaged after all. It's user/sysadmin responsibility to spawn it
again.
Now, for the managed helper process (MHP) the assumptions are:
1) there's one MHP per domain (all SCSI disks in the domain share the
same MHP).
2) the MHP runs as root, but is placed into the same CGroup, mount
namespace as qemu process it serves
3) MHP is lives and dies with the domain it is associated with.
The code might be complicated more than needed - it is prepared to have
one MHP per disk rather than domain (should we ever need it). Therefore
instead of storing one pid_t, we store them in a hash table where more
can be stored.
I know the design was agreed upon previously, but nothing like actual
patches to open Pandora's Box again ;-). I might be off in the weeds
here, but taking a step back and typing thoughts...
If MHP is a QEMU related process - IOW, something QEMU creates, then why
is it libvirt's responsibility to manage it? Why would/should libvirt be
responsible for restarting it if it goes away? The death would be
noticeable if we were notified that the socket closed or perhaps if QEMU
sent an event when it noticed. If the daemon was having a problem, then
wouldn't restarting it possibly cause a never ending fail/restart cycle?
What/Who stops that?
Perhaps the UHP model is how we should be treating things - that is
"mostly" not our problem.
Shouldn't the extent of libvirt's "management" be nothing more than
using it and handling when the connection dies by some action defined
for the device that's using it (e.g. domain shutdown, domain pause,
device hot unplug)?. Subsequent hot-plug's would need to determine that
the connection to the daemon can be made; otherwise, the attempt fails.
That may need to be something checked after the device add. Defining the
close action on a device by device basis allows the domain configurer to
make a choice based on whether it's a boot/system/important disk/lun or
just something being used for something.
Another random, but related thought... QEMU has made this change to add
this PR daemon to manage something and communicate with libvirt by some
defined socket... Someone else says, great idea, let's copy that and
libvirt will be able to manage it. After all imitation is a form of
flattery. That leaves libvirt trying to fit this new thing into the
existing model - either by liberally copying the same code and changing
the names to protect the innocent or refactoring to generate common
paths for separate technologies.
Still in the long run what we have for the current mechanism is a socket
by path. So if we consider that the common point why not have a way to
define socket paths to external daemon's by path. IOW: Trying to make
this generic enough that we can just have a "type" that would be able to
label the type of socket (e.g. "PR")
<disk>...
<source>..
<socket type='unix' path='/path/to/socket.sock'
mode='client|server'
closeaction='shutdown|pause|unplug'
[daemon='/path/to/daemon'
args='list of daemon args']>NAME</socket>
Where :
optional daemon & args will be used to spin out a new daemon
NAME is just some string, but must be UNIQUE amongs other sockets
(we could store these in a hash table any number of sockets)
The thought behind 'daemon' attribute is to avoid the qemu.conf
changes... The args allows defining a way to start the daemon debug or
however it's desired - it's just a string copied/passed to the daemon.
I would think it's up to daemon code to make all the rules and manage
everything including security, cgroups, acls, etc. etc. We don't care if
it's a LUN, a Disk, an LV, a Block device, etc. If the daemon wants to
somehow support migration, then have at it. We probably should be as
hands off as possible.
OK - rambled on long enough ;-)
John
FWIW: Coverity issues found in current series...
Patch 9:
* In qemuProcessSetupOnePRDaemon, @pidfile is NULL if jump to cleanup
if prHelperName is not executable causing a unlink(NULL) which is bad.
Also the @cfg is leaked.
Patch 10:
* In qemuDomainObjPrivateXMLParsePRs, virXPathNodeSet can return a -1
into an unsigned @n value, thus the subsequent for loop will take a while.