[libvirt] [RFC PATCH 0/2] Introduce RDT memory bandwidth allocation support

bing.niu＠intel.com

29 May 2018 29 May '18

2:58 a.m.

From: Bing Niu <bing.niu@intel.com> This series is to introduce RDT memory bandwidth allocation support by extending current virresctrl implementation. The Memory Bandwidth Allocation (MBA) feature provides indirect and approximate control over memory bandwidth available per-core. This feature provides a method to control applications which may be over-utilizing bandwidth relative to their priority in environments such as the data-center. The details can be found in Intel's SDM 17.19.7. Kernel supports MBA through resctrl file system same as CAT. Each resctrl group have a MB parameter to control how much memory bandwidth it can utilize in unit of percentage. In this series, MBA is enabled by enhancing existing virresctrl implementation. The policy employed for MBA is similar with CAT: The sum of each MBA group's bandwidth dose not exceed 100%. The enhancement of virresctrl include two parts: Patch 1: Add two new structure virResctrlInfoMB and virResctrlAllocMB for collecting host system MBA capability and domain memory bandwidth allocation. Patch 2: On frontend XML parsing, add new element "llc" in cachetune section for MBA allocation. Bing Niu (2): util: Add memory bandwidth support to resctrl conf: Extend cputune/cachetune to support memory bandwidth allocation src/conf/domain_conf.c | 86 +++++++++- src/util/virresctrl.c | 422 ++++++++++++++++++++++++++++++++++++++++++++++--- src/util/virresctrl.h | 5 + 3 files changed, 487 insertions(+), 26 deletions(-) -- 2.7.4

Show replies by date

bing.niu＠intel.com

29 May 29 May

2:58 a.m.

New subject: [libvirt] [RFC PATCH 1/2] util: Add memory bandwidth support to resctrl

From: Bing Niu <bing.niu@intel.com> Add memory bandwidth allocation support basing on existing virresctrl implementation. Two new structures virResctrlInfoMB and virResctrlAllocMB are introduced. virResctrlInfoMB is used to record host system MBA supporting information, e.g., minimum bandwidth allowed, bandwidth granularity, number of bandwidth controller (same as number of last level cache). virResctrlAllocMB is used for allocating memory bandwidth. Following virResctrlAllocPerType, it also employs a nested sparse array to indicate whether allocation is available for particular llc. Overall, the memory bandwidth allocation follows the same sequence as existing virresctrl cache allocation model. It exposes interfaces for probing host's memory bandwidth capability on initialization time and memory bandwidth allocation on runtime. Signed-off-by: Bing Niu <bing.niu@intel.com> --- src/util/virresctrl.c | 415 +++++++++++++++++++++++++++++++++++++++++++++++--- src/util/virresctrl.h | 5 + 2 files changed, 401 insertions(+), 19 deletions(-) diff --git a/src/util/virresctrl.c b/src/util/virresctrl.c index fc11635..ad050c7 100644 --- a/src/util/virresctrl.c +++ b/src/util/virresctrl.c @@ -86,6 +86,20 @@ struct _virResctrlInfoPerType { virResctrlInfoPerCache control; }; +/* Information about memory bandwidth allocation */ +typedef struct _virResctrlInfoMB virResctrlInfoMB; +typedef virResctrlInfoMB *virResctrlInfoMBPtr; +struct _virResctrlInfoMB { + /* minimum memory bandwidth allowed*/ + unsigned int min_bandwidth; + /* bandwidth granularity */ + unsigned int bandwidth_gran; + /* level number of llc*/ + unsigned int llc; + /* number of last level cache */ + unsigned int max_id; +}; + typedef struct _virResctrlInfoPerLevel virResctrlInfoPerLevel; typedef virResctrlInfoPerLevel *virResctrlInfoPerLevelPtr; struct _virResctrlInfoPerLevel { @@ -97,6 +111,8 @@ struct _virResctrlInfo { virResctrlInfoPerLevelPtr *levels; size_t nlevels; + + virResctrlInfoMBPtr mb_info; }; static virClassPtr virResctrlInfoClass; @@ -126,6 +142,7 @@ virResctrlInfoDispose(void *obj) VIR_FREE(level); } + VIR_FREE(resctrl->mb_info); VIR_FREE(resctrl->levels); } @@ -201,6 +218,18 @@ struct _virResctrlAllocPerType { size_t nmasks; }; +/* + * virResctrlAllocMB represents one memory bandwidth allocation. Since it can have + * multiple last level caches in a NUMA system, it is also represented as a nested + * sparse arrays as virRestrlAllocPerLevel + */ +typedef struct _virResctrlAllocMB virResctrlAllocMB; +typedef virResctrlAllocMB *virResctrlAllocMBPtr; +struct _virResctrlAllocMB { + unsigned int **bandwidth; + size_t nsizes; +}; + typedef struct _virResctrlAllocPerLevel virResctrlAllocPerLevel; typedef virResctrlAllocPerLevel *virResctrlAllocPerLevelPtr; struct _virResctrlAllocPerLevel { @@ -214,7 +243,7 @@ struct _virResctrlAlloc { virResctrlAllocPerLevelPtr *levels; size_t nlevels; - + virResctrlAllocMBPtr mba; /* The identifier (any unique string for now) */ char *id; /* libvirt-generated path in /sys/fs/resctrl for this particular @@ -259,6 +288,13 @@ virResctrlAllocDispose(void *obj) VIR_FREE(level); } + if (resctrl->mba) { + virResctrlAllocMBPtr mba = resctrl->mba; + for (i = 0; i < mba->nsizes; i++) + VIR_FREE(mba->bandwidth[i]); + } + + VIR_FREE(resctrl->mba); VIR_FREE(resctrl->id); VIR_FREE(resctrl->path); VIR_FREE(resctrl->levels); @@ -364,6 +400,9 @@ virResctrlInfoIsEmpty(virResctrlInfoPtr resctrl) if (!resctrl) return true; + if (resctrl->mb_info) + return false; + for (i = 0; i < resctrl->nlevels; i++) { virResctrlInfoPerLevelPtr i_level = resctrl->levels[i]; @@ -379,11 +418,62 @@ virResctrlInfoIsEmpty(virResctrlInfoPtr resctrl) return true; } +static int +virResctrlGetMemoryBandwidthInfo(virResctrlInfoPtr resctrl) +{ + DIR *dirp = NULL; + int ret = -1; + int rv = -1; + virResctrlInfoMBPtr i_mb = NULL; -#ifdef __linux__ + rv = virDirOpenIfExists(&dirp, SYSFS_RESCTRL_PATH "/info"); + if (rv <= 0) { + ret = rv; + goto cleanup; + } + /* query memory bandwidth allocation info */ + if (VIR_ALLOC(i_mb) < 0) { + ret = -1; + goto cleanup; + } + rv = virFileReadValueUint(&i_mb->bandwidth_gran, + SYSFS_RESCTRL_PATH "/info/MB/bandwidth_gran"); + if (rv == -2) { + /* The file doesn't exist, so it's unusable for us, + * properly mba unsupported */ + VIR_WARN("The path '" SYSFS_RESCTRL_PATH "/info/MB/bandwidth_gran'" + "does not exist"); + ret = 0; + goto cleanup; + } else if (rv < 0) { + /* File no existing or other failures are fatal, so just quit */ + goto cleanup; + } -int -virResctrlGetInfo(virResctrlInfoPtr resctrl) + rv = virFileReadValueUint(&i_mb->min_bandwidth, + SYSFS_RESCTRL_PATH "/info/MB/min_bandwidth"); + if (rv == -2) { + /* The file doesn't exist, so it's unusable for us, + * properly mba unsupported */ + VIR_WARN("The path '" SYSFS_RESCTRL_PATH "/info/MB/min_bandwidth'" + "does not exist"); + ret = 0; + goto cleanup; + } else if (rv < 0) { + /* File no existing or other failures are fatal, so just quit */ + goto cleanup; + } + + VIR_STEAL_PTR(resctrl->mb_info, i_mb); + ret = 0; + cleanup: + VIR_DIR_CLOSE(dirp); + VIR_FREE(i_mb); + return ret; +} + +static int +virResctrlGetCacheInfo(virResctrlInfoPtr resctrl) { DIR *dirp = NULL; char *endptr = NULL; @@ -512,6 +602,23 @@ virResctrlGetInfo(virResctrlInfoPtr resctrl) return ret; } +#ifdef __linux__ +int +virResctrlGetInfo(virResctrlInfoPtr resctrl) +{ + int ret; + + ret = virResctrlGetMemoryBandwidthInfo(resctrl); + if (ret < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Get Memory Bandwidth fail")); + goto out; + } + ret = virResctrlGetCacheInfo(resctrl); + out: + return ret; +} + #else /* ! __linux__ */ int @@ -534,12 +641,26 @@ virResctrlInfoGetCache(virResctrlInfoPtr resctrl, { virResctrlInfoPerLevelPtr i_level = NULL; virResctrlInfoPerTypePtr i_type = NULL; + virResctrlInfoMBPtr mb_info = NULL; size_t i = 0; int ret = -1; if (virResctrlInfoIsEmpty(resctrl)) return 0; + /* Let's take the opportunity to update the number of + * last level cache. This number is used to calculate + * free memory bandwidth */ + if (resctrl->mb_info) { + mb_info = resctrl->mb_info; + if (level > mb_info->llc) { + mb_info->llc = level; + mb_info->max_id = 1; + } else if (mb_info->llc == level) { + mb_info->max_id++; + } + } + if (level >= resctrl->nlevels) return 0; @@ -600,6 +721,9 @@ virResctrlAllocIsEmpty(virResctrlAllocPtr resctrl) if (!resctrl) return true; + if (resctrl->mba) + return false; + for (i = 0; i < resctrl->nlevels; i++) { virResctrlAllocPerLevelPtr a_level = resctrl->levels[i]; @@ -855,16 +979,218 @@ virResctrlAllocGetID(virResctrlAllocPtr alloc) } -char * -virResctrlAllocFormat(virResctrlAllocPtr resctrl) +int +virResctrlSetMemoryBandwidth(virResctrlAllocPtr resctrl, + unsigned int id, + unsigned int memory_bandwidth) +{ + virResctrlAllocMBPtr mba = resctrl->mba; + + if (!mba) { + if (VIR_ALLOC(mba) < 0) + return -1; + resctrl->mba = mba; + } + + if (mba->nsizes <= id && + VIR_EXPAND_N(mba->bandwidth, mba->nsizes, + id - mba->nsizes + 1) < 0) + return -1; + + if (mba->bandwidth[id]) { + virReportError(VIR_ERR_XML_ERROR, + _("Collision Memory Bandwidth on node %d"), + id); + return -1; + } + + if (VIR_ALLOC(mba->bandwidth[id]) < 0) + return -1; + + *(mba->bandwidth[id]) = memory_bandwidth; + return 0; +} + + +static int +virResctrlAllocMemoryBandwidthFormat(virResctrlAllocPtr alloc, virBufferPtr buf) +{ + int id; + + if (!alloc->mba) + goto out; + + virBufferAddLit(buf, "MB:"); + + for (id = 0; id < alloc->mba->nsizes; id++) { + if (alloc->mba->bandwidth[id]) { + virBufferAsprintf(buf, "%u=%u;", id, + *(alloc->mba->bandwidth[id])); + } + } + + virBufferTrim(buf, ";", 1); + virBufferAddChar(buf, '\n'); + virBufferCheckError(buf); + out: + return 0; +} + +static int +virResctrlAllocMemoryBandwidth(virResctrlInfoPtr resctrl, + virResctrlAllocPtr alloc, virResctrlAllocPtr free) +{ + int id; + int ret; + virResctrlAllocMBPtr mb_alloc = alloc->mba; + virResctrlAllocMBPtr mb_free = free->mba; + virResctrlInfoMBPtr mb_info = resctrl->mb_info; + + if (!mb_alloc) + return 0; + + if (mb_alloc && !mb_info) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("RDT Memory Bandwidth allocationi" + " unsupported")); + ret = -1; + goto out; + } + + for (id = 0; id < mb_alloc->nsizes; id ++) { + if (!mb_alloc->bandwidth[id]) + continue; + + if (*(mb_alloc->bandwidth[id]) % mb_info->bandwidth_gran) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Memory Bandwidth allocation of size " + "%u is not divisible by granularity %u"), + *(mb_alloc->bandwidth[id]), + mb_info->bandwidth_gran); + ret = -1; + goto out; + } + if (*(mb_alloc->bandwidth[id]) < mb_info->min_bandwidth) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Memory Bandwidth allocation of size " + "%u is smaller than the minimum " + "allowed allocation %u"), + *(mb_alloc->bandwidth[id]), + mb_info->min_bandwidth); + ret = -1; + goto out; + } + if (id >= mb_info->max_id) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("bandwidth controller %u not exist," + " max controller id %u"), + id, mb_info->max_id - 1); + return -1; + } + if (*(mb_alloc->bandwidth[id]) > *(mb_free->bandwidth[id])) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Not enough room for allocation of %u " + "bandwidth for node %u%%, freed %u%%"), + id, *(mb_alloc->bandwidth[id]), + *(mb_free->bandwidth[id])); + return -1; + } + } + ret = 0; + out: + return ret; +} + + +static int +virResctrlAllocParseMemoryBandwidthLine(virResctrlInfoPtr resctrl, + virResctrlAllocPtr alloc, + char *line) +{ + char **mbs = NULL; + char *tmp = NULL; + unsigned int bandwidth; + size_t nmb = 0; + unsigned int id; + size_t i; + int ret; + + /* For no reason there can be spaces */ + virSkipSpaces((const char **) &line); + + if (STRNEQLEN(line, "MB", 2)) + return 0; + + if (!resctrl || !resctrl->mb_info || + (!resctrl->mb_info->min_bandwidth) + || !(resctrl->mb_info->bandwidth_gran)) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Missing or inconsistent resctrl info for " + "memory bandwidth allocation")); + } + + if (!alloc->mba) + if (VIR_ALLOC(alloc->mba) < 0) + return -1; + + tmp = strchr(line, ':'); + if (!tmp) + return 0; + tmp++; + + mbs = virStringSplitCount(tmp, ";", 0, &nmb); + if (!nmb) + return 0; + + for (i = 0; i < nmb; i++) { + tmp = strchr(mbs[i], '='); + *tmp = '\0'; + tmp++; + + if (virStrToLong_uip(mbs[i], NULL, 10, &id) < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Invalid node id %u "), id); + ret = -1; + goto clean; + } + if (virStrToLong_uip(tmp, NULL, 10, &bandwidth) < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Invalid bandwidth %u"), bandwidth); + ret = -1; + goto clean; + } + if (alloc->mba->nsizes <= id && + VIR_EXPAND_N(alloc->mba->bandwidth, alloc->mba->nsizes, + id - alloc->mba->nsizes + 1) < 0) { + ret = -1; + goto clean; + } + if (!alloc->mba->bandwidth[id]) { + if (VIR_ALLOC(alloc->mba->bandwidth[id]) < 0) { + ret = -1; + goto clean; + } + } + + *(alloc->mba->bandwidth[id]) = bandwidth; + } + ret = 0; + clean: + virStringListFree(mbs); + return ret; +} + + +static int +virResctrlAllocCacheFormat(virResctrlAllocPtr resctrl, virBufferPtr buf) { - virBuffer buf = VIR_BUFFER_INITIALIZER; unsigned int level = 0; unsigned int type = 0; unsigned int cache = 0; + int ret = 0; if (!resctrl) - return NULL; + return 0; for (level = 0; level < resctrl->nlevels; level++) { virResctrlAllocPerLevelPtr a_level = resctrl->levels[level]; @@ -878,7 +1204,7 @@ virResctrlAllocFormat(virResctrlAllocPtr resctrl) if (!a_type) continue; - virBufferAsprintf(&buf, "L%u%s:", level, virResctrlTypeToString(type)); + virBufferAsprintf(buf, "L%u%s:", level, virResctrlTypeToString(type)); for (cache = 0; cache < a_type->nmasks; cache++) { virBitmapPtr mask = a_type->masks[cache]; @@ -889,18 +1215,49 @@ virResctrlAllocFormat(virResctrlAllocPtr resctrl) mask_str = virBitmapToString(mask, false, true); if (!mask_str) { - virBufferFreeAndReset(&buf); - return NULL; + ret = -1; + goto out; } - virBufferAsprintf(&buf, "%u=%s;", cache, mask_str); + virBufferAsprintf(buf, "%u=%s;", cache, mask_str); VIR_FREE(mask_str); } - virBufferTrim(&buf, ";", 1); - virBufferAddChar(&buf, '\n'); + virBufferTrim(buf, ";", 1); + virBufferAddChar(buf, '\n'); } } + out: + return ret; +} + + +static void +virResctrlMemoryBandwidthSubstract(virResctrlAllocPtr free, + virResctrlAllocPtr used) +{ + size_t i; + + if (used->mba) { + for (i = 0; i < used->mba->nsizes; i++) { + if (used->mba->bandwidth[i]) + *(free->mba->bandwidth[i]) -= *(used->mba->bandwidth[i]); + } + } +} + + +char * +virResctrlAllocFormat(virResctrlAllocPtr resctrl) +{ + virBuffer buf = VIR_BUFFER_INITIALIZER; + + if (!resctrl) + return NULL; + + virResctrlAllocCacheFormat(resctrl, &buf); + + virResctrlAllocMemoryBandwidthFormat(resctrl, &buf); virBufferCheckError(&buf); return virBufferContentAndReset(&buf); @@ -957,9 +1314,8 @@ virResctrlAllocParseProcessCache(virResctrlInfoPtr resctrl, return ret; } - static int -virResctrlAllocParseProcessLine(virResctrlInfoPtr resctrl, +virResctrlAllocParseCacheLine(virResctrlInfoPtr resctrl, virResctrlAllocPtr alloc, char *line) { @@ -1029,7 +1385,9 @@ virResctrlAllocParse(virResctrlInfoPtr resctrl, lines = virStringSplitCount(schemata, "\n", 0, &nlines); for (i = 0; i < nlines; i++) { - if (virResctrlAllocParseProcessLine(resctrl, alloc, lines[i]) < 0) + if (virResctrlAllocParseMemoryBandwidthLine(resctrl, alloc, lines[i]) < 0) + goto cleanup; + if (virResctrlAllocParseCacheLine(resctrl, alloc, lines[i]) < 0) goto cleanup; } @@ -1165,6 +1523,23 @@ virResctrlAllocNewFromInfo(virResctrlInfoPtr info) } } + if (info->mb_info) { + if (VIR_ALLOC(ret->mba) < 0) + goto error; + + if (info->mb_info->max_id && + VIR_EXPAND_N(ret->mba->bandwidth, ret->mba->nsizes, + info->mb_info->max_id - ret->mba->nsizes) < 0) + goto error; + + for (i = 0; i < ret->mba->nsizes; i++) { + if (VIR_ALLOC(ret->mba->bandwidth[i]) < 0) + goto error; + + *(ret->mba->bandwidth[i]) = 100; + } + } + cleanup: virBitmapFree(mask); return ret; @@ -1228,6 +1603,7 @@ virResctrlAllocGetUnused(virResctrlInfoPtr resctrl) goto error; } + virResctrlMemoryBandwidthSubstract(ret, alloc); virResctrlAllocSubtract(ret, alloc); virObjectUnref(alloc); alloc = NULL; @@ -1389,7 +1765,6 @@ virResctrlAllocFindUnused(virResctrlAllocPtr alloc, return ret; } - static int virResctrlAllocCopyMasks(virResctrlAllocPtr dst, virResctrlAllocPtr src) @@ -1451,6 +1826,9 @@ virResctrlAllocMasksAssign(virResctrlInfoPtr resctrl, if (!alloc_default) goto cleanup; + if (virResctrlAllocMemoryBandwidth(resctrl, alloc, alloc_free) < 0) + goto cleanup; + if (virResctrlAllocCopyMasks(alloc, alloc_default) < 0) goto cleanup; @@ -1524,7 +1902,6 @@ virResctrlAllocDeterminePath(virResctrlAllocPtr alloc, return 0; } - /* This checks if the directory for the alloc exists. If not it tries to create * it and apply appropriate alloc settings. */ int diff --git a/src/util/virresctrl.h b/src/util/virresctrl.h index 5368ba2..82c7ab1 100644 --- a/src/util/virresctrl.h +++ b/src/util/virresctrl.h @@ -89,6 +89,11 @@ virResctrlAllocSetSize(virResctrlAllocPtr resctrl, unsigned long long size); int +virResctrlSetMemoryBandwidth(virResctrlAllocPtr resctrl, + unsigned int id, + unsigned int mb); + +int virResctrlAllocForeachSize(virResctrlAllocPtr resctrl, virResctrlAllocForeachSizeCallback cb, void *opaque); -- 2.7.4

Ján Tomko

2 Jun 2 Jun

6:50 a.m.

New subject: [libvirt] [RFC PATCH 1/2] util: Add memory bandwidth support to resctrl

On Tue, May 29, 2018 at 06:58:02PM +0800, bing.niu@intel.com wrote:

...

From: Bing Niu <bing.niu@intel.com>

Add memory bandwidth allocation support basing on existing virresctrl implementation. Two new structures virResctrlInfoMB and virResctrlAllocMB are introduced.

virResctrlInfoMB is used to record host system MBA supporting information, e.g., minimum bandwidth allowed, bandwidth granularity, number of bandwidth controller (same as number of last level cache).

virResctrlAllocMB is used for allocating memory bandwidth. Following virResctrlAllocPerType, it also employs a nested sparse array to indicate whether allocation is available for particular llc.

Overall, the memory bandwidth allocation follows the same sequence as existing virresctrl cache allocation model. It exposes interfaces for probing host's memory bandwidth capability on initialization time and memory bandwidth allocation on runtime.

Signed-off-by: Bing Niu <bing.niu@intel.com> --- src/util/virresctrl.c | 415 +++++++++++++++++++++++++++++++++++++++++++++++--- src/util/virresctrl.h | 5 + 2 files changed, 401 insertions(+), 19 deletions(-)

diff --git a/src/util/virresctrl.c b/src/util/virresctrl.c index fc11635..ad050c7 100644 --- a/src/util/virresctrl.c +++ b/src/util/virresctrl.c @@ -86,6 +86,20 @@ struct _virResctrlInfoPerType { virResctrlInfoPerCache control; };

+/* Information about memory bandwidth allocation */ +typedef struct _virResctrlInfoMB virResctrlInfoMB; +typedef virResctrlInfoMB *virResctrlInfoMBPtr; +struct _virResctrlInfoMB { + /* minimum memory bandwidth allowed*/ + unsigned int min_bandwidth; + /* bandwidth granularity */ + unsigned int bandwidth_gran;

bandwidth_granularity is more descriptive

...

+ /* level number of llc*/ + unsigned int llc;

cryptic three-letter arconym, you can spell it out

...

+ /* number of last level cache */

Based on the comment, I don't understand the difference from the previous value.

...

+ unsigned int max_id; +}; + typedef struct _virResctrlInfoPerLevel virResctrlInfoPerLevel; typedef virResctrlInfoPerLevel *virResctrlInfoPerLevelPtr; struct _virResctrlInfoPerLevel { @@ -97,6 +111,8 @@ struct _virResctrlInfo {

virResctrlInfoPerLevelPtr *levels; size_t nlevels; + + virResctrlInfoMBPtr mb_info; };

static virClassPtr virResctrlInfoClass; @@ -126,6 +142,7 @@ virResctrlInfoDispose(void *obj) VIR_FREE(level); }

+ VIR_FREE(resctrl->mb_info); VIR_FREE(resctrl->levels); }

@@ -201,6 +218,18 @@ struct _virResctrlAllocPerType { size_t nmasks; };

+/* + * virResctrlAllocMB represents one memory bandwidth allocation. Since it can have + * multiple last level caches in a NUMA system, it is also represented as a nested + * sparse arrays as virRestrlAllocPerLevel + */ +typedef struct _virResctrlAllocMB virResctrlAllocMB; +typedef virResctrlAllocMB *virResctrlAllocMBPtr; +struct _virResctrlAllocMB { + unsigned int **bandwidth; + size_t nsizes; +}; + typedef struct _virResctrlAllocPerLevel virResctrlAllocPerLevel; typedef virResctrlAllocPerLevel *virResctrlAllocPerLevelPtr; struct _virResctrlAllocPerLevel { @@ -214,7 +243,7 @@ struct _virResctrlAlloc {

virResctrlAllocPerLevelPtr *levels; size_t nlevels; - + virResctrlAllocMBPtr mba; /* The identifier (any unique string for now) */ char *id; /* libvirt-generated path in /sys/fs/resctrl for this particular @@ -259,6 +288,13 @@ virResctrlAllocDispose(void *obj) VIR_FREE(level); }

+ if (resctrl->mba) { + virResctrlAllocMBPtr mba = resctrl->mba; + for (i = 0; i < mba->nsizes; i++) + VIR_FREE(mba->bandwidth[i]); + } + + VIR_FREE(resctrl->mba); VIR_FREE(resctrl->id); VIR_FREE(resctrl->path); VIR_FREE(resctrl->levels); @@ -364,6 +400,9 @@ virResctrlInfoIsEmpty(virResctrlInfoPtr resctrl) if (!resctrl) return true;

+ if (resctrl->mb_info) + return false; + for (i = 0; i < resctrl->nlevels; i++) { virResctrlInfoPerLevelPtr i_level = resctrl->levels[i];

@@ -379,11 +418,62 @@ virResctrlInfoIsEmpty(virResctrlInfoPtr resctrl) return true; }

+static int +virResctrlGetMemoryBandwidthInfo(virResctrlInfoPtr resctrl) +{ + DIR *dirp = NULL; + int ret = -1; + int rv = -1; + virResctrlInfoMBPtr i_mb = NULL;

-#ifdef __linux__

This ifdef removal looks suspicious

...

+ rv = virDirOpenIfExists(&dirp, SYSFS_RESCTRL_PATH "/info"); + if (rv <= 0) { + ret = rv; + goto cleanup; + } + /* query memory bandwidth allocation info */ + if (VIR_ALLOC(i_mb) < 0) { + ret = -1;

ret is initialized to -1, no need to set it again

...

+ goto cleanup; + } + rv = virFileReadValueUint(&i_mb->bandwidth_gran, + SYSFS_RESCTRL_PATH "/info/MB/bandwidth_gran"); + if (rv == -2) { + /* The file doesn't exist, so it's unusable for us, + * properly mba unsupported */ + VIR_WARN("The path '" SYSFS_RESCTRL_PATH "/info/MB/bandwidth_gran'" + "does not exist");

Is there a chance the directory SYSFS_RESCTRL_PATH will exist, but not this file? We should not spam the logs with warnings on older kernels.

...

+ ret = 0; + goto cleanup; + } else if (rv < 0) { + /* File no existing or other failures are fatal, so just quit */

This comment conflicts with the above statement.

...

+ goto cleanup; + }

-int -virResctrlGetInfo(virResctrlInfoPtr resctrl) + rv = virFileReadValueUint(&i_mb->min_bandwidth, + SYSFS_RESCTRL_PATH "/info/MB/min_bandwidth"); + if (rv == -2) { + /* The file doesn't exist, so it's unusable for us, + * properly mba unsupported */ + VIR_WARN("The path '" SYSFS_RESCTRL_PATH "/info/MB/min_bandwidth'" + "does not exist"); + ret = 0; + goto cleanup; + } else if (rv < 0) { + /* File no existing or other failures are fatal, so just quit */ + goto cleanup; + } + + VIR_STEAL_PTR(resctrl->mb_info, i_mb); + ret = 0; + cleanup: + VIR_DIR_CLOSE(dirp); + VIR_FREE(i_mb); + return ret; +} + +static int +virResctrlGetCacheInfo(virResctrlInfoPtr resctrl) { DIR *dirp = NULL; char *endptr = NULL; @@ -512,6 +602,23 @@ virResctrlGetInfo(virResctrlInfoPtr resctrl) return ret; }

+#ifdef __linux__ +int +virResctrlGetInfo(virResctrlInfoPtr resctrl) +{ + int ret;

Our convention is to initialize ret to -1 and then overwrite it exactly once before returning from the function, othrewise it gets hard to follow.

...

+ + ret = virResctrlGetMemoryBandwidthInfo(resctrl); + if (ret < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Get Memory Bandwidth fail")); + goto out; + } + ret = virResctrlGetCacheInfo(resctrl); + out: + return ret; +} + #else /* ! __linux__ */

int @@ -534,12 +641,26 @@ virResctrlInfoGetCache(virResctrlInfoPtr resctrl, { virResctrlInfoPerLevelPtr i_level = NULL; virResctrlInfoPerTypePtr i_type = NULL; + virResctrlInfoMBPtr mb_info = NULL; size_t i = 0; int ret = -1;

if (virResctrlInfoIsEmpty(resctrl)) return 0;

+ /* Let's take the opportunity to update the number of + * last level cache. This number is used to calculate + * free memory bandwidth */ + if (resctrl->mb_info) { + mb_info = resctrl->mb_info; + if (level > mb_info->llc) { + mb_info->llc = level; + mb_info->max_id = 1; + } else if (mb_info->llc == level) { + mb_info->max_id++; + } + } + if (level >= resctrl->nlevels) return 0;

@@ -600,6 +721,9 @@ virResctrlAllocIsEmpty(virResctrlAllocPtr resctrl) if (!resctrl) return true;

+ if (resctrl->mba) + return false; + for (i = 0; i < resctrl->nlevels; i++) { virResctrlAllocPerLevelPtr a_level = resctrl->levels[i];

@@ -855,16 +979,218 @@ virResctrlAllocGetID(virResctrlAllocPtr alloc) }

-char * -virResctrlAllocFormat(virResctrlAllocPtr resctrl)

The rename should be in a separate patch, to make the diff easier to read.

...

+int +virResctrlSetMemoryBandwidth(virResctrlAllocPtr resctrl, + unsigned int id, + unsigned int memory_bandwidth) +{ + virResctrlAllocMBPtr mba = resctrl->mba; + + if (!mba) { + if (VIR_ALLOC(mba) < 0) + return -1; + resctrl->mba = mba; + } + + if (mba->nsizes <= id && + VIR_EXPAND_N(mba->bandwidth, mba->nsizes, + id - mba->nsizes + 1) < 0) + return -1; + + if (mba->bandwidth[id]) { + virReportError(VIR_ERR_XML_ERROR, + _("Collision Memory Bandwidth on node %d"), + id); + return -1; + } + + if (VIR_ALLOC(mba->bandwidth[id]) < 0) + return -1; + + *(mba->bandwidth[id]) = memory_bandwidth; + return 0; +} + + +static int +virResctrlAllocMemoryBandwidthFormat(virResctrlAllocPtr alloc, virBufferPtr buf)

Usually we put the output variables first.

...

+{ + int id; + + if (!alloc->mba) + goto out; + + virBufferAddLit(buf, "MB:"); + + for (id = 0; id < alloc->mba->nsizes; id++) { + if (alloc->mba->bandwidth[id]) { + virBufferAsprintf(buf, "%u=%u;", id, + *(alloc->mba->bandwidth[id])); + } + } + + virBufferTrim(buf, ";", 1); + virBufferAddChar(buf, '\n'); + virBufferCheckError(buf); + out:

pointless label

...

+ return 0; +} + +static int +virResctrlAllocMemoryBandwidth(virResctrlInfoPtr resctrl, + virResctrlAllocPtr alloc, virResctrlAllocPtr free) +{ + int id; + int ret; + virResctrlAllocMBPtr mb_alloc = alloc->mba; + virResctrlAllocMBPtr mb_free = free->mba; + virResctrlInfoMBPtr mb_info = resctrl->mb_info; + + if (!mb_alloc) + return 0; + + if (mb_alloc && !mb_info) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("RDT Memory Bandwidth allocationi"

allocation

...

+ " unsupported")); + ret = -1;

ret should be initialized to -1

...

+ goto out; + } + + for (id = 0; id < mb_alloc->nsizes; id ++) { + if (!mb_alloc->bandwidth[id]) + continue; + + if (*(mb_alloc->bandwidth[id]) % mb_info->bandwidth_gran) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Memory Bandwidth allocation of size " + "%u is not divisible by granularity %u"), + *(mb_alloc->bandwidth[id]), + mb_info->bandwidth_gran); + ret = -1; + goto out; + } + if (*(mb_alloc->bandwidth[id]) < mb_info->min_bandwidth) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Memory Bandwidth allocation of size " + "%u is smaller than the minimum " + "allowed allocation %u"), + *(mb_alloc->bandwidth[id]), + mb_info->min_bandwidth); + ret = -1; + goto out; + } + if (id >= mb_info->max_id) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("bandwidth controller %u not exist," + " max controller id %u"), + id, mb_info->max_id - 1); + return -1; + } + if (*(mb_alloc->bandwidth[id]) > *(mb_free->bandwidth[id])) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Not enough room for allocation of %u " + "bandwidth for node %u%%, freed %u%%"), + id, *(mb_alloc->bandwidth[id]), + *(mb_free->bandwidth[id])); + return -1; + } + } + ret = 0; + out:

Pointless label.

...

+ return ret; +} + + +static int +virResctrlAllocParseMemoryBandwidthLine(virResctrlInfoPtr resctrl, + virResctrlAllocPtr alloc, + char *line) +{ + char **mbs = NULL; + char *tmp = NULL; + unsigned int bandwidth; + size_t nmb = 0; + unsigned int id; + size_t i; + int ret; + + /* For no reason there can be spaces */ + virSkipSpaces((const char **) &line); + + if (STRNEQLEN(line, "MB", 2)) + return 0; + + if (!resctrl || !resctrl->mb_info || + (!resctrl->mb_info->min_bandwidth)

|| should be at the end of the line

...

+ || !(resctrl->mb_info->bandwidth_gran)) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Missing or inconsistent resctrl info for " + "memory bandwidth allocation")); + } + + if (!alloc->mba) + if (VIR_ALLOC(alloc->mba) < 0) + return -1;

The outer if will need {} around the body. Is there a chance of alloc->mba being non-NULL here?

...

+ + tmp = strchr(line, ':'); + if (!tmp) + return 0; + tmp++; + + mbs = virStringSplitCount(tmp, ";", 0, &nmb); + if (!nmb) + return 0; + + for (i = 0; i < nmb; i++) { + tmp = strchr(mbs[i], '='); + *tmp = '\0'; + tmp++; + + if (virStrToLong_uip(mbs[i], NULL, 10, &id) < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Invalid node id %u "), id); + ret = -1;

ret should be initialized to -1 instead of setting it to -1 after every error

...

+ goto clean;

cleanup is the standard label name

...

+ } + if (virStrToLong_uip(tmp, NULL, 10, &bandwidth) < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Invalid bandwidth %u"), bandwidth); + ret = -1; + goto clean; + }

...

+ if (alloc->mba->nsizes <= id && + VIR_EXPAND_N(alloc->mba->bandwidth, alloc->mba->nsizes, + id - alloc->mba->nsizes + 1) < 0) { + ret = -1; + goto clean; + } + if (!alloc->mba->bandwidth[id]) { + if (VIR_ALLOC(alloc->mba->bandwidth[id]) < 0) { + ret = -1; + goto clean; + } + } + + *(alloc->mba->bandwidth[id]) = bandwidth;

Why does the array contain pointer to the value, instead of the value itself? Also, VIR_APPEND_ELEMENT is nicer than VIR_EXPAND + manual assignment.

...

+ } + ret = 0; + clean: + virStringListFree(mbs); + return ret; +} + + +static int +virResctrlAllocCacheFormat(virResctrlAllocPtr resctrl, virBufferPtr buf) { - virBuffer buf = VIR_BUFFER_INITIALIZER; unsigned int level = 0; unsigned int type = 0; unsigned int cache = 0; + int ret = 0;

-1 is the default

...

if (!resctrl) - return NULL; + return 0;

The return type change should be in a separate patch.

...

for (level = 0; level < resctrl->nlevels; level++) { virResctrlAllocPerLevelPtr a_level = resctrl->levels[level]; @@ -878,7 +1204,7 @@ virResctrlAllocFormat(virResctrlAllocPtr resctrl) if (!a_type) continue;

- virBufferAsprintf(&buf, "L%u%s:", level, virResctrlTypeToString(type)); + virBufferAsprintf(buf, "L%u%s:", level, virResctrlTypeToString(type));

And so should the switch from local virBuffer to virBufferPtr

...

for (cache = 0; cache < a_type->nmasks; cache++) { virBitmapPtr mask = a_type->masks[cache]; @@ -889,18 +1215,49 @@ virResctrlAllocFormat(virResctrlAllocPtr resctrl)

mask_str = virBitmapToString(mask, false, true); if (!mask_str) { - virBufferFreeAndReset(&buf); - return NULL; + ret = -1; + goto out; }

- virBufferAsprintf(&buf, "%u=%s;", cache, mask_str); + virBufferAsprintf(buf, "%u=%s;", cache, mask_str); VIR_FREE(mask_str); }

- virBufferTrim(&buf, ";", 1); - virBufferAddChar(&buf, '\n'); + virBufferTrim(buf, ";", 1); + virBufferAddChar(buf, '\n'); } } + out:

pointless label

...

+ return ret; +} + + +static void +virResctrlMemoryBandwidthSubstract(virResctrlAllocPtr free, + virResctrlAllocPtr used) +{ + size_t i; + + if (used->mba) { + for (i = 0; i < used->mba->nsizes; i++) { + if (used->mba->bandwidth[i]) + *(free->mba->bandwidth[i]) -= *(used->mba->bandwidth[i]); + } + } +} + + @@ -1389,7 +1765,6 @@ virResctrlAllocFindUnused(virResctrlAllocPtr alloc, return ret; }

-

Unrelated whitespace change.

...

static int virResctrlAllocCopyMasks(virResctrlAllocPtr dst, virResctrlAllocPtr src) @@ -1451,6 +1826,9 @@ virResctrlAllocMasksAssign(virResctrlInfoPtr resctrl, if (!alloc_default) goto cleanup;

+ if (virResctrlAllocMemoryBandwidth(resctrl, alloc, alloc_free) < 0) + goto cleanup; + if (virResctrlAllocCopyMasks(alloc, alloc_default) < 0) goto cleanup;

@@ -1524,7 +1902,6 @@ virResctrlAllocDeterminePath(virResctrlAllocPtr alloc, return 0; }

-

Same here

...

/* This checks if the directory for the alloc exists. If not it tries to create * it and apply appropriate alloc settings. */ int

Jano

bing.niu

3 Jun 3 Jun

7:49 p.m.

New subject: [libvirt] [RFC PATCH 1/2] util: Add memory bandwidth support to resctrl

Hi Jano, Thanks a lot for the review! please see my reply below. On 2018年06月02日 22:50, Ján Tomko wrote:

...

On Tue, May 29, 2018 at 06:58:02PM +0800, bing.niu@intel.com wrote:

...
From: Bing Niu <bing.niu@intel.com>

Add memory bandwidth allocation support basing on existing virresctrl implementation. Two new structures virResctrlInfoMB and virResctrlAllocMB are introduced.

virResctrlInfoMB is used to record host system MBA supporting information, e.g., minimum bandwidth allowed, bandwidth granularity, number of bandwidth controller (same as number of last level cache).

virResctrlAllocMB is used for allocating memory bandwidth. Following virResctrlAllocPerType, it also employs a nested sparse array to indicate whether allocation is available for particular llc.

Overall, the memory bandwidth allocation follows the same sequence as existing virresctrl cache allocation model. It exposes interfaces for probing host's memory bandwidth capability on initialization time and memory bandwidth allocation on runtime.

Signed-off-by: Bing Niu <bing.niu@intel.com> --- src/util/virresctrl.c | 415 +++++++++++++++++++++++++++++++++++++++++++++++--- src/util/virresctrl.h |   5 + 2 files changed, 401 insertions(+), 19 deletions(-)

diff --git a/src/util/virresctrl.c b/src/util/virresctrl.c index fc11635..ad050c7 100644 --- a/src/util/virresctrl.c +++ b/src/util/virresctrl.c @@ -86,6 +86,20 @@ struct _virResctrlInfoPerType {     virResctrlInfoPerCache control; };

+/* Information about memory bandwidth allocation */ +typedef struct _virResctrlInfoMB virResctrlInfoMB; +typedef virResctrlInfoMB *virResctrlInfoMBPtr; +struct _virResctrlInfoMB { +    /* minimum memory bandwidth allowed*/ +    unsigned int min_bandwidth; +    /* bandwidth granularity */ +    unsigned int bandwidth_gran;

bandwidth_granularity is more descriptive will revise in next version

...
+    /* level number of llc*/ +    unsigned int llc;

cryptic three-letter arconym, you can spell it out will update in next version

...
+    /* number of last level cache */

Based on the comment, I don't understand the difference from the previous value. sorry for making confuse here, properly due to fogged comments.

The llc element will record the level of last level cache(usually it's L3 cache, however theoretically we may have different level number of llc). The max_id is used to track how many of llc host have. In a NUMA system, we will have more than one llc. Those two elements are populated inside virResctrlInfoGetCache() when libvirt initialize host cache capability information by virCapabilitiesInitCaches(). We use those information for: 1. Guarantee user allocating memory bandwidth at a valid llc. 2. Create memory allocation info only on valid llc when virResctrlAllocNewFromInfo. Since the virResctrlAllocPtr allocated from virResctrlAllocNewFromInfo is used to calculate unused resource.

...

...
+    unsigned int max_id; +}; + typedef struct _virResctrlInfoPerLevel virResctrlInfoPerLevel; typedef virResctrlInfoPerLevel *virResctrlInfoPerLevelPtr; struct _virResctrlInfoPerLevel { @@ -97,6 +111,8 @@ struct _virResctrlInfo {

    virResctrlInfoPerLevelPtr *levels;     size_t nlevels; + +    virResctrlInfoMBPtr mb_info; };

static virClassPtr virResctrlInfoClass; @@ -126,6 +142,7 @@ virResctrlInfoDispose(void *obj)         VIR_FREE(level);     }

+    VIR_FREE(resctrl->mb_info);     VIR_FREE(resctrl->levels); }

@@ -201,6 +218,18 @@ struct _virResctrlAllocPerType {     size_t nmasks; };

+/* + * virResctrlAllocMB represents one memory bandwidth allocation. Since it can have + * multiple last level caches in a NUMA system, it is also represented as a nested + * sparse arrays as virRestrlAllocPerLevel + */ +typedef struct _virResctrlAllocMB virResctrlAllocMB; +typedef virResctrlAllocMB *virResctrlAllocMBPtr; +struct _virResctrlAllocMB { +    unsigned int **bandwidth; +    size_t nsizes; +}; + typedef struct _virResctrlAllocPerLevel virResctrlAllocPerLevel; typedef virResctrlAllocPerLevel *virResctrlAllocPerLevelPtr; struct _virResctrlAllocPerLevel { @@ -214,7 +243,7 @@ struct _virResctrlAlloc {

    virResctrlAllocPerLevelPtr *levels;     size_t nlevels; - +    virResctrlAllocMBPtr mba;     /* The identifier (any unique string for now) */     char *id;     /* libvirt-generated path in /sys/fs/resctrl for this particular @@ -259,6 +288,13 @@ virResctrlAllocDispose(void *obj)         VIR_FREE(level);     }

+    if (resctrl->mba) { +        virResctrlAllocMBPtr mba = resctrl->mba; +        for (i = 0; i < mba->nsizes; i++) +            VIR_FREE(mba->bandwidth[i]); +    } + +    VIR_FREE(resctrl->mba);     VIR_FREE(resctrl->id);     VIR_FREE(resctrl->path);     VIR_FREE(resctrl->levels); @@ -364,6 +400,9 @@ virResctrlInfoIsEmpty(virResctrlInfoPtr resctrl)     if (!resctrl)         return true;

+    if (resctrl->mb_info) +        return false; +     for (i = 0; i < resctrl->nlevels; i++) {         virResctrlInfoPerLevelPtr i_level = resctrl->levels[i];

@@ -379,11 +418,62 @@ virResctrlInfoIsEmpty(virResctrlInfoPtr resctrl)     return true; }

+static int +virResctrlGetMemoryBandwidthInfo(virResctrlInfoPtr resctrl) +{ +    DIR *dirp = NULL; +    int ret = -1; +    int rv = -1; +    virResctrlInfoMBPtr i_mb = NULL;

-#ifdef __linux__

This ifdef removal looks suspicious

This should be caused by diff format. The pragma is still here. Hopefully we can get rid of that this by separate renaming to a individual patch.

...

...
+ rv = virDirOpenIfExists(&dirp, SYSFS_RESCTRL_PATH "/info"); + if (rv <= 0) { + ret = rv; + goto cleanup; + } + /* query memory bandwidth allocation info */ + if (VIR_ALLOC(i_mb) < 0) { + ret = -1;

ret is initialized to -1, no need to set it again

Will update in v2

...

...
+ goto cleanup; + } + rv = virFileReadValueUint(&i_mb->bandwidth_gran, + SYSFS_RESCTRL_PATH "/info/MB/bandwidth_gran"); + if (rv == -2) { + /* The file doesn't exist, so it's unusable for us, + * properly mba unsupported */ + VIR_WARN("The path '" SYSFS_RESCTRL_PATH "/info/MB/bandwidth_gran'" + "does not exist");

Is there a chance the directory SYSFS_RESCTRL_PATH will exist, but not this file? We should not spam the logs with warnings on older kernels.

Yes. It's possible. if resctrl fs mount successfully. That means system have CAT or MBA or CAT&MBT. Better we change VIR_WARN to VIR_DEBUG for MBA and CAT information query part. How about this? ;)

...

...
+ ret = 0; + goto cleanup; + } else if (rv < 0) { + /* File no existing or other failures are fatal, so just quit */

This comment conflicts with the above statement.

Will update in v2

...

...
+        goto cleanup; +    }

-int -virResctrlGetInfo(virResctrlInfoPtr resctrl) +    rv = virFileReadValueUint(&i_mb->min_bandwidth, +                              SYSFS_RESCTRL_PATH "/info/MB/min_bandwidth"); +    if (rv == -2) { +        /* The file doesn't exist, so it's unusable for us, +         * properly mba unsupported */ +        VIR_WARN("The path '" SYSFS_RESCTRL_PATH "/info/MB/min_bandwidth'" +                 "does not exist"); +        ret = 0; +        goto cleanup; +    } else if (rv < 0) { +        /* File no existing or other failures are fatal, so just quit */ +        goto cleanup; +    } + +    VIR_STEAL_PTR(resctrl->mb_info, i_mb); +    ret = 0; + cleanup: +    VIR_DIR_CLOSE(dirp); +    VIR_FREE(i_mb); +    return ret; +} + +static int +virResctrlGetCacheInfo(virResctrlInfoPtr resctrl) {     DIR *dirp = NULL;     char *endptr = NULL; @@ -512,6 +602,23 @@ virResctrlGetInfo(virResctrlInfoPtr resctrl)     return ret; }

+#ifdef __linux__ +int +virResctrlGetInfo(virResctrlInfoPtr resctrl) +{ +    int ret;

Our convention is to initialize ret to -1 and then overwrite it exactly once before returning from the function, othrewise it gets hard to follow.

Will update in v2

...

...
+ +    ret = virResctrlGetMemoryBandwidthInfo(resctrl); +    if (ret < 0) { +        virReportError(VIR_ERR_INTERNAL_ERROR, "%s", +                       _("Get Memory Bandwidth fail")); +        goto out; +    } +    ret = virResctrlGetCacheInfo(resctrl); + out: +    return ret; +} + #else /* ! __linux__ */

int @@ -534,12 +641,26 @@ virResctrlInfoGetCache(virResctrlInfoPtr resctrl, {     virResctrlInfoPerLevelPtr i_level = NULL;     virResctrlInfoPerTypePtr i_type = NULL; +    virResctrlInfoMBPtr mb_info = NULL;     size_t i = 0;     int ret = -1;

    if (virResctrlInfoIsEmpty(resctrl))         return 0;

+    /* Let's take the opportunity to update the number of +     * last level cache. This number is used to calculate +     * free memory bandwidth */ +    if (resctrl->mb_info) { +        mb_info = resctrl->mb_info; +        if (level > mb_info->llc) { +            mb_info->llc = level; +            mb_info->max_id = 1; +        } else if (mb_info->llc == level) { +            mb_info->max_id++; +        } +    } +     if (level >= resctrl->nlevels)         return 0;

@@ -600,6 +721,9 @@ virResctrlAllocIsEmpty(virResctrlAllocPtr resctrl)     if (!resctrl)         return true;

+    if (resctrl->mba) +        return false; +     for (i = 0; i < resctrl->nlevels; i++) {         virResctrlAllocPerLevelPtr a_level = resctrl->levels[i];

@@ -855,16 +979,218 @@ virResctrlAllocGetID(virResctrlAllocPtr alloc) }

-char * -virResctrlAllocFormat(virResctrlAllocPtr resctrl)

The rename should be in a separate patch, to make the diff easier to read.

Will separate in v2.

...

...
+int +virResctrlSetMemoryBandwidth(virResctrlAllocPtr resctrl, + unsigned int id, + unsigned int memory_bandwidth) +{ + virResctrlAllocMBPtr mba = resctrl->mba; + + if (!mba) { + if (VIR_ALLOC(mba) < 0) + return -1; + resctrl->mba = mba; + } + + if (mba->nsizes <= id && + VIR_EXPAND_N(mba->bandwidth, mba->nsizes, + id - mba->nsizes + 1) < 0) + return -1; + + if (mba->bandwidth[id]) { + virReportError(VIR_ERR_XML_ERROR, + _("Collision Memory Bandwidth on node %d"), + id); + return -1; + } + + if (VIR_ALLOC(mba->bandwidth[id]) < 0) + return -1; + + *(mba->bandwidth[id]) = memory_bandwidth; + return 0; +} + + +static int +virResctrlAllocMemoryBandwidthFormat(virResctrlAllocPtr alloc, virBufferPtr buf)

Usually we put the output variables first.

Will update in v2

...

...
+{ + int id; + + if (!alloc->mba) + goto out; + + virBufferAddLit(buf, "MB:"); + + for (id = 0; id < alloc->mba->nsizes; id++) { + if (alloc->mba->bandwidth[id]) { + virBufferAsprintf(buf, "%u=%u;", id, + *(alloc->mba->bandwidth[id])); + } + } + + virBufferTrim(buf, ";", 1); + virBufferAddChar(buf, '\n'); + virBufferCheckError(buf); + out:

pointless label

Will update in v2

...

...
+ return 0; +} + +static int +virResctrlAllocMemoryBandwidth(virResctrlInfoPtr resctrl, + virResctrlAllocPtr alloc, virResctrlAllocPtr free) +{ + int id; + int ret; + virResctrlAllocMBPtr mb_alloc = alloc->mba; + virResctrlAllocMBPtr mb_free = free->mba; + virResctrlInfoMBPtr mb_info = resctrl->mb_info; + + if (!mb_alloc) + return 0; + + if (mb_alloc && !mb_info) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("RDT Memory Bandwidth allocationi"

allocation

Will update in v2

...

...
+ " unsupported")); + ret = -1;

ret should be initialized to -1

same as above.

...

...
+ goto out; + } + + for (id = 0; id < mb_alloc->nsizes; id ++) { + if (!mb_alloc->bandwidth[id]) + continue; + + if (*(mb_alloc->bandwidth[id]) % mb_info->bandwidth_gran) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Memory Bandwidth allocation of size " + "%u is not divisible by granularity %u"), + *(mb_alloc->bandwidth[id]), + mb_info->bandwidth_gran); + ret = -1; + goto out; + } + if (*(mb_alloc->bandwidth[id]) < mb_info->min_bandwidth) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Memory Bandwidth allocation of size " + "%u is smaller than the minimum " + "allowed allocation %u"), + *(mb_alloc->bandwidth[id]), + mb_info->min_bandwidth); + ret = -1; + goto out; + } + if (id >= mb_info->max_id) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("bandwidth controller %u not exist," + " max controller id %u"), + id, mb_info->max_id - 1); + return -1; + } + if (*(mb_alloc->bandwidth[id]) > *(mb_free->bandwidth[id])) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Not enough room for allocation of %u " + "bandwidth for node %u%%, freed %u%%"), + id, *(mb_alloc->bandwidth[id]), + *(mb_free->bandwidth[id])); + return -1; + } + } + ret = 0; + out:

Pointless label.

Will remove in v2

...

...
+ return ret; +} + + +static int +virResctrlAllocParseMemoryBandwidthLine(virResctrlInfoPtr resctrl, + virResctrlAllocPtr alloc, + char *line) +{ + char **mbs = NULL; + char *tmp = NULL; + unsigned int bandwidth; + size_t nmb = 0; + unsigned int id; + size_t i; + int ret; + + /* For no reason there can be spaces */ + virSkipSpaces((const char **) &line); + + if (STRNEQLEN(line, "MB", 2)) + return 0; + + if (!resctrl || !resctrl->mb_info || + (!resctrl->mb_info->min_bandwidth)

|| should be at the end of the line

Will update in v2

...

...
+ || !(resctrl->mb_info->bandwidth_gran)) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Missing or inconsistent resctrl info for " + "memory bandwidth allocation")); + } + + if (!alloc->mba) + if (VIR_ALLOC(alloc->mba) < 0) + return -1;

The outer if will need {} around the body.

Will fix in v2.

...

Is there a chance of alloc->mba being non-NULL here?Theoretically yes.

This function is used to pars string line queried from resctrl file schemata. Each resource is in a individual line. So the function will be invoked for each line. Only first time, We need allocate mba structure. I take a look at RDT kernel driver. In current implementation, MBA information will be formated into one single line, so this function is validly executed only once. If removing above if(!alloc->mba) also work. However, I prefer to keep it to get rid of any assumptions of kernel driver's behavior. How do you think, Jano? :)

...

...
+ + tmp = strchr(line, ':'); + if (!tmp) + return 0; + tmp++; + + mbs = virStringSplitCount(tmp, ";", 0, &nmb); + if (!nmb) + return 0; + + for (i = 0; i < nmb; i++) { + tmp = strchr(mbs[i], '='); + *tmp = '\0'; + tmp++; + + if (virStrToLong_uip(mbs[i], NULL, 10, &id) < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Invalid node id %u "), id); + ret = -1;

ret should be initialized to -1 instead of setting it to -1 after every error

Will fix in v2.

...

...
+            goto clean;

cleanup is the standard label name

...
+        } +        if (virStrToLong_uip(tmp, NULL, 10, &bandwidth) < 0) { +            virReportError(VIR_ERR_INTERNAL_ERROR, +                           _("Invalid bandwidth %u"), bandwidth); +            ret = -1; +            goto clean; +        }

...
+        if (alloc->mba->nsizes <= id && +            VIR_EXPAND_N(alloc->mba->bandwidth, alloc->mba->nsizes, +                         id - alloc->mba->nsizes + 1) < 0) { +            ret = -1; +            goto clean; +        } +        if (!alloc->mba->bandwidth[id]) { +            if (VIR_ALLOC(alloc->mba->bandwidth[id]) < 0) { +                ret = -1; +                goto clean; +            } +        } + +        *(alloc->mba->bandwidth[id]) = bandwidth;

Why does the array contain pointer to the value, instead of the value itself?

This is referenced from virresctrl CAT part. It is possible allocation only happened in some of llc. If the pointer is NULL, that means there is no memory allocation on that llc.

...

Also, VIR_APPEND_ELEMENT is nicer than VIR_EXPAND + manual assignment.

Will fix in v2.

...

...
+ } + ret = 0; + clean: + virStringListFree(mbs); + return ret; +} + + +static int +virResctrlAllocCacheFormat(virResctrlAllocPtr resctrl, virBufferPtr buf) { - virBuffer buf = VIR_BUFFER_INITIALIZER; unsigned int level = 0; unsigned int type = 0; unsigned int cache = 0; + int ret = 0;

-1 is the default

...
if (!resctrl) - return NULL; + return 0;

The return type change should be in a separate patch.

Will fix in v2.

...

...
    for (level = 0; level < resctrl->nlevels; level++) {         virResctrlAllocPerLevelPtr a_level = resctrl->levels[level]; @@ -878,7 +1204,7 @@ virResctrlAllocFormat(virResctrlAllocPtr resctrl)             if (!a_type)                 continue;

-            virBufferAsprintf(&buf, "L%u%s:", level, virResctrlTypeToString(type)); +            virBufferAsprintf(buf, "L%u%s:", level, virResctrlTypeToString(type));

And so should the switch from local virBuffer to virBufferPtr

...
            for (cache = 0; cache < a_type->nmasks; cache++) {                 virBitmapPtr mask = a_type->masks[cache]; @@ -889,18 +1215,49 @@ virResctrlAllocFormat(virResctrlAllocPtr resctrl)

                mask_str = virBitmapToString(mask, false, true);                 if (!mask_str) { -                    virBufferFreeAndReset(&buf); -                    return NULL; +                    ret = -1; +                    goto out;                 }

-                virBufferAsprintf(&buf, "%u=%s;", cache, mask_str); +                virBufferAsprintf(buf, "%u=%s;", cache, mask_str);                 VIR_FREE(mask_str);             }

-            virBufferTrim(&buf, ";", 1); -            virBufferAddChar(&buf, '\n'); +            virBufferTrim(buf, ";", 1); +            virBufferAddChar(buf, '\n');         }     } + out:

pointless label Will fix in v2.

...
+    return ret; +} + + +static void +virResctrlMemoryBandwidthSubstract(virResctrlAllocPtr free, +                                  virResctrlAllocPtr used) +{ +    size_t i; + +    if (used->mba) { +        for (i = 0; i < used->mba->nsizes; i++) { +            if (used->mba->bandwidth[i]) +                *(free->mba->bandwidth[i]) -= *(used->mba->bandwidth[i]); +        } +    } +} + + @@ -1389,7 +1765,6 @@ virResctrlAllocFindUnused(virResctrlAllocPtr alloc,     return ret; }

-

Unrelated whitespace change. Will fix in v2.

...
static int virResctrlAllocCopyMasks(virResctrlAllocPtr dst,                          virResctrlAllocPtr src) @@ -1451,6 +1826,9 @@ virResctrlAllocMasksAssign(virResctrlInfoPtr resctrl,     if (!alloc_default)         goto cleanup;

+    if (virResctrlAllocMemoryBandwidth(resctrl, alloc, alloc_free) < 0) +        goto cleanup; +     if (virResctrlAllocCopyMasks(alloc, alloc_default) < 0)         goto cleanup;

@@ -1524,7 +1902,6 @@ virResctrlAllocDeterminePath(virResctrlAllocPtr alloc,     return 0; }

-

Same here Will fix in v2.

...
/* This checks if the directory for the alloc exists. If not it tries to create * it and apply appropriate alloc settings. */ int

Jano

bing.niu＠intel.com

29 May 29 May

2:58 a.m.

New subject: [libvirt] [RFC PATCH 2/2] conf: Extend cputune/cachetune to support memory bandwidth allocation

From: Bing Niu <bing.niu@intel.com> Extend current cachetune section to support memory bandwidth allocation. Add a new cachetune element llc for memory allocation. As the example below: <cachetune vcpus='0'> <llc id='0' bandwidth='30'/> </cachetune> id --- on which last level cache memory bandwidth to be set bandwidth --- the memory bandwidth percent to set. Signed-off-by: Bing Niu <bing.niu@intel.com> --- src/conf/domain_conf.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++---- src/util/virresctrl.c | 7 ++++ 2 files changed, 86 insertions(+), 7 deletions(-) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index d6ac47c..aba998d 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -18578,6 +18578,55 @@ virDomainDefParseBootOptions(virDomainDefPtr def, return ret; } +static int +virDomainCachetuneDefParseMemoryBandwidth(xmlXPathContextPtr ctxt, + xmlNodePtr node, + virResctrlAllocPtr alloc) +{ + xmlNodePtr oldnode = ctxt->node; + unsigned int id; + unsigned int bandwidth; + char *tmp = NULL; + int ret = -1; + + ctxt->node = node; + + tmp = virXMLPropString(node, "id"); + if (!tmp) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("Missing cachetune attribute 'id'")); + goto cleanup; + } + if (virStrToLong_uip(tmp, NULL, 10, &id) < 0) { + virReportError(VIR_ERR_XML_ERROR, + _("Invalid cachetune attribute 'id' value '%s'"), + tmp); + goto cleanup; + } + VIR_FREE(tmp); + + tmp = virXMLPropString(node, "bandwidth"); + if (!tmp) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("Missing cachetune attribute 'bandwidth'")); + goto cleanup; + } + if (virStrToLong_uip(tmp, NULL, 10, &bandwidth) < 0) { + virReportError(VIR_ERR_XML_ERROR, + _("Invalid cachetune attribute 'bandwidth' value '%s'"), + tmp); + goto cleanup; + } + VIR_FREE(tmp); + if (virResctrlSetMemoryBandwidth(alloc, id, bandwidth) < 0) + goto cleanup; + + ret = 0; + cleanup: + ctxt->node = oldnode; + VIR_FREE(tmp); + return ret; +} static int virDomainCachetuneDefParseCache(xmlXPathContextPtr ctxt, @@ -18670,6 +18719,7 @@ virDomainCachetuneDefParse(virDomainDefPtr def, ssize_t i = 0; int n; int ret = -1; + bool mba_available = false; ctxt->node = node; @@ -18701,12 +18751,26 @@ virDomainCachetuneDefParse(virDomainDefPtr def, goto cleanup; } + if ((n = virXPathNodeSet("./llc", ctxt, &nodes)) < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Cannot extract Memory Bandwidth nodes under " + "cachetune. try cache allocation")); + } + + for (i = 0; i < n; i++) { + if (virDomainCachetuneDefParseMemoryBandwidth(ctxt, nodes[i], alloc) < 0) + goto cleanup; + } + + if (n) + mba_available = true; + if ((n = virXPathNodeSet("./cache", ctxt, &nodes)) < 0) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("Cannot extract cache nodes under cachetune")); - goto cleanup; + if (!mba_available) + goto cleanup; } - for (i = 0; i < n; i++) { if (virDomainCachetuneDefParseCache(ctxt, nodes[i], alloc) < 0) goto cleanup; @@ -26394,11 +26458,19 @@ virDomainCachetuneDefFormatHelper(unsigned int level, virBufferPtr buf = opaque; unsigned long long short_size = virFormatIntPretty(size, &unit); - virBufferAsprintf(buf, - "<cache id='%u' level='%u' type='%s' " - "size='%llu' unit='%s'/>\n", - cache, level, virCacheTypeToString(type), - short_size, unit); + /* If type is VIR_CACHE_TYPE_LAST, this means it's a memory + * bandwidth allocation formatting request */ + if (type == VIR_CACHE_TYPE_LAST) + virBufferAsprintf(buf, + "<llc id='%u' bandwidth='%llu'/>\n", + cache, size); + + else + virBufferAsprintf(buf, + "<cache id='%u' level='%u' type='%s' " + "size='%llu' unit='%s'/>\n", + cache, level, virCacheTypeToString(type), + short_size, unit); return 0; } diff --git a/src/util/virresctrl.c b/src/util/virresctrl.c index ad050c7..c720074 100644 --- a/src/util/virresctrl.c +++ b/src/util/virresctrl.c @@ -954,6 +954,13 @@ virResctrlAllocForeachSize(virResctrlAllocPtr resctrl, } } + if (resctrl->mba) { + virResctrlAllocMBPtr mba = resctrl->mba; + for (cache = 0; cache < mba->nsizes; cache++) + if (mba->bandwidth[cache]) + cb(0, VIR_CACHE_TYPE_LAST, cache, *mba->bandwidth[cache], opaque); + } + return 0; } -- 2.7.4

Ján Tomko

2 Jun 2 Jun

6:54 a.m.

New subject: [libvirt] [RFC PATCH 2/2] conf: Extend cputune/cachetune to support memory bandwidth allocation

On Tue, May 29, 2018 at 06:58:03PM +0800, bing.niu@intel.com wrote:

...

From: Bing Niu <bing.niu@intel.com>

Extend current cachetune section to support memory bandwidth allocation. Add a new cachetune element llc for memory allocation. As the example below:

<cachetune vcpus='0'> <llc id='0' bandwidth='30'/> </cachetune>

id --- on which last level cache memory bandwidth to be set bandwidth --- the memory bandwidth percent to set.

Signed-off-by: Bing Niu <bing.niu@intel.com> --- src/conf/domain_conf.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++---- src/util/virresctrl.c | 7 ++++ 2 files changed, 86 insertions(+), 7 deletions(-)

diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index d6ac47c..aba998d 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -18578,6 +18578,55 @@ virDomainDefParseBootOptions(virDomainDefPtr def, return ret; }

static int virDomainCachetuneDefParseCache(xmlXPathContextPtr ctxt, @@ -18670,6 +18719,7 @@ virDomainCachetuneDefParse(virDomainDefPtr def, ssize_t i = 0; int n; int ret = -1; + bool mba_available = false;

ctxt->node = node;

@@ -18701,12 +18751,26 @@ virDomainCachetuneDefParse(virDomainDefPtr def, goto cleanup; }

+ if ((n = virXPathNodeSet("./llc", ctxt, &nodes)) < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Cannot extract Memory Bandwidth nodes under " + "cachetune. try cache allocation")); + } + + for (i = 0; i < n; i++) { + if (virDomainCachetuneDefParseMemoryBandwidth(ctxt, nodes[i], alloc) < 0) + goto cleanup; + } + + if (n) + mba_available = true; + if ((n = virXPathNodeSet("./cache", ctxt, &nodes)) < 0) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("Cannot extract cache nodes under cachetune")); - goto cleanup; + if (!mba_available) + goto cleanup;

Is it okay to skip cache parsing in that case?

...

} - for (i = 0; i < n; i++) { if (virDomainCachetuneDefParseCache(ctxt, nodes[i], alloc) < 0) goto cleanup; @@ -26394,11 +26458,19 @@ virDomainCachetuneDefFormatHelper(unsigned int level, virBufferPtr buf = opaque; unsigned long long short_size = virFormatIntPretty(size, &unit);

- virBufferAsprintf(buf, - "<cache id='%u' level='%u' type='%s' " - "size='%llu' unit='%s'/>\n", - cache, level, virCacheTypeToString(type), - short_size, unit); + /* If type is VIR_CACHE_TYPE_LAST, this means it's a memory + * bandwidth allocation formatting request */

_TYPE_LAST values are sentinels used to iterate over all enum values and should never be given any other meaning. Jano

...

+ if (type == VIR_CACHE_TYPE_LAST) + virBufferAsprintf(buf, + "<llc id='%u' bandwidth='%llu'/>\n", + cache, size); +

bing.niu

3 Jun 3 Jun

8:11 p.m.

New subject: [libvirt] [RFC PATCH 2/2] conf: Extend cputune/cachetune to support memory bandwidth allocation

Hi Jano, Thanks for review, please see my response. On 2018年06月02日 22:54, Ján Tomko wrote:

...

On Tue, May 29, 2018 at 06:58:03PM +0800, bing.niu@intel.com wrote:

...
From: Bing Niu <bing.niu@intel.com>

Extend current cachetune section to support memory bandwidth allocation. Add a new cachetune element llc for memory allocation. As the example below:

       <cachetune vcpus='0'>                <llc id='0' bandwidth='30'/>        </cachetune>

id        --- on which last level cache memory bandwidth to be set bandwidth --- the memory bandwidth percent to set.

Signed-off-by: Bing Niu <bing.niu@intel.com> --- src/conf/domain_conf.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++---- src/util/virresctrl.c | 7 ++++ 2 files changed, 86 insertions(+), 7 deletions(-)

diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index d6ac47c..aba998d 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -18578,6 +18578,55 @@ virDomainDefParseBootOptions(virDomainDefPtr def,     return ret; }

static int virDomainCachetuneDefParseCache(xmlXPathContextPtr ctxt, @@ -18670,6 +18719,7 @@ virDomainCachetuneDefParse(virDomainDefPtr def,     ssize_t i = 0;     int n;     int ret = -1; +    bool mba_available = false;

    ctxt->node = node;

@@ -18701,12 +18751,26 @@ virDomainCachetuneDefParse(virDomainDefPtr def,         goto cleanup;     }

+    if ((n = virXPathNodeSet("./llc", ctxt, &nodes)) < 0) { +        virReportError(VIR_ERR_INTERNAL_ERROR, "%s", +                       _("Cannot extract Memory Bandwidth nodes under " +                         "cachetune. try cache allocation")); +    } + +    for (i = 0; i < n; i++) { +        if (virDomainCachetuneDefParseMemoryBandwidth(ctxt, nodes[i], alloc) < 0) +            goto cleanup; +    } + +    if (n) +        mba_available = true; +     if ((n = virXPathNodeSet("./cache", ctxt, &nodes)) < 0) {         virReportError(VIR_ERR_INTERNAL_ERROR, "%s",                        _("Cannot extract cache nodes under cachetune")); -        goto cleanup; +        if (!mba_available) +            goto cleanup;

Is it okay to skip cache parsing in that case?

Yes, but we need vcpu_str part below the cache parsing part. I saw the vcpu_str is used to create folder in virresctrl for a rdt group. Maybe we can add inline wrapper functions to skip cache parsing part. How do you think?

...

...
} - for (i = 0; i < n; i++) { if (virDomainCachetuneDefParseCache(ctxt, nodes[i], alloc) < 0) goto cleanup; @@ -26394,11 +26458,19 @@ virDomainCachetuneDefFormatHelper(unsigned int level, virBufferPtr buf = opaque; unsigned long long short_size = virFormatIntPretty(size, &unit);

- virBufferAsprintf(buf, - "<cache id='%u' level='%u' type='%s' " - "size='%llu' unit='%s'/>\n", - cache, level, virCacheTypeToString(type), - short_size, unit); + /* If type is VIR_CACHE_TYPE_LAST, this means it's a memory + * bandwidth allocation formatting request */

_TYPE_LAST values are sentinels used to iterate over all enum values and should never be given any other meaning.

hmm. Let me adjust callback functions to handle this explicitly.

...

Jano

...
+ if (type == VIR_CACHE_TYPE_LAST) + virBufferAsprintf(buf, + "<llc id='%u' bandwidth='%llu'/>\n", + cache, size); +

bing.niu

2 Jun 2 Jun

4:11 a.m.

ping for this series..... thanks a lot bing On 2018年05月29日 18:58, bing.niu@intel.com wrote:

...

From: Bing Niu <bing.niu@intel.com>

This series is to introduce RDT memory bandwidth allocation support by extending current virresctrl implementation.

The Memory Bandwidth Allocation (MBA) feature provides indirect and approximate control over memory bandwidth available per-core. This feature provides a method to control applications which may be over-utilizing bandwidth relative to their priority in environments such as the data-center. The details can be found in Intel's SDM 17.19.7. Kernel supports MBA through resctrl file system same as CAT. Each resctrl group have a MB parameter to control how much memory bandwidth it can utilize in unit of percentage.

In this series, MBA is enabled by enhancing existing virresctrl implementation. The policy employed for MBA is similar with CAT: The sum of each MBA group's bandwidth dose not exceed 100%. The enhancement of virresctrl include two parts:

Patch 1: Add two new structure virResctrlInfoMB and virResctrlAllocMB for collecting host system MBA capability and domain memory bandwidth allocation.

Patch 2: On frontend XML parsing, add new element "llc" in cachetune section for MBA allocation.

Bing Niu (2): util: Add memory bandwidth support to resctrl conf: Extend cputune/cachetune to support memory bandwidth allocation

src/conf/domain_conf.c | 86 +++++++++- src/util/virresctrl.c | 422 ++++++++++++++++++++++++++++++++++++++++++++++--- src/util/virresctrl.h | 5 + 3 files changed, 487 insertions(+), 26 deletions(-)

Pavel Hrdina

5 Jun 5 Jun

4:11 a.m.

On Tue, May 29, 2018 at 06:58:01PM +0800, bing.niu@intel.com wrote:

...

From: Bing Niu <bing.niu@intel.com>

This series is to introduce RDT memory bandwidth allocation support by extending current virresctrl implementation.

The Memory Bandwidth Allocation (MBA) feature provides indirect and approximate control over memory bandwidth available per-core. This feature provides a method to control applications which may be over-utilizing bandwidth relative to their priority in environments such as the data-center. The details can be found in Intel's SDM 17.19.7. Kernel supports MBA through resctrl file system same as CAT. Each resctrl group have a MB parameter to control how much memory bandwidth it can utilize in unit of percentage.

In this series, MBA is enabled by enhancing existing virresctrl implementation. The policy employed for MBA is similar with CAT: The sum of each MBA group's bandwidth dose not exceed 100%. The enhancement of virresctrl include two parts:

Patch 1: Add two new structure virResctrlInfoMB and virResctrlAllocMB for collecting host system MBA capability and domain memory bandwidth allocation.

Patch 2: On frontend XML parsing, add new element "llc" in cachetune section for MBA allocation.

Hi, Thanks for the patches. Before we start with the actual implementation it would be nice to agree on the design. ------------------------------------------------------------------------ So first point is that we should do it similarly as the cache allocation, we will not allow to "share" the bandwidth so the sum should be 100% as you already have that in your patches, but we need to do it in a way that in the future we can allow to "share" the bandwidth. Second point is how the XML will look like. There are two parts, one is the capabilities XML and second one is domain XML. It looks like that your patches don't expose any information in capabilities, we should do that in order to let management applications know that the feature is available and what are the possible values that they can use. ------------------------------------------------------------------------ I've tried to configure MBA on one machine that I have access to witch has this cpu: 'Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz' and it behaves strangely. If I configure 'schemata' the output of 'pqos -s' command is in some situations different: schemata pqos -s output MB:0=10 MBA COS0 => 10% available MB:0=20 MBA COS0 => 20% available MB:0=30 MBA COS0 => 30% available MB:0=40 MBA COS0 => 40% available MB:0=50 MBA COS0 => 50% available MB:0=60 MBA COS0 => 60% available MB:0=70 MBA COS0 => 90% available MB:0=80 MBA COS0 => 90% available MB:0=90 MBA COS0 => 90% available MB:0=100 MBA COS0 => 100% available If you look at the table you can see that for values 70-90 the pqos shows that the available bandwidth is 90%. Tested using Fedora 28: kernel-4.16.13-300.fc28.x86_64 intel-cmt-cat-1.2.0-2.fc28.x86_64 ------------------------------------------------------------------------ Since CAT (cache allocation technology) and MBA (memory bandwidth allocation) are unrelated and they are controlling different limitation we should not group MBA together with CAT in our XML files. From poor documentation it looks like that MBA is related to memory controller. Currently the cache allocation in capabilities XML is reported like this: <capabilities> <host> ... <cache> <bank id='0' level='3' type='both' size='30720' unit='KiB' cpus='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46'> <control granularity='1536' unit='KiB' type='code' maxAllocs='8'/> <control granularity='1536' unit='KiB' type='data' maxAllocs='8'/> </bank> <bank id='1' level='3' type='both' size='30720' unit='KiB' cpus='1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47'> <control granularity='1536' unit='KiB' type='code' maxAllocs='8'/> <control granularity='1536' unit='KiB' type='data' maxAllocs='8'/> </bank> </cache> ... </host> </capabilities> So the possible capabilities XML could look like this: <capabilities> <host> ... <memory> <bank id='0' cpus='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46'> <control granularity='10' maxAllocs='8'/> </bank> <bank id='1' cpus='1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47'> <control granularity='10' maxAllocs='8'/> </bank> </memory> ... </host> </capabilities> The element names 'memory' and 'bank' can be named differently, suggestions are welcome. Then there is the domain XML, for CAT we use this: <domain> ... <cputune> ... <cachetune vcpus='0-3'> <cache id='0' level='3' type='both' size='3' unit='MiB'/> <cache id='1' level='3' type='both' size='3' unit='MiB'/> </cachetune> ... <cputune> ... </domain> so the possible domain XML could look like this: <domain> ... <cputune> ... <memory vcpus='0-3'> <socket id='0' bandwidth='30'/> <socket id='1' bandwidth='20'/> </memory> ... <cputune> ... </domain> Again, the element names 'memory' and 'socket' can be named differently. Pavel

bing.niu

9:56 p.m.

Hi Pavel, Thanks for your valuable inputs here. please see my respond. On 2018年06月05日 20:11, Pavel Hrdina wrote:

...

On Tue, May 29, 2018 at 06:58:01PM +0800, bing.niu@intel.com wrote:

...
From: Bing Niu <bing.niu@intel.com>

This series is to introduce RDT memory bandwidth allocation support by extending current virresctrl implementation.

The Memory Bandwidth Allocation (MBA) feature provides indirect and approximate control over memory bandwidth available per-core. This feature provides a method to control applications which may be over-utilizing bandwidth relative to their priority in environments such as the data-center. The details can be found in Intel's SDM 17.19.7. Kernel supports MBA through resctrl file system same as CAT. Each resctrl group have a MB parameter to control how much memory bandwidth it can utilize in unit of percentage.

In this series, MBA is enabled by enhancing existing virresctrl implementation. The policy employed for MBA is similar with CAT: The sum of each MBA group's bandwidth dose not exceed 100%. The enhancement of virresctrl include two parts:

Patch 1: Add two new structure virResctrlInfoMB and virResctrlAllocMB for collecting host system MBA capability and domain memory bandwidth allocation.

Patch 2: On frontend XML parsing, add new element "llc" in cachetune section for MBA allocation.

Hi,

Thanks for the patches. Before we start with the actual implementation it would be nice to agree on the design. Total agree. The RFC code acts as baseline for discuss.

------------------------------------------------------------------------

So first point is that we should do it similarly as the cache allocation, we will not allow to "share" the bandwidth so the sum should be 100% as you already have that in your patches, but we need to do it in a way that in the future we can allow to "share" the bandwidth. Yes, the memory bandwidth allocation policy is derived from existing CAT in libvirt. no share or overlap. In the patch, I follow the existing CAT behavior. When allocating memory bandwidth. First, calculate the unused memory bandwidth by subtracting all existing RDT groups. If we want to enable memory bandwidth sharing. We can just simply skip this part and do allocation directly. Could this fit your comment " we need to do it in a way that in the future we can allow to "share" the bandwidth."? If there is anything missing or my understanding incorrect, Please point me out. :)

...

Second point is how the XML will look like. There are two parts, one is the capabilities XML and second one is domain XML.

It looks like that your patches don't expose any information in capabilities, we should do that in order to let management applications know that the feature is available and what are the possible values that they can use.

------------------------------------------------------------------------

I've tried to configure MBA on one machine that I have access to witch has this cpu: 'Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz' and it behaves strangely. If I configure 'schemata' the output of 'pqos -s' command is in some situations different:

schemata pqos -s output

MB:0=10 MBA COS0 => 10% available MB:0=20 MBA COS0 => 20% available MB:0=30 MBA COS0 => 30% available MB:0=40 MBA COS0 => 40% available MB:0=50 MBA COS0 => 50% available MB:0=60 MBA COS0 => 60% available MB:0=70 MBA COS0 => 90% available MB:0=80 MBA COS0 => 90% available MB:0=90 MBA COS0 => 90% available MB:0=100 MBA COS0 => 100% available

If you look at the table you can see that for values 70-90 the pqos shows that the available bandwidth is 90%.

Tested using Fedora 28: kernel-4.16.13-300.fc28.x86_64 intel-cmt-cat-1.2.0-2.fc28.x86_64

hmm.., that is strange. I directly manipulate resctrl fs. So I didn't hit such kind of issue. I will take a look at this pqos package and let you know.

...

------------------------------------------------------------------------

Since CAT (cache allocation technology) and MBA (memory bandwidth allocation) are unrelated and they are controlling different limitation we should not group MBA together with CAT in our XML files. From poor documentation it looks like that MBA is related to memory controller. From Intel sdm 17.19. MBA used to control the request rate for flushing data from llc to memory, usually MBA and llc have a 1:1 mapping relation. Yes, I miss exposing capability part. Thanks for pointing out.

Currently the cache allocation in capabilities XML is reported like this:

<capabilities> <host> ... <cache> <bank id='0' level='3' type='both' size='30720' unit='KiB' cpus='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46'> <control granularity='1536' unit='KiB' type='code' maxAllocs='8'/> <control granularity='1536' unit='KiB' type='data' maxAllocs='8'/> </bank> <bank id='1' level='3' type='both' size='30720' unit='KiB' cpus='1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47'> <control granularity='1536' unit='KiB' type='code' maxAllocs='8'/> <control granularity='1536' unit='KiB' type='data' maxAllocs='8'/> </bank> </cache> ... </host> </capabilities>

So the possible capabilities XML could look like this:

<capabilities> <host> ... <memory> <bank id='0' cpus='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46'> <control granularity='10' maxAllocs='8'/> </bank> <bank id='1' cpus='1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47'> <control granularity='10' maxAllocs='8'/> </bank> </memory> ... </host> </capabilities>

The element names 'memory' and 'bank' can be named differently, suggestions are welcome. How about change bank to node?

Then there is the domain XML, for CAT we use this:

<domain> ... <cputune> ... <cachetune vcpus='0-3'> <cache id='0' level='3' type='both' size='3' unit='MiB'/> <cache id='1' level='3' type='both' size='3' unit='MiB'/> </cachetune> ... <cputune> ... </domain>

so the possible domain XML could look like this:

<domain> ... <cputune> ... <memory vcpus='0-3'> <socket id='0' bandwidth='30'/> <socket id='1' bandwidth='20'/> </memory> ... <cputune> ... </domain>

Again, the element names 'memory' and 'socket' can be named differently. socket --> node?

Since the existing virrestrl implementation only care about cache part during development, So we may need change some names of structure and functions when enable MBA. How do you think

...

Pavel

bing.niu

11:46 p.m.

Hi Pavel, On 2018年06月06日 13:56, bing.niu wrote:

...

...
Then there is the domain XML, for CAT we use this:

<domain> ... <cputune> ... <cachetune vcpus='0-3'> <cache id='0' level='3' type='both' size='3' unit='MiB'/> <cache id='1' level='3' type='both' size='3' unit='MiB'/> </cachetune> ... <cputune> ... </domain>

so the possible domain XML could look like this:

<domain> ... <cputune> ... <memory vcpus='0-3'> <socket id='0' bandwidth='30'/> <socket id='1' bandwidth='20'/> </memory> ... <cputune> ... </domain>

Again, the element names 'memory' and 'socket' can be named differently.

socket --> node?

Since the existing virrestrl implementation only care about cache part during development, So we may need change some names of structure and functions when enable MBA. How do you think

...
Pavel

Is that possible to support MBA by extending CAT in domain XML? Since each <cachetune> will map to one virresctrlalloc structure and create a rdt_group in resctrl fs. Each rdt_group will have it's own closid. this work perfect if CAT only available. However, if MBA coming in with CAT enabled also, ike this. <domain> ... <cputune> ... <cachetune vcpus='0-3'> <cache id='0' level='3' type='both' size='3' unit='MiB'/> <cache id='1' level='3' type='both' size='3' unit='MiB'/> </cachetune> <memory vcpus='2-3'> <socket id='0' bandwidth='30'/> <socket id='1' bandwidth='20'/> </memory> ... <cputune> ... </domain> we have to make sure those two allocating will not have vcpu overlapped. like this, if (virBitmapOverlaps(def->cachetunes[i]->vcpus, vcpus) || virBitmapOverlaps(def->memroy->vcpus, vcpus)) { virReportError(VIR_ERR_XML_ERROR, "%s", _("Overlapping vcpus in cachetunes")); goto cleanup; that looks like introducing some dependency between CAT and MBA. Is that possible we rename cachetune so that handle CAT MBA together one section? Thanks a lot Bing

...

-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

Pavel Hrdina

8 Jun 8 Jun

5 a.m.

On Wed, Jun 06, 2018 at 03:46:37PM +0800, bing.niu wrote:

...

Hi Pavel,

On 2018年06月06日 13:56, bing.niu wrote:

...
...
Then there is the domain XML, for CAT we use this:

<domain> ... <cputune> ... <cachetune vcpus='0-3'> <cache id='0' level='3' type='both' size='3' unit='MiB'/> <cache id='1' level='3' type='both' size='3' unit='MiB'/> </cachetune> ... <cputune> ... </domain>

so the possible domain XML could look like this:

<domain> ... <cputune> ... <memory vcpus='0-3'> <socket id='0' bandwidth='30'/> <socket id='1' bandwidth='20'/> </memory> ... <cputune> ... </domain>

Again, the element names 'memory' and 'socket' can be named differently.

socket --> node?

Since the existing virrestrl implementation only care about cache part during development, So we may need change some names of structure and functions when enable MBA. How do you think

...
Pavel

Is that possible to support MBA by extending CAT in domain XML? Since each <cachetune> will map to one virresctrlalloc structure and create a rdt_group in resctrl fs. Each rdt_group will have it's own closid. this work perfect if CAT only available. However, if MBA coming in with CAT enabled also, ike this.

<domain> ... <cputune> ... <cachetune vcpus='0-3'> <cache id='0' level='3' type='both' size='3' unit='MiB'/> <cache id='1' level='3' type='both' size='3' unit='MiB'/> </cachetune> <memory vcpus='2-3'> <socket id='0' bandwidth='30'/> <socket id='1' bandwidth='20'/> </memory> ... <cputune> ... </domain>

we have to make sure those two allocating will not have vcpu overlapped. like this,

if (virBitmapOverlaps(def->cachetunes[i]->vcpus, vcpus) || virBitmapOverlaps(def->memroy->vcpus, vcpus)) { virReportError(VIR_ERR_XML_ERROR, "%s", _("Overlapping vcpus in cachetunes")); goto cleanup; that looks like introducing some dependency between CAT and MBA. Is that possible we rename cachetune so that handle CAT MBA together one section?

I would like to avoid mixing cache tune and memory bandwidth under 'cachetune' element. But this is a good point, we need to make sure that the 'vcpus' is not overlapping. Renaming existing XML element is not possible, it needs to be backward compatible. Pavel

bing.niu

11:13 p.m.

...

On Wed, Jun 06, 2018 at 03:46:37PM +0800, bing.niu wrote:

...
Hi Pavel,

On 2018年06月06日 13:56, bing.niu wrote:

...
...
Then there is the domain XML, for CAT we use this:

<domain> ... <cputune> ... <cachetune vcpus='0-3'> <cache id='0' level='3' type='both' size='3' unit='MiB'/> <cache id='1' level='3' type='both' size='3' unit='MiB'/> </cachetune> ... <cputune> ... </domain>

so the possible domain XML could look like this:

<domain> ... <cputune> ... <memory vcpus='0-3'> <socket id='0' bandwidth='30'/> <socket id='1' bandwidth='20'/> </memory> ... <cputune> ... </domain>

Again, the element names 'memory' and 'socket' can be named differently.

socket --> node?

Since the existing virrestrl implementation only care about cache part during development, So we may need change some names of structure and functions when enable MBA. How do you think

...
Pavel

Is that possible to support MBA by extending CAT in domain XML? Since each <cachetune> will map to one virresctrlalloc structure and create a rdt_group in resctrl fs. Each rdt_group will have it's own closid. this work perfect if CAT only available. However, if MBA coming in with CAT enabled also, ike this.

<domain> ... <cputune> ... <cachetune vcpus='0-3'> <cache id='0' level='3' type='both' size='3' unit='MiB'/> <cache id='1' level='3' type='both' size='3' unit='MiB'/> </cachetune> <memory vcpus='2-3'> <socket id='0' bandwidth='30'/> <socket id='1' bandwidth='20'/> </memory> ... <cputune> ... </domain>

we have to make sure those two allocating will not have vcpu overlapped. like this,

if (virBitmapOverlaps(def->cachetunes[i]->vcpus, vcpus) || virBitmapOverlaps(def->memroy->vcpus, vcpus)) { virReportError(VIR_ERR_XML_ERROR, "%s", _("Overlapping vcpus in cachetunes")); goto cleanup; that looks like introducing some dependency between CAT and MBA. Is that possible we rename cachetune so that handle CAT MBA together one section?

I would like to avoid mixing cache tune and memory bandwidth under 'cachetune' element. But this is a good point, we need to make sure that the 'vcpus' is not overlapping.

Renaming existing XML element is not possible, it needs to be backward compatible.

Pavel

On 2018年06月08日 21:00, Pavel Hrdina wrote: thanks for clarification ；）. we will do a cross check about cachetune and memory to make sure no vcpus overlapping. I just feel this may confuse people. Since we must tell them vcpus cannot overlapping for cachetune and memory, but vcpus can be equal. this should be the most common case--- create one rdt group for one vm.

...

-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

Pavel Hrdina

9 Jun 9 Jun

4:28 a.m.

On Sat, Jun 09, 2018 at 03:13:56PM +0800, bing.niu wrote:

...

...
On Wed, Jun 06, 2018 at 03:46:37PM +0800, bing.niu wrote:

...
Hi Pavel,

On 2018年06月06日 13:56, bing.niu wrote:

...
...
Then there is the domain XML, for CAT we use this:

<domain> ... <cputune> ... <cachetune vcpus='0-3'> <cache id='0' level='3' type='both' size='3' unit='MiB'/> <cache id='1' level='3' type='both' size='3' unit='MiB'/> </cachetune> ... <cputune> ... </domain>

so the possible domain XML could look like this:

<domain> ... <cputune> ... <memory vcpus='0-3'> <socket id='0' bandwidth='30'/> <socket id='1' bandwidth='20'/> </memory> ... <cputune> ... </domain>

Again, the element names 'memory' and 'socket' can be named differently.

socket --> node?

Since the existing virrestrl implementation only care about cache part during development, So we may need change some names of structure and functions when enable MBA. How do you think

...
Pavel

Is that possible to support MBA by extending CAT in domain XML? Since each <cachetune> will map to one virresctrlalloc structure and create a rdt_group in resctrl fs. Each rdt_group will have it's own closid. this work perfect if CAT only available. However, if MBA coming in with CAT enabled also, ike this.

<domain> ... <cputune> ... <cachetune vcpus='0-3'> <cache id='0' level='3' type='both' size='3' unit='MiB'/> <cache id='1' level='3' type='both' size='3' unit='MiB'/> </cachetune> <memory vcpus='2-3'> <socket id='0' bandwidth='30'/> <socket id='1' bandwidth='20'/> </memory> ... <cputune> ... </domain>

we have to make sure those two allocating will not have vcpu overlapped. like this,

if (virBitmapOverlaps(def->cachetunes[i]->vcpus, vcpus) || virBitmapOverlaps(def->memroy->vcpus, vcpus)) { virReportError(VIR_ERR_XML_ERROR, "%s", _("Overlapping vcpus in cachetunes")); goto cleanup; that looks like introducing some dependency between CAT and MBA. Is that possible we rename cachetune so that handle CAT MBA together one section?

I would like to avoid mixing cache tune and memory bandwidth under 'cachetune' element. But this is a good point, we need to make sure that the 'vcpus' is not overlapping.

Renaming existing XML element is not possible, it needs to be backward compatible.

Pavel

On 2018年06月08日 21:00, Pavel Hrdina wrote: thanks for clarification ；）. we will do a cross check about cachetune and memory to make sure no vcpus overlapping. I just feel this may confuse people. Since we must tell them vcpus cannot overlapping for cachetune and memory, but vcpus can be equal. this should be the most common case--- create one rdt group for one vm.

Yes, I don't like that confusion as well. We will see how the code will look like and how it will behave whether it makes sense or not. If needed we can change it. Pavel

Pavel Hrdina

8 Jun 8 Jun

2:51 a.m.

On Wed, Jun 06, 2018 at 01:56:53PM +0800, bing.niu wrote:

...

Hi Pavel, Thanks for your valuable inputs here. please see my respond.

On 2018年06月05日 20:11, Pavel Hrdina wrote:

...
On Tue, May 29, 2018 at 06:58:01PM +0800, bing.niu@intel.com wrote:

...
From: Bing Niu <bing.niu@intel.com>

This series is to introduce RDT memory bandwidth allocation support by extending current virresctrl implementation.

The Memory Bandwidth Allocation (MBA) feature provides indirect and approximate control over memory bandwidth available per-core. This feature provides a method to control applications which may be over-utilizing bandwidth relative to their priority in environments such as the data-center. The details can be found in Intel's SDM 17.19.7. Kernel supports MBA through resctrl file system same as CAT. Each resctrl group have a MB parameter to control how much memory bandwidth it can utilize in unit of percentage.

In this series, MBA is enabled by enhancing existing virresctrl implementation. The policy employed for MBA is similar with CAT: The sum of each MBA group's bandwidth dose not exceed 100%. The enhancement of virresctrl include two parts:

Patch 1: Add two new structure virResctrlInfoMB and virResctrlAllocMB for collecting host system MBA capability and domain memory bandwidth allocation.

Patch 2: On frontend XML parsing, add new element "llc" in cachetune section for MBA allocation.

Hi,

Thanks for the patches. Before we start with the actual implementation it would be nice to agree on the design. Total agree. The RFC code acts as baseline for discuss.

------------------------------------------------------------------------

So first point is that we should do it similarly as the cache allocation, we will not allow to "share" the bandwidth so the sum should be 100% as you already have that in your patches, but we need to do it in a way that in the future we can allow to "share" the bandwidth. Yes, the memory bandwidth allocation policy is derived from existing CAT in libvirt. no share or overlap. In the patch, I follow the existing CAT behavior. When allocating memory bandwidth. First, calculate the unused memory bandwidth by subtracting all existing RDT groups. If we want to enable memory bandwidth sharing. We can just simply skip this part and do allocation directly. Could this fit your comment " we need to do it in a way that in the future we can allow to "share" the bandwidth."? If there is anything missing or my understanding incorrect, Please point me out. :)

Sounds good to me.

...

...
Second point is how the XML will look like. There are two parts, one is the capabilities XML and second one is domain XML.

It looks like that your patches don't expose any information in capabilities, we should do that in order to let management applications know that the feature is available and what are the possible values that they can use.

------------------------------------------------------------------------

I've tried to configure MBA on one machine that I have access to witch has this cpu: 'Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz' and it behaves strangely. If I configure 'schemata' the output of 'pqos -s' command is in some situations different:

schemata pqos -s output

MB:0=10 MBA COS0 => 10% available MB:0=20 MBA COS0 => 20% available MB:0=30 MBA COS0 => 30% available MB:0=40 MBA COS0 => 40% available MB:0=50 MBA COS0 => 50% available MB:0=60 MBA COS0 => 60% available MB:0=70 MBA COS0 => 90% available MB:0=80 MBA COS0 => 90% available MB:0=90 MBA COS0 => 90% available MB:0=100 MBA COS0 => 100% available If you look at the table you can see that for values 70-90 the pqos shows that the available bandwidth is 90%.

Tested using Fedora 28: kernel-4.16.13-300.fc28.x86_64 intel-cmt-cat-1.2.0-2.fc28.x86_64

hmm.., that is strange. I directly manipulate resctrl fs. So I didn't hit such kind of issue. I will take a look at this pqos package and let you know.

Yes, I was directly manipulating the resctrl fs as well and after the modification to schemata file the content was correct, however, I wanted to validate what was actually configured and in order to do that I used 'pqos' tool. The question now is whether that tool is broken or if you configure some values the actual configuration is different.

...

...
------------------------------------------------------------------------

Since CAT (cache allocation technology) and MBA (memory bandwidth allocation) are unrelated and they are controlling different limitation we should not group MBA together with CAT in our XML files. From poor documentation it looks like that MBA is related to memory controller. From Intel sdm 17.19. MBA used to control the request rate for flushing data from llc to memory, usually MBA and llc have a 1:1 mapping relation. Yes, I miss exposing capability part. Thanks for pointing out.

Currently the cache allocation in capabilities XML is reported like this:

<capabilities> <host> ... <cache> <bank id='0' level='3' type='both' size='30720' unit='KiB' cpus='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46'> <control granularity='1536' unit='KiB' type='code' maxAllocs='8'/> <control granularity='1536' unit='KiB' type='data' maxAllocs='8'/> </bank> <bank id='1' level='3' type='both' size='30720' unit='KiB' cpus='1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47'> <control granularity='1536' unit='KiB' type='code' maxAllocs='8'/> <control granularity='1536' unit='KiB' type='data' maxAllocs='8'/> </bank> </cache> ... </host> </capabilities>

So the possible capabilities XML could look like this:

<capabilities> <host> ... <memory> <bank id='0' cpus='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46'> <control granularity='10' maxAllocs='8'/> </bank> <bank id='1' cpus='1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47'> <control granularity='10' maxAllocs='8'/> </bank> </memory> ... </host> </capabilities>

The element names 'memory' and 'bank' can be named differently, suggestions are welcome. How about change bank to node?

Definitely can be 'node', this was just to make it consistent with cache allocation.

...

...
Then there is the domain XML, for CAT we use this:

<domain> ... <cputune> ... <cachetune vcpus='0-3'> <cache id='0' level='3' type='both' size='3' unit='MiB'/> <cache id='1' level='3' type='both' size='3' unit='MiB'/> </cachetune> ... <cputune> ... </domain>

so the possible domain XML could look like this:

<domain> ... <cputune> ... <memory vcpus='0-3'> <socket id='0' bandwidth='30'/> <socket id='1' bandwidth='20'/> </memory> ... <cputune> ... </domain>

Again, the element names 'memory' and 'socket' can be named differently. socket --> node?

Same here, I'm OK with 'node'.

...

Since the existing virrestrl implementation only care about cache part during development, So we may need change some names of structure and functions when enable MBA. How do you think

Yes, the existing virresctrl implementation was mainly focused to cache allocation. There is a new patch series with some cleanups for the existing implementation which renames few things but we will probably need to rename some other functions/structures/etc. One important note, the rename should be done in separate patch without any functional changes. Pavel

2822

Age (days ago)

2833

Last active (days ago)

List overview

Download

14 comments

4 participants

participants (4)

bing.niu
bing.niu＠intel.com
Ján Tomko
Pavel Hrdina