[libvirt] [GSOC] project libvirt fuzzing

Dear all, This is my first post in the list. I am currently a graduate student studying computer science, particularly interested in visualization technologies and I have been using QEMU for a variety of projects for a while. Two of the courses that I am taking this semester really attracted me to the libvirt community are Advanced Operating Systems and Secure Software Development. I have been learning kernel fuzzing as well as other general fuzzing tools. Then I found the topic of "QEMU command line generator XML fuzzing" is pretty interesting and totally in line with my interest and background. Though I have read through the documentations on the website, just to make sure I am doing it correctly, could anyone confirm this project is still available? And what I need to do next in order to participate the project this summer? Do I need to find a mentor by myself? Potentially, I could find my OS or Security professor as my mentor, but I am not sure yet which would be the best way. Thanks, Dan

On 04.03.2017 07:23, Da L wrote:
Dear all,
Hey,
This is my first post in the list.
Very well. Welcome. It is always nice to see people interested in libvirt.
I am currently a graduate student studying computer science, particularly interested in visualization technologies and I have been using QEMU for a variety of projects for a while. Two of the courses that I am taking this semester really attracted me to the libvirt community are Advanced Operating Systems and Secure Software Development. I have been learning kernel fuzzing as well as other general fuzzing tools.
Then I found the topic of "QEMU command line generator XML fuzzing" is pretty interesting and totally in line with my interest and background. Though I have read through the documentations on the website, just to make sure I am doing it correctly, could anyone confirm this project is still available? And what I need to do next in order to participate the project this summer? Do I need to find a mentor by myself? Potentially, I could find my OS or Security professor as my mentor, but I am not sure yet which would be the best way.
Yes, the project is still on. It does not have a mentor assigned yet, but don't worry about that now - there is a lot of mentors around. For now, I can be your point of contact. So, just to explain you some details of the project: libvirt's format for storing domain configuration is XML. However, none of the hypervisors out there uses XML to describe domain configuration. For instance, in qemu it's all about the command line. You want this disk for you domain? You have to put it onto the command line. And so on. Therefore, in a very simplistic way, for qemu libvirt translates the XML into qemu command line language. Now, this process is very complex and sort of tricky. That's why we would like to generate "all" possible combinations of XML, let the command line generator crunch them and produce qemu command line. Well, that's not entirely true, because command line generator works over some internal representation of domain (not XML) that is produced by our XML parser: XML document -> XML parser -> QEMU cmd line generator -> QEMU cmd line There is plenty of fuzzing libraries available on the market, so I guess one of the first steps would be to explore our options and pick one that suits our needs. Do you have experience with any of them? Frankly, I have very little. Regarding the GSoC process, each organization makes their own rules for accepting students. Here at libvirt the rules are described here: http://wiki.libvirt.org/page/Google_Summer_of_Code_FAQ Please let me know what are your thoughts on all of this, and also don't hesitate to ask anything. Michal

On Sun, Mar 5, 2017 at 2:47 AM, Michal Privoznik <mprivozn@redhat.com> wrote: > On 04.03.2017 07:23, Da L wrote: > > Dear all, > > > > Hey, > > > This is my first post in the list. > > Very well. Welcome. It is always nice to see people interested in libvirt. > > Hi Michal, Thank you very much for the explanation and encouragement. I am so glad to join the community. > > > > I am currently a graduate student studying computer science, particularly > > interested in visualization technologies and I have been using QEMU for a > > variety of projects for a while. Two of the courses that I am taking this > > semester really attracted me to the libvirt community are Advanced > > Operating Systems and Secure Software Development. I have been learning > > kernel fuzzing as well as other general fuzzing tools. > > > > Then I found the topic of "QEMU command line generator XML fuzzing" is > > pretty interesting and totally in line with my interest and background. > > Though I have read through the documentations on the website, just to > make > > sure I am doing it correctly, could anyone confirm this project is still > > available? And what I need to do next in order to participate the project > > this summer? Do I need to find a mentor by myself? Potentially, I could > > find my OS or Security professor as my mentor, but I am not sure yet > which > > would be the best way. > > Yes, the project is still on. It does not have a mentor assigned yet, > but don't worry about that now - there is a lot of mentors around. For > now, I can be your point of contact. > > So, just to explain you some details of the project: libvirt's format > for storing domain configuration is XML. However, none of the > hypervisors out there uses XML to describe domain configuration. For > instance, in qemu it's all about the command line. You want this disk > for you domain? You have to put it onto the command line. And so on. > Therefore, in a very simplistic way, for qemu libvirt translates the XML > into qemu command line language. Now, this process is very complex and > sort of tricky. That's why we would like to generate "all" possible > combinations of XML, let the command line generator crunch them and > produce qemu command line. Well, that's not entirely true, because > command line generator works over some internal representation of domain > (not XML) that is produced by our XML parser: > > Please correct me if I am wrong about my following understanding: 1. Regarding XML config file, one typical usage with libvirt could be: $ virsh define <domain_config_file.xml <http://your_xml_config_file.ml>> 2. I noticed in the source code of libvirt, there exist several files in close relation to xml, including src/util/virxml.{c,h}, which might be the target of this project? 3. And libvirt also is compiled with libxml2. 4. Then in virt-xml-validate, which is a bash script, (in build/bin directory after make install) calling xmllint. I have not been able to get round to figure out the relations of the above pieces yet. I spent some time to try to instrument and compile the executables with AFL, but so far with no luck. (The idea is as simple as changing gcc in Makefile/configure to afl-gcc). The attached figure is just a demo showing using AFL to fuzz virt-admin, which is not instrumented, (so kinda of boring and not quite useful). But I think AFL could be one of the candidate as a fuzzer for this project due its prevalence and proved effectiveness. Regarding fuzzing, I think we can try several fuzzing tools to run in parallel, as different fuzzers tend to find different kinds of bugs. Thus, AFL (American Fuzz Lop) [1], which is a coverage-guided mutation-based fuzzer with genetic algorithm, can take hand-crafted xml seed to fuzz our libvert target. Alternatively, we could develop generation-based grammar module in AFL (which is definitely non-trivial); so far I have not seen active development in AFL community on xml format grammar generation. Another option could be clang-libfuzzer [2]. Several related articles show examples of fuzzing are using AFL to generate SQL [3], llvm-afl [4], and hexml fuzzing with AFL [5]. In combination with lcov, we could compare different fuzzers and guide our fuzzing tuning. NOTE the [5] example is quite interesting; it is fuzzing a haskell-written xml paser. I will probably not update more until next week; I am having three mid-terms this week. [1] http://lcamtuf.coredump.cx/afl/ [2] http://llvm.org/docs/LibFuzzer.html [3] https://lcamtuf.blogspot.com/2015/01/afl-fuzz-making-up-grammar-with.html [4] http://lists.llvm.org/pipermail/llvm-dev/2014-December/079390.html [5] https://github.com/ndmitchell/hexml/issues/6 Again, thanks a lot. Any guidance, comments, or suggestions would be more than welcome and highly appreciated. Best, Dan XML document -> XML parser -> QEMU cmd line generator -> QEMU cmd line > > There is plenty of fuzzing libraries available on the market, so I guess > one of the first steps would be to explore our options and pick one that > suits our needs. Do you have experience with any of them? Frankly, I > have very little. > > Regarding the GSoC process, each organization makes their own rules for > accepting students. Here at libvirt the rules are described here: > > http://wiki.libvirt.org/page/Google_Summer_of_Code_FAQ > > Please let me know what are your thoughts on all of this, and also don't > hesitate to ask anything. > > Michal > >

On 03/07/2017 06:27 AM, D L wrote:
On Sun, Mar 5, 2017 at 2:47 AM, Michal Privoznik <mprivozn@redhat.com> wrote:
On 04.03.2017 07:23, Da L wrote:
Dear all,
Hey,
This is my first post in the list.
Very well. Welcome. It is always nice to see people interested in libvirt.
Hi Michal,
Thank you very much for the explanation and encouragement. I am so glad to join the community.
I am currently a graduate student studying computer science, particularly interested in visualization technologies and I have been using QEMU for a variety of projects for a while. Two of the courses that I am taking this semester really attracted me to the libvirt community are Advanced Operating Systems and Secure Software Development. I have been learning kernel fuzzing as well as other general fuzzing tools.
Then I found the topic of "QEMU command line generator XML fuzzing" is pretty interesting and totally in line with my interest and background. Though I have read through the documentations on the website, just to
make
sure I am doing it correctly, could anyone confirm this project is still available? And what I need to do next in order to participate the project this summer? Do I need to find a mentor by myself? Potentially, I could find my OS or Security professor as my mentor, but I am not sure yet which would be the best way.
Yes, the project is still on. It does not have a mentor assigned yet, but don't worry about that now - there is a lot of mentors around. For now, I can be your point of contact.
So, just to explain you some details of the project: libvirt's format for storing domain configuration is XML. However, none of the hypervisors out there uses XML to describe domain configuration. For instance, in qemu it's all about the command line. You want this disk for you domain? You have to put it onto the command line. And so on. Therefore, in a very simplistic way, for qemu libvirt translates the XML into qemu command line language. Now, this process is very complex and sort of tricky. That's why we would like to generate "all" possible combinations of XML, let the command line generator crunch them and produce qemu command line. Well, that's not entirely true, because command line generator works over some internal representation of domain (not XML) that is produced by our XML parser:
Please correct me if I am wrong about my following understanding: 1. Regarding XML config file, one typical usage with libvirt could be: $ virsh define <domain_config_file.xml <http://your_xml_config_file.ml>>
The file has to be stored locally. Libvirt doesn't have an 'url-grabber'. In fact, our APIs expect XML document passed as string (not a filename where it is stored). It's just virsh that allows users to point it to a file which is read and passed to the define API.
2. I noticed in the source code of libvirt, there exist several files in close relation to xml, including src/util/virxml.{c,h}, which might be the target of this project?
Sort of. virxml.c file contains XML parsing helpers (mostly higher-level APIs over libxml2). The XML parsing is done in src/conf/domain_conf.c (or network_conf.c for libvirt networks, etc.). The entry point for exploring domain XML parsing can be virDomainDefParseString() function. BTW: while exploring libvirt sources I strongly advice to use so called tagged sources ("make tags" or "ctags -R ." or some equivalent), because libvirt sources consists of lots of short functions calling other functions. Tagged sources then allow developers to jump onto symbol under cursor (in vim it is "CTRL + ]" or "g + ]" if the symbol is defined at multiple locations). Now that we have parsed the domain XML into internal representation (virDomainDef), we can look into qemu command line generation. I think the whole process is best visible in qemuDomainCreateXML() (e.g. "vim -t qemuDomainCreateXML" ;-)). This is qemu driver implementation of public API virDomainCreateXML(). It allows users to create so called transient domains. Long story short: "here, I have domain XML, start it up for me, will you?". Therefore at the beginning the domain XML is parsed (using the function described above), several not-important-right-now functions are called and then qemuProcessStart() is called which calls qemuProcessLaunch() which calls qemuBuildCommandLine(). Finally, this is the function that takes the virDomainDef (among other arguments) and produces yet another internal representation of qemu command line (virCommandPtr). This command line is then executed later in the process.
3. And libvirt also is compiled with libxml2.
Yes. This has strong historical background (hint: look who started libvirt and who wrote libxml2 ;-)). Frankly, I don't think we've ever considered a different xml parsing library.
4. Then in virt-xml-validate, which is a bash script, (in build/bin directory after make install) calling xmllint.
Yeah. Writing our XMLs by hand can be overwhelming. Moreover, libvirt has this philosophy of ignoring unknown elements/attributes. So it might happen that for instance you have a typo in an element name and you're still wondering why libvirt ignores that particular setting (e.g. path to disk of domain). Therefore we have grammar rules (RNG) that could help you here - virt-xml-validate would error out in this example. Well, even virsh errors our now because it instructs libvirt to do the XML validation before parsing. But that hasn't been always the case.
I have not been able to get round to figure out the relations of the above pieces yet. I spent some time to try to instrument and compile the executables with AFL, but so far with no luck. (The idea is as simple as changing gcc in Makefile/configure to afl-gcc). The attached figure is just a demo showing using AFL to fuzz virt-admin, which is not instrumented, (so kinda of boring and not quite useful). But I think AFL could be one of the candidate as a fuzzer for this project due its prevalence and proved effectiveness.
We don't have to limit ourselves just for domain XML -> qemu cmd line fuzzing. We can look into other areas too (there's a lot of inputs for libvirt), e.g. RPC protocol (we have our own protocol for communication with distant server/client over network), fuzz XML parsers themselves (domain is not the only object that libvirt manages, we have networks, interfaces, storage pools/volumes, etc.). It's just that qemu cmd line fuzzing seemed complicated enough so that the chances of running a fuzzer successfully are high.
Regarding fuzzing, I think we can try several fuzzing tools to run in parallel, as different fuzzers tend to find different kinds of bugs.
True. I had this on my mind as well.
Thus, AFL (American Fuzz Lop) [1], which is a coverage-guided mutation-based fuzzer with genetic algorithm, can take hand-crafted xml seed to fuzz our libvert target. Alternatively, we could develop generation-based grammar module in AFL (which is definitely non-trivial);
Yeah, I thought about this when watching a talk on AFL. We might explore other possibilities - they already might have something we want.
so far I have not seen active development in AFL community on xml format grammar generation. Another option could be clang-libfuzzer [2].
Several related articles show examples of fuzzing are using AFL to generate SQL [3], llvm-afl [4], and hexml fuzzing with AFL [5]. In combination with lcov, we could compare different fuzzers and guide our fuzzing tuning.
Yes, good idea.
NOTE the [5] example is quite interesting; it is fuzzing a haskell-written xml paser.
Indeed.
I will probably not update more until next week; I am having three mid-terms this week.
Good luck.
[1] http://lcamtuf.coredump.cx/afl/ [2] http://llvm.org/docs/LibFuzzer.html [3] https://lcamtuf.blogspot.com/2015/01/afl-fuzz-making-up-grammar-with.html [4] http://lists.llvm.org/pipermail/llvm-dev/2014-December/079390.html [5] https://github.com/ndmitchell/hexml/issues/6
Again, thanks a lot. Any guidance, comments, or suggestions would be more than welcome and highly appreciated.
Best,
Dan
Michal

On Tue, Mar 7, 2017 at 4:08 AM, Michal Privoznik <mprivozn@redhat.com> wrote:
On Sun, Mar 5, 2017 at 2:47 AM, Michal Privoznik <mprivozn@redhat.com> wrote:
On 04.03.2017 07:23, Da L wrote:
Dear all,
Hey,
This is my first post in the list.
Very well. Welcome. It is always nice to see people interested in
Hi Michal,
Thank you very much for the explanation and encouragement. I am so glad to join the community.
I am currently a graduate student studying computer science,
interested in visualization technologies and I have been using QEMU for a variety of projects for a while. Two of the courses that I am taking
semester really attracted me to the libvirt community are Advanced Operating Systems and Secure Software Development. I have been learning kernel fuzzing as well as other general fuzzing tools.
Then I found the topic of "QEMU command line generator XML fuzzing" is pretty interesting and totally in line with my interest and background. Though I have read through the documentations on the website, just to make sure I am doing it correctly, could anyone confirm this project is still available? And what I need to do next in order to participate the
On 03/07/2017 06:27 AM, D L wrote: libvirt. particularly this project
this summer? Do I need to find a mentor by myself? Potentially, I could find my OS or Security professor as my mentor, but I am not sure yet which would be the best way.
Yes, the project is still on. It does not have a mentor assigned yet, but don't worry about that now - there is a lot of mentors around. For now, I can be your point of contact.
So, just to explain you some details of the project: libvirt's format for storing domain configuration is XML. However, none of the hypervisors out there uses XML to describe domain configuration. For instance, in qemu it's all about the command line. You want this disk for you domain? You have to put it onto the command line. And so on. Therefore, in a very simplistic way, for qemu libvirt translates the XML into qemu command line language. Now, this process is very complex and sort of tricky. That's why we would like to generate "all" possible combinations of XML, let the command line generator crunch them and produce qemu command line. Well, that's not entirely true, because command line generator works over some internal representation of domain (not XML) that is produced by our XML parser:
Please correct me if I am wrong about my following understanding: 1. Regarding XML config file, one typical usage with libvirt could be: $ virsh define <domain_config_file.xml <http://your_xml_config_file. ml>>
The file has to be stored locally. Libvirt doesn't have an 'url-grabber'. In fact, our APIs expect XML document passed as string (not a filename where it is stored). It's just virsh that allows users to point it to a file which is read and passed to the define API.
2. I noticed in the source code of libvirt, there exist several files in close relation to xml, including src/util/virxml.{c,h}, which might be the target of this project?
Sort of. virxml.c file contains XML parsing helpers (mostly higher-level APIs over libxml2). The XML parsing is done in src/conf/domain_conf.c (or network_conf.c for libvirt networks, etc.). The entry point for exploring domain XML parsing can be virDomainDefParseString() function. BTW: while exploring libvirt sources I strongly advice to use so called tagged sources ("make tags" or "ctags -R ." or some equivalent), because libvirt sources consists of lots of short functions calling other functions. Tagged sources then allow developers to jump onto symbol under cursor (in vim it is "CTRL + ]" or "g + ]" if the symbol is defined at multiple locations).
Hi Michal,
Thank you so much for the detailed description. I will get back to you for each point in detail next week. By the way, so nice to see the power of vi in a real project. Best, Dan Now that we have parsed the domain XML into internal representation
(virDomainDef), we can look into qemu command line generation. I think the whole process is best visible in qemuDomainCreateXML() (e.g. "vim -t qemuDomainCreateXML" ;-)). This is qemu driver implementation of public API virDomainCreateXML(). It allows users to create so called transient domains. Long story short: "here, I have domain XML, start it up for me, will you?". Therefore at the beginning the domain XML is parsed (using the function described above), several not-important-right-now functions are called and then qemuProcessStart() is called which calls qemuProcessLaunch() which calls qemuBuildCommandLine(). Finally, this is the function that takes the virDomainDef (among other arguments) and produces yet another internal representation of qemu command line (virCommandPtr). This command line is then executed later in the process.
3. And libvirt also is compiled with libxml2.
Yes. This has strong historical background (hint: look who started libvirt and who wrote libxml2 ;-)). Frankly, I don't think we've ever considered a different xml parsing library.
4. Then in virt-xml-validate, which is a bash script, (in build/bin directory after make install) calling xmllint.
Yeah. Writing our XMLs by hand can be overwhelming. Moreover, libvirt has this philosophy of ignoring unknown elements/attributes. So it might happen that for instance you have a typo in an element name and you're still wondering why libvirt ignores that particular setting (e.g. path to disk of domain). Therefore we have grammar rules (RNG) that could help you here - virt-xml-validate would error out in this example. Well, even virsh errors our now because it instructs libvirt to do the XML validation before parsing. But that hasn't been always the case.
I have not been able to get round to figure out the relations of the
above
pieces yet. I spent some time to try to instrument and compile the executables with AFL, but so far with no luck. (The idea is as simple as changing gcc in Makefile/configure to afl-gcc). The attached figure is just a demo showing using AFL to fuzz virt-admin, which is not instrumented, (so kinda of boring and not quite useful). But I think AFL could be one of the candidate as a fuzzer for this project due its prevalence and proved effectiveness.
We don't have to limit ourselves just for domain XML -> qemu cmd line fuzzing. We can look into other areas too (there's a lot of inputs for libvirt), e.g. RPC protocol (we have our own protocol for communication with distant server/client over network), fuzz XML parsers themselves (domain is not the only object that libvirt manages, we have networks, interfaces, storage pools/volumes, etc.). It's just that qemu cmd line fuzzing seemed complicated enough so that the chances of running a fuzzer successfully are high.
Regarding fuzzing, I think we can try several fuzzing tools to run in parallel, as different fuzzers tend to find different kinds of bugs.
True. I had this on my mind as well.
Thus, AFL (American Fuzz Lop) [1], which is a coverage-guided mutation-based fuzzer with genetic algorithm, can take hand-crafted xml seed to fuzz our libvert target. Alternatively, we could develop generation-based grammar module in AFL (which is definitely non-trivial);
Yeah, I thought about this when watching a talk on AFL. We might explore other possibilities - they already might have something we want.
so far I have not seen active development in AFL community on xml format grammar generation. Another option could be clang-libfuzzer [2].
Several related articles show examples of fuzzing are using AFL to generate SQL [3], llvm-afl [4], and hexml fuzzing with AFL [5]. In combination with lcov, we could compare different fuzzers and guide our fuzzing tuning.
Yes, good idea.
NOTE the [5] example is quite interesting; it is fuzzing a
haskell-written
xml paser.
Indeed.
I will probably not update more until next week; I am having three mid-terms this week.
Good luck.
[1] http://lcamtuf.coredump.cx/afl/ [2] http://llvm.org/docs/LibFuzzer.html [3] https://lcamtuf.blogspot.com/2015/01/afl-fuzz-making-up-
grammar-with.html
[4] http://lists.llvm.org/pipermail/llvm-dev/2014-December/079390.html [5] https://github.com/ndmitchell/hexml/issues/6
Again, thanks a lot. Any guidance, comments, or suggestions would be more than welcome and highly appreciated.
Best,
Dan
Michal

On 03/07/2017 09:22 PM, D L wrote:
Hi Michal,
Thank you so much for the detailed description. I will get back to you for each point in detail next week.
Sure, not problem.
By the way, so nice to see the power of vi in a real project.
http://vim.wikia.com/wiki/Browsing_programs_with_tags Michal

On Tue, Mar 7, 2017 at 4:08 AM, Michal Privoznik <mprivozn@redhat.com> wrote:
On Sun, Mar 5, 2017 at 2:47 AM, Michal Privoznik <mprivozn@redhat.com> wrote:
On 04.03.2017 07:23, Da L wrote:
Dear all,
Hey,
This is my first post in the list.
Very well. Welcome. It is always nice to see people interested in
Hi Michal,
Thank you very much for the explanation and encouragement. I am so glad to join the community.
I am currently a graduate student studying computer science,
interested in visualization technologies and I have been using QEMU for a variety of projects for a while. Two of the courses that I am taking
semester really attracted me to the libvirt community are Advanced Operating Systems and Secure Software Development. I have been learning kernel fuzzing as well as other general fuzzing tools.
Then I found the topic of "QEMU command line generator XML fuzzing" is pretty interesting and totally in line with my interest and background. Though I have read through the documentations on the website, just to make sure I am doing it correctly, could anyone confirm this project is still available? And what I need to do next in order to participate the
On 03/07/2017 06:27 AM, D L wrote: libvirt. particularly this project
this summer? Do I need to find a mentor by myself? Potentially, I could find my OS or Security professor as my mentor, but I am not sure yet which would be the best way.
Yes, the project is still on. It does not have a mentor assigned yet, but don't worry about that now - there is a lot of mentors around. For now, I can be your point of contact.
So, just to explain you some details of the project: libvirt's format for storing domain configuration is XML. However, none of the hypervisors out there uses XML to describe domain configuration. For instance, in qemu it's all about the command line. You want this disk for you domain? You have to put it onto the command line. And so on. Therefore, in a very simplistic way, for qemu libvirt translates the XML into qemu command line language. Now, this process is very complex and sort of tricky. That's why we would like to generate "all" possible combinations of XML, let the command line generator crunch them and produce qemu command line. Well, that's not entirely true, because command line generator works over some internal representation of domain (not XML) that is produced by our XML parser:
Please correct me if I am wrong about my following understanding: 1. Regarding XML config file, one typical usage with libvirt could be: $ virsh define <domain_config_file.xml <http://your_xml_config_file. ml>>
The file has to be stored locally. Libvirt doesn't have an 'url-grabber'. In fact, our APIs expect XML document passed as string (not a filename where it is stored). It's just virsh that allows users to point it to a file which is read and passed to the define API.
Oh, my bad. The typing was somehow translated into an url in my browser. But it is an interesting idea to have config files requested via http.
2. I noticed in the source code of libvirt, there exist several files in close relation to xml, including src/util/virxml.{c,h}, which might be the target of this project?
Sort of. virxml.c file contains XML parsing helpers (mostly higher-level APIs over libxml2). The XML parsing is done in src/conf/domain_conf.c (or network_conf.c for libvirt networks, etc.). The entry point for exploring domain XML parsing can be virDomainDefParseString() function. BTW: while exploring libvirt sources I strongly advice to use so called tagged sources ("make tags" or "ctags -R ." or some equivalent), because libvirt sources consists of lots of short functions calling other functions. Tagged sources then allow developers to jump onto symbol under cursor (in vim it is "CTRL + ]" or "g + ]" if the symbol is defined at multiple locations).
I took a deeper look at the domain_conf.c and network_conf.c. It is just so
Now that we have parsed the domain XML into internal representation (virDomainDef), we can look into qemu command line generation. I think the whole process is best visible in qemuDomainCreateXML() (e.g. "vim -t qemuDomainCreateXML" ;-)). This is qemu driver implementation of public API virDomainCreateXML(). It allows users to create so called transient domains. Long story short: "here, I have domain XML, start it up for me, will you?". Therefore at the beginning the domain XML is parsed (using the function described above), several not-important-right-now functions are called and then qemuProcessStart() is called which calls qemuProcessLaunch() which calls qemuBuildCommandLine(). Finally, this is the function that takes the virDomainDef (among other arguments) and produces yet another internal representation of qemu command line (virCommandPtr). This command line is then executed later in the process.
Here I traced through the invocations starting from qemuDomainCreateXML. Indeed, eventually, it returned a _virCommand struct with some process information, like file descriptors, pid, uid, gid etc. And for different
amazing to see a single file having 26 K lines of code. I first thought it must be generated automatically, then I found there are ~1640 commit for that single file over 8 years. Yes, ctags is very very helpful! purposes, it is being passed as an argument in about 200 places, such as in ./src/qemu/qemu_command.c, there are qemuBuildMasterKeyCommandLine(), and qemuBuildNVRAMCommandLine(), in /src/util/vircommand.c: there are virCommandSetWorkingDirectory(), and virCommandProcessIO() in /src/rpc/virnetsocket.c, there is virNetSocketNewConnectCommand(). in /src/storage/storage_util.c, there is storageBackendCreateQemuImgSetOptions(). etc
3. And libvirt also is compiled with libxml2.
Yes. This has strong historical background (hint: look who started libvirt and who wrote libxml2 ;-)). Frankly, I don't think we've ever considered a different xml parsing library.
Oh yeah, just for curiosity, I git cloned libxml2, and find the name of Daniel Veillard, then found out more stories. Really amazing work. Maybe I should not ask, (but since I know nothing yet, it won't hurt), have we ever considered another format alternative to xml? Json, for example, since xml is kind of hard to parse. By the way, when I save a VM's state with "virsh save ID FILE_NAME", it generated a huge XML file (500 M to 16G) then I found out virDomainSnapshotCreateXML() is called when executing that command ( https://libvirt.org/formatsnapshot.html). When restoring the state, it is calling virDomainRevertToSnapshot(), virDomainSnapshotGetXMLDesc etc. For the XML fuzzing project, do we need to consider those situations?
4. Then in virt-xml-validate, which is a bash script, (in build/bin directory after make install) calling xmllint.
Yeah. Writing our XMLs by hand can be overwhelming. Moreover, libvirt has this philosophy of ignoring unknown elements/attributes. So it might happen that for instance you have a typo in an element name and you're still wondering why libvirt ignores that particular setting (e.g. path to disk of domain). Therefore we have grammar rules (RNG) that could help you here - virt-xml-validate would error out in this example. Well, even virsh errors our now because it instructs libvirt to do the XML validation before parsing. But that hasn't been always the case.
I have not been able to get round to figure out the relations of the
above
pieces yet. I spent some time to try to instrument and compile the executables with AFL, but so far with no luck. (The idea is as simple as changing gcc in Makefile/configure to afl-gcc). The attached figure is just a demo showing using AFL to fuzz virt-admin, which is not instrumented, (so kinda of boring and not quite useful). But I think AFL could be one of the candidate as a fuzzer for this project due its prevalence and proved effectiveness.
We don't have to limit ourselves just for domain XML -> qemu cmd line fuzzing. We can look into other areas too (there's a lot of inputs for libvirt), e.g. RPC protocol (we have our own protocol for communication with distant server/client over network), fuzz XML parsers themselves (domain is not the only object that libvirt manages, we have networks, interfaces, storage pools/volumes, etc.). It's just that qemu cmd line fuzzing seemed complicated enough so that the chances of running a fuzzer successfully are high.
All right. I think that's definitely a good idea. I will start looking
into this tomorrow and resume the fuzzing experiment that I left. Thank you very much for the detailed explanation. I am having a much better understanding about the scope and how I would plan to confine/manage the timeline of the project. Dan
Regarding fuzzing, I think we can try several fuzzing tools to run in parallel, as different fuzzers tend to find different kinds of bugs.
True. I had this on my mind as well.
Thus, AFL (American Fuzz Lop) [1], which is a coverage-guided mutation-based fuzzer with genetic algorithm, can take hand-crafted xml seed to fuzz our libvert target. Alternatively, we could develop generation-based grammar module in AFL (which is definitely non-trivial);
Yeah, I thought about this when watching a talk on AFL. We might explore other possibilities - they already might have something we want.
so far I have not seen active development in AFL community on xml format grammar generation. Another option could be clang-libfuzzer [2].
Several related articles show examples of fuzzing are using AFL to generate SQL [3], llvm-afl [4], and hexml fuzzing with AFL [5]. In combination with lcov, we could compare different fuzzers and guide our fuzzing tuning.
Yes, good idea.
NOTE the [5] example is quite interesting; it is fuzzing a
haskell-written
xml paser.
Indeed.
I will probably not update more until next week; I am having three mid-terms this week.
Good luck.
[1] http://lcamtuf.coredump.cx/afl/ [2] http://llvm.org/docs/LibFuzzer.html [3] https://lcamtuf.blogspot.com/2015/01/afl-fuzz-making-up-
grammar-with.html
[4] http://lists.llvm.org/pipermail/llvm-dev/2014-December/079390.html [5] https://github.com/ndmitchell/hexml/issues/6
Again, thanks a lot. Any guidance, comments, or suggestions would be more than welcome and highly appreciated.
Best,
Dan
Michal

On 03/16/2017 09:08 AM, D L wrote:
On Tue, Mar 7, 2017 at 4:08 AM, Michal Privoznik <mprivozn@redhat.com> wrote:
The file has to be stored locally. Libvirt doesn't have an 'url-grabber'. In fact, our APIs expect XML document passed as string (not a filename where it is stored). It's just virsh that allows users to point it to a file which is read and passed to the define API.
Oh, my bad. The typing was somehow translated into an url in my browser. But it is an interesting idea to have config files requested via http.
I'm not so sure. This looks like something for tools working on the top of libvirt, e.g. virt-install. If we were to have everything in libvirt, it would become unmanageable.
I took a deeper look at the domain_conf.c and network_conf.c. It is just so amazing to see a single file having 26 K lines of code. I first thought it must be generated automatically, then I found there are ~1640 commit for that single file over 8 years. Yes, ctags is very very helpful!
Yeah, it is our biggest file. The next one is src/qemu/qemu_driver.c. But ctags are useful not because of big files, but because our code is scattered into a lot of files. But that's not important now.
Now that we have parsed the domain XML into internal representation (virDomainDef), we can look into qemu command line generation. I think the whole process is best visible in qemuDomainCreateXML() (e.g. "vim -t qemuDomainCreateXML" ;-)). This is qemu driver implementation of public API virDomainCreateXML(). It allows users to create so called transient domains. Long story short: "here, I have domain XML, start it up for me, will you?". Therefore at the beginning the domain XML is parsed (using the function described above), several not-important-right-now functions are called and then qemuProcessStart() is called which calls qemuProcessLaunch() which calls qemuBuildCommandLine(). Finally, this is the function that takes the virDomainDef (among other arguments) and produces yet another internal representation of qemu command line (virCommandPtr). This command line is then executed later in the process.
Here I traced through the invocations starting from qemuDomainCreateXML. Indeed, eventually, it returned a _virCommand struct with some process information, like file descriptors, pid, uid, gid etc. And for different purposes, it is being passed as an argument in about 200 places, such as in ./src/qemu/qemu_command.c, there are qemuBuildMasterKeyCommandLine(), and qemuBuildNVRAMCommandLine(),
The virCommand type is for generic command execution. Not just qemu. For instance, when creating new storage volumes, libvirt spawns qemu-img tool. That's why you can find some virCommand occurrences all over the place (e.g. in storage_util.c). Moreover, some functions take existing virCommand object and just add something to it - just like qemuBuildMasterKeyCommandLine() is doing. This way, the build process of command line is split into many functions. The main reason to do so is better maintainability. Just like you'd use functions in regular code to semantically divide code into parts. What's important here is qemuBuildCommandLine() which takes domain definition (well, pointer to it), and constructs correspoding qemu command line (represented as a pointer to virCommand which is returned) by calling several functions - each one constructing some part of the command line. Now the GSoC idea could be to test this qemuBuildCommandLine() function. A fuzzer would create the virDomainDef object, we would run qemuBuildCommandLine() over it and see if it crashed or not, whether a sane output was generated or not. Then to take this one level up, virDomainDef is produced by virDomainDefParse() which takes a string (read XML document) and parses it. At this point, the fuzzer does not need to care about virDomainDef at all, it can just create all possible XML documents and call virDomainDefParse() over them, and then qemuBuildCommandLine() over the result of parser. Therefore I think this is what we should aim at.
in /src/util/vircommand.c: there are virCommandSetWorkingDirectory(), and virCommandProcessIO() in /src/rpc/virnetsocket.c, there is virNetSocketNewConnectCommand(). in /src/storage/storage_util.c, there is storageBackendCreateQemuImgSetOptions(). etc
3. And libvirt also is compiled with libxml2.
Yes. This has strong historical background (hint: look who started libvirt and who wrote libxml2 ;-)). Frankly, I don't think we've ever considered a different xml parsing library.
Oh yeah, just for curiosity, I git cloned libxml2, and find the name of Daniel Veillard, then found out more stories. Really amazing work.
Yeah. Daniel wrote libxml2 and then started libvirt. So choosing XML was a nobrainer :-).
Maybe I should not ask, (but since I know nothing yet, it won't hurt), have we ever considered another format alternative to xml? Json, for example, since xml is kind of hard to parse.
It's not any harder than JSON. There's 1:1 mapping between JSON and XML. Anything that can be expressed in one format can be expressed in the other too. And we cannot really switch formats because we try to stay backward compatible. Meaning, if you write a program that co-operates with libvirt, any subsequent update of libvirt should not break your application. Therefore, if your application knows how to create/parse XML documents (because that's what you need currently if you talk to libvirt), we cannot switch to JSON, because your application would stop working with XML parser error. Just a side note - there are plenty projects on the top of libvirt which create/parse the XML for you so you don't even have to touch XML yourself. You don't even know that there's XML behind the scenes. One of such projects is libvirt-gobject, for instance. So XML is not an issue here.
By the way, when I save a VM's state with "virsh save ID FILE_NAME", it generated a huge XML file (500 M to 16G)
Yes, becuase the file does not only contain the domain XML, but also insternal state of the domain (=guest memory + qemu memory). Therefore, if you restore from it, you will get the very same state as when doing the save.
then I found out virDomainSnapshotCreateXML() is called when executing that command ( https://libvirt.org/formatsnapshot.html). When restoring the state, it is calling virDomainRevertToSnapshot(), virDomainSnapshotGetXMLDesc etc.
Not really. 'virsh save' calls virDomainSave() which calls qemuDomainSave() (because conn->driver points to qemuHypervisorDriver). qemuDomainSave() -> qemuDomainSaveFlags() -> qemuDomainSaveInternal() -> qemuDomainSaveMemory() where basically all the interesting work takes place.
For the XML fuzzing project, do we need to consider those situations?
I think the most important is to have XML -> qemu cmd line fuzzing in place and only after that focus on extending that to what I'm describing below in previous e-mails.
4. Then in virt-xml-validate, which is a bash script, (in build/bin directory after make install) calling xmllint.
Yeah. Writing our XMLs by hand can be overwhelming. Moreover, libvirt has this philosophy of ignoring unknown elements/attributes. So it might happen that for instance you have a typo in an element name and you're still wondering why libvirt ignores that particular setting (e.g. path to disk of domain). Therefore we have grammar rules (RNG) that could help you here - virt-xml-validate would error out in this example. Well, even virsh errors our now because it instructs libvirt to do the XML validation before parsing. But that hasn't been always the case.
I have not been able to get round to figure out the relations of the
above
pieces yet. I spent some time to try to instrument and compile the executables with AFL, but so far with no luck. (The idea is as simple as changing gcc in Makefile/configure to afl-gcc). The attached figure is just a demo showing using AFL to fuzz virt-admin, which is not instrumented, (so kinda of boring and not quite useful). But I think AFL could be one of the candidate as a fuzzer for this project due its prevalence and proved effectiveness.
We don't have to limit ourselves just for domain XML -> qemu cmd line fuzzing. We can look into other areas too (there's a lot of inputs for libvirt), e.g. RPC protocol (we have our own protocol for communication with distant server/client over network), fuzz XML parsers themselves (domain is not the only object that libvirt manages, we have networks, interfaces, storage pools/volumes, etc.). It's just that qemu cmd line fuzzing seemed complicated enough so that the chances of running a fuzzer successfully are high.
All right. I think that's definitely a good idea. I will start looking
into this tomorrow and resume the fuzzing experiment that I left. Thank you very much for the detailed explanation. I am having a much better understanding about the scope and how I would plan to confine/manage the timeline of the project.
Cheers. Michal

On Thu, Mar 16, 2017 at 1:03 PM, Michal Privoznik <mprivozn@redhat.com> wrote:
On 03/16/2017 09:08 AM, D L wrote:
On Tue, Mar 7, 2017 at 4:08 AM, Michal Privoznik <mprivozn@redhat.com> wrote:
The file has to be stored locally. Libvirt doesn't have an
'url-grabber'. In fact, our APIs expect XML document passed as string (not a filename where it is stored). It's just virsh that allows users to point it to a file which is read and passed to the define API.
Oh, my bad. The typing was somehow translated into an url in my browser. But it is an interesting idea to have config files requested via http.
I'm not so sure. This looks like something for tools working on the top of libvirt, e.g. virt-install. If we were to have everything in libvirt, it would become unmanageable.
I took a deeper look at the domain_conf.c and network_conf.c. It is just so amazing to see a single file having 26 K lines of code. I first thought it must be generated automatically, then I found there are ~1640 commit for that single file over 8 years. Yes, ctags is very very helpful!
Yeah, it is our biggest file. The next one is src/qemu/qemu_driver.c. But ctags are useful not because of big files, but because our code is scattered into a lot of files. But that's not important now.
Now that we have parsed the domain XML into internal representation
(virDomainDef), we can look into qemu command line generation. I think the whole process is best visible in qemuDomainCreateXML() (e.g. "vim -t qemuDomainCreateXML" ;-)). This is qemu driver implementation of public API virDomainCreateXML(). It allows users to create so called transient domains. Long story short: "here, I have domain XML, start it up for me, will you?". Therefore at the beginning the domain XML is parsed (using the function described above), several not-important-right-now functions are called and then qemuProcessStart() is called which calls qemuProcessLaunch() which calls qemuBuildCommandLine(). Finally, this is the function that takes the virDomainDef (among other arguments) and produces yet another internal representation of qemu command line (virCommandPtr). This command line is then executed later in the process.
Here I traced through the invocations starting from qemuDomainCreateXML.
Indeed, eventually, it returned a _virCommand struct with some process information, like file descriptors, pid, uid, gid etc. And for different purposes, it is being passed as an argument in about 200 places, such as in ./src/qemu/qemu_command.c, there are qemuBuildMasterKeyCommandLine(), and qemuBuildNVRAMCommandLine(),
The virCommand type is for generic command execution. Not just qemu. For instance, when creating new storage volumes, libvirt spawns qemu-img tool. That's why you can find some virCommand occurrences all over the place (e.g. in storage_util.c). Moreover, some functions take existing virCommand object and just add something to it - just like qemuBuildMasterKeyCommandLine() is doing. This way, the build process of command line is split into many functions. The main reason to do so is better maintainability. Just like you'd use functions in regular code to semantically divide code into parts. What's important here is qemuBuildCommandLine() which takes domain definition (well, pointer to it), and constructs correspoding qemu command line (represented as a pointer to virCommand which is returned) by calling several functions - each one constructing some part of the command line.
Now the GSoC idea could be to test this qemuBuildCommandLine() function. A fuzzer would create the virDomainDef object, we would run qemuBuildCommandLine() over it and see if it crashed or not, whether a sane output was generated or not. Then to take this one level up, virDomainDef is produced by virDomainDefParse() which takes a string (read XML document) and parses it. At this point, the fuzzer does not need to care about virDomainDef at all, it can just create all possible XML documents and call virDomainDefParse() over them, and then qemuBuildCommandLine() over the result of parser. Therefore I think this is what we should aim at.
in /src/util/vircommand.c: there are
virCommandSetWorkingDirectory(), and virCommandProcessIO() in /src/rpc/virnetsocket.c, there is virNetSocketNewConnectCommand(). in /src/storage/storage_util.c, there is storageBackendCreateQemuImgSetOptions(). etc
3. And libvirt also is compiled with libxml2.
Yes. This has strong historical background (hint: look who started libvirt and who wrote libxml2 ;-)). Frankly, I don't think we've ever considered a different xml parsing library.
Oh yeah, just for curiosity, I git cloned libxml2, and find the name of
Daniel Veillard, then found out more stories. Really amazing work.
Yeah. Daniel wrote libxml2 and then started libvirt. So choosing XML was a nobrainer :-).
Maybe I should not ask, (but since I know nothing yet, it won't hurt), have
we ever considered another format alternative to xml? Json, for example, since xml is kind of hard to parse.
It's not any harder than JSON. There's 1:1 mapping between JSON and XML. Anything that can be expressed in one format can be expressed in the other too. And we cannot really switch formats because we try to stay backward compatible. Meaning, if you write a program that co-operates with libvirt, any subsequent update of libvirt should not break your application. Therefore, if your application knows how to create/parse XML documents (because that's what you need currently if you talk to libvirt), we cannot switch to JSON, because your application would stop working with XML parser error.
Just a side note - there are plenty projects on the top of libvirt which create/parse the XML for you so you don't even have to touch XML yourself. You don't even know that there's XML behind the scenes. One of such projects is libvirt-gobject, for instance. So XML is not an issue here.
By the way, when I save a VM's state with "virsh save ID FILE_NAME", it generated a huge XML file (500 M to 16G)
Yes, becuase the file does not only contain the domain XML, but also insternal state of the domain (=guest memory + qemu memory). Therefore, if you restore from it, you will get the very same state as when doing the save.
then I found out virDomainSnapshotCreateXML()
is called when executing that command ( https://libvirt.org/formatsnapshot.html). When restoring the state, it is calling virDomainRevertToSnapshot(), virDomainSnapshotGetXMLDesc etc.
Not really. 'virsh save' calls virDomainSave() which calls qemuDomainSave() (because conn->driver points to qemuHypervisorDriver). qemuDomainSave() -> qemuDomainSaveFlags() -> qemuDomainSaveInternal() -> qemuDomainSaveMemory() where basically all the interesting work takes place.
For the XML fuzzing project, do we need
to consider those situations?
I think the most important is to have XML -> qemu cmd line fuzzing in place and only after that focus on extending that to what I'm describing below in previous e-mails.
Hi Michal,
I have been digesting your comments. Then I switched concentration from general instrumentation and fuzzing to qemuBuildCommandLine(). I have been having difficulties of resolving the dependencies/shared objects in order to fuzz a particular function. Then I came to a conclusion, I would imagine, but have not started yet, to target specific functions, some helper functions need to be in place to be responsible of the callbacks, and it seems hand-crafted instrumentation is also necessary. This might be one of the cases where programming is necessary for this project. Given the slow progress, or maybe I started later than an ideal situation, I am a bit worried if I could finish the requirement before the submission deadline, not to mention other libvirt community-specific requirement mentioned on the website. So to make sure I am on the right track, what are the concrete goals to achieve, specific requirement to meet, or procedures for me to follow in order to submit the application by the deadline? Thanks, Dan
4. Then in virt-xml-validate, which is a bash script,
(in build/bin directory after make install) calling xmllint.
Yeah. Writing our XMLs by hand can be overwhelming. Moreover, libvirt has this philosophy of ignoring unknown elements/attributes. So it might happen that for instance you have a typo in an element name and you're still wondering why libvirt ignores that particular setting (e.g. path to disk of domain). Therefore we have grammar rules (RNG) that could help you here - virt-xml-validate would error out in this example. Well, even virsh errors our now because it instructs libvirt to do the XML validation before parsing. But that hasn't been always the case.
I have not been able to get round to figure out the relations of the
above
pieces yet. I spent some time to try to instrument and compile the executables with AFL, but so far with no luck. (The idea is as simple as changing gcc in Makefile/configure to afl-gcc). The attached figure is just a demo showing using AFL to fuzz virt-admin, which is not instrumented, (so kinda of boring and not quite useful). But I think AFL could be one of the candidate as a fuzzer for this project due its prevalence and proved effectiveness.
We don't have to limit ourselves just for domain XML -> qemu cmd line fuzzing. We can look into other areas too (there's a lot of inputs for libvirt), e.g. RPC protocol (we have our own protocol for communication with distant server/client over network), fuzz XML parsers themselves (domain is not the only object that libvirt manages, we have networks, interfaces, storage pools/volumes, etc.). It's just that qemu cmd line fuzzing seemed complicated enough so that the chances of running a fuzzer successfully are high.
All right. I think that's definitely a good idea. I will start looking
into this tomorrow and resume the fuzzing experiment that I left. Thank you very much for the detailed explanation. I am having a much better understanding about the scope and how I would plan to confine/manage the timeline of the project.
Cheers.
Michal

On 03/21/2017 04:39 AM, D L wrote:
On Thu, Mar 16, 2017 at 1:03 PM, Michal Privoznik <mprivozn@redhat.com> wrote:
Hi Michal,
Hey,
I have been digesting your comments. Then I switched concentration from general instrumentation and fuzzing to qemuBuildCommandLine(). I have been having difficulties of resolving the dependencies/shared objects in order to fuzz a particular function. Then I came to a conclusion, I would imagine, but have not started yet, to target specific functions, some helper functions need to be in place to be responsible of the callbacks, and it seems hand-crafted instrumentation is also necessary. This might be one of the cases where programming is necessary for this project.
I don't think that we want to fuzz functions callde from qemuBuildCommandLine() separately. That indeed would be too overwhelming. I think we would be perfectly okay with fuzzing the qemuBuildCommandLine() itself (well, with help of XML parsing as described in my previous e-mails). So we might focus on generating XMLs for now (e.g. write a grammar that does that? dunno - don't have much experience with fuzzers). The whole idea that I have in my mind is as follows: 1) let fuzzer genereate a XML document 2) def = virDomainDefParse*(document); 3) qemuBuildCommandLine(def); 4) if SIGSEGV store XML somewhere for future inspection 5) goto 1) For points 2) and 3) we might need to create a binary, but that should be fairly easy to do. Does this sound reasonable to you?
Given the slow progress, or maybe I started later than an ideal situation, I am a bit worried if I could finish the requirement before the submission deadline, not to mention other libvirt community-specific requirement mentioned on the website.
Well, the requirements for submitting are not have all the coding ready :-). You can check the requirements here: http://wiki.libvirt.org/page/Google_Summer_of_Code_FAQ Then, for the student application you should describe in the form how you want to achieve the goal, design some time line and so on. Don't worry, you can edit it until the deadline.
So to make sure I am on the right track, what are the concrete goals to achieve, specific requirement to meet, or procedures for me to follow in order to submit the application by the deadline?
Well, you've successfully subscribed to the list and I assume you've cloned and compiled libvirt. So what you need to do is to prove it - send a patch that fixes something in libvirt. There is a link in the FAQ to a list of bite sized tasks. Or I can think of something easy if you want. Michal

On Tue, Mar 21, 2017 at 16:15:35 +0100, Michal Privoznik wrote:
On 03/21/2017 04:39 AM, D L wrote:
On Thu, Mar 16, 2017 at 1:03 PM, Michal Privoznik <mprivozn@redhat.com> wrote:
[...]
necessary. This might be one of the cases where programming is necessary for this project.
I don't think that we want to fuzz functions callde from qemuBuildCommandLine() separately. That indeed would be too overwhelming. I think we would be perfectly okay with fuzzing the qemuBuildCommandLine() itself (well, with help of XML parsing as described in my previous e-mails). So we might focus on generating XMLs for now (e.g. write a grammar that does that? dunno - don't have much experience with fuzzers). The whole idea that
Ideally it should take the grammar we have for our XMLs so that we don't have to update it manually all the time.
I have in my mind is as follows:
1) let fuzzer genereate a XML document 2) def = virDomainDefParse*(document); 3) qemuBuildCommandLine(def); 4) if SIGSEGV store XML somewhere for future inspection
including backtrace
5) goto 1)

On 03/21/2017 04:34 PM, Peter Krempa wrote:
On Tue, Mar 21, 2017 at 16:15:35 +0100, Michal Privoznik wrote:
On 03/21/2017 04:39 AM, D L wrote:
On Thu, Mar 16, 2017 at 1:03 PM, Michal Privoznik <mprivozn@redhat.com> wrote:
[...]
necessary. This might be one of the cases where programming is necessary for this project.
I don't think that we want to fuzz functions callde from qemuBuildCommandLine() separately. That indeed would be too overwhelming. I think we would be perfectly okay with fuzzing the qemuBuildCommandLine() itself (well, with help of XML parsing as described in my previous e-mails). So we might focus on generating XMLs for now (e.g. write a grammar that does that? dunno - don't have much experience with fuzzers). The whole idea that
Ideally it should take the grammar we have for our XMLs so that we don't have to update it manually all the time.
While this would certainly be interesting thing to do I'm afraid of two things here: 1) state explosion - our XML schema is so complicated that trying to generate each state it could be in depending on grammar would lead to "uncountable" many states. Plus calling 2) + 3) over them would take ages to finish. But we can aim on a very basic subset for now and probably expand that later? 2) Reversing the process from RNG to XML generation: how would that even work? I mean, how do you parse RNG schema and reason about it? I know it's an XML document just like any other, but what I am interested in is how to catch the meaning of rules written in the schema. For instance: <element name="blah"> <zeroOrMore> <element name="subBlah"> <text/> </element> </zeroOrMore> </element> We all know what this simple grammar can generate. But if I were to write a program that parses the rules and generates XML documents according to them, I'd probably end up hiding under the desk.
I have in my mind is as follows:
1) let fuzzer genereate a XML document 2) def = virDomainDefParse*(document); 3) qemuBuildCommandLine(def); 4) if SIGSEGV store XML somewhere for future inspection
including backtrace
Ah, sure.
5) goto 1)
Michal

On Tue, Mar 21, 2017 at 17:09:58 +0100, Michal Privoznik wrote:
On 03/21/2017 04:34 PM, Peter Krempa wrote:
On Tue, Mar 21, 2017 at 16:15:35 +0100, Michal Privoznik wrote:
On 03/21/2017 04:39 AM, D L wrote:
On Thu, Mar 16, 2017 at 1:03 PM, Michal Privoznik <mprivozn@redhat.com> wrote:
[...]
necessary. This might be one of the cases where programming is necessary for this project.
I don't think that we want to fuzz functions callde from qemuBuildCommandLine() separately. That indeed would be too overwhelming. I think we would be perfectly okay with fuzzing the qemuBuildCommandLine() itself (well, with help of XML parsing as described in my previous e-mails). So we might focus on generating XMLs for now (e.g. write a grammar that does that? dunno - don't have much experience with fuzzers). The whole idea that
Ideally it should take the grammar we have for our XMLs so that we don't have to update it manually all the time.
While this would certainly be interesting thing to do I'm afraid of two things here:
1) state explosion - our XML schema is so complicated that trying to generate each state it could be in depending on grammar would lead to "uncountable" many states. Plus calling 2) + 3) over them would take ages to
Yes these are the problems of fuzzing. By definition [1] you need to tell the fuzzer what is and what isn't a valid input. Otherwise you'd already get an exploded state. Are you expecting to test any random string as an XML? Or at least any valid XML as a libvirt xml? [2] You also need the schema to do a partially valid input so that other code paths can be reached, otherwise you'd mostly get stuck at the first error check in the parser. Basically the schema is quite the oposite. It very drastically limits the amount of strings (or valid XML files) that you should feed to the parser so that it actually tests reasonable stuff.
finish. But we can aim on a very basic subset for now and probably expand that later?
I'm afraid that if you stick with a subset or don't make it automated, it won't get finished ever.
2) Reversing the process from RNG to XML generation: how would that even work? I mean, how do you parse RNG schema and reason about it? I know it's an XML document just like any other, but what I am interested in is how to catch the meaning of rules written in the schema. For instance:
<element name="blah"> <zeroOrMore> <element name="subBlah"> <text/> </element> </zeroOrMore> </element>
You picked a very bad subset for demonstration since it basically allows everyting, which is not very far from the infinite ape theorem. Mostly such elements would be parsed verbatim, so the only failure you could ever get is memory allocation problem. If you pick a <optional> or something mandating a input format (<choice>, etc.), you get a set of valid and invalid settings. The fuzzer should test some of the valid ones along with a few random invalid to see if it fails.
We all know what this simple grammar can generate. But if I were to write a program that parses the rules and generates XML documents according to them, I'd probably end up hiding under the desk.
Isn't that the job of the fuzzer?
I have in my mind is as follows:
1) let fuzzer genereate a XML document 2) def = virDomainDefParse*(document); 3) qemuBuildCommandLine(def);
BTW if you want to check the command line generator too, you need to have a valid XML on input so the schema is actually the way to go.
4) if SIGSEGV store XML somewhere for future inspection
[...] [1] https://en.wikipedia.org/wiki/Fuzz_testing [2] https://en.wikipedia.org/wiki/Infinite_monkey_theorem

On Tue, Mar 21, 2017 at 11:15 AM, Michal Privoznik <mprivozn@redhat.com> wrote:
On 03/21/2017 04:39 AM, D L wrote:
On Thu, Mar 16, 2017 at 1:03 PM, Michal Privoznik <mprivozn@redhat.com> wrote:
Hi Michal,
Hey,
I have been digesting your comments. Then I switched concentration from general instrumentation and fuzzing to qemuBuildCommandLine(). I have been having difficulties of resolving the dependencies/shared objects in order to fuzz a particular function. Then I came to a conclusion, I would imagine, but have not started yet, to target specific functions, some helper functions need to be in place to be responsible of the callbacks, and it seems hand-crafted instrumentation is also necessary. This might be one of the cases where programming is necessary for this project.
I don't think that we want to fuzz functions callde from qemuBuildCommandLine() separately. That indeed would be too overwhelming. I think we would be perfectly okay with fuzzing the qemuBuildCommandLine() itself (well, with help of XML parsing as described in my previous e-mails). So we might focus on generating XMLs for now (e.g. write a grammar that does that? dunno - don't have much experience with fuzzers). The whole idea that I have in my mind is as follows:
1) let fuzzer genereate a XML document 2) def = virDomainDefParse*(document); 3) qemuBuildCommandLine(def); 4) if SIGSEGV store XML somewhere for future inspection 5) goto 1)
For points 2) and 3) we might need to create a binary, but that should be fairly easy to do. Does this sound reasonable to you?
That's great. I really appreciate the clarity. Yes, I also think we need to generate binaries. The work flow sounds reasonable for me now.
Given the slow progress, or maybe I started later than an ideal situation, I am a bit worried if I could finish the requirement before the submission deadline, not to mention other libvirt community-specific requirement mentioned on the website.
Well, the requirements for submitting are not have all the coding ready :-). You can check the requirements here:
http://wiki.libvirt.org/page/Google_Summer_of_Code_FAQ
Then, for the student application you should describe in the form how you want to achieve the goal, design some time line and so on. Don't worry, you can edit it until the deadline.
Thanks a lot for letting me know again about this.
So to make sure I am on the right track, what are the concrete goals to achieve, specific requirement to meet, or procedures for me to follow in order to submit the application by the deadline?
Well, you've successfully subscribed to the list and I assume you've cloned and compiled libvirt. So what you need to do is to prove it - send a patch that fixes something in libvirt. There is a link in the FAQ to a list of bite sized tasks. Or I can think of something easy if you want.>
Michal
Yes, I compiled, installed, and used the binaries successfully.
Could you confirm the location of bug list is the following, please? https://bugzilla.redhat.com/buglist.cgi?component=libvirt&product=Virtualization%20Tools https://bugzilla.redhat.com/enter_bug.cgi?product=Virtualization%20Tools&component=libvirt "This list is too long for Red Hat Bugzilla's little mind" I will take a look at several of the latest ones see if I can solve one of them; I will let you know otherwise. Thanks a lot, Dan

On 03/21/2017 07:04 PM, D L wrote:
Yes, I compiled, installed, and used the binaries successfully. Could you confirm the location of bug list is the following, please?
https://bugzilla.redhat.com/buglist.cgi?component=libvirt&product=Virtualization%20Tools
This will fetch all bug there are/ever were for upstream libvirt, regardless of their state. You want just opened ones. Thus this should be: https://bugzilla.redhat.com/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&component=libvirt&product=Virtualization%20Tools&query_format=advanced
https://bugzilla.redhat.com/enter_bug.cgi?product=Virtualization%20Tools&component=libvirt
This is for entering a new bug. Unless you've found one, you don't need this. Michal

On Wed, Mar 22, 2017 at 4:04 AM, Michal Privoznik <mprivozn@redhat.com> wrote:
Yes, I compiled, installed, and used the binaries successfully. Could you confirm the location of bug list is the following, please?
On 03/21/2017 07:04 PM, D L wrote: product=Virtualization%20Tools
This will fetch all bug there are/ever were for upstream libvirt, regardless of their state. You want just opened ones. Thus this should be:
https://bugzilla.redhat.com/buglist.cgi?bug_status=NEW& bug_status=ASSIGNED&component=libvirt&product= Virtualization%20Tools&query_format=advanced
https://bugzilla.redhat.com/enter_bug.cgi?product= Virtualization%20Tools&component=libvirt
This is for entering a new bug. Unless you've found one, you don't need this.
Michal
Hi Michal, To keep the thread consistent, I am writing now about two Bugs here which probably have been mentioned in other thread. So I am watching two bugs, and trying to decide one to fix. They are 1431652 and 1434550. For 1434550, I am reading ./src/nodeinfo.c and several other related files for a bug about possible incorrect number of socket when executing 'virsh capabilities' on a host. It seems this totally depends on design decision. When showing cell number = 2 and socket = 1 in the XML CPU topology in a 2 CPU sockets machine indeed kinda of confusing, if one does not read the XML carefully, or is not familiar with the notion of "cell". Would people want it to be changed or leave it in the current state? For 1431652, I am able to reproduce the error. And I also tried to use absolute path which resulted in different output and behavior as the following from 'history' command 513 truncate -s 100M test-backing.img 514 pwd 515 qemu-img create /var/tmp/test-overlay.img -f qcow2 -b 'json:{"driver":"raw","file": {"driver":"file","filename":"/var/tmp/test-backing.img"}}' 516 ls -lh 517 history root@<host> :/var/tmp# !507 qemu-img info test-overlay.img image: test-overlay.img file format: qcow2 virtual size: 100M (104857600 bytes) disk size: 196K cluster_size: 65536 backing file: json:{"driver":"raw","file":{"driver":"file","filename":"/var/tmp/test-backing.img"}} Format specific information: compat: 1.1 lazy refcounts: false refcount bits: 16 corrupt: false root@<host> :/var/tmp# !508 virt-install --import --name tmp-relpaths --memory 1024 --disk /var/tmp/test-overlay.img WARNING No operating system detected, VM performance may suffer. Specify an OS with --os-variant for optimal results. WARNING Graphics requested but DISPLAY is not set. Not running virt-viewer. WARNING No console to launch for the guest, defaulting to --wait -1 Starting install... Creating domain... | 0 B 00:00:00 Domain installation still in progress. Waiting for installation to complete. At the end, it hanged and I had to terminated the 'virsh install' by ctrl + C. So I guess the expected behavior when using 'virsh install' with relative path should be the same. Is that right? I may want to try both of the bugs and maybe one of them can be solved in a timely manner. Or would you recommend one, or other one than those two? (I did attempt to search something like xml parser or cmd generation bugs which could be closer to the fuzzing project) Dan

On Tue, Mar 07, 2017 at 12:27:58AM -0500, D L wrote:
On Sun, Mar 5, 2017 at 2:47 AM, Michal Privoznik <mprivozn@redhat.com> wrote: Regarding fuzzing, I think we can try several fuzzing tools to run in parallel, as different fuzzers tend to find different kinds of bugs. Thus, AFL (American Fuzz Lop) [1], which is a coverage-guided mutation-based fuzzer with genetic algorithm, can take hand-crafted xml seed to fuzz our libvert target. Alternatively, we could develop generation-based grammar module in AFL (which is definitely non-trivial); so far I have not seen active development in AFL community on xml format grammar generation. Another option could be clang-libfuzzer [2].
Several related articles show examples of fuzzing are using AFL to generate SQL [3], llvm-afl [4], and hexml fuzzing with AFL [5]. In combination with lcov, we could compare different fuzzers and guide our fuzzing tuning.
FYI, I would very much like to see it use a fuzzer that is open source, because I'd like the end result of the project to ideally produce some test suite or test framework that we can put in to our CI system and run daily to validate future changes. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

On Thu, Mar 16, 2017 at 1:29 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
On Tue, Mar 07, 2017 at 12:27:58AM -0500, D L wrote:
On Sun, Mar 5, 2017 at 2:47 AM, Michal Privoznik <mprivozn@redhat.com> wrote: Regarding fuzzing, I think we can try several fuzzing tools to run in parallel, as different fuzzers tend to find different kinds of bugs. Thus, AFL (American Fuzz Lop) [1], which is a coverage-guided mutation-based fuzzer with genetic algorithm, can take hand-crafted xml seed to fuzz our libvert target. Alternatively, we could develop generation-based grammar module in AFL (which is definitely non-trivial); so far I have not seen active development in AFL community on xml format grammar generation. Another option could be clang-libfuzzer [2].
Several related articles show examples of fuzzing are using AFL to generate SQL [3], llvm-afl [4], and hexml fuzzing with AFL [5]. In combination with lcov, we could compare different fuzzers and guide our fuzzing tuning.
FYI, I would very much like to see it use a fuzzer that is open source, because I'd like the end result of the project to ideally produce some test suite or test framework that we can put in to our CI system and run daily to validate future changes.
Hi Daniel,
Yes, I am definitely focusing on open source fuzzers. I have been having a question for quite a while. I thought mostly behind the scenes of each established open sources projects should have a security team working on security testing on a regular basis. Accordingly they also have the tool chains and standardized procedures to find, report and fix security vulnerabilities. They may or may not work with or collaborate with the Developer teams. It is also possible that some of those exploitable bugs were purely discovered just by interested individuals as their side project/work. And some of them got CVE assigned eventually. I was hoping to find some record of how such bugs were discovered; i.e., there'd be some tutorial-like documentations describing how to work on a large scale industrial fuzzing project. I primarily got most of the impressions from the following links about libxml2 AFL fuzzing bug report: https://bugzilla.gnome.org/show_bug.cgi?id=744980 https://bugzilla.gnome.org/show_bug.cgi?id=756263 https://bugzilla.gnome.org/show_bug.cgi?id=759020 https://bugzilla.gnome.org/show_bug.cgi?id=759671 https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2015-7115 https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2015-7116 Not only at libvirt community, is libxml2's situations also similar to other major open source projects? Dan Regards,
Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

On Thu, Mar 16, 2017 at 1:29 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
On Tue, Mar 07, 2017 at 12:27:58AM -0500, D L wrote:
On Sun, Mar 5, 2017 at 2:47 AM, Michal Privoznik <mprivozn@redhat.com> wrote: Regarding fuzzing, I think we can try several fuzzing tools to run in parallel, as different fuzzers tend to find different kinds of bugs. Thus, AFL (American Fuzz Lop) [1], which is a coverage-guided mutation-based fuzzer with genetic algorithm, can take hand-crafted xml seed to fuzz our libvert target. Alternatively, we could develop generation-based grammar module in AFL (which is definitely non-trivial); so far I have not seen active development in AFL community on xml format grammar generation. Another option could be clang-libfuzzer [2].
Several related articles show examples of fuzzing are using AFL to generate SQL [3], llvm-afl [4], and hexml fuzzing with AFL [5]. In combination with lcov, we could compare different fuzzers and guide our fuzzing tuning.
FYI, I would very much like to see it use a fuzzer that is open source, because I'd like the end result of the project to ideally produce some test suite or test framework that we can put in to our CI system and run daily to validate future changes.
Regards, Daniel --
Hi all, I am reviewing the feedbacks of this thread, and I would like to revisit some topics. I think this project is more about finding bugs in libvirt, when using fuzzing, especially the implications would be security vulnerabilities. Thus the input file could be anything that pretend to be legitimate xml, which potentially would crash the target program, such as virsh. Depending on the exact fuzzer, being either mutational or generational, or even hybrid, the fuzzer engine and the executor will take care most of the work including input file generation, mutation, testing, recording, and reporting. Fuzzing will allow us to reproduce the bugs with the recorded culprit xml file, then we have a case where we find a bug. It is totally a lazy person's tool to do software testing, without writing much code. Therefore, I am modifying this project a little towards be a CI fuzzing testing framework, potentially a deliverable product presenting a centralized real-time status of online fuzzing information, integrated with libvirt existing toolchain. The components of the framework incorporates fuzzer manager, a panel of open source fuzzer engines, executor, CI and dashboard system. There are related works such as oss-fuzz. However, the most obvious difference is that here it can be potentially closely integrated into existing libvirt community workflow, or any other open source community of the like who would like to have their own fuzzing CI with flexible and version-ed configuration. Dan |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/
:| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|
participants (6)
-
D L
-
Da L
-
Dan
-
Daniel P. Berrange
-
Michal Privoznik
-
Peter Krempa