======================================== OpenDNSSEC -- a DNS Workflow Description ======================================== The following is my (Rick van Rein's) design based on a day of discussion at NLNet Labs regarding DNS aspects and implementation concerns for the DNS-aware parts of the OpenDNSSEC signer. It is a proposed architecture, orthogonal to the KASP which manages keys for DNSSEC. Please read this together with the diagram that I've drawn. A quick legend for the symbols used is included with the graph. Structure: Concurrency Control ============================== We need a few clear concepts to guide us through the incredibly versatile possibilities of concurrency. If not, we could easily drown and be left with a scheduling problem that is too broad to be effectively implemented. Below, the following principles are proposed as guidelines to keep the concurrency model simple yet flexible: * Atomic changes * Queues between processes * Transactional process steps * Serialisable domain updates * State-of-the-art DNS queries These concepts help to avoid race conditions, and the absense of directed cycles (that is, cycles found when following arrows in their pointing direction) makes it possible to be sure that there will be no deadlock either. Atomic changes: "Diff" ---------------------- The idea is to transport only changes of DNS information, but to do so in atomic changes. Atomic meaning: no other process can detect an intermediate state, so the whole thing contained in a single communication package on a "Diff" channel is either to-be-done or it is done. A "Diff" comprises of any combination of removals of a name/type combination of a resource record, followed by new record data (or nothing if there are no new contents). These "Diff" records are conceptually similar to an IXFR, although it is likely that a realistic IXFR will be split into smaller blocks, each of which would separately be treated as an RRset to be signed. The separation of IXFR into minimalistic "Diff" records (and differencing an AXFR to obtain similarly small "Diff" records) is helpful in enabling other processes to merge in their tasks even if the work from processing an IXFR or AXFR is heavy. This helps to keep OpenDNSSEC responsive to emergency resigning, even when faced with a large zone resigning operation. Queues between processes ------------------------ A queue is a conceptually straightforward synchronisation construct. Its principal operation comes down to a first-in, first-out structure. In reality though, a queue could optimise by sending out high-priority events before normal-priority ones. This could be used to advance emergency operations like emergency key rollover. Queues should adhere to the principle of domain serialisability, described below. Within these constraints, they could optimise to their liking. Before a process can write into a queue, it should obtain a mutually exclusive lock on that queue. When done writing, the lock is to be released. When reading the next "Diff" record from the queue, a lock is also given on that record; after being done with it, the lock ought to be released. This is very often a conceptual design stratagy that need not be implemented. For example, if the steps on the left and right side of Q1 are part of the same implementation process, there may not be a need to actually lock anything. Also, an implementation might do everything before Q2 in one process (probably using ``select()`` to wait for I/O or alerts) then there is effectively no need to lock the writing side of Q2 at all, let alone compete. Transactional process steps --------------------------- Every rectangle in the diagram represents what could conceptually be seen as a simple consumer/producer process. It consumes a "Diff" record (or something else as annotated on the incoming arrow) from the incoming queue, and produces one or more "Diff" records or other on its output. Every process step, that is the whole process of consuming, chewing on it and producing one or more bits of output, can be seen as a transaction, or an atomic processing step. Conceptually, it appears as if the consumption of a "Diff" on the input and the production of the corresponding output happens instantaneously, or more accurately, time cannot be split finely enough to look at an intermediate state. Locks are of course the way to achieve this: First acquire a lock on the input (by consuming an input record) then lock the output and produce whateven comes out, then release the locks at the *same* time. When the read lock on the input queue is lifted, the corresponding element is removed from that queue. By always obtaining the locks in this order (if we need to implement them at all) it is possible to be certain that no deadlock will ever occur. Once again, optimising variations would be possible, for example to enable multiple processes to sign in parallel, but it requires a more complicated queue. The queue would enable multiple processes to lock consecutive positions in the queue, and would pass on the information in an order compliant with the conceptual idea of a queue. Serialisable domain updates --------------------------- It is probably good to use one constraint to any optimisations that relate to concurrency, and that is the concurrency concept that is also used in databases. This principles states that it should always be possible to assume a particular ordering of transactions after a number of transactions have committed. It could be said that databases enfore a linear model of time, as opposed to branching time. There are situations where it is safe to change the order of "Diff" records in a queue, possibly to the advantage of emergency updates. This would at least seem possible if they handle two different domains. It may or may not apply to hierarchically ordered domains, such as a parent and its child. State-of-the-art DNS queries ---------------------------- An important requirement of several of the processing steps is that they ought to be able to query for the most recent DNS data. The concern for all the various stages of a domain being signed is far from simple. It should suffice if each processing step can rely on the data that it has output itself -- it does not seem to be necessary to incorporate any updates performed by later processing steps. This means that a need for DNS information can be sent along the arrows, straight through queues and processing steps until it ends in the zoneDB. That is, if no intermediate "Diff" exists that answers the need for information. Such a "Diff" would have been output by the current process, and is thus of concern. Thanks to the serialisability property, the nearest-by "Diff" is always the most recent one output by the current processing step, so it is to be used. Note that processing steps are atomic, so "Diff" data need only be looked for in queues. Probably the best is to include the read/output side of a queue even if it is locked, but never to include the write/input side of a queue if it is locked -- the latter being subject to change. For example, if the step "re-list all RR" needs to query for all records in a domain, it would ask Q2, who could either find a "Diff" that answers the query, or pass it on to Q3, Q4 and so on until zoneDB. If the same name occurs in Q2 and zoneDB, the "Diff" in Q2 prevails. If a "Diff" exists in Q3 and Q4 covering the same DNS request, the version in Q3 is delivered back to the inquiring process. Structure: Timing and scheduling ================================ DNS introduces timing requirements in the form of TTL and SOA parameters. DNSSEC adds even more such constraints. In essence, the adjoining schema does not take all these timing requirements into account. Instead, the proposal is to look upon it as a generic problem of *realtime scheduling*. Based on this view, it ought to be possible to infer a set of queries that can be sent to the various blocks of the system to investigate how much work is still in progress, how much is lying ahead (due to timer-based events) and at what time each change should be triggered. I expect a generic realtime scheduler to solve this problem for us, and that we will only have to supply it with timing details and the proper set of priorities. Low priorities for user-initiated changes (IXFR from the master), medium for emergency rollovers and medium to high for time-scheduled operations. The realtime scheduler will be linked in with the processes that run, as well as the ordering of queues. Model: NSEC(3) and ordering =========================== DNS data for signing comprises of resource records as well as *holes* between them. The holes need explicit handling because they will be signed just as well as resource records. The need for multiple orders ---------------------------- It is not likely that we will want to create NSEC as well as NSEC3 for a zone. It may happen however, that we will want to create a new NSEC3 chain, possibly due to a weakness in the hashing algorithm used. Note that this is not likely to happen at all, since NSEC3 makes multiple passes with a single hash which is thought to be good. But if it ever happens, it is best to resolve it without domain downtime, especially because such changes would probably apply to many domains at the same time. Efficiency of NSEC3 chain rollover is an important concern if it is indeed the case that multiple domains will want to do the same rollover at the same time. Orders and a few useful properties ---------------------------------- A *partial order* is a mathematical concept that is defined as a tuple:: where ``S`` defines a set of values, or in DNS terms, the resource records; and ``R`` defines relations between elements of ``S``. The relation ``R`` can have a number of useful properties: * ``R`` is *transitively closed* if for any ``(a,b)`` and ``(b,c)`` in ``R``, their composition ``(a,c)`` is also part of ``R`` * ``R`` is *complete* if for any pair of different values ``a`` and ``b`` in ``S``, either ``(a,b)`` or ``(b,a)`` is part of ``R`` If we use ``R`` to model the order in which NSEC or NSEC3 records are ordered, we would assume it was transitively closed, at least conceptually. As soon as ``R`` is a complete order over ``S``, we would be able to generate all NSEC or NSEC3 records for the zone with records ``S``. More practically though, we could also avoid any transitive relations in ``R``; that is, if ``(a,b)`` and ``(b,c)`` are part of ``R``, we would avoid including ``(a,c)`` as well. This can be done if holes like ``(a,c)`` when present in ``R`` are cut up into ``(a,b)`` and ``(b,c)`` upon the introduction of ``b`` in ``S``. Similarly, when removing an element from ``S``, two holes in ``R`` must be composed to one hole. In this practical way of implementing the NSEC or NSEC3 order, it would be simple to conclude when the cycle is complete, namely when the number of resource records in ``S`` is one more than the number of relations in ``R``. This is only true if transitivity is always removed. Complexity of creating an order ------------------------------- In theory, it would be possible to introduce a new order for a set of resource records by setting up a set of resource records with an empty relation. Then, when resigning the zone, all resource records would be retracted and re-inserted, an opportunity for cutting up the holes. However. This would lead to an order created with the insertion sort algorithm, which is of complexity O(N*N). More efficient algorithms exist, notably heapsort, which are of complexity O(N*log(N)). Quicksort is not as good as heapsort, since it will not always achieve O(N*log(N)). Since N is the size of the domain for which a new NSEC or NSEC3 cycle would be generated, the choice of algorithm is a big concern for large zones, notably TLDs. This means that it is probably best to create an order in one operation covering all records found in a domain. Later, when records are individually removed or added as a result of "Diff" records, small changes to the NSEC or NSEC3 order can be made by way of cutting-up and joining holes as described above. Description: Diagram blocks =========================== Following is a description for each block in the diagram. DNS from master --------------- This is the interaction with the master that feeds OpenDNSSEC. It is most likely based on NOTIFY and IXFR/AXFR interactions. Check SOA --------- For incoming IXFR, it is good to know that the order in which data comes in is correct. If not, an AXFR should be requested instead and treated by the "diff" process. A single IXFR may be split into multiple "Diff" records, each covering an RRset. This means that the "Diff" records are nice and small. serial/dom ---------- Stores the latest accepted SOA serial number per domain. This may come from either IXFR or AXFR interactions with the master. Note that SOA serial numbers need not be passed on in the rest of the OpenDNSSEC flow of "Diff" events. The last stages will add a serial number to match the version of the domain. The reason that incoming and outgoing serial numbers are not coupled is that, between two consecutive incoming numbers, OpenDNSSEC may have to insert a version for its own reasons, for reasons like a key rollover. diff ---- This processing step groups an AXFR into the RRsets and then asks what the state-of-the-art DNS data for each is. This consults Q1, Q2, Q3, Q4 and zoneDB, in that order. The first match is reported back to the "diff" process, to learn if it warrants sending a "Diff" record with the change or not. strip/okay ---------- This processing step could remove record types that could confuse the following processes. SOA, NSEC3 and NSEC3PARAM are all strictly under the control of the logic that follows, and should therefore be removed. Aside from plain records, it is worth noting that several DNSSEC-related records are welcome to pass through this processing step. These include DS, DLV and DNSKEY -- the latter because a DNSKEY may have to be prepublished before a transition to another hoster (and another OpenDNSSEC machine). KASP's DNSKEY rollover timer ---------------------------- KASP will trigger a Timeout for a DNSKEY under a zone, and trigger its replacement. It is a matter of taste if the DNSKEY durability knowledge is located in the KASP, or that the KASP is only the source of Timeouts at given moments. Since a HSM often contains a secure timer, it makes sense to assign the responsibility of timekeeping to the KASP. Note however that simple USB tokens do not support secure timing, and signed NTP may have to be relied upon. Since NTP is not usually signed, this may involve setting up one's own timing source as part of such a KASP. key emerg --------- When a key is compromised, an emergency process is started to replace a DNSKEY under a zone. The "ZSK emerg" described below also triggers "key emerg". DNSKEY RR changes ----------------- This creates "Diff" records that introduce and/or remove DNSKEY records. Note that a "Diff" record is an atomic change, so it may be possible to substitute records immediately. Most likely though, the KASP timer will have to signal different kinds of timeouts, namely those for introduction and those for removal of DNSKEY records. This process also craetes a "Diff" record for a SOA to indicate a wish to publish a zone's intermediate state. To get this, it simply requests the state-of-the-art SOA record and creates a "Diff" that both removes and creates that same SOA. Serial numbers are a matter of later concern. KASP pubkeys ------------ This database contains the public keys in use by the KASP. Since the KASP is likely to create public keys (from its internally generated private keys) it is probably a database that is being queried for the keys in use. In addition to that, public keys (and their complementary private keys) can be removed by the "DNSKEY RR changes" process. KASP's NSEC3 rollover timer --------------------------- Assuming that we will want to replace NSEC3PARAM records, and thus create a new NSEC3 chain, a process similar to DNSKEY rollover can be initiated. The "Diff" submitted will also proceed in two phases; first, a new NSEC3PARAM will be added to the state-of-the-art NSEC3PARAM, and second, the older NSEC3PARAM will be removed. NSEC3 emerg ----------- This covers the extremely unlikely event where an NSEC3 hashing algorithm would cause an emergency. NSEC3PARAM generator -------------------- This processing step creates new NSEC3PARAM and removes old ones after their useful life. Note that NSEC3 and NSEC chains are created further down the workflow. This process also craetes a "Diff" record for a SOA to indicate a wish to publish a zone's intermediate state. To get this, it simply requests the state-of-the-art SOA record and creates a "Diff" that both removes and creates that same SOA. Serial numbers are a matter of later concern. AND --- This block causes an event after an "NSEC3 emerg" process has led to a corresponding "Diff". KASP's zone resign timer ------------------------ This timer fires when it is time to re-sign a zone. As with key rollover, it is still to be decided if the KASP knows when a zone must be signed freshly, or if the KASP only handles abstract timing requests as required by the process steps in this workflow. ZSK emerg --------- In case of an emergency related to the zone signing key, the zone must be freshly signed. Note that "key emerg" is also triggered and must be finished before "ZSK emerg" is useful. re-list all RR in small, atomic RRsets -------------------------------------- This processing step walks through the state-of-the-art DNS records and for each RRset it finds it constructs a "Diff" which retracts it and, in the same atomic "Diff", reconstructs it. This is intended to trigger the signing processes downstream. This process also craetes a "Diff" record for a SOA to indicate a wish to publish a zone's intermediate state. This may be prudent to do after partial updates, so that other processes can intervene and the domain updates are split in separate IXFR records. To get this, it requests the state-of-the-art SOA record and creates a "Diff" that both removes and creates that same SOA. Serial numbers are a matter of later concern. name*order storage ------------------ This database stores the ``(S,R)`` relations of DNS names and the corresponding order or orders that are currently in use. If multiple NSEC/NSEC3 chains are being built or used, multiple name*order relations are stored for a single zone. add NSEC(3) ----------- This processing step responds in a special way to NSEC3PARAM changes, and in a general way to all others. If an NSEC3PARAM is removed, the corresponding "name*order" is removed from storage, and by way of newly generated "Diff" records, from the signed zone data. If a new NSEC3PARAM is supplied, a new order is created in the storage for "name*order". As part of the same transaction, all NSEC3 records are supplied as "Diff" records on the process's output. In general, when an RRset is changed by a "Diff" on the input (and that includes NSEC3PARAM) the "name*order" storage and the NSEC and/or NSEC3 records are created to match: * If a record name is removed, the holes surrounding it are recombined. * If a record name is inserted, the hole in which it falls is cut in two. * If a new record type is added to or removed from a name that exists beforehand and afterwards, the set of RRtypes in any NSEC3 record is altered and a new NSEC3 record created. Note: Wildcards should also be created in this step. KASP's privkey ops ------------------ This is an API to the part of the KASP that accesses private key operations. This is especially used to sign RRsets. It remains to be determined where knowledge about the private keys to use is stored; this may be part of the KASP or part of the surrounding DNS workflow. RRset signer ------------ This process step consumes a "Diff" record and signs all the changes contained in it. This usually includes NSEC3 records as well as plain records. This step also creates a new SOA serial number for the newly signed zone. It does this whenever it comes accross a SOA record, which it sees as an indication that the zone should be published. The simplicity of the RRset signer mainly stems from the preparation by the previously added NSEC(3) records and by the separation of "Diff" messages into atomic RRset changes. The relative indpendency of the RRsets to be signed means that an optimisation for parallel signing could be setup. Note that the best way of doing this is probably to delegate such concurrency to the KASP's privkey ops, that is, the signer engine. serial/dom ---------- This is a database similar to its same-named compadre in the input processing stage. It stores the SOA serial values added to the output records. zoneDB ------ This database stores the completely signed records for a zone. This database supports a number of operations, all of which are transactional: * Collecting a new IXFR as a sequence of "Diff" ending with SOA * Retrieving a zone's IXFR and AXFR data * Replacing a zone's IXFR and AXFR data with less IXFR and updated AXFR data IXFR to AXFR smasher -------------------- This process avoids an evergrowing "zoneDB". It can employ a number of principles, possibly in combination, to achieve this effect: * When there is more IXFR data than AXFR data for a zone, any number of the oldest IXFRs can be combined with the AXFR * When an IXFR is past its SOA-indicated lifetime, it can be combined with the AXFR * When all slaves have retrieved the oldest IXFR, it can be combined with the AXFR to save storage zone publish ------------ This I/O operation sends NOTIFY and responds to AXFR and IXFR requests from a zone's slaves. Scenarios ========= This section shows a number of scenarios or use cases rolling through the workflow. It ends by detailing the interactions between the DNS workflow module and the KASP module. **TODO**