Critical dualValue bug? #543

wzzhu · 2023-02-22T13:10:50Z

In func (instr *execOp) exec(m *tapeMachine) (err error) {
there is one handling of swapping of dualValue for a gradient node here

// this is a gradient node then, we should also bind the value to the node's dualValue
	if m.bindDV() && node.derivOf != nil {

Why it uses add operator to add the existing directive in dual value part with the new computed derivative node's value in this line:

add := newEBOByType(addOpType, TypeOf(dv.d), TypeOf(v))

The correct behavior should be just setting the new directive node's value to the derivative part of the dual value (which is to copy, not to add)?

It is critical for training as it causes incorrect result

Thanks!

context:

	// this is a gradient node then, we should also bind the value to the node's dualValue
	if m.bindDV() && node.derivOf != nil {
		for _, src := range node.derivOf {
			if len(m.bindNodesDV) > 0 && !m.bindNodesDV.Contains(src) {
				continue
			}

			switch {
			case node.op == (Iop{}):
				// we'll need to put a closure into the closure queue
				closure := func() error {
					dv := dvUnit(src.boundTo)
					add := newEBOByType(addOpType, TypeOf(dv.d), TypeOf(v))
					if _, err := add.UnsafeDo(dv.d, v); err != nil {
						return err
					}
					return nil
				}
				m.closureQueue = append(m.closureQueue, closure)
			default:
				// TODO HERE
				/*
				   e.g. problem
				   z = y * (x + 1)

				   Here, 1 is a constant. But 1 comes early in the expression graph.
				   The final gradient is also 1, so 1 will also be the derivOf `z`
				   But because the graph is sorted, the 1 node will be walked before
				   the `z` node, and this part will cause a panic, as `z` will have no `Value`
				   associated with it yet.
				*/
				dv := dvUnit(src.boundTo)
				add := newEBOByType(addOpType, TypeOf(dv.d), TypeOf(v))    <==== HERE!

				if d, err := add.UnsafeDo(dv.d, v); err == nil {
					dv.SetDeriv(d)
					src.bind(dv)
				} else {
					return err
				}
			}

		}

	}

The text was updated successfully, but these errors were encountered:

chewxy · 2023-02-22T22:59:00Z

Looking into it.

wzzhu · 2023-02-23T09:14:35Z

Recently I read the gorgonian code details for doing back propagation using backward mode for derivative calculation and understood most parts. But I am not sure if dual value is needed for learnable nodes? As the gradient nodes are allocated and computed along with the execution of the expression graph, we can always get derivative from node.deriv.Value() to avoid the possible inconsistency when a gradient node is rebounded to another value. And they may save a lot of code for maintaining the consistency between node.deriv's bounded value and the directive part of the dual value in the node's bounded value.

Another comment about gorgonian. As an active user of gorgonia, what I fear most for this framework is that it does incorrect computation. Many such computation bugs may take many days to discover (I remembered I fixed one bug in gorgonians softmax for float64 which took a few days to find out that it was caused by a typo)

Thanks.

chewxy · 2023-02-27T10:38:16Z

So it's taken me a few days to write this, I hope I don't come across as insulting. Please forgive me if I do as it is not my intention.

The Maths

I want to consider a very simple maths expression: $z = a \times b + b \times c$

The expression graph looks like this:

Now, the backpropagation algorithm is basically doing a partial differentiation with regards to each of the input. So we would be finding $\frac{\partial z}{\partial a}$, $\frac{\partial z}{\partial b}$ and $\frac{\partial z}{\partial c}$

Of interest is $\frac{\partial z}{\partial b}$:

$$\begin{align*} \frac{\partial z}{\partial b} &= \frac{\partial ab + bc}{\partial b}\\\ &= \frac{\partial ab}{\partial b} + \frac{\partial bc}{\partial b}\\\ &= a + b \end{align*}$$

In fact this pattern occurs across every expression. The rule of thumb is, in a expression graph like the one above, when considering the gradient, sum all the gradient that are incoming to the node (lines in red go to $b$)
expr dot

The Implementation

Now let's consider the two lines here:

dv := dvUnit(src.boundTo)
add := newEBOByType(addOpType, TypeOf(dv.d), TypeOf(v))

The first line creates a *dualValue if src.boundTo isn't already a *dualValue. When a *dualValue is created, the gradient tensor is set, with all values set to 0.

We then add that any incoming node to the gradient tensor. This works well because $0+x = x$

Re: Comment

I completely agree. The v0.10.0 branch has quite a few changes that does away with quite a lot of things

wzzhu · 2023-02-27T12:38:17Z

Thanks for taking time writing this up. There is no problem of being straight in technical discussion. Actually it is GREAT!

For differentiation part, it should use add and that is also explained in this excellent article for using backward mode for derivative calculation https://colah.github.io/posts/2015-08-Backprop/ ( I saw this link from your blog before and read it before I tried to understand how the code work in gorgonia)

But the problem I raised is not related to this process. It is related to the use of dualValue synchronization with derivative node's updated value. If I just disable using duel value by not calling BindDualValues for all learnable nodes, the gradients calculated are correct.

The following explains the bug with an example.

e.g.,
Assuming a learnable node A's derivative node is Node B, and A is bounded to a dual value named dual_value

Initially, A.dual_value.d = gradient node B's value

But after some updates, node B's value is rebounded to another value by executing the computation graph in tape machine. At this time, it tries to synchronize the B's new bounded value back to A.dual_value.d. But this time, it uses add. That results the error.

Please let me know if my explanation is not clear or my understanding is incorrect for that part of code. Welcome to ping in in twitter or other means to discuss directly if needing to have more rounds of clarification.

dcu · 2023-03-20T22:46:09Z

any update about this? the gradient for my tabnet implementation gets weird results after a few epochs (it stops learning) and it might be related to this

wzzhu · 2023-03-21T10:13:47Z

Share some of my understandings after reading and tracing the code for this part.

A duel value is defined as

type dualValue struct {
	Value
	d Value // the derivative wrt to each input
}

A dual value will be bounded to a learnable node (weight or bias nodes) if BindDuelValue is called on learnable nodes. Then the VM will set the bindDV flag and uses duel value when executing the compiled VM instructions.

A derivative node is created inside the computation graph for each node in the learnable node set by gorgonia when user calls gorgonia.Grad(loss, learnables). It is used to compute the gradient of a learnable node with respect to the loss node. (There are gradient nodes created like constant gradient node (gradient=1) as starting point for back propagation algorithm, but they are not related to duel value.) Each learnable node will have one corresponding derivative node.

If the TapeVM has duel value flag set (bindDV) on learnable nodes in the expression graph, it will use dual value to keep gradients in the learnable node. The duel value is allocated from a memory pool if the node's bounded value is not a dual value yet. The d field in the duel value will accumulate the gradient node's value computed in each run of VM into the d field. Therefore in the following code, it uses add operator.

                                dv := dvUnit(src.boundTo)
				add := newEBOByType(addOpType, TypeOf(dv.d), TypeOf(v))    <==== HERE!

				if d, err := add.UnsafeDo(dv.d, v); err == nil {
					dv.SetDeriv(d)
					src.bind(dv)
				} else {
					return err
				}

To help better debugging, we can use gorgonia.WithGrad to manage the storage of duel node's gradient field. So we can print out the gradients easily from the normal program code.

One thing to note is that if RunAll() is called multiple times before any solver.Step, the gradient will accumulates, due to the above addOp. This is expected and will be good for training with weighted gradients in policy gradient case. The solver.Step will zero the gradient values. I checked each of the solvers and found that they all have that.

But it is strange that sometimes the gradient nodes will bind to different values during a run. I haven't been able to reproduced the case again in simple test cases. In that case, the duel value's gradient will be the sum of values from the bounded values of the gradient node, which will be invalid.

chewxy · 2023-04-10T23:37:25Z

Hi, I know it seems like I've not moved much on this... I've been working on a generics version of the tensor package and I think it might solve all these weird memory issues. (The TLDR is when there's sharing of memory, weird things like these might happen)

wzzhu changed the title ~~dualValue bug?~~ Critical dualValue bug? Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Critical dualValue bug? #543

Critical dualValue bug? #543

wzzhu commented Feb 22, 2023 •

edited

chewxy commented Feb 22, 2023

wzzhu commented Feb 23, 2023 •

edited

chewxy commented Feb 27, 2023

wzzhu commented Feb 27, 2023

dcu commented Mar 20, 2023

wzzhu commented Mar 21, 2023 •

edited

chewxy commented Apr 10, 2023

Critical dualValue bug? #543

Critical dualValue bug? #543

Comments

wzzhu commented Feb 22, 2023 • edited

chewxy commented Feb 22, 2023

wzzhu commented Feb 23, 2023 • edited

chewxy commented Feb 27, 2023

The Maths

The Implementation

Re: Comment

wzzhu commented Feb 27, 2023

dcu commented Mar 20, 2023

wzzhu commented Mar 21, 2023 • edited

chewxy commented Apr 10, 2023

wzzhu commented Feb 22, 2023 •

edited

wzzhu commented Feb 23, 2023 •

edited

wzzhu commented Mar 21, 2023 •

edited