Next: 4 Rank-based monomial representation Up: Monomial Representations for Gröbner Computations Previous: 2 monomial representations

# 3 Vectorized monomial operations

The main idea behind what we call vectorized monomial operations is the following: provided that the size of the machine word is a multiple (say, m) of the size of one exponent, we perform monomial operations on machine words, instead of directly on exponents. By doing so, we process a vector of m exponents with word operations only, thereby reducing the length of the inner loops of the monomial operations and avoiding non-aligned accesses of exponents.

The PoSSo library [14] already uses parts of this idea to speed up monomial additions, assignments, and tests for equalities, but has not (yet?) carried these ideas over to the much more time-critical monomial comparisons and divisibility tests.

The details of this technique are based on the following lemma whose proof is straightforward and, therefore, omitted:

Lemma 1: Let , and with . Furthermore, let and with such that

Then the following holds:

 (1) (2) = (3) (4)

Lemma 1 shows how the monomial operations on the right-hand sides of (1) - (4) can be vectorized'', such that, on a 2's complement machine with word-size sw, the checks (resp. operations) on the left hand sides of (1) - (4) can be performed in at most three single machine instructions (see the source code examples below for further details).

However, before Lemma 1 can be applied in an actual implementation, some technical difficulties have to be overcome:

Firstly, stricter bounds on the values of the single exponents have to be assured (i.e., the exponent values need to be less than 2se -1 instead of 2se).

Secondly, the condition implies that the total length of the exponent vector has to be a multiple of the word-size which requires that unnecessary'' exponents (whose value is set to and always kept at 0) might have to be added to the exponent vector .

Thirdly, and most importantly, the order and arrangement of the exponents within the exponent vector has to be adjusted, depending on the monomial ordering and on the endianess of the used machine. On big-endian machines, the order of the exponents has to be reversed for reverse lexicographical orderings whereas on little-endian machines, the order of the exponents has to be reversed for lexicographical orderings. In practice, this fact can be hidden behind appropriate macro (or inline) definitions for accessing single exponents. In our implementation, we used a global variable called pVarOffSet and implemented the exponent access macro as follows:

#define pGetExp(p, i)   \
p->exp[(pVarOffSet ? pVarOffSet - i : i)]


Provided that n is the number of variables of the current ring then we set the value of pVarOffSet as follows:

 type of machine type of ordering big-endian little-endian lexicographical 0 n-1 reverse lexicographical n-1 0

Figure 1: Vectorized monomial operations in SINGULAR

Some source code fragments can probably explain it best: Figure 1 shows (somewhat simplified) versions of our implementation of the vectorized monomial operations. Some explanatory remarks are in order:

n_w is a global variable denoting the length of the exponent vectors in machine words (i.e., if sw is the size of a machine word, se the size of an exponent, n the number of variables, then ).

pLexSgn and pOrdSgn (used in MonComp) are global variables which are used to appropriately manipulate the return value of the comparison routine and whose values are set as follows:

 lp ls Dp Ds dp ds pOrdSgn 1 -1 1 -1 1 -1 pLexSgn 1 -1 1 1 -1 -1

Notice that MonAdd works on three monomials and it is most often used as a hidden'' initializer (or, assignment), since monomial additions are the natural source'' of most new monomials.

Our actual implementation contains various, tediously to describe, but more or less obvious, optimizations (like loop unrolling, use of pointer arithmetic, replacement of multiplications by bit operations, etc). We apply, furthermore, the idea of vectorized operations'' to monomial assignments and equality tests, too. However, we shall not describe here the details of these routines, since their implementation is more or less obvious and they have less of an impact on the running time than the basic monomial operations.

Table 1: Detailed timings for vectorized monomial operations: SINGULAR compiled with gcc version 2.7.2 for a Pentium Pro 200 (32 bit, little-endian) running Linux and average of timings for a HP C160 (32 bit, big-endian) running HP-UX 10.20 and a DEC Alpha (64 bit, little endian) running Linux.
 Example tn,s %mon %comp %div %add ecyclic 6 3.9 79.4 68.9 1.6 8.8 1.7 2.0 1.3 1.3 2.8 homog cyclic 7 125.8 72.5 33.6 11.8 27.1 1.5 1.9 1.3 1.3 2.6 homog cyclic 6 0.9 69.5 30.5 13.3 25.8 1.6 1.9 1.3 1.3 2.6 katsura 7 8.8 71.5 41.5 3.7 26.3 1.6 1.9 1.2 1.3 2.6 rcyclic 15 53.2 76.2 29.6 26.0 20.6 1.6 2.0 1.4 1.2 2.5 rcyclic 14 22.9 76.0 28.1 28.5 19.4 1.5 1.9 1.3 1.2 2.5 rcyclic 16 106.4 78.2 26.5 30.7 21.0 1.6 2.0 1.4 1.2 2.5 rcyclic 19 740.0 80.3 29.5 29.9 20.9 1.5 1.9 1.4 1.2 2.5 rcyclic 18 396.5 80.0 29.4 29.2 21.4 1.5 1.9 1.4 1.2 2.4 rcyclic 13 10.2 74.8 27.6 26.6 20.7 1.5 1.8 1.3 1.2 2.4 rcyclic 12 4.3 74.7 28.4 29.8 16.5 1.6 1.9 1.2 1.2 2.4 rcyclic 11 1.6 67.4 22.9 24.7 19.8 1.6 1.8 1.2 1.2 2.4 bjork 8 4.0 67.8 32.0 13.2 22.5 1.5 1.8 1.2 1.3 2.3 rcyclic 17 210.9 78.2 26.5 30.3 21.4 1.5 1.8 1.3 1.2 2.3 gerhard 1 1.7 68.7 48.1 4.5 16.0 1.3 1.5 1.2 1.5 2.3 katsura 8 94.3 73.1 45.1 3.5 24.5 1.5 1.7 1.2 1.3 2.3 homog 2mat3 126.0 81.9 77.5 1.9 2.5 1.3 1.5 1.4 1.2 2.2 cyclic 7 190.0 66.9 31.0 11.2 24.7 1.4 1.8 1.3 1.2 2.2 rcyclic 10 0.5 53.7 16.2 20.0 17.5 1.6 1.7 1.1 1.2 2.2 cyclic 6 0.7 61.6 26.8 16.3 18.6 1.4 1.7 1.2 1.2 2.1 homog gonnet 125.4 76.8 43.0 31.5 2.2 1.5 1.7 1.2 1.1 2.1 gerhard 2 13.8 72.3 54.8 3.2 14.3 1.3 1.4 1.1 1.4 2.1 homog alex 2 12.9 71.7 58.2 4.8 8.8 1.3 1.4 1.1 1.4 2.0 schwarz 11 131.6 65.4 30.6 18.4 16.5 1.3 1.6 1.3 1.2 2.0 2mat3 123.1 80.2 75.7 1.9 2.6 1.2 1.4 1.4 1.2 2.0 schwarz 10 21.9 62.5 28.2 18.4 15.9 1.3 1.5 1.2 1.2 2.0 cohn2 9.7 64.4 29.7 4.1 30.6 1.4 1.5 1.1 1.2 1.9 gerhard 3 24.4 67.1 45.5 13.1 8.6 1.3 1.3 1.1 1.3 1.8 symmetric 6 60.6 68.0 25.6 14.5 28.0 1.4 1.4 1.1 1.2 1.8 schwarz 9 3.3 55.5 23.0 20.0 12.5 1.3 1.4 1.1 1.2 1.8 alex 2 7.1 63.6 52.1 3.3 8.2 1.3 1.4 1.1 1.2 1.8 schwarz 8 0.6 50.6 24.7 14.6 11.2 1.3 1.5 1.1 1.1 1.7 ecyclic 7 1076.4 86.4 84.2 0.3 2.0 1.2 1.4 1.2 1.1 1.6 alex 3 2.0 65.9 53.5 7.7 4.7 1.3 1.3 1.1 1.2 1.6 homog alex 3 2.0 66.2 47.2 14.8 4.1 1.2 1.2 1.1 1.3 1.6 gonnet 1.2 43.8 6.2 34.2 3.4 1.1 1.3 1.2 1.1 1.5 averages Pentium Pro 103.3 69.8 38.4 15.6 15.8 1.4 1.6 1.2 1.2 2.2 HP C160 125.4 69.0 41.7 17.9 9.5 1.2 1.5 1.3 1.8 2.9 DEC Alpha 221.4 69.8 37.7 16.1 16.0 1.7 1.7 1.2 1.1 2.3

So much for the theory, now let us look at some actual timings: Table 1 shows various timings illustrating the effects of the vectorized monomial operations described in this section. In the first column, we list the used examples -- details about those can be found in the Appendix. All GB computations were done using the degree reverse lexicographical ordering (dp) over the coefficient field Z/32003. We measured the following times (all in seconds):

t1.0
Running time of SINGULAR version 1.0 which uses the straightforward implementation of monomial operations.
tn,s, tn,c
Running time of SINGULAR without vectorized monomial operations and exponents of type short (for tn,s) and char (for tn,c).
ts, tc
Running time of SINGULAR with vectorized monomial operations and exponents of type short (for ts) and char (for tc).
Percentage of the running time (of tn,s runs) spent in basic monomial operations, monomial comparisons, divisibility tests, and additions, respectively (i.e., %mon = %comp + %div + %add)
Before evaluating these numbers, we should like to mention that running the same tests on different machines and/or with different simple monomial orderings leads to a similar picture (see also Appendix B for more timings).

Now, what do these numbers tell us?

Firstly, they support our assertion that for GB computations over finite fields of small characteristic, the running time is largely dominated by basic monomial operations (see column %mon). However, in which of the (three) basic monomial operations the most time is spent varies very much from example to example (compare, e.g., line gonnet'' with line ecyclic 6'').

Secondly, and most importantly: the impact of the vectorized monomial operations is quite substantial (see columns tn,s/ts, tn,c/tc which show the speedup gained by vectorized operations). As expected, the larger the ratio m = sw/se (i.e, the number of exponents packed in one machine word), the more speedup is usually gained (compare column tn,s/ts and tn,c/tc). However, notice that we cannot conclude a direct correlation between the percentage of time spent in monomial operations and the efficiency gains of vectorized operations. This is due to the fact that the number of inner loop iterations (and, hence, reduction of inner loop iterations) for comparisons and divisibility tests is not constant, but depends on the input monomials.

Thirdly: As we would expect, the more exponents are encoded into one machine word, the faster the GB computation is accomplished (see the ts/tc column). This has two main reasons: first, more exponents are handled by one machine operation; and second, less memory is used, and therefore, the memory performance increased (e.g., number of cache misses is reduced). However, we also need to keep in mind that the more exponent are encoded into one word, the smaller are the upper bounds on the value of a single exponents. This is especially crucial for computations with char exponents, since these require that an exponent may not be larger than 127. Ideally, we would like to dynamically'' switch from one exponent size to the next larger one, whenever it becomes necessary. With SINGULAR, we cannot do this yet, at the moment, but we intend to implement this feature in one of the next major upgrades.

Fourthly: We would like to point out that the only difference between the SINGULAR versions resulting in t1.0 and tn,s is that monomial comparisons for the different orderings are not reduced to one inlined routine but instead are realized by function pointers (and, therefore, each monomial comparisons requires a function call). Hence, column t1.0/tn,s shows that already the in-place realization of monomial comparisons results in a considerable efficiency gain which by far compensates the additional overhead incurred by the additional indirection for accessing one (single) exponent value (i.e., the overhead incurred by the pGetExp macro shown above).

Last, but not least, the combination of all these factors into our newest and fastest version of SINGULAR, leads to a rather significant improvement (see column t1.0/tc) over the old'' SINGULAR version 1.0.

Next: 4 Rank-based monomial representation Up: Monomial Representations for Gröbner Computations Previous: 2 monomial representations
| ZCA Home | Reports |