Principles of the generation of isomers

Principles of the Generation of Constitutional and Configurational Isomers

T. Wieland, A. Kerber, R. Laue

Department of Mathematics, University of Bayreuth, D-95440 Bayreuth, Germany

E-mail: wieland@btm2d1.mat.uni-bayreuth.de

Abstract

In the automatization of molecular structure elucidation, generators for constitutional and configurational isomers are indispensable tools. Sophisticated computer programs of this kind are based on a mathematical theory of the construction of discrete structures. First, we will describe the basic principles for the generation of constitutional isomers, starting with the notions of group actions and ending up with recent improvements with respect to hybridization constraints. The second part is devoted to the generation of all configurational isomers to a given structural isomer. The underlying mathematical concepts as well as the techniques for computing spatial realizations are discussed. The outlined theory is implemented in the program system MOLGEN, which provides a fast and efficient tool for structure elucidation, running under DOS, MS-Windows, OS/2 and SUN-Solaris.

1 INTRODUCTION

In the past two decades, molecular structure elucidation was propeled by the development of computer programs for the automatic construction of isomers. Three types of approach can be distinguished (for details see a review [1,2]): The first, in essence, seeks to assemble structures from standard fragments [3-5]. The next utilizes a precise mathematical representation to reduce the problem to a (possibly rather large) catalog of substructures and computes all irredundant combinations [6-10]. The last Ansatz does not rely on any catalog, but calculates the compounds as connectivity matrices by orderly generation [11-13].

A common maxim of all approaches is that to be useful in everyday's work in a laboratory, a structure generator must fulfil at least two requirements: a) It must be exhaustive, i.e. it must generate all isomers corresponding to the given data. b) It must be irredundant, i.e. no isomers should occur twice. In order to meet these requirements, sophisticated algorithms based on mathematical theory must be applied. The efficiency of the programs depends on how the problem of redundancy is treated; algorithms using algebraic techniques generally excel over pure combinatorial methods.

For the algorithmic treatment, chemical compounds must be modeled on a certain level of abstraction. The following levels are commonly used:

The stoichiometric level: At this level only the empirical formula of a compound is considered, e.g. C4H6O6.
The topological level: Here connectivity, i.e. bonds and bond types, are considered additionally. Constitutional isomers are distinguished, e.g. there are 20,500 isomers of C4H6O6. Some of them are depicted in Figure 1.
The geometric level: The decisive notion here is configuration, usually caused by asymmetric atoms or double bonds with differing ligands. Two molecules are regarded as equal, iff they can be superimposed by rotation of one entire molecule or around single acyclic bonds. Isomers can be represented for example by configuration lists, Fisher-projections or three-dimensional drawings. Figure 2 shows the three configurational isomers of tartaric acid (brutto formula C4H6O6):
Figure 2. Configurational isomers of tartaric acid.
The electro-mechanic level: At this level, intramolecular movements are also taken into account. The distinction is made between conformational isomers. A further refinement in rigid and freely rotatable molecules is possible.

Further levels might be considered, but in many cases, the above abstractions are sufficient in structure elucidation problems. However, in the interpretation of results of mathematical chemistry the underlying model must always be kept in mind. For it is only in the scope of this model that mathematics can provide conclusions.

2 GROUP ACTIONS

Graphs and multigraphs must be labeled to be stored in a computer's memory. But the labeling, i.e. the numbering of the vertices, is not relevant for molecules. So methods for classifying all representations of graphs must be applied. Hence a key notion for structure generation is that of group action.

Let G denote a finite group and X a finite set. An action of G on X is described by a mapping

G x X to X: (g,x) mapsto gx, such that g(g'x)=(gg')x, and 1x=x.

The action defines substructures on G and X:

G(x) := {gx | g in G}, the orbit of x
G\\X := { G(x) | x in X}, the set of all orbits
T(G\\X), a transversal of the orbits, where X is the union over all (G\\X) G(t) with t in T and G(t1) and G(t2) have an empty intersection for different t1 and t2.

Let us further define the set of mappings Y^X := {f | f: X to Y}. If G acts on X and H acts on Y, then we get the following actions:

Y^X: (g,f) mapsto fo g^-1
(H^X x G) x Y^X to Y^X: ((ψ ;g),f) mapsto f', with f'(x):= ψ(x) f(g^-1x), the wreath product [14, 15].

The orbits of these actions are called symmetry classes of mappings.

A labeled m-graph with p vertices is an element f from X to {0,1,...,m} with X:=p^[2]:= {{i,k}| 1 lower or equal i<k lower or equal p}, where f({i,k})=j means that the pair {i,k} of vertices connected by an edge of degree j; if j=0, the vertices are not connected. To derive all (unlabeled) multigraphs we have to compute the orbits of the action of the symmetric group Sp on X to {0,1,...,m}. Early algorithms tried to determine the orbits by comparing each potential representative to all previously found. The expenditure on such computations, however, is enormous. A better way is to calculate canonical representatives only, e.g. representatives that are maximal within their orbits.

By writing f in X to {0,1,...,m} as a vector f=/f(x1),f(x2),.../ we can formulate the basic algorithm as follows:

Start with the empty set T.

Compute in a lexicographically descending way all vectors f in X to {0,1,...,m} and examine each f:

If there is a π in Sp with f > πf, then f is not a canonical orbit representative.
Else save f by adding it to T.

Using this algorithm, certain problems have to be overcome:

In the worst case, all elements of Sp must be constructed and applied to f. So sophisticated methods for a fast and efficient test have been developed [11, 16] using, e.g., stabilizers and partitioning.
A lot of not canonical vectors are tested. Thus rules for skipping some candidates have been put forward [11, 16], for example orderly generation[12, 13] and learning effects.

3 STRUCTURE GENERATION IN MOLGEN

The structure generator incorporated in the program system 10, 17, 18, 19, 20, 21, 22] makes intensive use of the principles described above.

The data contains two elements:

an H-vector, coding the number of hydrogens connected to each non-hydrogen atom.
an adjacency matrix, coding the bonds between all non-hydrogen atoms.

The actual generation process is performed as follows:

All (canonical) H-vectors are generated, corresponding to the given molecular formula.
For each H-vector all canonical adjacency matrices are computed.

In both steps the techniques of canonical transversal construction, fast maximality tests and orderly generation as described above are used.

This approach bears several advantages:

The maximal bond degree can be easily restricted. Moreover, forbidden substructures can be incorporated in the generation process.
If the distribution of hydrogens is known a priori, this information can be used to leave out the first step and construct all corresponding adjacency matrices only. In this way, the number of solutions may be reduced drastically.
By further refining |11] this technique, bond multiplicities of the atoms can be prescribed, e.g. carbon hydridizations as known from 13C-NMR experiments.

An example of the reduction in the number of possible isomers is shown in Table 1:

Restriction Number of isomers

- 2,558,517

max. double bonds 2,434,123

no O-O 360,594

H-distribution
(2,1,1,0,0,0,1,1,1,1,0,0) 35,058

Multiplicities
(Csp3,Csp3,Csp3,Csp2,Csp2,Csp2,
Osp3,Osp3,Osp3,Osp3,Osp3,Osp2) 990

Restriction	Number of isomers
-	2,558,517
max. double bonds	2,434,123
no O-O	360,594
H-distribution (2,1,1,0,0,0,1,1,1,1,0,0)	35,058
Multiplicities (Csp3,Csp3,Csp3,Csp2,Csp2,Csp2, Osp3,Osp3,Osp3,Osp3,Osp3,Osp2)	990

(The current version of the user interface in MOLGEN (vide infra) does not allow the input of hydridizations. This is to be incorporated together with further developments of the generator [23].)

With this algorithm the total variety of structures corresponding to the given data can be computed - regardless as to whether they can be synthesized or even exist in reality. As opinions of what is stable and what is not are subject to changes over the years, the check on plausibility is left to the user. We include, however, a list of rather unlikely substructures (like a 4-ring with an sp-hybridized carbon or a 3-ring with three sp2-hybridized carbons) to cull isomers thought to be unrealistic. Since this list is composed from data files, it can freely be altered according to the individual needs.

4 SUBSTRUCTURE RESTRICTIONS

The most effective type of substructures that can be prescribed in MOLGEN are macroatoms. These are non-overlapping fragments which are treated as single atoms during the generation. Afterwards, they must be brought to their actual size. This raises the problem of calculating all essentially different connections of the macroatom with the environment in the molecule. This question is solved by applying a double coset algorithm [24]. On the one hand, the fragment has an internal topological symmetry, expressed by a permutation group A, and on the other hand, the bonds which lead to the fragment from the environment, can have an equivalency, expressed by a permutation group B. Then a transversal of the double cosets A\Sp/B gives all essentially different connections.

If we take dioxin C12O2H4Cl4, for example, with the skeleton

we have to determine all different bijections {1,2,3,4,5,6,7,8} to {H,H,H,H,Cl,Cl,Cl,Cl}. On the skeleton, the group A isomorphic to C2 x C2 is acting, and on the substituents we have B isomorphic to S4 x S4. This leads to 22 double coset representatives, hence 22 isomers.

5 FINDING CONFIGURATIONAL ISOMERS

Most structure generation programs are only capable of constructions at the second level of abstraction (cf. 1), which also means that spatial properties are almost entirely neglected. But the spatial arrangements of the single atoms determine some important physio-chemical properties of the molecule. It is therefore desirable to have a generator at hand that is working at the third level.

A number of approaches [25-28] have been reported, which are all more or less effective. What we want to discuss here, are the mathematical foundations.

The generation of configurational isomers after the algorithm by Nourse et al. [25] is performed in three steps:

Find all potential stereocenters in the given constitutional isomer.
Construct the configuration symmetry group.
Verify the potential centers and construct the isomers as (0,1)-sequences.

Potential stereocenters are atoms of valence 3 or 4 with one hydrogen at the most, which are not part of an aromatic system, not triply bonded, and not part of cumulenes with H2-ends. To include double bonds in this consideration, the relevant bonds are converted to single bonds with fictitious bivalent nodes.

6 THE CONFIGURATION SYMMETRY GROUP

The configuration symmetry group represents the topological symmetry of the graphs with special respect to the stereocenters. We therefore again regard the molecular structure as a labeled m-graph g in X to {0,1,...,m} with X:=p^[2]. In a p-atomic molecule with n different elements, we call b in {1,...,p} to {1,..,n} a labeling of g. Then we can define the automorphism group of the molecular graph as A(g,b) := { f in Sp | fg=g and fb =b}. For our purposes, we only consider the permutation of the stereocenters [28,30].

The automorphism group is constructed in our algorithm in two steps: First the vertices of the graph are partitioned such that there cannot be automorphisms that map vertices from different partitions onto one another. Next all possible permutations within the partitions are checked as to whether they are also automorphisms of the graph. The result is stored as a SIMS chain [29].

To be able to write the ligands of a center as (ordered) tuples, the environment of the center is considered as locally placed in space. Although this is easily incorporated in the algorithm via the initial numbering of the vertices, it is a useful method for interpreting changes of the order of the ligands as changes of the configuration (depending on the sign). As an example we consider the following structure:

III
Its CSG is shown in Table 2.

Table 2. The CSG of the structure III, where e:=(0)(1) and i:=(01).

(e,e,e,e,e,e,e,e); 1

(e,e,i,e,e,i,e,e); 1

(e,i,e,e,e,e,e,i); 1

(e,i,i,e,e,i,e,i); 1

(i,e,e,i,e,e,e,e); (12)(57)

(i,i,e,i,e,e,e,i); (12)(57)

(i,e,i,i,e,i,e,e); (12)(57)

(i,i,i,i,e,i,e,i); (12)(57)

(e,i,i,e,e,e,e,e); (03)(46)

(e,i,e,e,e,i,e,e); (03)(46)

(e,e,i,e,e,e,e,i); (03)(46)

(e,e,e,e,e,i,e,i); (03)(46)

(i,i,i,i,e,e,e,e); (03)(12)(46)(57)

(i,e,i,i,e,e,e,i); (03)(12)(46)(57)

(i,i,e,i,e,i,e,e); (03)(12)(46)(57)

(i,e,e,i,e,i,e,i); (03)(12)(46)(57)

So the permutations of the automorphism group must be extended by mappings which represent the ability of a ligand permutation to change the configuration of the center. These mappings will be called inversions. (In Nourse's notation [30] these are called superscripts.) They are determined as follows:

If the center i is fixed under a f, consider the permutation of the ligands of the center under f. If it is even, set E(i):=(0)(1), else set E(i):=(01).
If the center is mapped to j, set E(j):=(0)(1) if the ligands are permuted even, and E(j):=(01) otherwise.

Table 3. Stereoisomers of structure III ordered 0,3,4,6,1,2,5,7:

(0, 0, 0, 0, 0, 0, 0, 0) (0, 1, 0, 0, 0, 0, 0, 1)

(0, 0, 0, 0, 0, 0, 0, 1) (0, 1, 0, 0, 0, 0, 1, 0)

(0, 0, 0, 1, 0, 0, 0, 0) (0, 1, 0, 1, 0, 0, 0, 0)

(0, 0, 0, 1, 0, 0, 0, 1) (0, 1, 0, 1, 0, 0, 0, 1)

(0, 0, 0, 1, 0, 0, 1, 0) (0, 1, 0, 1, 0, 0, 1, 0)

(0, 0, 0, 1, 0, 0, 1, 1) (0, 1, 0, 1, 0, 0, 1, 1)

(0, 0, 1, 1, 0, 0, 0, 0) (0, 1, 1, 1, 0, 0, 0, 0)

(0, 0, 1, 1, 0, 0, 0, 1) (0, 1, 1, 1, 0, 0, 0, 1)

(0, 1, 0, 0, 0, 0, 0, 0) (0, 1, 1, 1, 0, 0, 1, 0)

So we can define the configuration symmetry group (CSG) as C(g,b) := { (E;f) | f in A(g,b)}, and it proves [28] to be a subgroup of the wreath product [14,15] of the S2 over {1,...,s} with A(g,b), if there are s stereocenters in the molecule.

The CSG of the cyclobutan-derivative III is listed in table 2. The stereocenters are numbered.

After the construction of C(g,b), we are able to verify the potential stereocenters by the following rule: if there is a permutation in the automorphism group whose restriction to the ligands of a potential stereocenter is odd, this atom is a real stereocenter only if the non-fixed ligands themselves contain at least one stereocenter. The direction of the search should be from the outer to the inner regions of the molecule in order to reduce the expenditure and to avoid overlooking a center.

For the actual generation of the stereoisomers, we remark that with s stereocenters we get all essentially different configurational isomers by computing a transversal of C(,)\\{1,...,s} to {0,1}. This transversal can be computed with (an enhanced version of) the basic algorithm from section 2.

A list with the stereoisomers of III as (0,1)-sequences is shown in table 3. As any structural isomer has one or more configurational isomers, the numbers of possible stereoisomers of a set of constitutional isomers can be computed, as shown in table 4.

It is also interesting to compare the sizes of both sets of isomers with each other. In fig. 3 the ratios of stereoisomers per structural isomer are printed. While the numbers of constitutional isomers of CnHm have peaks at 4 to 8 hydrogens, the ratio decreases monotonously as the number of hydrogens increases.

As discussed above, this algorithm also generates all isomers without regard to their realizability. The evaluation of the results is left to the user.

The generation algorithm in the form discussed above can only handle stereocenters of valence 3 or 4. It is, however, possible to extend it to higher valences [31].

Table 4. Total numbers of configurational isomers, compared with constitutional isomers (rows: carbons, columns: hydrogens)

CiHj 2 4 6 8 10 12 14 16 18 20 22

3
2

3

3

4

2

2

1

1
- - - - - - -

4
7

15

11

22

9

13

5

6

2

2
- - - - - -

5
21

76

40

126

40

100

26

48

10

13

3

3
- - - - -

6
85

518

185

1053

217

958

159

514

77

171

25

38

5

5
- - - -

7
356

4235

920

10313

1230

10820

1031

6464

575

2447

222

620

56

101

9

11
- - -

8
1804

42694

5308

119777

7982

141083

7437

93365

4679

39417

2082

11350

654

2248

139

299

18

24
- -

9
10064

498276

33860

1575402

56437

2053105

57771

1490065

40139

692420

19983

221494

7244

50270

1902

8102

338

875

35

55
-

10
64352

6576033

241297

6576033

439373

32854849

488125

25956181

369067

13133974

201578

4605287

81909

1165876

24938

216341

5568

28977

852

2640

75

136

Click here for Picture

Figure 3. Ratios of stereoisomers per structural isomer

7 SPATIAL REALIZATIONS

It is nice to get the stereoisomers as (0,1)-sequences or even R/S-sequences, but this is often not really useful. Since stereoisomerism is a steric property of a molecule, spatial represention of the isomers is desirable.

This can be done by automatic model builders [32-34] or by geometric construction starting from reference coordinates, e.g. obtained by a molecular mechanics calculation [35]. We followed the second way, basing our method on earlier approaches [27, 36].

The first step is to identify the configuration in the reference placement. This is performed by geometric analysis: Let a stereocenter (position X) have its neighbours in the positions X1, X2, X3 and X4. Calculate n := X1X2 x X1X3. If n.(X1-X) <0, return 1; else return 0. The double bonds are examined analogously.

Next the ligands at the centers are reflected. In the recursive procedure four cases are considered:

The center does not lie in a ring: reflect the ligands containing the least centers.
The center lies in exactly one ring: if the number of centers in the cyclic ligands is lower than the one in the acyclic ligands, reflect the cyclic ligands; else vice versa.
The center lies in two rings of a spiro-compound: reflect the ligands belonging to the ring with fewer centers.
The center lies in more than one ring; the rings have more than one atom in common: here no exact, only approximate constructions [28] are possible.

The first example (case 1) was tartaric acid in fig. 2. As a second, we consider tetramethylcyclobutane (structure IV, case 2).

Click here for Picture

Its four stereoisomers are shown in fig. 4. Another example, now for case 3, is provided by the spiro-compound V. There are 8 configurational isomers of this molecule; two of them are drawn in fig. 5

Click here for Picture

Figure 4. Stereoisomers of structure IV

Click here for Picture

Figure 5. Configurations (0,0,0,0) and (0,1,0,1) of the eight stereoisomers of structure V

8 MOLGEN USER INTERFACE

Since the generation process as described above should be performed in close interaction with the user, a graphical user interface has been created. It consists of three parts:

MOLGEN, the main program, contains the generator for structural isomers, controls the input of restrictions to it and displays the result by computing two-dimensional placements.
MOLED is a simple structure editor to enter the substructure restrictions for the generator.
MOLVIEW calculates spatial coordinates and configurational isomers, and displays them.

As the users of such software differ considerably, MOLGEN is available [37] for DOS, Microsoft Windows, IBM OS/2 and Sun SOLARIS/Motif in different versions. Demos are accessible from the WorldWide Web document http://btm2xd.mat.uni-bayreuth.de/molgen.

9 DISCUSSION

The implementation of the algorithms described above has led to a computer program that is able to generate all isomers of a given brutto formula, optionally subject to other restrictions. The ability to construct configurational isomers crosses the border to stereochemistry. A flexible user environment makes the program available for a large number of research problems.

A prominent application of structure generating programs are automated structure elucidation systems. At present MOLGEN is incorporated in SpecInfo [38]. There a given spectrum is interpreted via a large spectra database, yielding a set of substructures. MOLGEN generates candidates based thereon, which can afterwards be verified.

But a structure generator can also be useful in chemical education [19,20], e.g. to visualize the variety of isomerism, in pharmacology [39], and various other fields.

ACKNOWLEDGEMENT

We are grateful to the German Federal Ministry of Education, Science, Research and Technology (BMBF) for financial support.

REFERENCES AND NOTES

Last modified: 04/09/96, T. Wieland

(e,e,e,e,e,e,e,e);	1
(e,e,i,e,e,i,e,e);	1
(e,i,e,e,e,e,e,i);	1
(e,i,i,e,e,i,e,i);	1
(i,e,e,i,e,e,e,e);	(12)(57)
(i,i,e,i,e,e,e,i);	(12)(57)
(i,e,i,i,e,i,e,e);	(12)(57)
(i,i,i,i,e,i,e,i);	(12)(57)
(e,i,i,e,e,e,e,e);	(03)(46)
(e,i,e,e,e,i,e,e);	(03)(46)
(e,e,i,e,e,e,e,i);	(03)(46)
(e,e,e,e,e,i,e,i);	(03)(46)
(i,i,i,i,e,e,e,e);	(03)(12)(46)(57)
(i,e,i,i,e,e,e,i);	(03)(12)(46)(57)
(i,i,e,i,e,i,e,e);	(03)(12)(46)(57)
(i,e,e,i,e,i,e,i);	(03)(12)(46)(57)

(0, 0, 0, 0, 0, 0, 0, 0)		(0, 1, 0, 0, 0, 0, 0, 1)
(0, 0, 0, 0, 0, 0, 0, 1)		(0, 1, 0, 0, 0, 0, 1, 0)
(0, 0, 0, 1, 0, 0, 0, 0)		(0, 1, 0, 1, 0, 0, 0, 0)
(0, 0, 0, 1, 0, 0, 0, 1)		(0, 1, 0, 1, 0, 0, 0, 1)
(0, 0, 0, 1, 0, 0, 1, 0)		(0, 1, 0, 1, 0, 0, 1, 0)
(0, 0, 0, 1, 0, 0, 1, 1)		(0, 1, 0, 1, 0, 0, 1, 1)
(0, 0, 1, 1, 0, 0, 0, 0)		(0, 1, 1, 1, 0, 0, 0, 0)
(0, 0, 1, 1, 0, 0, 0, 1)		(0, 1, 1, 1, 0, 0, 0, 1)
(0, 1, 0, 0, 0, 0, 0, 0)		(0, 1, 1, 1, 0, 0, 1, 0)