cosmo - Constrained search for motifs in DNA sequences


Constraint file structure

The constraint file you supply contains the definitions for one or more constraint sets. Each constraint set starts with the character @. The only requirement for a given constraint set is that it must contain a definition of the breakdown of the motif into intervals. Apart from this mandatory command, each constraint set may contain a number of optional constraint definitions that are described below. You may wish to look at some examples of valid constraint files.

Motif intervals

Each constraint set must contain an entry like

>IntervalSetup
Length: 3 bp
Length: 30%
Length: variable

that specifies how the motif can be conceptually divided into separate intervals that each correspond to a distinct set of constraints on the position weight matrix. The entry above divides the motif into three separate intervals: The first one always has length 3 base pairs, regardless of the motif width under consideration, the second one always takes up 30% of the entire motif width, and the last one is assigned whatever number of base pairs is left after the first two intervals have been allocated. We are forced to specify how interval widths scale with changing motif widths since cosmo will generally search through a range of candidate motif widths.

The entry has to start with the line >IntervalSetup. Each following line begins with the token Length: and sets up a new interval. The different interval types are then specified in the way shown above. If you do not want to divide the motif into intervals, you may use the entry

>IntervalSetup
Length: variable

Information content

The information content at a position of the motif at which the letters A,C,G, and T occur with probabilites pA,pC,pG, and pT, respectively, is defined as

IC = 2 + pAlog2pA + pClog2pC + pGlog2pG + pTlog2pT

For DNA sequences, the information content is bounded between 0 and 2. It is related to the entropy of this position through the relation

IC = 2 - Entropy

The information content is a measure for how conserved a position in the motif is, with higher information content corresponding to positions that have been highly conserved and lower information content corresponding to positions that have undergone more frequent substitutions.

Bound constraints on the information content across an interval

The entry

>IcBounds
Interval: 2
Bounds: 0 to 0.8

specifies that the information content across the second interval of the motif must lie between 0.0 and 0.8. Such a constraint may be useful if we suspect that the information content of the motif follows a certain general pattern such as high-low-high or low-high-low.

The entry has to start with the line >IcBounds or >ICBounds. The next line specifies which interval the bound constraint applies to. The last line gives the lower and upper bounds on the information content across the chosen interval respectively.

Shape constraints on the information content profile across an interval

The entry

>IcShape
Interval: 1
Shape: Linear
LeftBounds: 1.0 to 2.0
RightBounds: 0.8 to 1.5
ErrorTol: 0.05

specifies that the information content across interval 1 cannot deviate by more than 0.05 from a linear shape, with the information content at the start of the interval bounded between 1.0 and 2.0 and the information content at the end of the interval bounded between 0.8 and 1.5. Such constraints may be useful if we wish to exclude position weight matrices from consideration whose information content profile is sharply discontinuous across a given interval.

The entry has to start with the line >IcShape or >ICShape. The next line specifies which interval the shape constraint applies to. The next line specifies the functional form of the information content across that interval. The possible entries are given by Linear, MonotoneIncreasing, and MonotoneDecreasing. The next two lines give bounds on the information content at the start and the end of the interval, respectively. The last line sets a limit on how much the actual information content may deviate from the specified shape.

Lower bounds on nucleotide frequencies across an interval

The entry

>NucFreq
Interval: 2
Pos: all
Nuc: GC
LowerBound: 0.7

specifies that the GC content across interval 2 be at least 70%. The entry

>NucFreq
Interval: 1
Pos: 2
Nuc: A
LowerBound: 0.5

requires the nucleotide A to occur at least 50% of the time in position 2 of interval 1.

The entry has to start with the line >NucFreq or >NucProb. The next line specifies which interval the constraint applies to. The next line specifies a position in that interval, with the choice all or avg corresponding to requiring that the average nucleotide frequency across that interval be no less than the given lower bound. Note that a lower bound on a particular position can only be given for intervals whose length is a fixed number of base pairs and thus does not change as the motif width under consideration changes. The following line specifies the nucleotides whose frequency is to be bounded from below, with possible entries given by A, C, G, T, AT, and GC. The last line finally gives the lower bound on the nucleotide frequency.

Palindromic intervals

The entry

>Palindrome
Intervals: 1 and 3
ErrorTol: 0.1

specifies that intervals 1 and 3 of the motif be palindromes of each other, with corresponding nucleotide frequencies deviating from each other by no more than 0.1. Such constraints are useful for example if you know that the DNA-binding domains of the transcription factor under consideration are homo-dimeric, causing the DNA stretches that are bound by the transcription factor to be palindromes of each other.

The entry has to start with the line >Palindrome or >Pal. The next line gives the two intervals that are required to be palindromes of each other, and the last line defines the error tolerance.

Submotifs

The entry

>Submotif
Motif: GGAA
MinFreq: 0.90

requires that the motif contain the substring GGAA, with these nucleotides occurring in their respective positions roughly with frequency 90%. Such constraints are useful when the transcription factor under consideration is known to belong to a certain family of transcription factors that is characterized by the occurrence of a certain submotif within the motif. DNA sequences bound by transcription factors with an ETS domain, for example, all contain the stretch GGAA somewhere within the binding site.

The entry has to start with the line >Submotif or >Sub. The next two lines give the nucleotide sequence of the submotif and the approximate minimum frequency with which you want these nucleotides to occur within the motif.

Bounds on differences of shape parameters

The entry

>ParmDiff
Parameters: 1b - 1a
Bounds: -2 to 0

specifies that the difference between the information content at the end and the start of the first interval be bounded between -2 and 0. Such a constraint could be combined with a linear shape constraint across interval 1 to require that the information content across interval 1 be linear and decreasing. The entry

>ParmDiff
Parameters: 1b - 1a
Bounds: 0 to 0

specifies the information content at the end and the start of the first interval be indentical. Such a constraint could be combined with a linear shape constraint across interval 1 to require that the information content across interval 1 be constant. The entry

>ParmDiff
Parameters: 2a - 1b
Bounds: 0 to 0

specifies the information content at the start of interval 2 and at the end of interval 1 be indentical, requiring the information content to be continuous at the junction between these two intervals. Example 3 below gives another example for the use of this type of constraint.

The entry has to start with the line >ParmDiff or >ParameterDifference. The next line defines the particular difference of shape parameters that we want to bound. Parameters are specified by the interval number followed by the letter a or b, denoting the left and right edge of the interval, respectively. The last line defines the bounds on this parameter difference.

Example 1: Palindromic high-low-high motif

@ ConstraintSet 1

>IntervalSetup
Length: 3 bp
Length: variable
Length: 3 bp

>IcBounds
Interval: 1
Bounds: 1.0 to 2.0

>IcBounds
Interval: 2
Bounds: 0 to 0.8

>Pal
Intervals: 1 and 3
ErrorTol: 0.05

This constraint set divides the interval into three separate intervals, with the outer two having a fixed length of 3 base pairs and the middling taking up the remaining number of base pairs. The
information content across interval 1 is required to be at least 1.0 whereas the information content across interval 2 can be no greater than 0.8. Finally, the outer two intervals are required to be palindromes of each other.

This constraint set would be useful if we knew that the transcription factor under consideration had two homodimeric DNA-binding domains that each bind DNA stretches of length 3 base pairs. Stretches of the motif that are bound by the protein can be expected to more highly conserved than other portions, causing them to have a higher information content. The homodimeric structure of the transcription factor forces it to bind to two stretches of the motif that are palindromes of each other.

Example 2: Palindromic high-low-high motif with "empty" constraint set

@ ConstraintSet 1

>IntervalSetup
Length: 3 bp
Length: variable
Length: 3 bp

>IcBounds
Interval: 1
Bounds: 1.0 to 2.0

>IcBounds
Interval: 2
Bounds: 0 to 0.8

>Pal
Intervals: 1 and 3
ErrorTol: 0.05

@ ConstraintSet 2

>IntervalSetup
Length: variable

This constraint file is identical to the previous example except that it includes the definition for a second constraint set. This second constraint set is "empty" in the sense that it does not impose any constraints on the motif to be discovered.

When given several constraint sets, cosmo will select the one that appears to be most compatible with the sequences at hand. By including an empty constraint set, we give cosmo the chance to reject the first constraint set as not fitting the data very well and working in the unconstrained setup. This allows us to reduce the risk of negatively affecting the performance of cosmo by specifying a set of constraints that the true motif does in fact not satisfy. Note that the input form has an option that will automatically add an "empty" constraint set to the constraint file that you supplied, saving you the need to include the empty constraint set in your file.

Example 3: V-shaped information content profile

@ ConstraintSet 1

>IntervalSetup
Length: 50%
Length: 50%

>IcShape
Interval: 1
Shape: Linear
LeftBounds: 1.0 to 2.0
RightBounds: 0.5 to 1.5
ErrorTol: 0.05

>IcShape
Interval: 2
Shape: Linear
LeftBounds: 0.5 to 1.5
RightBounds: 1.0 to 2.0
ErrorTol: 0.05

>ParmDiff
Parameters: 1b - 1a
Bounds: 2.0 to 0.0

>ParmDiff
Parameters: 2b - 2a
Bounds: -2.0 to 0.0

>ParmDiff
Parameters: 2a - 1b
Bounds: 0.0 to 0.0

This constraint file divides the motif into two intervals of equal length. The
information content across both intervals is required to be linear. The parameter difference bounds require the information content to be decreasing across the first interval and increasing across the second interval. Finally the information content profile is required to be continuous at the junction between the two intervals. This setup specifies a V-shaped information content profile.

Example 4: Submotif or high GC content

@ ConstraintSet 1

>IntervalSetup
Length: variable

>Submotif
Motif: GCCG
MinFreq: 0.80

@ ConstraintSet 2

>IntervalSetup
Length: variable

>NucFreq
Interval: 1
Pos: all
Nuc: GC
LowerBound: 0.7

This constraint file defines two alternative constraint sets. The first requires the motif to contain the submotif GCGC. The second represents a somewhat weaker version of this requirement that only demands the GC content across the entire motif to be at least 70%.