Constraint file structure
The constraint file you supply contains the definitions for one or more constraint sets. Each constraint set starts with the character @. The only requirement for a given constraint set is that it must contain a definition of the breakdown of the motif into intervals. Apart from this mandatory command, each constraint set may contain a number of optional constraint definitions that are described below. You may wish to look at some examples of valid constraint files.Motif intervals
Each constraint set must contain an entry like
>IntervalSetup
Length: 3 bp
Length: 30%
Length: variable
that specifies how the motif can be conceptually divided into separate intervals that each correspond to a distinct set of constraints on the position weight matrix. The entry above divides the motif into three separate intervals: The first one always has length 3 base pairs, regardless of the motif width under consideration, the second one always takes up 30% of the entire motif width, and the last one is assigned whatever number of base pairs is left after the first two intervals have been allocated. We are forced to specify how interval widths scale with changing motif widths since cosmo will generally search through a range of candidate motif widths.
The entry has to start with the line >IntervalSetup. Each following line begins with the token Length: and sets up a new interval. The different interval types are then specified in the way shown above. If you do not want to divide the motif into intervals, you may use the entry
>IntervalSetup
Length: variable
Information content
The information content at a position of the motif at which the letters A,C,G, and T occur with probabilites pA,pC,pG, and pT, respectively, is defined as
For DNA sequences, the information content is bounded between 0 and 2. It is related to the entropy of this position through the relation
The information content is a measure for how conserved a position in the motif is, with higher information content corresponding to positions that have been highly conserved and lower information content corresponding to positions that have undergone more frequent substitutions.
Bound constraints on the information content across an interval
The entry
>IcBounds
Interval: 2
Bounds: 0 to 0.8
specifies that the information content across the second interval of the motif must lie between 0.0 and 0.8. Such a constraint may be useful if we suspect that the information content of the motif follows a certain general pattern such as high-low-high or low-high-low.
The entry has to start with the line >IcBounds or >ICBounds. The next line specifies which interval the bound constraint applies to. The last line gives the lower and upper bounds on the information content across the chosen interval respectively.
Shape constraints on the information content profile across an interval
The entry
>IcShape
Interval: 1
Shape: Linear
LeftBounds: 1.0 to 2.0
RightBounds: 0.8 to 1.5
ErrorTol: 0.05
The entry has to start with the line >IcShape or >ICShape. The next line specifies which interval the shape constraint applies to. The next line specifies the functional form of the information content across that interval. The possible entries are given by Linear, MonotoneIncreasing, and MonotoneDecreasing. The next two lines give bounds on the information content at the start and the end of the interval, respectively. The last line sets a limit on how much the actual information content may deviate from the specified shape.
Lower bounds on nucleotide frequencies across an interval
The entry
>NucFreq
Interval: 2
Pos: all
Nuc: GC
LowerBound: 0.7
>NucFreq
Interval: 1
Pos: 2
Nuc: A
LowerBound: 0.5
The entry has to start with the line >NucFreq or >NucProb. The next line specifies which interval the constraint applies to. The next line specifies a position in that interval, with the choice all or avg corresponding to requiring that the average nucleotide frequency across that interval be no less than the given lower bound. Note that a lower bound on a particular position can only be given for intervals whose length is a fixed number of base pairs and thus does not change as the motif width under consideration changes. The following line specifies the nucleotides whose frequency is to be bounded from below, with possible entries given by A, C, G, T, AT, and GC. The last line finally gives the lower bound on the nucleotide frequency.
Palindromic intervals
The entry
>Palindrome
Intervals: 1 and 3
ErrorTol: 0.1
The entry has to start with the line >Palindrome or >Pal. The next line gives the two intervals that are required to be palindromes of each other, and the last line defines the error tolerance.
Submotifs
The entry
>Submotif
Motif: GGAA
MinFreq: 0.90
The entry has to start with the line >Submotif or >Sub. The next two lines give the nucleotide sequence of the submotif and the approximate minimum frequency with which you want these nucleotides to occur within the motif.
Bounds on differences of shape parameters
The entry
>ParmDiff
Parameters: 1b - 1a
Bounds: -2 to 0
>ParmDiff
Parameters: 1b - 1a
Bounds: 0 to 0
>ParmDiff
Parameters: 2a - 1b
Bounds: 0 to 0
The entry has to start with the line >ParmDiff or >ParameterDifference. The next line defines the particular difference of shape parameters that we want to bound. Parameters are specified by the interval number followed by the letter a or b, denoting the left and right edge of the interval, respectively. The last line defines the bounds on this parameter difference.
Example 1: Palindromic high-low-high motif
@ ConstraintSet 1 >IntervalSetup Length: 3 bp Length: variable Length: 3 bp >IcBounds Interval: 1 Bounds: 1.0 to 2.0 >IcBounds Interval: 2 Bounds: 0 to 0.8 >Pal Intervals: 1 and 3 ErrorTol: 0.05This constraint set divides the interval into three separate intervals, with the outer two having a fixed length of 3 base pairs and the middling taking up the remaining number of base pairs. The information content across interval 1 is required to be at least 1.0 whereas the information content across interval 2 can be no greater than 0.8. Finally, the outer two intervals are required to be palindromes of each other.
This constraint set would be useful if we knew that the transcription factor under consideration had two homodimeric DNA-binding domains that each bind DNA stretches of length 3 base pairs. Stretches of the motif that are bound by the protein can be expected to more highly conserved than other portions, causing them to have a higher information content. The homodimeric structure of the transcription factor forces it to bind to two stretches of the motif that are palindromes of each other.
Example 2: Palindromic high-low-high motif with "empty" constraint set
@ ConstraintSet 1 >IntervalSetup Length: 3 bp Length: variable Length: 3 bp >IcBounds Interval: 1 Bounds: 1.0 to 2.0 >IcBounds Interval: 2 Bounds: 0 to 0.8 >Pal Intervals: 1 and 3 ErrorTol: 0.05 @ ConstraintSet 2 >IntervalSetup Length: variableThis constraint file is identical to the previous example except that it includes the definition for a second constraint set. This second constraint set is "empty" in the sense that it does not impose any constraints on the motif to be discovered.
When given several constraint sets, cosmo will select the one that appears to be most compatible with the sequences at hand.
By including an empty constraint set, we give cosmo the chance to reject the first constraint set as not fitting the data very well and working in the unconstrained setup.
This allows us to reduce the risk of negatively affecting the performance of cosmo by specifying a set of constraints that the true motif does in fact not satisfy.
Note that the input form has an option that will automatically add an "empty" constraint set to the constraint file that you supplied, saving you the need to include the empty constraint set in your file.
Example 3: V-shaped information content profile
@ ConstraintSet 1
>IntervalSetup
Length: 50%
Length: 50%
>IcShape
Interval: 1
Shape: Linear
LeftBounds: 1.0 to 2.0
RightBounds: 0.5 to 1.5
ErrorTol: 0.05
>IcShape
Interval: 2
Shape: Linear
LeftBounds: 0.5 to 1.5
RightBounds: 1.0 to 2.0
ErrorTol: 0.05
>ParmDiff
Parameters: 1b - 1a
Bounds: 2.0 to 0.0
>ParmDiff
Parameters: 2b - 2a
Bounds: -2.0 to 0.0
>ParmDiff
Parameters: 2a - 1b
Bounds: 0.0 to 0.0
This constraint file divides the motif into two intervals of equal length.
The information content across both intervals is required to be linear.
The parameter difference bounds require the information content to be decreasing across the first interval and increasing across the second interval.
Finally the information content profile is required to be continuous at the junction between the two intervals.
This setup specifies a V-shaped information content profile.
Example 4: Submotif or high GC content
@ ConstraintSet 1 >IntervalSetup Length: variable >Submotif Motif: GCCG MinFreq: 0.80 @ ConstraintSet 2 >IntervalSetup Length: variable >NucFreq Interval: 1 Pos: all Nuc: GC LowerBound: 0.7This constraint file defines two alternative constraint sets. The first requires the motif to contain the submotif GCGC. The second represents a somewhat weaker version of this requirement that only demands the GC content across the entire motif to be at least 70%.