contents -------- Legalities Copyright notice Disclaimer Citation The Package Introduction 1. Purpose 2. DCSE 3. System requirements 4. Conventions used Data formats 1.The alignment format 1.1. Symbols for nucleotides 1.2. Symbols for amino acids 1.3. Symbol for gaps 1.4. Symbols describing secondary structure 1.5. Helix numbering 1.6. Other symbols 2.The reference format Generalities 1. Defaults 2. Menus 3. Editing in DCSE 4. The pointer DCSE: getting started 1. Creating an alignment file 2. Starting DCSE 2.1. Selection of organisms and positions 2.2. The screen 3. Editing 3.1. Moving around 3.2. Shifting characters 3.3. Dealing with secondary structure signs DCSE: The menu 1.Options 1.1. Display options 1.2.2. Movesec 1.2.3. Xchange 1.2.4. Protect 2. Filing 2.1. Saving your data 2.2. Output 2.3. Getting help 2.4. Import 2.5. Making a reference file from an alignment 2.6. Moving 2.7. Selecting other organisms or positions 2.8. exit 3. Positioning 3.1. Goto 3.2. Markers 3.3. Finding things 3.4. Going to the next non standard basepair 4. Primary structure tools 4.1. Checking the sequence 4.2. Aligning sequences 4.3. Shifting to a position 4.4. Inserting characters 4.5. Deleting characters 4.6. Overwriting sequences 5. Secondary structure tools 5.1. Checking the secondary structure 5.2. Copying secondary structure signs 5.3. Looking for compensations 5.4. Base pairing matrix 5.5. Saving and loading secondary structures 6.Alignment 6.1. Inserting and deleting positions 6.2. Locking and unlocking organisms 6.3. Grouping and ungrouping sequences 6.4. Removing sequences 6.5. Creating a sequence line 6.6. Changing the name 7. Divers 7.1. Information 7.2. Sequence info 7.3. Real position 7.4. Searching sequence motifs 7.5. Translating a coding sequence APPENDIX A: the data formats 1. Alignment format 2. Reference format 3. Organism lists 4.Default files 5.Marker files APPENDIX B: Key shortcuts Hints and tips Legalities Copyright notice Copyright © Peter De Rijk, University of Antwerp (UIA), 1993 You may give this application to anyone, via any medium, so long as it is delivered with ALL the supplied files and UNALTERED, and it is not supplied on a disc you are charging for (except for media and postage costs). I maintain copyright on all the material supplied and reserve the right to amend these conditions in cases where I deem misuse. Disclaimer This application is supplied free to everyone 'as is', I do not give any guarantee that it is free of bugs, or supply any warranty about its suitability for use. No liability will be accepted for any damage to or loss of data as a result of using this application. However, if there are any problems with it and you notify me of them, I will probably do my best to rectify them. Citation A paper has been written about DCSE and this is submitted to and acepted by CABIOS. It is however not in print yet. If you have used the program to obtain results in a paper you've written, please cite the following reference: Peter De Rijk and Rupert De Wachter DCSE v2.54, an interactive tool for sequence alignment and secondary structure research. Comput. Applic. Biosci. Reprints of articles in which DCSE is mentioned would be welcomed. The Package DCSE has been compiled for following environments: VMS for VAXstations, Ultrix on DECstations, DOS on IBM compatible PCs and RISC OS on the Acorn Archimedes range. How to install the package is dependent on which system you use and will be explained in the accompanying readme file. The operation of the program (e.g. colour codes, keys) can also differ. These differences are also described in the 'System specifics' section of the readme file. If you are having problems with the program contact me. I will do my best to get it fixed. Please report any bugs you have found. If possible, state your machine's hardware and software configurations. Sending me a full description of the circumstances in which the bug occurs, possibly with the data it happened on, will help me tracking down a bug. If you have any suggestions, you can also make them to me. I can be contacted by snail mail or E-Mail at the following addresses. Peter De Rijk University of Antwerp (UIA) Department of Biochemistry Universiteitsplein 1 B-2610 Antwerp tel.: 32-03-820.23.16 fax: 32-03-820.22.48 E-mail: derijkp@reks.uia.ac.be If you have received this package please send me your name and address, and if possible your E-mail address and phone number, together with the number of the version you have, and the type of computer and operating system, so I can add you to the address list. If you do, I can inform you of new releases and possible bugs. I will do my best to reply as fast as I can to any problems, etc. . However, the development of DCSE is not my only task, which is why my response might not be always as fast as you would like. Introduction ------------ 1. Purpose Although ordinary text editors can be used to make and correct sequence alignments they are not very well suited for this task, mainly due to the limited number of characters that can be placed on one line. A normal text contains about one hundred characters on one line. A single line in an alignment can easily take in several thousands of characters. Most editors simply can't handle that amount of characters. Editors that do handle this are usually so slow when the edited text contains long lines that they are not useable either. Alignment can be achieved by inserting or deleting gap symbols. However, these classical methods for editing text are not very suited to align sequences. Biopolymer sequences have been experimentally determined, and should not be changed unless errors are discovered. Alignment does not involve changing, but shifting characters and groups of characters in the sequences. In the case of molecules showing a common secondary structure, such as ribosomal RNA, alignment of more variable areas can be guided by this secondary structure information. A sequence editor for these molecules should provide an easy and straightforward method to incorporate this and other information in an alignment. It should also provide tools to locate structural elements, and to check whether proposed elements are correct. 2. DCSE DCSE (Dedicated Comparative Sequence Editor) originated from the need to maintain the rapidly growing alignment of small subunit ribosomal RNA's (De Rijk, 1992). This alignment contains now almost 2400 sequences and has about 4700 positions. DCSE was from the start created to be able to cope with large alignments. It provides a user-friendly, menu-driven environment to make and maintain sequence alignments. DCSE offers a lot of usefull features for people working on alignments. Dynamic memory allocation lets you use the memory you have at run-time, and you can work on parts of a large alignment file. Sequence and structure information can be combined, and even displayed using colour. Sequences or sequence parts can be automatically aligned. Sequences can be grouped into compound sequences to allow easy multiple alignment. These and other features make of DCSE an extremely interesting program for sequence alignments. Convers is a program which complements DCSE. DCSE itself does not add sequences to an alignment or delete sequences from an alignment. Convers is used for this. New unaligned sequences can be available in several formats (currently reference format or NBRF/PIR format). Convers can add sequences in these formats to an existing alignment, or create an alignment from a set of sequences. The sequences can then be aligned using DCSE. Convers is also used for a variety of other tasks concerning alignments, such as filtering sequences out of an alignment, conversion to a format for creating trees, converting info from the EMBL databank to the reference format, etc. 3. System requirements DCSE has been compiled for following environments: IRIX on a Silicon Graphics Indigo, VMS for VAXstations, Ultrix on DECstations, DOS on IBM compatible PCs (386 or higher) and RISC OS on the Acorn Archimedes range. DCSE was written in common C. However, some parts are necessarily platform specific. These parts lie mainly in routines interfacing with the operating system, such as screen manipulation, keyboard and filing system routines. Most of the platform specific items are explained in the accompanying readme file. 4. Conventions used A a key to be pressed. The shortcut keys are case sensitive, so you should apply the shift key and the key marked with 'A' to get A, and the 'A' key alone to get a. Sometimes the keys differ according to the system you are using. The correct keys can be found in the readme file. The manual will present the most usual keys (sometimes alternatives are shown between brackets). something that must be substituted by a real value when using the program. The text between the signs tells what sort of value this should be. eg. a number a name [value] optional item. The value between the brackets can be given, but is not necessary 'File' refers to a menu item you can select. Sometimes you can use a shortcut to apply the relevant menu item. ... Something is left out. What is left out should be clear from the context Data formats ------------ The main formats used by DCSE, namely the alignment format and the reference format are described in this chapter. Most other formats, such as those of the default file, are described in the text where appropriate. A short description of all formats used can be found in Appendix A. 1.The alignment format An alignment file usually has the extension ".ali". It has four info lines and several sequence lines. The first line shows two numbers: the first position shown in the alignment, and the last position. The difference between the two must be equal to the number of positions in the alignment. These positions can preceded by a 'P' for a protein alignment, a 'D' for a DNA alignment or an 'R' for an RNA alignment. When none of these is present, the alignment is supposed to be an RNA alignment. The other info lines can contain any text. The second info line is usually empty. The third and fourth line usually contain an indication of the position. A sequence line consists of the entire sequence (including gaps), followed by a space, a number of five characters long, another space, and the species name of maximum 40 characters. The number is not essential, the number of characters between the sequence and its name however has to be 7. All sequences should have equal length. Each sequence line consists of symbols for nucleotides or gaps, alternated with positions that are either blank or contain a symbol delimiting a secondary structure element. 1.1. Symbols for nucleotides Completely identified nucleotides are indicated using the standard codes. U, C, A, G The standard ambiguity codes also apply for partially identified nucleotides: Y : U or C R : A or G M : A or C K : U or G W : U or A S : C or G B : U, C, or G D : U, A, or G H : U, C, or A V : C, A, or G N : U, C, A, or G A problem arises because the symbol "N", used by authors when publishing or submitting sequences, can have two different meanings: (1) A residue could not be properly identified on a sequencing gel. This can be due to template heterogeneity (e.g. in the case of reverse transcriptase sequencing) or to ambiguity in the polymerase reaction. In this case the number of unidentified nucleotides, although not their identity, is known. (2) The sequence was only partially determined and after alignment with a complete sequence of a related organism, the undetermined areas were padded with N's. In this case both the number and the identity of the nucleotides is unknown. Unfortunately, most authors do not explicitly mention which case applies. We assume, somewhat arbitrarily, that case 1 (known number of nucleotides, unidentifiable on a sequencing gel) applies if a single N, or a row of 5 N's at most, is found intercalated between known nucleotides. Rows of more than 5 N's are treated as unsequenced areas of unknown length in a partially sequenced RNA. Two different symbols are used to distinguish the two cases, as follows. N unidentified nucleotide, length of unidentified area probably known. o unidentified nucleotide, length of unidentified area unknown. In this case we intercalate a number of "o" symbols matching the number nucleotides in the most closely related species. 1.2. Symbols for amino acids The standard one letter codes apply: D aspartic acid (Asp) E glutamic acid (Glu) G glycine (Gly) N asparagine (Asn) Q glutamine (Gln) C cysteine (Cys) S serine (Ser) T threonine (Thr) Y tyrosine (Tyr) A alanine (Ala) V valine (Val) L leucine (Leu) I isoleucine (Ile) P proline (Pro) F phenylalanine (Phe) M methionine (Met) W tryptophan (Trp) K lysine (Lys) R arginine (Arg) H histidine (His) X unidentified 1.3. Symbol for gaps The symbol "-" is used to denote the presence of a gap at an alignment position. Note that in areas of undetermined sequence, the placement of symbols for nucleotides ("o") and gaps ("-") is hypothetical. The pattern of "o" and "-" is matched to that of the most closely related known sequence. The presence of these symbols in areas of undetermined sequence is required to allow DCSE to check the consistency of the postulated secondary structure patterns. 1.4. Symbols describing secondary structure The following symbols are used to indicate secondary structure elements.: [ and ] : beginning and end of one strand of a helix. ^ : symbolizes ][, a new helix starting immediately after the previous one. {and } : beginning and end of an internal loop or bulge loop interrupting a helix strand. ( and ) : enclose a base forming part of a non-standard pair (any pair other than G.C, A.U, or G.U). Their use is illustrated in the following figure. 1.5. Helix numbering To allow the identification of secondary structure elements, "helix numbering lines" are intercalated between the sequences. The name of such lines must begin with "Helix numbering". These lines contain the helix names,but have otherwise an empty sequence (only gap characters). The 5'- and 3'- strand of a helix name are indicated as and '. 1.6. Other symbols The characters A-Z are not allowed between the sequence characters, since they are used for describing the sequence. Currently only symbols describing secondary structure elements ([,],^,{,},(,)) and the asterix (*) have a special meaning. The user can choose other symbols to indicate other things in an alignment, e.g. '@' for an alpha helix and '#' for a beta sheet in protein alignments. 2.The reference format A reference file usually has the extension ".ref". Its format is modelled after the format of EMBL or GENBANK files. The sequence itself can be directly copied from those files. The header is different however, since it must indicate the exact position of the sequence you're interested in. It also gives some other information about the sequence. Generally, a reference file consists of a reference header followed by the data in a number of sequence blocks, separated by a "//". #reference file DCSE references info #data Header Sequence // Header Sequence // ... The reference header tells which fields can be present in the sequence headers. Every line contains a field, and has following format: ::: e.g.: acc:S:Accession number:NoAccn The three letter field ident will also be used in the sequence headers to identify which information is given following it. Current field types are 'S' for string and 'N' for number. The 'description' tells what the field is used for. The 'default value' is the value that will be given for a sequence when this field is not present in its header. Default values can be indirected using a '%'. e.g. When a field has '%org' set as a default, it will return the contents of the org field when the original field is not present. Every sequence block consists of a sequence header, which can contain any of the fields described in the reference header, followed by the essential sequence information. The lines in the sequence header are given in the following format: : e.g.: acc:X52949 When one sequence header contains one item several time, the data after this item is put in one concatenated line. The start of the essential sequence information is indicated by the field 'seq:' and it has following format: seq: /[

] The sequence name identifies the sequence, and forms the correlation between reference files and the alignment files. One sequence can be divided over several of sequence blocks having the same sequence name. The order of the different pieces is shown in the last line of the header. This is highly usefull when a sequence consists of several exons, or of different fragments. gives the first position of the sequence in the block. gives the last position. When the last position is given first ( is bigger then ), the complementary sequence of the one in the block is used. tells you which part of the total sequence this block contains, and says of how many parts the total sequence consists. After a number of parts the type of sequence can be given by a P (=protein), D (=DNA) or R (=RNA). If this identifier is omitted, the sequence is supposed to be an RNA sequence. The sequence is given on the next lines. Characters can be given in upper or lower case. Usually the format of EMBL is adopted. In this format every line contains 60 characters, in blocks of ten, separated by a space. The first block of characters of every line is preceded by 5 spaces. In GENBANK format, there's a gap of 10 characters before the characters. This gap contains a number. However it is not necessary to follow this format strictly. DCSE just starts to read characters beginning from the first sequence line. All non alphabetic characters will be ignored. The first characters are skipped and following characters are read until the th character is read. Generalities ------------ 1. Defaults When DCSE starts up, it looks in the current directory for a file called "dcse.def". In this file, you can put your preferred defaults for a lot of variables, such as the file you start with, your reference file, etc. You can have several defaults for every variable. It is a simple ASCII file. Every line has the following format: : e.g.: file : total.ali file : other.ali select_file : test.lst When asked for input, you are presented the first default. You can switch between the different default values by pressing the up and down arrow keys. The upper choice is the program default. This is often the value you typed in the last time using the function (during the same session). You can correct the value presented, by moving with the left and right arrow keys, and deleting (backspace or delete) or inserting letters (just type the letter). If the first key you type is a normal character (not an arrow key, or backspace), the line is cleared, and you begin by typing a new value. 2. Menus In DCSE, menus are shown on the top line of the screen. You can select an item by pressing the highlighted character in this item. Alternatively, you can go to the item by using the arrow keys. The line under the menu explains in short the currently selected function. When you are positioned correctly, press RETURN. Quit cancels the menu, and returns you to the alignment you're editing. If present, Back returns you to the previous menu. The main menu can be called with the key / . If there is a shortcut to a certain function, it is shown between brackets after this function in the menu. 3.Editing in DCSE DCSE looks at an alignment as if it were an abacus. The alignment has a rod or sequence line for every organism. All rods have the same length, which is set by the number of positions. This number can be reduced by removing positions which are empty in all sequence lines, or increased by inserting new positions. Every rod has a fixed number of beads (characters) in a fixed order. The number of characters is smaller than the number of positions, so every position contains either a nucleotide symbol or a gap symbol. This way the characters can be shifted. If a character is pushed leftward or rightward, and makes contact with another one, the latter is pushed in the same direction. In this way the fixed order of characters, or primary structure, will always remain correct. Just as in an ordinary screen editor, DCSE uses the screen as a window on a part of the alignment. This window can be moved in several ways to display other parts of the alignment. The screen can be split to show two windows on different parts of the alignment. It also features a pointer to show the current position in the alignment. This pointer can be moved by the arrow keys, and the window will scroll appropriately in order to keep the pointer on the screen. In contrast to normal editors, the pointer is not used to insert or delete characters from a certain position, although DCSE has functions to do this as well. It is rather used as a finger, which can push characters leftward or rightward. The pointer can not only push characters, it can also move characters to the other end of a gap, get a character from the other end of a gap, or move a continuous block of characters to either side. The pointer can also be resized so that it covers a number of sequences and positions. The resized pointer can perform the same actions as the small one, but all characters covered by the pointer will keep their relative positions during the process. The order of the sequence lines is not rigid. One or more lines can be locked in a given position on the screen, while the rest of the alignment can still be moved up- or downward. This makes it easy to compare one sequence to several others, or to rearrange the order of the sequences temporarily. 4. The pointer When starting DCSE, the pointer will have just the size of one block. The pointer can be resized by pressing d . This puts it in 'drag-mode'. The upper-left corner gets locked in the current position. You can then move the lower-right corner to make the pointer bigger or smaller. Pressing d again lets you control the position of the pointer again, rather than the size. Pressing p will shrink the pointer to its minimum size, and takes it out of 'drag-mode'. When in 'drag-mode' a 'D' will be added to the position which is shown in the first line on the screen. The characters under the pointer are called the 'selected part'. The organisms containing selected characters are called the 'selected organisms'. DCSE: getting started --------------------- 1. Creating an alignment file Any file withe data in the alignment format can be edited by DCSE. (The format is described in the chapter Data formats, and if you'r not sure, check out an example file.) However, DCSE does not offer the possibility to add sequences. You might wonder how to create your own alignment file, and how to append your own sequences to an alignment. For doing this, you should use Convers. With it, you can create a new alignment file, or append new sequences to an existing alignment. These sequences can be supplied in several formats (REF and PIR currently). Convers can also be used to convert other well known sequence formats such as those of EMBL and Genbank sequences to a usable format. It can further be used to rearange an alignment, or append two alignments. It even offers limited database facilities on reference files. For more info on Convers, check out the Convers manual. 2. Starting DCSE 2.1. Selection of organisms and positions An alignment file can become quite big. In DCSE, however it is not necessary to load the entire alignment into memory; you can choose the organisms and positions you need, and load only those. When the program is started by typing dcse, it first asks the name of the file containing the alignment (default: file). It then shows the following screen: Here you can select which organisms and positions will be loaded in memory. Scrolling up and down the organisms is done using the arrow keys. PgUp and PgDn move one page up and down. The organism between the arrows gets selected when you press RETURN. The organisms will then be shown in inverse video. Pressing RETURN on a selected organism deselects that organism. When you want to select all organism between the last selected organism and the one between arrows, use the f option. d deselects all organisms. a inverts the selection: all selected organisms get deselected, and all unselected organisms get selected. o lets you type in an organism to search for. (default: select_org) Pressing s saves all the names of all selected organisms to a file (default: select_file). When you want to select the same organisms in a later session, just load (l) the same file. The p key allows you to choose the begin- and endposition of the part you want to edit. You can start editing by pressing e. x stops the program. 2.2. The screen Next picture shows a screenshot of DCSE in action under DOS, with some explanation. The top line shows information about current position of the pointer, which is displayed as a rectangular area with a grey background. The second line is used for messages and menus. The next line displays help. The following are sequence lines, preceeded by a three letter abbreviation of the organism name. 3. Editing 3.1. Moving around You can move the window without moving the pointer (relative to the screen). Moving left and right one position is done using 4 and 6. 8 and 2 shift the window up and down one line. PgUp (Ctrl up) en PgDn (Ctrl down) move the window up and down by an entire page, while Ctrl left (PF2) en Ctrl right (PF3) shift the window left and right by page. The pointer can be moved in all directions using the arrow keys. When the pointer reaches the end of the screen, the window on the alignment is shifted (till it reaches the end of the alignment). With < and > you can move faster (by 10 positions) left and right. Alt up (PF4) and the keypad Alt down (-) makes the pointer jump to the top or the bottom of the screen. Home (0) and the End (.) on the keypad move the pointer to the beginning and the end of the sequence respectively. 7(Alt left) and 9 (Alt right) take the pointer to the next gap on the left or the right side. When the pointer is positioned in a gap, it moves to the end of this gap in the selected direction. You can move to a certain position by typing g (goto). DCSE will ask you on the info line for the position you want to go to. Entering a negative number the pointer allows you to move to a sequence position. e.g. Typing 10 will move the pointer to the 10th character in the alignment. Typing -10 moves the pointer to the 10th character in the sequence. 3.2. Shifting characters Aligning mainly involves shifting characters. So there are a lot of different options to do so. During a shift, the relative positions of the selected part (the one which the cursor is on) are always preserved. e moves the selected characters one portion to the left. Characters left to the pointer can be pushed by it. When the pointer is moving left, and reaches a character, it will push it left. When this character reaches another character, this will be pushed left as well. r moves characters the same way to the right. You can protect certain areas from being moved accidentally. On a spot normally containing a secondary structure sign, you can put a '*'. This character can not be pushed, and will stop the pointer when it reaches it. (The '*' can be moved though, when it is selected.) Applying the Shift key with e (r) also pulls the continuous group of characters (not containing a gap) right (left) of the pointer in the selected direction Ctrl e 'throws' the continuous group of characters left of the cursor across the first gap. The selected block of characters is then thrown to the left, till it reaches the closest character. Ctrl r throws characters in the other direction. The key 1 draws the first character before the pointer to the position right beside it. If the position besides the pointer already contains a character, the first gap before the pointer is sought. The last character before this gap, is moved so that it becomes the first character after this gap. You can get the same effect in the other direction using 3. 3.3. Dealing with secondary structure signs To fill in a secondary structure sign, select with the pointer the position(s) you want to show the sign. Then just type in the desired character. (If the pointer has characters selected, these will not change.) To wipe the signs, press a space with the appropriate position(s) selected. By default, when you are moving characters, secondary structure signs will be adapted where necessary. e.g. A]-[G will automatically change to - A^G when you push the A to the right. DCSE: The menu -------------- The main menu of DCSE is called with the / key. It offers following options: File PosiTion Primary Secondary Options Alignment Divers Quit 'Quit' leaves the menu. The other options all bring up a submenu which is discussed further. The 'Options' submenu will be discussed first, since some of its features are important for the other functions. 1.Options Selecting the 'Options' entry gives you access to a few "switches". You can toggle them between different states. Their state determines how the alignment will be shown, and how it will react to changes. 1.1. Display options 1.1.1. Splitting the screen When comparing two areas, looking for complementarities, etc.. you must be able to look at two different areas at the same time. This can be done with the 'Window split' option. (shortcut: w) This option splits the screen into equally sized parts, each offering a different window on the alignment. Each can be controlled independently in the direction of the sequence. In the direction of the organisms they will move together. You can switch between the two windows with the 'Tab' key. Selecting this option again gives you one window again. 1.1.2. Showsec 'Showsec' (shortcut: v) toggles between normal mode, and a mode where the secondary structure symbols are not shown. Beware, even when the signs are not being shown, they can still be changed. In the Helix numbering lines, helix names will be shown fully (i.e. not missing the characters at secondary structure symbol positions) as much as possible. When the pointer comes over these lines in this mode, the display can be distorted. 1.1.3. Colour Handling of colour is specific to the different platforms. Generally, you can link colour codes (these are described in appendix C) to any character. When DCSE encounters the character, the attribute, foreground or background colour is changed according to this codes. The 'Colour' (shortcut: c) menu is used to set the colours and make the links. It gives following options: Colours: Bases Secondary switCh Links Pointer Fgr bGr Back Quit The 'Bases' entry activates or deactivates the colour codes for bases or amino acids. 'Secondary' will switch on or off colour for secondary structure symbols. These options can be combined, although the result depends very much on the colours used. Helix numbering lines will never show colour. 'switCh' will switch between display of characters in colour and display of secondary structure in colour. 'Pointer' and 'Default' will set the colour codes for the pointer, and for the standard foreground and background colours. 'Links' can be used to change the colour codes for any character(s). When it is selected, you are asked to type in the characters that you want to change color. In the following menu you can select the foreground and background colour, and the attribute of the characters. Which are available depends on the system. Pointer and default colours are changed in the same way (but you don't have to type in a list of characters of course.) 1.1.4. Refresh This functions wipes and rewrites the entire screen. Under VMS or UNIX the screen can get messed up by other processes, e.g. notify of finished jobs. This is the cure. 1.2.2. Movesec When you move sequences, the secondary structure signs are moved together with the characters. With 'Movesec' off (shortcut: S), the signs stay in the same position when the characters are moved. When this is selected an 'S' will be shown in the top left corner of the screen, before the position. 1.2.3. Xchange When 'Xchange' is set to 'do', all secondary structure symbols under the pointer will be replace by any (allowed) character you type in. When this option (shortcut: X) is set to ask, an 'X' will be shown in the upper left corner of the screen. When the pointer hase a size bigger than one on one, and you type in a secondary structure symbol, DCSE will ask for confirmation on the replace. It gives you a menu with following choices: 'All', 'One' and 'Quit'. When you choose all, all secondary structure symbols under the pointer will be replaced by the character you typed in. With 'One' you can choose to replace only one type of character, eg. the space. 1.2.4. Protect This option can be on or of. When it is on, a 'P' is shown in the upper left corner of the screen. In this case the relative alignment of the currently selected sequences will not change while moving them around. eg. AUU-GC-G will become -AUUGC-G when pushed right with Protect off. AUUGGC-G -AUUGGCG and -AUU-GCG with Protect on -AUUGGCG 2. Filing In the main menu you can select filing by pressing f, or by going to the 'File' option and pressing RETURN. The filing menu consists of following options: File: Save(s) Output(o) Ins/del(i) Help(f3) Move(L) seLect(b) eXit(x) Back Quit 2.1. Saving your data You can save your modified file by selecting 'Save' from the file menu, or by pressing s outside the menu. This brings up the save menu. It allows you to save all sequences you've loaded by choosing 'All'. When you choose 'Organisms', only the currently selected organisms are saved to the file. 'No save' results in nothing being written to the file. If you have changed the order of the organisms (using Lock and unlock, see part 6.2 and 6.3), this new order will NOT be reflected in the file. The file will contain the organisms and their corrected (shifted) sequences in the old order. You can change the order of organisms permanently in a file, by loading all organisms, changing the order, and saving them using 'Output' with appropriate parameters. You can also use the 'rearrange ALI' option of the program Convers. 2.2. Output 'Save' saves your corrected data to the same file, in the same format. 'Output' (shortcut: o) however makes a new file. It has several parameters which allow you to change the contents and the format of this new file. 'Output' e.g. allows you to make a new alignment containing just part of the organisms or positions, or to make a printfile with the alignment split in different blocks. You can select which organisms you want to save to the file with the option 'Org'. When set to 'all', 'Output' saves all organisms, even those which are not currently in memory. These will be saved in the order they are in the FILE. When you select 'mem', only the organisms currently in memory are being used for the save. These are saved in the same order they are in at the time of the save! The option 'select' allows you to save only the currently selected organisms. 'Pos' allows you to select the positions for the save. Here you can also choose 'all', 'mem' or 'sel'. Any combination of 'Org' and 'Pos' options is possible. When you want to print an alignment, you can subdivide it in several blocks which are printed one under the other. 'Block' sets the size of these blocks. By default the blocksize is set to the number of positions currently in memory. This means the alignment will be saved as one block (which is necessary to use the output as an alignment in DCSE). When you select 'all' for the 'Pos' option, and still want to keep one block, you will have to increase the blocksize. When you input a blocksize bigger then the maximum number of positions that will be saved, it will automatically be lowered to this maximum before outputting. If you want to make a file for printing, containing several blocks, lower the blocksize. The 'File' entry simply allows you to choose which file the output will be directed to (default: output_file). 'Settings' gives you a submenu where you can set following parameters; With 'Sec' you can choose whether secondary structure signs are printed (y) or not (n). (If you intend to use the generated output as alignment file in DCSE, this should be set to y.) When 'Compact' is set to y(es), the positions not containing a character in the sequences you want to save, will be cut out in the new file. This option can only be used when saving organism and positions from memory (or selection). When trying to use the 'all' option together with compaction, output will refuse the job. 'Abbreviated' tells 'output' whether to abbreviate the names of the organisms (y) or use the full names (n). 2.3. Getting help 'Help' (shortcut: f3) offers a short help window, showing some of the key shortcuts. you can scroll the help text up and down using the arrow keys. X stops the help. 2.4. Import The 'Import' function lets you import a (partial) alignment into the loaded alignment. 'File' lets you select the alignment file to import (default: import_file). 'Go' starts the import. The pointer selects the positions in which the alignment will be imported. The first sequence line of the foreign alignment is used as a reference. The sequence bearing the same name in the current alignment is searched, and the sequences (without gaps) compared. If the sequences are not equal, the routine stops. The function then checks whether the foreign alignment will fit. If not, insertions will be made in the appropriate places. Then the other sequences of the foreign alignment are loaded in. If the sequence corresponds with the sequence having the same name in the current alignment, the foreign alignment will be transposed to the selected positions. This function can be of great use when working on an alignment with several persons. Limited sets of positions or organisms can be saved using the output function and be worked on. The edited partial alignments can be imported back into the total alignment using this function. 2.5. Making a reference file from an alignment The 'Ref' item allows the user to create a reference file (name and sequence only, no extra data) from the currently selected organisms. With 'File' the name of the file to write into can be chosen. 'Go' starts the process. This can be usefull e.g. when you have extracted a part of an alignment (output), and are working on this. When you check the sequences to the original reference file, they will naturally be too small. Using this function, you can create a ref file for temporary use. 2.6. Moving The description in the part 'Moving around' applies to moving around in the block of the alignment you have loaded in memory. You can however shift this block as well by selecting the 'Move' entry in the filing menu. You will be asked by which number of positions you want to shift the block. Use negative numbers to load a part more to the beginning of the file, and positive to move to the end of the file. You are then asked if you want to save changes you have made in the block you're working on by a standard save menu (options: all, organisms, no save). 2.7. Selecting other organisms or positions When you want to change the number of positions you have in memory, or want to work on other organisms, use the seLect option (shortcut: b). This brings you back to the selection screen. 2.8. exit 'eXit' does just what you would expect it to do; leave the program. Before quitting it asks for confirmation. Then it shows a save menu to allow you to save your results in case you forgot to do so. 3. Positioning The 'posiTion' menu gives you the possibility to goto or find specific positions or areas in an alignment. It brings up next submenu: Position: Goto(g) Mark(m) Find(f) Pointer(p) Back Quit 'Goto', 'Mark' and 'Find' are explained in detail. 'Pointer' just brings the pointer to its minimum size, and sets it to 'non-drag' state. 3.1. Goto The 'Goto' entry will ask you for an alignment position to go to (default: goto_pos). When you enter a minus before the number, it will move the pointer to the sequence position. 3.2. Markers In its simplest form, a marker is a spot on an alignment which you give a name, and which you are able to go back to later on. You can have as many markers as you like (memory providing), and you can save and load them. This system is thus very usefull in different areas. While editing, it can be used to mark interesting areas (e.g. about the same sequence in different organisms), or to mark an area you're working on, when you go temporarily to another area. The system can also be used to mark things such as the position of hidden breaks, introns, tertiary structure, ... . since the format of a marker file is really simple, other programs can easily generate them. When you load these marker files, it is easy to examine the indicated postitions. With the 'Mark' (shortcut: m) entry, you can create a new mark on the current position. It shows following menu: Mark: Control Goto Param Mark Name:MARK: 7 Quit 'Param' allows you to set several parameters. With 'Name' you select the name of the marker to be created. 'Control' controls whether markers are active, visible, will be loaded, saved or deleted. The mark parameters The parameters menu gives you following options: Parameters: Alipos:y Single:y Organism:n Mark Back Quit 'Alipos' says whether the alignment positions or sequence position is used. When the sequence is moved, the alignment marker will not move accordingly. The sequence marker will naturally stay in the correct position in the sequence. When you try to position a sequence marker on a gap, it will move to the end of the gap. (Since there are no characters in a gap, it cannot count them.) 'Single' determines whether the marker will contain one or two positions (on the same sequence). When the split screen option (see 1.1) is on, 'Single' will be set automatically to n(o), recording both positions. You can set it to y(es), to record only the position of the current pointer in the marker you're creating. 'Organism' chooses whether the name of the current organism will be incorporated in the marker. When set to y(es), subsequent selection of this marker will move the pointer to the given position in this particular sequence. When set to n(o), it will move to the position on the current sequence line. 'Mark' will, just like the 'Mark' entry in the main markers menu, create the marker. Controlling markers The 'Control' submenu looks like this: Control: Load Save Activate Inactivate Clear Delete Visible(on/off) Back Quit 'Load' loads previously saved sets of markers from a file (default: markers_file). (A marker file can also be made by other programs, such as convers.) 'Save' saves the currently active markers to a file (default: markers_file). Marker files have the following format: ... In a marker without organism, the organism name part will contain only spaces. Positions 1 and 2 are positive numbers when they are alignment positions, and negative when they are sequence positions. When the marker only contains one position, position 2 contains zero. A marker name can be 40 characters long. A line beginning with a '!' is regarded as a comment, and is not loaded. You can 'Activate' or 'Inactivate' a range of markers. An inactive marker is still in memory (not deleted), but you can not go to it, and it will not be saved. You select a range by typing in a part of the name. All markers containing this part in their name, will be subsequently activated or inactivated. This feature can be used to create different groups of markers: e.g. you have a number of markers called: 'INTRON: org 1', 'INTRON: org 2', ..., and a range of markers called 'BREAK: org 1', 'BREAK: org 2', .... You can inactivate all the break markers, by choosing 'Inactivate' and typing in as a range 'BREAK'. You can of course inactivate one specific marker by typing in its entire name. 'Clear' deletes a number of markers. These markers will be erased from memory and will be lost when not saved. You can choose to clear all the currently 'Active' or 'Inactive' markers. 'Both' will delete all markers. 'Range' allows you to type in a range of markers you want to delete. 'Delete' lets you delete one marker at a time. When you've entered this function, it show you next line: Delete (DEL/RET): You can browse through the markers with the up and down arrow keys. When you press the left or right arrow key, the position(s) of the marker are shown in the alignment. Pressing Delete deletes the marker whose name is currently shown on the info-line. Pressing f leads you to the find options. You can type in the name of a marker you want to find. Return stops the delete option. When you apply 'Visible(on/off)', the marks are shown in the alignment by a hash sign (#). When there is a secondary structure sign on the spot where the hash appears, this sign is kept in memory. When you apply 'Visible(on/off)' again, all hash signs will be replaced by their former value. If you have put a new secondary sign in stead of the hash sign, the old secondary structure sign will be discarded in favour of the new one. Going to markers The 'Goto' item brings you to another menu with following options. 'Pos' takes you to a specified position. 'Mark' offers the possibility to go to a certain marker. You can browse through the markers with the up and down arrow keys, and show the selected marker by pressing the left or right arrow key. Pressing Return moves you to this marker. 'Former', 'Current', and 'Next' move you to the former, current and next marker. The 'Logical' option determines the order of the markers. If it is set to n(o), the order of the markers is determined by the order they were loaded in. With 'Logical' set to y(es), the order is determined by the order in which they appear in the alignment. 3.3. Finding things The 'Find' entry (shortcut: f) of the file position menu gives following options: Find: Seq Compl Org Helix Exact:y Forward:y Dif:0 Gaps:0 Quit 'Exact' only applies to the finding of sequences. It says whether the sequence should be exactly the same (e.g. S=/ G), or not (e.g. S=G, but S=/ A). For protein sequences following 'ambiguity codes' are allowed in the search function when 'Exact' is set to n(o): Z : acidic (D,E) B : basic (K,R,H) U : uncharged polar (G,N,Q,C,S,T,Y) J : polar or charged (D,E,K,R,H,G,N,Q,C,S,T,Y) O : nonpolar (A,V,L,I,P,F,M,W) 'Forward' determines the direction of the search. 'Dif' gives the number of differences that are allowed in the search for a sequence, complementary sequence or pattern. The 'Gaps' item gives the number of (extra) gaps that are allowed when searching a sequence pattern. This value has NO effect on the search for sequence or complementary sequence Finding a sequence 'Seq' is used to find a certain sequence in the alignment. After selecting 'Seq' you are asked to type in a sequence, or edit the default one. When the pointer is expanded, the sequence under the pointer is chosen as default sequence. Otherwise the default sequence is the sequence you last used for a search. you can also search for a certain order of characters on one row (one position). This is indicated by starting the sequence with a '|' sign. Once you've entered the sequence, the screen will split.The first screen will keep to the original position. The other screen will show the position of the found sequence. Finding a complementary sequence 'Compl' works in the same way as 'Seq', though the sequence complementary to the one entered is being searched. 'Strict' does not apply to 'Compl'. Finding a pattern The search pattern option lets you look for sequence patterns. They are specified by a string of characters. When several characters are possible on a certain position in the pattern, they can be put between square brackets. You can also include a '-' between the brackets: This means that the base can be present, but doesn't have to. e.g.: AGGC[GC][G-]AA will find a AGGC followed by a G or C; This might be followed by a G. The sequence must end by two A's. The value given by the 'Gaps' item gives the extra (apart from the ones indicated in the pattern) number of gaps in pattern or sequence that allowed Finding an organism 'Org' allows you to type in the name (default: find_org), or a part of the name of an organism. The pointer will be moved to the next organism containing that text. (If the pointer is after the searched organism, it will not be found!) You can also type in a number. '1' will bring you to the first organism. will bring you to the 'th organism. When the number entered is bigger then the number of organisms, you will just go to the last organism. Finding a helix You can go to a certain helix by choosing 'Helix', and typing in the helix name (default: find_helix). To be able to do this, you must have an organism with a name beginning with 'Helix numbering'. This organism contains the helix names on the correct position. 3.4. Going to the next non standard basepair The options to go to the next non standard basepair are not on the 'Position' menu. They can be called with the keys h and H. They will only have an effect in split window mode. When you press h, the program checks whether the base right of the pointer in the left window is complementary with the base left of the pointer in the right window. If this is so, the pointer of the left window is moved to the right, and the right pointer to the left. This continues till a non-complementary pair is reached. H does the same thing in the other direction. 4. Primary structure tools 4.1. Checking the sequence You can check whether the primary structure is correct using 'Check' (shortcut: y). This is done by comparing the sequence against a file in the reference format. The sequence in the reference file is never changed, en thus stays correct (assuming the original data was correct). The sequence in the alignment and in the reference file will be compared, and the differences indicated. characters 'o' are ignored in the checking of the primary structure. When checking nucleic acid sequences, any stretch of more then 5 N's (this number is a bit arbitrary) in the reference sequence will be cut out. Such stretches should be replaced with o's (see the alignment format). T's in a reference file will be replaced by a 'U' for RNA sequences. 'Reference file' lets you input the name for the reference file (default: yref_file). 'Init ref' initialises the reference file. This is necessary when the reference file has been changed (e.g. added an organism) during a session. Then select 'One' or 'Go' to check one sequence (the one the pointer is on), or all sequences beginning with the one under the pointer. When a difference is encountered, a correction is proposed. You can reply with Y(es), D(o), I(insert), N(o), or S(kip). Choosing 'Y' does the correction, possibly pushing characters aside, and halts further checking of the sequence. 'D' does the correction as well, but continues the checking. 'I' does the correction using a global insert if necessary, and continues checking. 'N' stops checking with no correction. 'S' continues checking without correcting the difference. 4.2. Aligning sequences You can automatically align one sequence to another in DCSE. The better the sequence you're aligning resembles the one you're aligning it to, the better the result will be (naturally). It uses a combination of two methods. The program starts comparing the two sequences by using a recursive algorithm. It correlates the beginnings of both sequences, and the ends. It then searches for a sub-sequence of specified size (preferably rather big) that appears in both sequences. If it is found, a correlation is made between the sub-sequence in the two sequences. The stretches between the new and the former correlation are aligned by applying the same method using a smaller blocksize. The program then carries on looking for sub-sequences of the first blocksize. The program stops when the end of one of the sequences is reached. When the sub-sequence reaches a certain minimum size, the remaining stretches are aligned using a method derived of the Sellers algorithm (1974, SIAM J. Appl. Math., 26, 787-793). For every alignment, there's a 'distance' between the two sequences. This distance is determined by several parameters. For every substitution, the distance is increased by a certain amount. This is called the 'substitution cost'. Inserting a gap increases the distance with the 'gap penalty', and for every character deleted or inserted there's an 'insertion-deletion cost'. The program searches the optimal alignment by minimizing this distance. Alternatively you can choose to calculate the similarity of two aligned sequences. In this case, the best alignment will be found by maximizing the similarity. Mostly, a sequence is aligned to a sequence that is already aligned to others. This reference sequence will not be altered. The characters of the other sequence will be shifted relative to this sequence in order to reflect the calculated correspondence. However, if the newly aligned sequence contains an insertion in a spot where the reference sequence (and the alignment) does not have a corresponding gap, this can not be properly accommodated in the alignment. DCSE's alignment routine can handle this situation in several ways. It can create a global insert in the entire alignment, or it can carry out the insert by pushing the surrounding characters in the newly aligned sequence aside, hereby possibly disrupting the alignment locally. An other option is to leave the insertion out. This option will produce an error in the primary structure, which can be detected easily later on by the primary structure checking routine (y). This will leave it up to the user to decide whether a global insert should be created, or whether the problem can be solved by a local sequence realignment. With the pointer you select which sequences will be aligned. The first sequence will act as reference sequence. The following sequences are aligned to the first. When you have a number of positions selected, only the characters on those positions will be aligned. When the pointer has a minimum width, but occupies several organisms, the entire sequences will be aligned. The 'Align' menu gives you following options: Align: Parameters Blocksize:30,15 Insert:global Sellers:y Go Float:n Quit 'Insert' determines how insertions that can not be accommodated properly, are handled. When set to 'global' DCSE will create a global insert for every insert that couldn't be placed, 'push' will do the insert by locally disrupting the alignment. 'off' will leave those inserts out. The last option ('local') is used to align several sequence to each other, but not to rest (yet). The alignement will be made by rearanging all selected sequences. This way the reference sequence can be rearanged as well, and alignment to other sequences (not selected) can be disturbed. 'Blocksize' sets the blocksizes that are used in the first alignment algorithm. When 'Sellers' is set to n(o), only the first alignment algorithm is used. The 'Float' option only has an influence on nucleic acid alignments. When it is set to n(o), an integer implementation of the Sellers algorithm is used. This is faster, and less memory hungry, but also less general. 'Parameters' lets you set the parameters for the Sellers derived algorithm when using Float:n. Protein sequence alignment always use the general implementation. ' Go' starts aligning. The general implementation of the Sellers algorithm uses a file to determine whether distances or similarities are used, and to set gap penalty, indel and substitution costs. This file is called 'nucl.prm' for nucleic acid alignments, and prot.prm for protein alignments. If a file of that name is not found in the current directory, it is searched for in the home directory. For protein sequences the Dayhoff matrix is provided. The files must have the following general format: gap: indel: ... ... ... ... ... ... ... ... ... in which the , ... are the alphabetic characters representing a certain base or amino acid, and are the values which are added to the alignment score/distance when the specified substitution occurs. The first line contains either the word 'Similarities to indicate that similarity scores will be calculated and compared, or 'Distances' to indicate that the alignment showing the smallest distance will be searched. The gap penalty and indel cost will usually be positive for distances, and negative for similarities. 4.3. Shifting to a position 'Moveto' (shortcut: =) moves the selected characters to a specified position, thereby pushing the characters that they encounter in the direction they are moving. This option does not change the primary structure. It can make big changes in the alignment though. 4.4. Inserting characters The primary structure is usually not changed. You can however do this using 'Insert' (shortcut: I). When you select it, you will go into insert mode. Any character you type in now, will be inserted in the sequence at the position of the pointer. When there is no empty position to insert the character into, surrounding characters will be pushed away to make room for it. When none of the surrounding characters can be pushed away, the character will not be inserted. During insert mode, you can move left and right with the arrow keys. You can also delete characters with the backspace key. Insert mode is stopped by pressing Return. 4.5. Deleting characters You can delete the currently selected characters by selecting 'Delete' (shortcut: D). 4.6. Overwriting sequences 'Overwrite' (shortcut: f1 or PF1) lets you overwrite a part of the sequence. It lets you write a string containing characters not accepted by the other functions. The characters just appear as you type them. You can delete typed in characters with backspace. Overwriting a sequence is for instance usefull if you want to type in a helix name on Helix numbering. 5. Secondary structure tools 5.1. Checking the secondary structure 'Check' (shortcut: t) lets you check the (RNA) secondary structure that is shown in the alignment. The secondary structure signs in the alignment show where helices, bulges, and non standard basepairs are located. They do not show which helix segments pair with each other. To be able to figure this out, there must be a pseudo-organism with a name beginning with 'Helix numbering' present. The sequence of 'Helix numbering' is almost entirely empty. It only contains the helix names at the position a helix occurs. Every helix contains two segments, so every name must be present twice in 'Helix numbering'. An end quote is not recognised, so name and name' are seen as the name, and can thus be used to indicate the two complementary segments of a helix. When you first apply 'Check' a table is made containing the positions of all helices using the first organism whose name begins with 'Helix numbering'. This table is used for further checking. When you make changes to 'Helix numbering', you will have to reinitialize this table. The 'Reinit' entry does this. 'Go' starts the checking. When the program encounters a fault in the proposed secondary structure (a not indicated non standard base pair, a helix that's not closed, etc.), it stops, gives an appropriate error message, and indicates the location of this fault in the alignment. When you choose 'One' to start the checking, only one helix is checked. With 'Helix' you can choose which helix this is (default: sec_helix). Faults concerning helices not being closed are still being given though. The 'Parameters' entry of the 'Check' menu allows you to set certain parameters. When 'stop at Wrong place' is set to y(es), the program will also stop when a helix is not positioned on the exact position it should be. (this means that it is not exactly above the helixname in 'Helix numbering'.) With 'Extend' you can choose whether the programs checks if helices can be extended. 'bUlge extend' tells the program to look whether bulges can be narrowed. When there's an 'N' at the end of a helix, it can always be extended. With 'N' you can choose whether these "extensions" will be ignored (i) or shown (s). 5.2. Copying secondary structure signs 'Copy' (shortcut: C) lets you copy the secondary structure signs of the upper selected sequence to the selected sequences below. For this option you choose 'Secondary structure' from the copy menu. The other option 'O's' looks for every selected position to the top selected sequence. If this contains a character, it puts an 'o' at this position in all lower sequences when this does not contain a character. This feature is usefull when working on a partial sequence. An 'o' serves as a character that is supposed to be there, judging from comparative evidence. 5.3. Looking for compensations 'Find comps' (shortcut: k) allows you to look for compensating positions between aligned regions. You can select the two areas that will be compared with the two pointers in split screen mode. When the screen is not split, every position under the pointer will be checked against all other positions under the pointer. As usual, 'File' gives the name of the file the results will be written to. With 'Markers' you can select whether the results will be written as a marker file. 'Settings' allows you to set the different parameters that control the search. For every organism, the program will take a block of bases starting from the first position, and check whether this block is complementary with the block of bases which ends at the second position. The number of bases is given by 'Size'. 'Low pass' sets the minimum number of organisms for which these blocks have to be complementary. If a sufficient number of organisms contain complementary blocks at the two positions, these will be checked for compensations. For every pair of positions a score can be calculated. This is done by counting the number of strong and weak basepairs, and the number of basepairs having the purine left or right. If two positions have both strong and weak basepairs, the score is incremented by 1. Having purines left ánd right also increments the score with 1. One of the complementary positions within the block has to reach a minimum score set with 'Min score'. Information about every pair of positions that fulfils these conditions, is written to the output file. This first gives the two positions, then the score, followed by the percentage of organisms for which the blocks were complementary. You will also find a base bias and a side bias. This are values which give an idea of the distribution of the different basepairings. When 50% of the basepairs have their purine to the 5' side, and 50% to the left side, the side bias will be 100. When all pairs have it on the same side, the bias will be 0. The base bias gives the bias for strong and weak basepairs. The higher the scores are, the better the basepair is supported. If you save with 'Markers' set to 'n', the scores will be given for every position of the complementary blocks. When you save the positions as markers, the scores for every pair of positions in the blocks will be added together. For the side and base bias an average is calculated. Notice that when you save as markers the second position is one higher than when you save as plain text. When you load your markers, the first position will set the pointer to the left of the first block, while the second pointer will be positioned to the right of the complementary block. 5.4. Base pairing matrix 'Matrix' (shortcut: M) lets you make a base pairing matrix of selected positions. This can be used to search for complementary segments in a certain area or between different areas. A base pairing matrix shows a rectangle, with a sequence on the top and the right side of it. When a base of the top sequence is complementary to a base in the left sequence, a mark in the rectangle is shown in the appropriate coordinates. When the screen is not split, the sequence under the pointer will be checked for complementarity to itself. With a split screen, the positions under the first pointer are compared to the positions under the second one. 'Limit' lets you set the minimum amount of continuous bases that should be present to show the marks for a baseparing in the matrix. e.g. when it is set to 2, a single possible basepairing will not be shown. The number of bases is counted for every possible complementary stretch. When the number exceeds the minimum value, this number is shown in the matrix to mark the complementary spots. When the number is greater then 9, different characters are used. When more then one organism is selected, two positions have to be complementary in all organisms to produce a mark in the matrix. You can however allow a number of faults with 'Allowed faults'. The matrix is written to the file specified by 'File' (default: matrix_file). 'Go' starts the creation of the matrix. Next figure shows an example of a matrix 5.5. Saving and loading secondary structures Secondary structure information can be saved separately from an alignment by using the 'Save' option from the 'Secondary' menu. With the 'Org' option you can choose whether the structure of all organisms loaded in memory (mem), or only those of the currently selected organisms (sel) are saved. The 'Pos' option determines whether the structure in the currently selected or loaded positions will be saved. 'File' says in which file the structure information will be saved. 'Go' starts the save. When saving, the position of the secondary structure elements is determined using the sequence positions. So even when the sequences are realigned, loading the secondary structure will position the symbols on their correct positions. The 'Load secondary structure' menu has about the same options as the save menu. You can choose whether the symbols for all organisms and/or positions in memory, or just for the selected ones are loaded with the 'Org' and 'Pos' option. 'File' determines the name of the file to load the organisms from. 6.Alignment 6.1. Inserting and deleting positions The 'Ins/del' entry (shortcut: i) of the filing menu, lets you create global insertions or deletions in the alignment. 'Mode' switches between the creation of an insertion and deletion. 'Size' lets you set the size of the desired insertion or deletion. 'Go' performs the insertion or deletion. Any insertion or deletion events are recorded by DCSE. When you save one or more organisms the insertion and deletions are applied to the entire alignment. For this the entire alignment will have to be rewritten. You are offered to save to a differently named file in this case. If you exit or go back to the selection screen (b) without saving, the insertions and deletions recorded since the last save operation are discarded. 6.2. Locking and unlocking organisms You can lock an organism at a certain height on the screen using 'Lock' (shortcut: l). When the window on the alignment is moved up or down, the locked organism stays in the same spot relative to the screen. So it actually moves in the alignment. 'Unlock' (shortcut: u) unlocks a locked organism, so that it behaves normally again. This system allows you to keep certain organisms at hand to compare them to to others, to keep 'Helix numbering' on your screen, and even to change the order of the organisms in the alignment. (When you save, the original order is kept! You can however save the changes using 'Output' with 'Org' set to 'mem'.) 6.3. Grouping and ungrouping sequences The 'Group' (shortcut: G) item allows you to group the selected sequences into one compound sequence. If at least one of the grouped sequences contains a character at a certain position, a character is put in the compound sequence. This character will represent all the characters of the grouped sequences at this position. Normally the most abundant character is used. The value in 'Coverage' is used to controll the decision which character will be used. The compound character must stand for at least % of the characters present at this position. e.g. Suppose a position contains 3A's, 2G's and one U. When coverage is 50%, the compound character will be an A. When coverage is 60%, the compound character will be R. The 'Coverage' item also determines which secondary structure symbols are shown in the compound sequence. If the most abundant is present in more than % of the sequences, it is shown. Otherwise a space is shown. When a grouped sequence is realigned, all sequences in the group will be realigned accordingly when they are ungrouped. (Their alignment to eachother will not change.) When secondary structure signs are changed in a compound sequence, the change will NOT be reflected in the original sequences when ungrouped. Sometimes just a few sequences actually have a base on a certain position. You can decide to leave these bases out by setting the minimum number that must be present on a position with the 'Min' item. You can also decide to show these 'sub minimal' bases as o's with the option 'sUbmin'. This is interesting when you want to align a sequence to a group of sequences. You can group the sequences, using this option to represent inserts in one or a few sequences as o's. You can then align the sequence to this group. The o's are not taken into account when doing an automatic alignment. They will ensure though that, when ungrouped, the inserts retain there position. When you select no for 'Submin', the inserts will be left out of the grouped sequence. This will not disturb the alignment of sequences in the group, When ungrouping, the position of the inserts relative to sequences that were not in the group is not necesarily reserved in this case. (Inserts that are not represented in the compound sequences are inserted in the first possible place.) e.g. Group with Min=2, Submin=o, coverage=100 A U G - G C G --> A U o - o S G --> A U G - G C G A U - - G G G A U - - G G G A U - - - G G A U - - - G G Group with Min=2, Submin=no, coverage=100 A U G - G C G --> A U - - - S G --> A U G G - C G A U - - G G G A U - G - G G A U - - - G G A U - - - G G The 'Name' item lets you set the name of the group to be formed. 'Group' groups all selected sequences. The abreviation of the name of a compound sequence is shown highlighted. When saving, groups are expanded automatically, and they are grouped again afterwards. Output will refuse to work on compound sequences, so you will have to ungroup them yourself. 'Ungroup' (shortcut: U) will ungroup all selected compound sequences. You can save and load grouping information with the 'Save' and 'Load' options in the group menu. When you choose to save the grouping info, you can fill in the name of the file that will contain the grouping info using the 'File option'. The 'Range' option will let you choose whether all groups, or just the currently selected one will be saved. The order of plain (not grouped) organisms will be saved as well. In a next sesion you can restore the grouping info by choosing 'Load', and entering the same filename. DCSE will try to restore the order of organisms and grouping of organism as specified in the file. This feature can also be used to just change the order of organisms to that of a previously saved session. (There do not have to be groups for this feature to work.) 6.4. Removing sequences The 'Remove' item lets you remove the currently selected sequences from memory. This can be usefull when you do not have enough memory to ungroup or save a coumpound sequence. You can then first save some of the other sequences, and remove them. This will free enough memory to expand the compound sequence. 6.5. Creating a sequence line You can create a new empty sequence line from within DCSE by choosing the 'Create' option. This sequence will be appended to the alignment file. It will appear in memory under the current sequence. (Normally you will be adding existing sequences from a reference file using convers.) 6.6. Changing the name You can change the name of organism by selecting 'Name'. You have to save the organism if you want to change the name on the file. When you are working on a file not produced by 'Output' or convers, the file sometimes doesn't have room to fit longer names than the one the sequence has currently. ('Output' puts in spaces after the name to accommodate for 40 characters.) In such case, you should create a new file with the total alignment using 'Output'. Leave DCSE, and load the new file. You should then be able to change you're names properly. Take care when changing a name, that the name of the corresponding sequence in the reference file should be changed as well. 7. Divers 7.1. Information 'Info' (shortcut: J) can give you some information concerning the selected sequence(s). It gives following options: Info: Comp. comp./Pos Dinucl. Var. Simil. Helix File: info.txt Quit 'File' lets you enter the file the information will be written to (default: info_file). 'Comp.' gives you the basecomposition of the selected subsequence. For every organism the number of A's, U's, etc. are counted and given. 'comp./Pos' gives the composition per position. So for every selected position, The program counts how much A's, U's, etc. are present in the selected organisms. These values can be given as percentages or as real values. 'Var.' gives the RNV value (a variability measure for nucleic acid sequences) for the selected positions and organisms. The RNV value gives an idea of the variability of a position. You will be asked to enter a minimum number. When less than this number of organisms have a base for a certain position, the position will not be shown in the info file. By default, this number is set to the number of selected organisms. The info file will contain a list of the organisms used, followed by the list of RNV values. Every line of the list contains following items: the position, the number of organisms containing a base at that position, the RNV value, and a number of stars indicating the size of the RNV value. 'Simil' calculates the similarity between the selected sequences. The option 'Screen' gives the similarity between the first two selected sequences for the selected part. 'File' will write the similarity values between all selected organisms and the first one to file (for the selected part). You can specify a number n, smaller than the selected number of positions with 'Blocksize'. In this case, for each selected position, the similarity between the selected sequences will be calculated for each block of n positions starting from this position. The values are written to file using the 'File' option. 'Helix' will create a table in which the presence or absence of all helices of the first 'Helix numbering' line in memory is shown for all selected organisms. 7.2. Sequence info The 'Seqinfo' option (shortcut: n) shows the extra info concerning the selected organism that is contained in the reference file. The reference must have been specified in the primary structure checking option. 7.3. Real position 'Realpos' (shortcut: j) will tell you the sequence position at which the pointer is located. With 'N' and 'O' you can select whether N's and o's are being counted as characters. 'Go' gives you the number. 7.4. Searching sequence motifs 'Motifs' (shortcut: z) lets you search for sequence motifs. A motif is a piece of sequence which is present in a lot of organisms (with small changes). Pressing z gives you following options: Motifs: Dif:2 Num:20 Block:20 File:motifs.prt Go Quit 'Dif' gives the number of characters two sequences may differ, and still be regarded as the same sequence. 'Num' gives the minimum number of organisms that must contain the relevant sequence. By default this is all sequences. 'Block' says how many characters a motif must have. The results are written to a file with the name under 'File' (default: motifs_file). The program takes all possible fragments of the correct length of the current sequence (the upper selected one), and checks these fragments against all other sequences to see whether it occurs in enough organisms. This process can take a long time! 7.5. Translating a coding sequence In order to study coding nucleic acid sequences, it can be handy to see the sequence and what its coding for, rather than its structure. By applying the translate function, the selected sequence is translated into its corresponding amino acid sequence. Before every three bases, the lower case symbol of the amino acid they code for is given. When you apply different colours to these amino acid symbols, the information of both can be easily combined in an alignment. APPENDIX A: the data formats ---------------------------- This section gives a short, schematic description of formats used by DCSE. The full description can be found in the text. x's denotes characters. A space is shown as a bullet character (·). 1. Alignment format []······· ··················10 .... ········x ·.·.·.·.·|·.·.·.·.·| .... .·.·.·.·| ·X·X·X[X·X·X]X·X·X·X .... X[X·X·X]X·····1· ·X·X·X[X{X}X]X·X·X·X .... X[X·X]X·X·····2· ·X·X·X[X·X·X]X·X·X·X .... X[X·X·X]X·····3· ·X·X·X[X(X)X]X·X·X·X .... X[X(X)X]X·····4· 2. Reference format #Reference file DCSE ::: ... #data : ... seq: ··/] ·····xxxxxxxxxx·xxxxxxxxxx·xxxxxxxxxx·xxxxxxxxxx·xxxxxxxxxx·xxxxxxxxxx ·····xxxxxxxxxx·xxxxxxxxxx·xxxxxxxxxx·xxxxxxxxxx·xxxxxxxxxx·xxxxxxxxxx ..... // ... 3. Organism lists .... 4.Default files ·:· ·:· ·:· ..... ·:· ·:· .... 5.Marker files ························ ························ APPENDIX B: Key shortcuts ------------------------- Moving around: Arrow keys move pointer > and < move pointer fast 2,4,6,8 : move window 7 (Alt left arrow) : next block left 9 (Alt right arrow): next block right Home (0) : beginning of line End (.) : end of line Ctrl left,right (PF2,PF3): move position left & right by page Ctrl up,down (PF4,'-') : move position up & down by page PgUp,PgDn (PvSc,NxSc) :move by page up and down d : toggle drag h : Go to next not complementary basepair (outside) H : Go to next not complementary basepair (inside) - : show other side of cursor p : shrink pointer Moving sequences around: = : move block to position r : push right e : push left Ctrl r : throw right to next character Ctrl e : throw left to next character 1 : pull one character to right side of pointer 3 : pull one character to left side of pointer R : pull right E : pull left Main menu / Filing s : Save o : Output i : Insert or delete positions in entire alignment f3 : help m : move block in memory b : Go back to selection of organisms x : Quit program Positioning g : Goto position m : Mark f : Find sequence / complement / organism / helix / basecompensations Primary structure tools y : Check primary structure a : Align selected sequences to upper selected sequence = : move block to a certain position I : Insert sequence D : Delete sequence PF1 : Overwrite with text Secondary structure tools t : Check secondary structure C : Copy secondary structure or o's M : Make matrix Options w : Split windows S : Do not move secondary structure P : Protect selected sequences X : Xchange ask for confirmation or not v : Do not show secondary structure signs c : Colours Alignment i : Insert or delete positions in entire alignment l : Lock organism in space u : Unlock G : Group sequences U : Ungroup sequences Divers J : Info about selected sequences j : Position in sequence z : Find motifs n : Sequence info Hints and tips If you have any handy tips for doing some things in DCSE, you can put them here. If you send them to me, they could be incorporated. They might be helpfull to other users as well. Some actions (i.e. ungroup, saving of groups) require temporary memory. If your system has limited memory and you have lots of organisms/positions loaded, you can get a memory allocation error, and it can be a problem to get your data saved. One solution might be to remove (File menu) some organisms from memory (save those first), and then try again. However, it is always wise not to fill your memory completely.