contents -------- Legalities Copyright notice Disclaimer Citation & Package Introduction 1. Purpose 2. Conventions used 3. Defaults 4. Main menu Conversions 1. IG Suite to EMBL 2. GENBANK to EMBL 2. EMBL to reference 5.NBRF/PIR 3. Reference format to alignment 4. Alignment to tree Utilities 1. Transfer 2. Scan 2.1. Data 2.2. The format string 2.3. The filter string 2.4. Sorting 2.5. Scanning alignments and EMBL files 3. Rearrange Appendix A: the data formats 1. Alignment format 2. Reference format 3. Organism lists 4. Default files Hints and tips Legalities Copyright notice Copyright © Peter De Rijk, University of Antwerp (UIA), 1993 You may give this application to anyone, via any medium, so long as it is delivered with ALL the supplied files and UNALTERED, and it is not supplied on a disk you are charging for (except for media and postage costs). I maintain copyright on all the material supplied and reserve the right to amend these conditions in cases where I deem misuse. Disclaimer This application is supplied free to everyone 'as is', I do not give any guarantee that it is free of bugs, or supply any warranty about its suitability for use. No liability will be accepted for any damage to or loss of data as a result of using this application. However, if there are any problems with it and you notify me of them, I will probably do my best to rectify them. Citation & Package See the manual of DCSE for further info. Introduction ------------ 1. Purpose CONVERS is a program which complements DCSE in various ways. It will do conversions from formats used by other programs to formats DCSE understands, and the other way round. It can currently convert sequence files from IG Suite format to EMBL format. It has an option to find and extract sequence information from files obtained from EMBL, and to convert them to the reference format used by DCSE. It can be used to convert an alignment or parts of it to the NBRF/PIR format as used in Clustal, and back again. It can convert an alignment or parts of the alignment to the format used by TREECON (Van de Peer and De Wachter, 1993, CABIOS 9:177-182) for making trees. It also has the possibility perform several functions on the data used by DCSE. Some of these functions can also be performed in DCSE, but are handled in a different way. Convers can append sequences from a reference file to an alignment, sort an alignment or reference file, and merge different alignments of the same length. Information such as literature reference and taxonomic position can be extracted from a reference file and printed in user configurable formats . 2. Conventions used A a key to be pressed. The shortcut keys are case sensitive, so you should apply the shift key and the key marked with 'A' to get A, and the 'A' key alone to get a. something that must be substituted by a real value when using the program. The text between the signs tells what sort of value this should be. eg. a number a name [value] optional item. The value between the brackets can be given, but is not necessary 'File' refers to a menu item you can select. Sometimes you can use a shortcut to apply the relevant menu item. ... Something is left out. What is left out should be clear from the context 3. Defaults When Convers starts up, it looks in the current directory for a file called "convers.def". In this file, you can put your preferred defaults for a lot of variables, such as the file you start with, your reference file, etc. You can have several defaults for every variable. It is a simple ASCII file. Every line has the following format: : e.g.: file : total.ali file : other.ali select_file : test.lst When asked for input, you are presented the first default. You can switch between the different default values using the arrow keys. The upper entry is the program default. This is often the value you typed in the last time using the function (during the same session). You can correct the value presented, by moving with the left and right arrow keys, and deleting (backspace or delete) or inserting letters (just type the letter). If the first key you type is a normal character (not an arrow key, or backspace), the line is cleared, and you begin typing a new value. 4. Main menu When Convers is started, it shows its main menu. It shows available functions in a box titled 'Task'. The currently selected function is shown in inverse video. A function can be selected using the arrow keys, and executed by pressing RETURN. When a function stops it can return a message. This will be shown in the box labelled 'Previous task'. It is used to catch error messages. The option 'Quit' will exit Convers. Conversions ----------- 1. IG Suite to EMBL Intelligenetics suite (IntelliGenetics, Inc., Mountain View, California 94040) is a series of programs which can be used to search data in the databases of EMBL or GENBANK (and others). The entire sequence and annotations can be saved to a file. The format of this file differs slightly from that used by the EMBL. This functions simply asks the name of the IG Suite file (default: igembl_ig), and the name the EMBL file will have (default: igembl_embl), and does the conversion. Different entries in the EMBL file are separated by a line containing only '/ /'. 2. GENBANK to EMBL Files obtained from GENBANK can be converted to files in the EMBL format by this routine. You just have to enter the name of the GENBANK file, and the name the EMBL fle will have. Different entries in the GENBANK file are separated by a line containing only '/ /'. 2. EMBL to reference Files obtained from EMBL often contain a long stretch of sequence, containing several molecules, whose position is (sometimes) given in a feature table. It can also give several literature references. For making an alignment, you will probably only need the sequence of one molecule. The function selected by 'Convert EMBL format to REF format' makes extracting the information you need easy. First you are asked for the name of the EMBL file (default: embref_emb) and the name of the reference file (default: embref_emb). If a reference file of that name already exists, you are offered to append the new sequences, or remove the old file. When a new file is created, the header for the reference file is copied from the file 'header.def ' (When no file of this name is found in the current directory, the one in the resources directory is used. EMBL to reference can create the items in the following header (so these must be present): #reference file DCSE acc:S:Accesion number:NoAccn src:S:Source:Unknown str:S:Strain:Unknown ogl:S:Organel:Cellular par:S:Partial:N ta1:S:Taxonomy 1:Unknown au1:S:First author:Unknown aut:S:Other authors:Unknown ttl:S:Title:Unpublished jou:S:Journal: vol:N:Volume: pgs:S:Pages: dat:S:Date: edt:S:EMBL date:Unknown cdt:S:Creation date:Unknown mty:S:molecule type:Unknown #data When you have entered the name of the reference file, you are presented a menu. This contains one item for every EMBL 'record'. Each item shows the EMBL accession number, the organism name and the description for this record. The current item is shown in inverse video. you can move by using PgUp, PgDn and the arrow keys. You can choose the current item by pressing RETURN. This brings you to the features menu. This gives a list of all features (or types of molecules) described in the record and their positions. Also given are the total sequence (Total), and the options 'Other' and 'Back'. 'Other' will let you type in any positions. Back will return to the previous menu without selecting anything. Selecting one of the others chooses the displayed positions for the creation of a reference item. You are then asked for the molecule type (default: embref_other). The program default here is the description given with the selected positions. Remember you can choose this default by pressing the up arrow key. Then a menu is given with the references contained in the EMBL record. Choose one, and all the information about the sequence will be written to the reference file, and the program will return to the first menu (showing accession number, organism name and description.). When a molecule consists of several parts (fragments, several exons, ...), you can make a reference item for every part. These items should have the same name. The reference file will have to be edited manually afterwards to correct the number and order of the parts. e.g. for a sequence called seq1 which consists of two parts. change // ... mty:5.8S rRNA seq: seq1 1 10 1/1 AAAAAAAAAA AAAAAAAAAA // ... mty:LSU rRNA seq: seq1 4 8 1/1 UUUUUUUUUU UUUUUUUUUU // to // ... mty:5.8S rRNA seq: seq1 1 10 1/2 AAAAAAAAAA AAAAAAAAAA // ... mty:LSU rRNA seq: seq1 4 8 2/2 UUUUUUUUUU UUUUUUUUUU // 5.NBRF/PIR There is an option to create a file in the NBRF/PIR format from a DCSE alignment. This routine works the same way as the previous (alignment to tree), but PIR files are produced (default: alipir_ali). The routine will ask whether alignment gaps should be removed, producing the plain, unaligned sequences in NBRF/PIR format. NBRF/PIR to ALI does the opposite. It will create an alignment from a file in the NBRF/PIR format (default: pirali_pir). It asks the name of this file, and the name of the alignment. If the alignment already exists, it will offer to append the alignment to the old alignment, or to replace it. You can select from a menu which sequences will be converted. NBRF/PIR to REF will create a ref file from a file in the NBRF/PIR format. It asks the name of this file, and the name of the ref file. If the file already exists, it will offer to append the reference items to the old file, or to replace it. You can select from a menu which sequences will be converted. These routines offer a great opportunity for exchange with other programs. e.g. a multiple alignment program like Clustal can be used to create a base alignment for further study and refinement. An alignment can be converted to PIR (excluding the gaps), aligned by Clustal, and turned back into a DCSE alignment. You can even extract a part of a big alignment, align it using Clustal, and put the partial alignment back in again using the import function. 3. Reference format to alignment The option REF to ALI is used to create an alignment file from sequences in a reference file. The selected sequences are put in the alignment file without gaps in the sequence. (Gaps are filled in at the end of the sequence to fill up all positions in the alignment, whose size will be that of the longest sequence.) The option can also be used to append new sequences to an existing alignment. The routine first asks for the name of the reference file (default: refali_ref) and of the alignment file (default: refali_ali). If the alignment already exists, you are asked whether to append the sequences, or to delete the existing alignment, and to create a new one (or to quit). When you are creating a new alignment, you are asked which type it will be: P = protein, D = DNA, R=RNA. In the following menu, you can choose which sequences of the reference file will be loaded. The 'Select all' option will select all organisms. 'Load sequences' starts the conversion. When a protein alignment is selected, nucleic acid sequences in the reference file will be automatically translated. When a nucleic acid alignment is selected, protein sequences will be retro-translated. 4. Alignment to tree This routine asks the name of the alignment file (default: alitre_ali) and the name the tree file (default: alitre_tre). (If it exists, it can be appended.) In a menu the sequences to convert can be selected. The option 'Quit' stops the routine. 'Go' lets you go on to the selection of positions to be used. The positions can be entered in the following format: -,-, ... with firstpos1 < lastpos1 < firstpos2 < lastpos2. Default, all alignment positions are used. Utilities --------- 1. Transfer This routine can transfer sequences from one alignment to another. It asks for source (default: aliali_source) and destination alignment (default: aliali_dest). If these alignments are of different length, the routine aborts. You can then choose in a list of sequences in the source alignment which ones you want to transfer. Selecting 'Go' starts the transfer. If the sequence exists in the destination alignment, it will be overwritten by the sequence in the source alignment. Sequences not present in the destination alignment are appended to it. The program asks for confirmation on every copy. You can reply by Y (yes), N (no), A (all) and Q (quit). When you reply 'A', no confirmation will be asked for copying the following sequences. 2. Scan Scan lets you extract information from a reference file in a very flexible way. It gives some database functionality. Data can be filtered out, and the output can be sorted and written in a user definable format. It asks for the name of the reference file (default: scan_src) and of the output file (default: scan_lst) to be created. It then shows the following screen: P lets you select whether different parts of the same sequence will be joined (J), worked upon independently from each other (Y), or whether only the first part is used (N). V lets you enter a format string which describes which data and in what format will be written to the output file. The first string (Vorm) (default: scan_vorm) determines in which format the data of the first piece of every sequence will be printed. The next (part) (default: scan_pvorm) determines the format of the following parts (if there are any). S lets you select the sort order (default: scan_order) for all output and for the parts if the are joined. F lets you enter a filter string (default: scan_filter) which can be used to filter out unwanted data. G starts the output, while Q will take you back to the main menu. The routine will output messages to the screen of what it's doing (loading, filtering, sorting, writing), and reports errors it encounters. When it is finished, it asks for a keypress. 2.1. Data Under the options, data elements are shown which can always be used. They are retrieved from the essential sequence information. These should not be used as items in a reference file. They are: org organism or sequence name prt part number pts number of parts the sequence consists of len length of part tle total length of sequence tlp length of sequence up to this part ppa the position of this part in the sequence (=tlp+1) Three data items are provided to help formatting. (They can e.g. be used for referencing: print the list of organisms with a number. Then append the list of references in the same sort order with a number. Presto) bla returns a blank field (this can be formatted to a specific size) lin gives a line number. This line number is increased information about one block is printed out. num gives a sequence number. This number is increased after all information about one sequence is printed out (including its parts). Pressing I gives a list of fields which are defined in the current reference file with some explanation. PgUp , PgDn and the arrow keys can be used to scroll them up and down. X goes back to the previous menu. 2.2. The format string The format string controls which data will be printed, and how it will be printed. Most characters are just copied to the output file. The backslash ('\') acts as an escape character. This is used to create special characters such as: \b backspace \f formfeed \n newline (line feed) \r carriage return \t tab (horizontal) \v vertical tab \\ \ (the backslash character) \" double quote the percentage sign ('%') is the symbol that data from the reference file will be inserted at this position. The following three alphabetical characters determine from which field the data will be fetched (=field ident). E.g. %org will print out the sequence (or organism) name. You can insert a format specification between the '%' and the field ident, which will determine how the data will be printed. This specification has the following format (Optional parameters are shown between square brackets.): %[flags][width][.precision] [flags]: - left-justified, if not present, right justified + positive number always has a plus sign [width]: n (number) at least n characters are printed. If the data has less then n characters, the output is padded with blanks. n same, but output is padded with zeros [.precision] .n at most n characters are printed. If the data has more than n characters, the output is truncated. The field ident can be replaced by a format string between brackets. The result of this format string will be calculated using the data in the current entry. The resulting string will be formatted according to the format specification before the brackets. e.g. %40.40(%org %acc) The format within the brackets will give a string containing sequence name and accession number seperated by a space. If this string is shorter then 40 characters, it will be leftpadded until it has 40. If it longer it will be clipped. Files can be included in the format string by using the construct %{name} This construct in a format string will be substituted by the contents of the file named 'name.scn' in the current directory. (Line feeds in the file will be replaced by an appropriate "\n" in the format string.) When this file is not present in the current directory, it will be searched in the DCSE directory. e.g.: If the file ref.scn contains: %aut (%dat) %ttl %jou %vol:%pgs the format string %org\n%{ref}\n will be interpreted as %org\n%aut (%dat)\n%ttl\n%jou %vol:%pgs\n 2.3. The filter string The filter string determines which sequence blocks in the reference file will produce output. A filter is a comparison. If it is true, the sequence block will be kept to produce output. Otherwise it will be discarded. A filter has the following format: e.g. tle>1000 will only output information about sequences whose total length is greater than 1000. Several comparators are allowed: = is equal to <> is not equal to < is smaller than <= is smaller than or equal to > is bigger than >= is bigger than or equal to # contains string: this is only applicable to string fields. It will keep the sequence blocks which have the string in the selected field @ is part of string : same, but keeps blocks whose field is contained in the string . Different filters can be combined to a new filter by using following booleans: ()AND() keep block if both filter1 and filter2 are true. ()OR() keep block if filter1 or filter2 is true. ()EOR() keep block if filter1 or filter2 is true, but they are not both true. !() keep block if filter is not true This system allows a very flexible way to choose which data will be extracted from the reference file. Complex searches are well feasible. E.g. the string (!(ogl#Cell)AND(len>500))OR(len>1000) will keep only those sequences which are longer than 1000 characters, or which are longer than 500 characters but not cellular. If the filter string is empty, all blocks are retained to produce output. 2.4. Sorting The sort string determines in which order the information will be printed. It has following format: [<+/->][<+/->]... The order will be determined by the data in field 1 (indicated by field ident 1). For blocks having the same data in field 1, field 2 will be used, etc. The plus or minus sign after the field ident must be present. It determines whether ascending or descending order is used. E.g. with the sort string tle-org+ the information will be sorted on the length of the total sequence, beginning with the longest. When sequences are of equal length, their information is sorted on the organism name (in alphabetic order). 2.5. Scanning alignments and EMBL files Scanning an alignment gives a list of organisms present in the alignment. Scanning an EMBL file gives a list of items present in thist EMBL file (accession number, organism and title). None of the formatting, filtering, or sorting options are applicable on these types of file. 3. Rearrange The Rearrange routine can be used to change the order of sequence in an alignment or reference file. It can also be used to extract a set of sequences from these files. The routine asks for the names of the alignment or reference file (default: rear_ref), the name of the sequence list (default: rear_lst), and the name that will be given to the rearranged file (default: rear_sorted). The sequence list contains the sequence names in the new order, one name per line. (Everything after the first 40 characters is ignored.) If not all sequences in the original file are mentioned in the list file, you are asked whether you want to write the remaining organisms to a file. If so, you are asked for the name of the file (default: rear_remain) to write these to. Appendix A: the data formats ---------------------------- This section gives a short, schematic description of formats used by DCSE. The full description can be found in the DCSE manual. x's denotes characters. a space is shown as a bullet character (·). 1. Alignment format []······· ··················10 .... ········x ·.·.·.·.·|·.·.·.·.·| .... .·.·.·.·| ·X·X·X[X·X·X]X·X·X·X .... X[X·X·X]X·····1· ·X·X·X[X{X}X]X·X·X·X .... X[X·X]X·X·····2· ·X·X·X[X·X·X]X·X·X·X .... X[X·X·X]X·····3· ·X·X·X[X(X)X]X·X·X·X .... X[X(X)X]X·····4· 2. Reference format #Reference file DCSE ::: ... #data : ... seq: ··/[·] ·····xxxxxxxxxx·xxxxxxxxxx·xxxxxxxxxx·xxxxxxxxxx·xxxxxxxxxx·xxxxxxxxxx ·····xxxxxxxxxx·xxxxxxxxxx·xxxxxxxxxx·xxxxxxxxxx·xxxxxxxxxx·xxxxxxxxxx ..... // ... 3. Organism lists .... 4.Default files ·:· ·:· ·:· ..... ·:· ·:· .... Hints and tips If you have any handy tips for doing some things in DCSE, you can put them here. If you send them to me, I could incorporate them here. They might be helpfull to other users as well. Here is one to wet your appetite. Scanning a reference file can be used to produce a marker file. e.g. V(orm : ) (part:%-40.40org-%ppa 0 begin %mty in %org\n ) will give the begin positions of sequence fragments as markers.