GA4GH Recommendation and Comparison to Usage Practices
- We recommends the use of "0-start, half-open" (interbase) coordinate system in all systems
- This is not a retrospective recommendation for existing standards and products
- "1-start, fully-closed" should be used when displaying coordinates through a GUI or report
0-start, half-open genomic coordinate system¶
Two integers that define the start and end positions of a range of residues, possibly with length zero, and specified using "0-start, half-open" coordinates.
The following also applies to coordinates:
- Coordinates start at 0 and finish at the length of the sequence
- Start must be greater than 0
- End must be greater than the start
- The length of an interval is (end - start)
- The reverse start is (sequence length - end)
- The reverse end is (sequence length - (start-1))
- A zero-length interval (start == end) is a point between two residues
- An interval of length 1 is a residue position
- Two intervals are equal if their start and end are equal
- Two intervals intersect if start or end occurs between the start and end of the other
- Two intervals coincide if they intersect or if they are equal
- start (int): start position >= 0 (required)
- end (int): end position >= start (required)
Circular regions are not considered to be part of GA4GH and not covered here, since human genome data is handled as linear sequence. APIs may choose to support a circular location but must still support "0-start, half-open" coordinates.
The "0-start, half-open" scheme is also know by the following names:
- "0-based, half-open"
- UCSC style
- Chado style
All of these names refer to identical representations of coordinates. Interbase has a different interpretation of the representation useful when considering insertion events. Care should be taken when using these alterative names as they combine representation and interpretation.
How '0-start, half-open' works
G A G T G C G G T G G A G T G C G C C G C C A T G G 1 1 1 1 1 1 1 1 1 1 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
"0-start, half-open" breaks down into two integer positions. The first, "0-start", refers to the start coordinate and uses an indexing scheme starting at 0 to refer to bases within a sequence, similar to array indexes in most C based programming languages. The second, "half-open", refers to the end coordinate and is one higher than the start (effectively using an indexing system starting at 1).
This scheme makes sub-sequences very easy to define. In the above example we have highlighted the subsequence
GAGTGC, which starts at position 4 and ends at position 10. Calculating the length of this subsequence is easily done by subtracting start from end e.g. (10-4) = 6. Other transformations are less prone to programming errors than the alternative system "1-start, fully-closed".
This same coordinate system can be used to flag insertions and deletions as a start and an end which equal each other refers to a space between two residues e.g. 4,4 would flag an event occurring between
What is '1-start, fully-closed'?
GAGTGC GGTGGAGTGCGCCGCCATGG 11111111112 12345678901234567890
"1-start, fully-closed" is the human readable coordinate system used in all genomic data displays and reports. It indexes sequences starting at 1. This system should be used when displaying genomic data to a human because it is the correct way to refer to positions. The subsequence
GAGTGC in "1-start, fully-closed" starts at position 5 and ends at position 10. Length is calculated by subtracting start from end plus one e.g. ((10+1)-5) = 6.
GA4GH Products and their coordinate systems¶
Not all GA4GH related products, specifications and APIs use the same system for their coordinates. Refer to the table below for full details.
|Product||"0-start, half-open"||"1-start, fully-closed"||Interbase|
- Andrew Yates (@andrewyatz )
- GA4GH GKS workstream discussions & beyond
- the documentation of the
Variantobject for the original GA4GH schema and the discussions that led to it: #49 and #121.
- a nice explanation of coordinate systems at Biostars.org by Obi Griffith
- Chado Interbase documentation
- Why numbering should start at zero, by Edsger Dijkstra (1982; PDF | HTML)
- Interbase primer
- Beacon’s support for coordinate systems
- Refget’s support for coordinate systems
- UCSC information on “0-start, half-open”
- Transforming between coordinates in “0-start, half open”