JOINing Two Excel Worksheets

Question
I’m working on combining two excel worksheets. Before I start, I’d like to mention that I also have mysql workbench, so I’m open to working on this issue in either sql or vba (I should learn both). I’m working with .bed files, which are lists of genomic coordinates. In short, the data is indexed by chromosome number (ie:chr2) and then has a numerical start and stop location on the chromosome. These numerical locations can span a large range (ie:100-10,000) or be a single position (ie: 999-1000). I have a list of coordinates that cover a large range, and in a separate file I have a list of single positions.

Example of a file with ranges:

chromosome  start  stop

chr1     45616321

chr3     984211253

Example of file with single positions:

chromosome  start  stop

chr1     52135214

chr3     1025410255

I would like to combine these worksheets such that if a location in my list of single positions is found within the range in my list of ranges, the locations for both are listed in the same row. The lists are 1000s of locations long, so I’d also like this program to loop through every row. Using the example data listed above, I’d like my output to look like the following:

Example of desired output:

chromosome  start    stop  chromosome  start  stop

chr1     45616321   chr1   52135214

chr3     984211253  chr3   1025410255

There is a high probability that multiple single positions will fall within a single range, and I would like these to be listed as separate rows.

Below is a SQL solution. It works but it is complicated, because data needs to be retrieved from and then exported to the database.

select a.chromosome, a.start, a stop, b.chromosome, b.start, b.stop from ranges_table a, positions_table b where b.start >= a.start and b.stop <= a.stop

The following VBA solution also consists of the SQL query and the retrieval and export code.

SubSqlJoin()

Dim oConnAsNew ADODB.Connection

Dim oRSAsNew ADODB.Recordset

Dim sPath

Dim sSQLAsString, wbAsWorkbook

Set wb =ThisWorkbook

sSQL ="select a.chromosome, a.start, a stop,"& _

"b.chromosome, b.start, b.stop"& _

"from <ranges_table> a, <positions_table> b"& _

"where b.start >= a.start and b.stop <= a.stop"

sSQL =Replace(sSQL,“<ranges_table>”, _

Rangename(wb.Worksheets(“Ranges”).Range(“A1”).CurrentRegion))

sSQL =Replace(sSQL,“<positions_table>”, _

Rangename(wb.Worksheets(“Positions”).Range(“A1”).CurrentRegion))

Ifwb.Path<>""Then

sPath = wb.FullName

Else

MsgBox"The workbook must be saved first!"

ExitSub

EndIf

oConn.Open"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=‘"& sPath &"’;"& _

“Extended Properties=‘Excel 12.0;HDR=Yes;IMEX=1’;”

oRS.OpensSQL, oConn

IfNotoRS.EOFThen

wb.Worksheets(“Results”).Range(“A2”).CopyFromRecordsetoRS

Else

MsgBox"No records found"

EndIf

oRS.Close

oConn.Close

EndSub

FunctionRangename(rAsRange)AsString

Rangename=“[”& r.Parent.Name &“$”& _

r.Address(False,False) &“]”

EndFunction

Answer
It’s rather complicated to do this in VBA. Here’s a solution in SPL:

A

1

=file(“D:\\range.xlsx”).xlsimport@t()

2

=file(“D:\\position.xlsx”).xlsimport@t()

3

=xjoin(A1;A2,start>A1.start && stop<A1.stop)

4

=file(“D:\\result.xlsx”).xlsexport(A3,#1.chromosome,#1.start,#1.stop,#1.chromosome,#2.start,#2.stop)

 

Result of executing the SPL script:

undefined

A1: Import one of the Excel worksheet as a table sequence;

A2: Import another Excel worksheet as table sequence;

A3: Perform a cross join over the two table sequences;

A4: Export the resulting table sequence to a new Excel worksheet.